Learning a Deep Predictive Coding Network for a Semi-Supervised 3D-Hand Pose Estimation

Jamal Banzi; Isack Bulugu; Zhongfu Ye

doi:10.1109/JAS.2020.1003090

Volume 7 Issue 5

Sep. 2020

IEEE/CAA Journal of Automatica Sinica

JCR Impact Factor: 19.2, Top 1 (SCI Q1)

CiteScore: 28.2, Top 1% (Q1)
Google Scholar h5-index: 95， TOP 5

Turn off MathJax

Article Contents

Article Navigation > IEEE/CAA Journal of Automatica Sinica > 2020 > 7(5): 1371-1379

Jamal Banzi, Isack Bulugu and Zhongfu Ye, "Learning a Deep Predictive Coding Network for a Semi-Supervised 3D-Hand Pose Estimation," IEEE/CAA J. Autom. Sinica, vol. 7, no. 5, pp. 1371-1379, Sept. 2020. doi: 10.1109/JAS.2020.1003090

Citation:

Jamal Banzi, Isack Bulugu and Zhongfu Ye, "Learning a Deep Predictive Coding Network for a Semi-Supervised 3D-Hand Pose Estimation," IEEE/CAA J. Autom. Sinica, vol. 7, no. 5, pp. 1371-1379, Sept. 2020. doi: 10.1109/JAS.2020.1003090

Citation:

PDF( 1110 KB)

Learning a Deep Predictive Coding Network for a Semi-Supervised 3D-Hand Pose Estimation

doi: 10.1109/JAS.2020.1003090

Funds: This work was supported in part by the Fundamental Research Funds for the Central Universities (WK2350000002)

More Information

Author Bio:
Jamal Banzi (M’17) received the Ph.D. degree in information and communication engineering from the University of Science and Technology of China in 2019. He is currently working as a Lecturer at the Sokoine University, Tanzania. His research interests include computer vision and pattern recognition, deep learning, sign language recognition, and the human-machine interaction

Isack Bulugu received the B.Sc. degree in electronics in 2007 from University of Dar-es-salaam, Tanzania. He received the master and Ph.D. degrees in signal and information processing engineering from Tianjin University of Technology and Education and University of Science and Technology of China in 2014 and 2018, respectively. He is currently working at the University of Dar-es-Salaam, Tanzania. His research interests include image processing, hand gesture recognition, and artificial intelligence

Zhongfu Ye is a Professor in the Department of Electronic Engineering and Information Science, University of Science and Technology of China. He obtained the Ph.D. degree in signal and information processing from the University of Science and Technology of China in 1995. He is the Author of over 200 academic papers on national and international major journals and important international conferences like IEEE, IET, PASP, PASA, Acta Electronica Sinica. He has been teaching and researching in USTC since 1995 and was appointed as a Full Professor in 2000. He is currently the Director of Signal Statistics and Processing Research Center of USTC, Head of the Discipline of Signal and Information Processing of USTC, Ph.D. Supervisor of Institute of Electronics of CAS, Member of the Editorial Board of Journal of Communication, Acta Armament, and Journal of Data Acquisition and Processing. He is also a Reviewer for IEEE, IET, international conferences, and national academic journals, Member of the Committee of Instrument Science and Control Technology, Chinese Association of Higher Education, and Member of the Committee of Photonics, Chinese Society of Astronautics. His current research interests include array signal processing, speech/audio signal processing, image, and processing
Corresponding author: J. Banzi is with the Sokoine University, Morogoro 3000, Tanzania (e-mail: jbanzi@mail.ustc.edu.cn)
Received Date: 2019-01-10
Revised Date: 2019-06-08
Accepted Date: 2019-10-24

Available Online: 2019-12-18

Abstract

Abstract

In this paper we present a CNN based approach for a real time 3D-hand pose estimation from the depth sequence. Prior discriminative approaches have achieved remarkable success but are facing two main challenges: Firstly, the methods are fully supervised hence require large numbers of annotated training data to extract the dynamic information from a hand representation. Secondly, unreliable hand detectors based on strong assumptions or a weak detector which often fail in several situations like complex environment and multiple hands. In contrast to these methods, this paper presents an approach that can be considered as semi-supervised by performing predictive coding of image sequences of hand poses in order to capture latent features underlying a given image without supervision. The hand is modelled using a novel latent tree dependency model (LDTM) which transforms internal joint location to an explicit representation. Then the modeled hand topology is integrated with the pose estimator using data dependent method to jointly learn latent variables of the posterior pose appearance and the pose configuration respectively. Finally, an unsupervised error term which is a part of the recurrent architecture ensures smooth estimations of the final pose. Experiments on three challenging public datasets, ICVL, MSRA, and NYU demonstrate the significant performance of the proposed method which is comparable or better than state-of-the-art approaches.
- Convolutional neural networks,
- deep learning,
- hand pose estimation,
- human-machine interaction,
- predictive coding,
- recurrent neural networks,
- unsupervised learning

FullText(HTML)

References(46)

References

[1]	E. Barsoum, “Articulated hand pose estimation review,” arXiv preprint arXiv: 1604.06195, pp. 1−50, 2016.
[2]	A. Erol, G. Bebis, M. Nicolescu, R. D. Boyle, and X. Twombly, “Vision-based hand pose estimation: A review,” Comput. Vis. Image Underst., vol. 108, no. 1−2, pp. 52–73, Oct.–Nov. 2007.
[3]	S. Sridhar, F. Mueller, A. Oulasvirta, and C. Theobalt, “Fast and robust hand tracking using detection-guided optimization,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Boston, USA, 2015, pp. 3213−3221.
[4]	P. Krejov, A. Gilbert, and R. Bowden, “Combining discriminative and model based approaches for hand pose estimation,” in Proc. 11th IEEE Int. Conf. and Workshops on Automatic Face and Gesture Recognition, Ljubljana, Slovenia, 2015, pp. 1−7.
[5]	L. Tracewski, L. Bastin, and C. C. Fonte, “Repurposing a deep learning network to filter and classify volunteered photographs for land cover and land use characterization,” Geo-Spat. Inf. Sci., vol. 20, no. 3, pp. 252–268, Sept. 2017. doi: 10.1080/10095020.2017.1373955
[6]	H. Yu, J. W. Wang, Y. Bai, W. Yang, and G. S. Xia, “Analysis of large-scale UAV images using a multi-scale hierarchical representation,” Geo-Spat. Inf. Sci., vol. 21, no. 1, pp. 33–44, Jan. 2018. doi: 10.1080/10095020.2017.1418263
[7]	T. Y. Chen, P. W. Ting, M. Y. Wu, and L. C. Fu, “Learning a deep network with spherical part model for 3D hand pose estimation,” Pattern Recognit., vol. 80, pp. 1–20, Aug. 2018. doi: 10.1016/j.patcog.2018.02.029
[8]	C. Zimmermann and T. Brox, “Learning to estimate 3D hand pose from single RGB images,” in Proc. IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 4913−4921.
[9]	A. Tagliasacchi, M. Schröder, A. Tkach, S. Bouaziz, M. Botsch, and M. Pauly, “Robust articulated-ICP for real-time hand tracking,” in Proc. Eurographics Symp. Geometry Processing, 2015, pp. 101−114.
[10]	H. Patel, A. Thakkar, M. Pandya, and K. Makwana, “Neural network with deep learning architectures,” J. Inf. Optim. Sci., vol. 39, no. 1, pp. 31–38, 2018.
[11]	Q. Ye, S. X. Yuan, and T. K. Kim, “Spatial attention deep net with partial PSO for hierarchical hybrid hand pose estimation,” in Proc. European Conf. Computer Vision, Amsterdam, The Netherlands, 2016, pp. 346−361.
[12]	X. Sun, Y. C. Wei, S. Liang, X. O. Tang, and J. Sun, “Cascaded hand pose regression,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Boston, USA, 2015, pp. 824−832.
[13]	A. Sinha, C. Choi, and K. Ramani, “DeepHand: Robust hand pose estimation by completing a matrix imputed with deep features,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, USA, 2016, pp. 4150−4158.
[14]	C. Choi, A. Sinha, J. H. Choi, S. Jang, and K. Ramani, “A collaborative filtering approach to real-time hand pose estimation,” in Proc. IEEE Int. Conf. Computer Vision, Santiago, Chile, 2015, pp. 2336−2344.
[15]	M. Oikonomidis, I. A. Lourakis, and A. A. Argyros, “Evolutionary quasi-random search for hand articulations tracking,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Columbus, USA, 2014, pp. 3422−3429.
[16]	P. Krejov, A. Gilbert, and R. Bowden, “Guided optimisation through classification and regression for hand pose estimation,” Comput. Vis. Image Underst., vol. 155, pp. 124–138, Feb. 2017. doi: 10.1016/j.cviu.2016.11.005
[17]	C. Qian, X. Sun, Y. C. Wei, X. O. Tang, and J. Sun, “Realtime and robust hand tracking from depth,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Columbus, USA, 2014, pp. 1106−1113.
[18]	D. H. Tang, T. H. Yu, and T. K. Kim, “Real-time articulated hand pose estimation using semi-supervised transductive regression forests,” in Proc. IEEE Int. Conf. Computer Vision, Sydney, Australia, 2013, pp. 3224−3231.
[19]	J. F. Banzi, Z. F. Ye, and I. Bulugu, “A novel hand pose estimation using dicriminative deep model and Transductive learning approach for occlusion handling and reduced descrepancy,” in Proc. IEEE Int. Conf. Computer and Communications, Chengdu, China, 2016, pp. 347−352.
[20]	G. Poier, K. Roditakis, S. Schulter, D. Michel, H. Bischof, and A. A. Argyros, “Hybrid one-shot 3D hand pose estimation by exploiting uncertainties,” in BMVC, Swansea, UK, 2015.
[21]	M. Tompson, Y. Stein, M. Lecun, and K. Perlin, “Real-time continuous pose recovery of human hands using convolutional networks,” ACM Trans. Graph., vol. 33, no. 5, pp. 169, Sept. 2014.
[22]	H. K. Guo, G. J. Wang, X. H. Chen, C. R. Zhang, F. Qiao, and H. Z. Yang, “Region ensemble network: Improving convolutional network for hand pose estimation,” in Proc. IEEE Int. Conf. Image Processing, Beijing, China, 2017, pp. 4512−4516.
[23]	M. Oberweger, P. Wohlhart, and V. Lepetit, “Training a feedback loop for hand pose estimation,” in Proc. IEEE Int. Conf. Computer Vision, Santiago, Chile, 2015 pp. 3316−3324.
[24]	A. Dosovitskiy, P. Fischer, J. T. Springenberg, M. Riedmiller, and T. Brox, “Discriminative unsupervised feature learning with exemplar convolutional neural networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 9, pp. 1734–1747, Sept. 2016. doi: 10.1109/TPAMI.2015.2496141
[25]	L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, Apr. 2018. doi: 10.1109/TPAMI.2017.2699184
[26]	L. H. Ge, H. Liang, J. S. Yuan, and D. Thalmann, “3D convolutional neural networks for efficient and robust hand pose estimation from single depth images,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, USA, 2017, pp. 5679−5688.
[27]	M. Oberweger and V. Lepetit, “DeepPrior++: Improving fast and accurate 3D hand pose estimation,” in Proc. IEEE Int. Conf. Computer Vision Workshops, Venice, Italy, 2017, pp. 585−594.
[28]	Z. H. Zhou and J. Feng, “Deep forest: Towards an alternative to deep neural networks,” in Proc. Twenty-Sixth Int. Joint Conf. Artificial Intelligence, Melbourne, Australia, 2017, pp. 3553−3559.
[29]	F. Wang and Y. Li, “Beyond physical connections: Tree models in human pose estimation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition., Portland, USA, 2013, pp. 596−603.
[30]	V. Y. F. Tan, A. Anandkumar, and A. S. Willsky, “Learning high-dimensional Markov forest distributions: Analysis of error rates,” J. Mach. Learn. Res., vol. 12, pp. 1617–1653, Jul. 2011.
[31]	C. Chow and C. Liu, “Approximating discrete probability distributions with dependence trees,” IEEE Trans. Inf. Theory, vol. 14, no. 3, pp. 462–167, May 1968. doi: 10.1109/TIT.1968.1054142
[32]	D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learning Representations, San Diego, USA, 2015, pp. 1−13.
[33]	Y. P. Huang and R. P. N. Rao, “Predictive coding,” WIREs Cogn. Sci., vol. 2, no. 5, pp. 580–593, Sept.–Oct. 2011. doi: 10.1002/wcs.142
[34]	F. Stolzenburg, O. Michael, and O. Obst, “Predictive neural networks,” arXiv preprint arXiv: 1802.03308, 2018.
[35]	N. Srivastava, E. Mansimov, and R. Salakhutdinov, “Unsupervised learning of video representations using LSTMs,” in Proc. 32nd Int. Conf. Machine Learning, Lille, France, 2015, pp. 843−852.
[36]	Theano Development Team, “Theano: A python framework for fast computation of mathematical expressions,” arXiv preprint arXiv: 1605.02688, 2016.
[37]	W. Lotter, G. Kreiman, and D. Cox, “Deep predictive coding networks for video prediction and unsupervised learning,” in Proc. Conf. ICLR, Palais des Congrès Neptune, Toulon, France, 2017.
[38]	D. H. Tang, H. J. Chang, A. Tejani, and T. K. Kim, “Latent regression forest: Structured estimation of 3D hand poses,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 7, pp. 1374–1387, Jul. 2017. doi: 10.1109/TPAMI.2016.2599170
[39]	X. Y. Zhou, Q. F. Wan, W. Zhang, X. Y. Xue, and Y. C. Wei, “Model-based deep hand pose estimation,” in Proc. Int. Joint Conf. Artificial Intelligence, New York City, USA, 2016, pp. 2421−2427.
[40]	C. Wan, T. Probst, L. Van Gool, and A. Yao, “Crossing nets: Dual generative models with a shared latent space for hand pose estimation,” in Proc. Conf. Computer Vision and Pattern Recognition, pp. 7, 2017.
[41]	S. X. Yuan, Q. Ye, B. Stenger, S. Jain, and T. K. Kim, “Bighand 2.2m benchmark: Hand pose dataset and state of the art analysis,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, USA, 2017, pp. 2605−2613.
[42]	M. Madadi, S. Escalera, X. Baró, and J. Gonzalez, “End-to-end global to local CNN learning for hand pose recovery in depth data,” arXiv preprint arXiv: 1705.09606, 2017.
[43]	L. H. Ge, Y. J. Cai, J. W. Weng, and J. S. Yuan, “Hand point net: 3D hand pose estimation using point sets,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018, pp. 8417−8426.
[44]	J. Supančič III, G. Rogez, Y. Yang, J. Shotton, and D. Ramanan, “Depth-based hand pose estimation: Methods, data, and challenges,” Int. J. Comput. Vision, vol. 126, no. 11, pp. 1180–1198, Nov. 2018. doi: 10.1007/s11263-018-1081-7
[45]	X. M. Deng, S. Yang, Y. D. Zhang, P. Tan, L. Chang, and H. A. Wang, “Hand 3D: Hand pose estimation using 3D neural network,” arXiv preprint arXiv: 1704.02224, 2017.
[46]	J. Banzi, I. Bulugu, and Z. F. Ye, “Deep predictive neural network: Unsupervised learning for hand pose estimation,” Int. J. Machine Learning and Computing, vol. 9, 2019. doi: 10.18178/ijmlc.2019.9.4.822

Supplements(0)

Cited By

Proportional views

Proportional views

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Figures(10) / Tables(2)

Get Citation

PDF

XML

Article Metrics

Article views (2001) PDF downloads(81)

Highlights

A new way of modelling a hand topology using (LDTM) which transforms internal joint locations to an explicit hand representation. This hand representation is more compact and invariant in scale and view angles.
Strong hand detector integrated with the deep learning based pose estimator into one pipeline. Therefore, our hand pose estimation is based on the prior knowledge of the human hand.

Learning a Deep Predictive Coding Network for a Semi-Supervised 3D-Hand Pose Estimation

doi: 10.1109/JAS.2020.1003090

Abstract

References

Proportional views

Catalog

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Highlights

Export File

Citation

Format

Content