Dynamic Hand Gesture Recognition Based on Short-Term Sampling Neural Networks

Wenjin Zhang; Jiacun Wang; Fangping Lan

doi:10.1109/JAS.2020.1003465

Volume 8 Issue 1

Jan. 2021

IEEE/CAA Journal of Automatica Sinica

JCR Impact Factor: 19.2, Top 1 (SCI Q1)

CiteScore: 28.2, Top 1% (Q1)
Google Scholar h5-index: 95， TOP 5

Turn off MathJax

Article Contents

Article Navigation > IEEE/CAA Journal of Automatica Sinica > 2021 > 8(1): 110-120

Wenjin Zhang, Jiacun Wang and Fangping Lan, "Dynamic Hand Gesture Recognition Based on Short-Term Sampling Neural Networks," IEEE/CAA J. Autom. Sinica, vol. 8, no. 1, pp. 110-120, Jan. 2021. doi: 10.1109/JAS.2020.1003465

Citation:

Wenjin Zhang, Jiacun Wang and Fangping Lan, "Dynamic Hand Gesture Recognition Based on Short-Term Sampling Neural Networks," IEEE/CAA J. Autom. Sinica, vol. 8, no. 1, pp. 110-120, Jan. 2021. doi: 10.1109/JAS.2020.1003465

Citation:

PDF( 1440 KB)

Dynamic Hand Gesture Recognition Based on Short-Term Sampling Neural Networks

doi: 10.1109/JAS.2020.1003465

More Information

Author Bio:
Wenjin Zhang received the B.S. degree in software engineering from Chuangshu Institute of Technology in 2018, and the M.S. degree in software engineering from Monmouth Unversity, USA, in 2020. Now, he is an Adjunct Professor at Monmouth University in the Department of Computer Science and Software Engineering. His reseach interests include machine learning, deep learning, and computer vision

Jiacun Wang (M’00–SM’00) received the Ph.D. degree in computer engineering from Nanjing University of Science and Technology (NUST) in 1991. He is currently a Professor of software engineering at Monmouth University, USA. From 2001 to 2004, he was a Member of Scientific Staff with Nortel Networks in Richardson, Texas. Prior to joining Nortel, he was a Research Associate of the School of Computer Science, Florida International University (FIU) at Miami. Prior to joining FIU, he was an Associate Professor at NUST. His research interests include software engineering, discrete event systems, formal methods, wireless networking, and real-time distributed systems. He authored Timed Petri Nets: Theory and Application Kluwer, 1998), Real-time Embedded Systems (Wiley, 2018), and Formal Methods in Computer Science (CRC, 2019), edited Handbook of Finite Stat Based Models and Applications (CRC, 2012), and published about 90 research papers in journals and conferences. Dr. Wang was an Associate Editor of IEEE Transactions on Systems, Man and Cybernetics, Part C. He has served as general chair, program chair, and special sessions chair or program committee member for many international conferences

Fangping Lan received the B.S. degree in computer engineering from Changshu Institute of Technology (CIT) in 2019. She is currently a graduate student of software engineering at Monmouth University, USA. From Sept. 2019 to May 2020, she was a Graduate Research Assistant with Monmouth University. Her research interests include machine learning and deep learning
Corresponding author: The authors are with the Department of Computer Science and Software Engineering, Monmouth University, New Jersey 07740 USA (e-mail: s1260807@monmouth.edu; jwang@monmouth.edu; fangpinglan0116@gmail.com)
Received Date: 2020-04-08
Revised Date: 2020-07-20
Accepted Date: 2020-08-13

Available Online: 2020-09-02

Abstract

Abstract

Hand gestures are a natural way for human-robot interaction. Vision based dynamic hand gesture recognition has become a hot research topic due to its various applications. This paper presents a novel deep learning network for hand gesture recognition. The network integrates several well-proved modules together to learn both short-term and long-term features from video inputs and meanwhile avoid intensive computation. To learn short-term features, each video input is segmented into a fixed number of frame groups. A frame is randomly selected from each group and represented as an RGB image as well as an optical flow snapshot. These two entities are fused and fed into a convolutional neural network (ConvNet) for feature extraction. The ConvNets for all groups share parameters. To learn long-term features, outputs from all ConvNets are fed into a long short-term memory (LSTM) network, by which a final classification result is predicted. The new model has been tested with two popular hand gesture datasets, namely the Jester dataset and Nvidia dataset. Comparing with other models, our model produced very competitive results. The robustness of the new model has also been proved with an augmented dataset with enhanced diversity of hand gestures.
- Convolutional neural network (ConvNet),
- hand gesture recognition,
- long short-term memory (LSTM) network,
- short-term sampling,
- transfer learning

FullText(HTML)

References(50)

References

[1]	T. Starner, J. Weaver, and A. Pentland, “Real-time American sign language recognition using desk and wearable computer based video,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 12, pp. 1371–1375, Dec. 1998. doi: 10.1109/34.735811
[2]	H. Cooper, B. Holt, and R. Bowden, “Sign language recognition,” in Visual Analysis of Humans: Looking at People, T. B. Moeslund, A. Hilton, V. Krüger, and L. Sigal, Eds. London, UK: Springer, 2011, pp. 539–562.
[3]	J. S. Sonkusare, N. B. Chopade, R. Sor, and S. L. Tade, “A review on hand gesture recognition system,” in Proc. Int. Conf. Computing Communication Control and Automation, Pune, India, 2015, pp. 790–794.
[4]	L. Dipietro, A. M. Sabatini, and P. Dario, “A survey of glove-based systems and their applications,” IEEE Trans. Syst.,Man,Cybern.,Part C, vol. 38, no. 4, pp. 461–482, Jul. 2008. doi: 10.1109/TSMCC.2008.923862
[5]	B. K. Chakraborty, D. Sarma, M. K. Bhuyan, and K. F. MacDorman, “Review of constraints on vision-based gesture recognition for human-computer interaction,” in IET Comput. Vis., vol. 12, no. 1, pp. 3–15, Feb. 2018. doi: 10.1049/iet-cvi.2017.0052
[6]	C. Zhu, J. Y. Yang, Z. P. Shao, and C. P. Liu, “Vision based hand gesture recognition using 3D shape context,” IEEE/CAA J. Autom. Sinica, DOI: 10.1109/JAS.2019.1911534.
[7]	X. H. Yuan, L. B. Kong, D. C. Feng, and Z. C. Wei, “Automatic feature point detection and tracking of human actions in time-of-flight videos,” IEEE/CAA J. Autom. Sinica, vol. 4, no. 4, pp. 677–685, Sept. 2017. doi: 10.1109/JAS.2017.7510625
[8]	B. Hu and J. C. Wang, “Deep learning based hand gesture recognition and UAV flight controls,” in Proc. 24th Int. Conf. Automation and Computing, Newcastle upon Tyne, United Kingdom, 2018, pp. 1–6.
[9]	G. Marin, F. Dominio, and P. Zanuttigh, “Hand gesture recognition with leap motion and kinect devices,” in Proc. IEEE Int. Conf. Image Processing, Paris, France, 2014, pp. 1565–1569.
[10]	K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Proc. 27th Int. Conf. Neural Information Processing Systems, Lake Tahoe, USA, 2014, pp. 568–576.
[11]	M. Asadi-Aghbolaghi, A. Clapés, M. Bellantonio, H. J. Escalante, V. Ponce-López, X. Baró, I. Guyon, S. Kasaei, and S. Escalera, “Deep learning for action and gesture recognition in image sequences: A survey,” in Gesture Recognition, S. Escalera, I. Guyon, and V. Athitsos, Eds. Cham, Germany: Springer, 2017.
[12]	Y. Zhu, Z. Z. Lan, S. Newsam, and A. Hauptmann, “Hidden two-stream convolutional networks for action recognition”, in Proc. 14th Asian Conf. Computer Vision, Perth, Australia, 2018.
[13]	C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, USA, 2016, pp. 1933–1941.
[14]	R. Girdhar, D. Ramanan, A. Gupta, J. Sivic, and B. Russell, “ActionVLAD: Learning spatio-temporal aggregation for action classification,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, USA, 2017, pp. 3165–3174.
[15]	L. M. Wang, Y. J. Xiong, Z. Wang, Y. Qiao, D. H. Lin, X. O. Tang, and L. Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in Proc. 14th European Conf. Computer Vision, Amsterdam, The Netherlands, 2016.
[16]	L. M. Wang, Y. J. Xiong, Z. Wang, Y. Qiao, D. H. Lin, X. O. Tang, and L. Van Gool, “Temporal segment networks for action recognition in videos,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 11, pp. 2740–2755, Nov. 2019. doi: 10.1109/TPAMI.2018.2868668
[17]	H. Sak, A. Senior, and F. Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” in Proc. 15th Annual Conf. Int. Speech Communication Association: Celebrating the Diversity of Spoken Languages, Singapore, 2014.
[18]	Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998. doi: 10.1109/5.726791
[19]	M. Shilman, Z. L. Wei, S. Raghupathy, P. Simard, and D. Jones, “Discerning structure from freeform handwritten notes,” in Proc. 7th Int. Conf. Document Analysis and Recognition, Edinburgh, UK, 2003, pp. 60–65.
[20]	B. Schölkopf, J. Platt, and T. Hofmann, “Efficient learning of sparse representations with an energy-based model,” in Advances in Neural Information Processing Systems 19: Proc. the 2006 Conf., MIT Press, 2007.
[21]	D. Ciregan, U. Meier, and J. Schmidhuber, “Multi-column deep neural networks for image classification,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Providence, USA, 2012, pp. 3642–3649.
[22]	K. Bong, S. Choi, C. Kim, and H. J. Yoo, “Low-power convolutional neural network processor for a face-recognition system,” IEEE Micro, vol. 37, no. 6, pp. 30–38, Nov.-Dec. 2017. doi: 10.1109/MM.2017.4241350
[23]	K. M. He, X. Y. Zhang, S. Q. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, USA, 2016, pp. 770–778.
[24]	S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997. doi: 10.1162/neco.1997.9.8.1735
[25]	R. Haridy. (2017, Aug. 22). Microsoft’s speech recognition system is now as good as a human. Microsoft, Redmond, Washington. [Online] Available: https://newatlas.com/microsoft-speech-recognition-equals-humans/50999/
[26]	F. Beaufays. (2015, Aug.). The neural networks behind Google Voice transcription. Google, Mountain View, CA. [Online]. Available: https://ai.googleblog.com/2015/08/the-neural-networks-behind-google-voice.html
[27]	H. Sak, A. Senior, K. Rao, F. Beaufays, and J. Schalkwyk. (2015, Sept.). Google voice search: Faster and more accurate. Google, Mountain View, CA. [Online]. Available: https://ai.googleblog.com/2015/09/google-voice-search-faster-and-more.html
[28]	C. Smith. (2016, Jun. 13). iOS 10: Siri now works in third-party apps, comes with extra AI features. Apple Inc., Cupertino, CA. [Online]. Available: https://bgr.com/2016/06/13/ios-10-siri-third-party-apps/
[29]	W. Vogels. (2016, Nov. 30). Bringing the Magic of Amazon AI and Alexa to Apps on AWS. Amazon, Seattle, Washington, [Online]. Available: https://www.allthingsdistributed.com/2016/11/amazon-ai-and-alexa-for-all-aws-apps.html
[30]	AlphaStar team: Mastering the Real-Time Strategy Game StarCraft II. DeepMind, London, UK. [Online]. Available: https://deepmind.com/blog/article/alphastar-mastering-real-time-strategy-game-starcraft-ii
[31]	C. K. Li, Y. H. Hou, P. C. Wang, and W. Q. Li, “Multiview-based 3-D action recognition using deep networks,” IEEE Trans. Human-Machine Systems, vol. 49, no. 1, pp. 95–104, Feb. 2019. doi: 10.1109/THMS.2018.2883001
[32]	D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3D convolutional networks,” in Proc. IEEE Int. Conf. Computer Vision, Santiago, Chile, 2015, pp. 4489–4497.
[33]	A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Columbus, USA, 2014, pp. 1725–1732.
[34]	C. J. Tsai, Y. W. Tsai, S. L. Hsu, and Y. C. Wu, “Synthetic training of deep CNN for 3D hand gesture identification,” in Proc. Int. Conf. Control, Artificial Intelligence, Robotics & Optimization, Prague, Czech Republic, 2017, pp. 165–170.
[35]	C. Y. Li, X. Zhang, and L. W. Jin, “LPSNet: A novel log path signature feature based hand gesture recognition framework,” in Proc. IEEE Int. Conf. Computer Vision Workshops, Venice, Italy, 2017, pp. 631–639.
[36]	O. Köpüklü, N. Köse, and G. Rigoll, “Motion fused frames: Data level fusion strategy for hand gesture recognition,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition Workshops, Salt Lake City, USA, 2018, pp. 2184–21848.
[37]	O. Köpüklü, A. Gunduz, N. Kose, and G. Rigoll, “Real-time hand gesture detection and classification using convolutional neural networks,” in Proc. 14th IEEE Int. Conf. Automatic Face & Gesture Recognition, Lille, France, 2019, pp. 1–8.
[38]	P. C. Wang, W. Q. Li, P. Ogunbona, J. Wan, and S. Escalera, Sergio. (2019)., “RGB-D-based human motion recognition with deep learning: A survey,” Comput. Vis. Image Understanding, vol. 171, pp. 118–139, Jun. 2018. doi: 10.1016/j.cviu.2018.04.007
[39]	O. Köpüklü, N. Kose, A. Gunduz, and G. Rigoll, “Resource Efficient 3D Convolutional Neural Networks,” in Proc. IEEE/CVF Int. Conf. Computer Vision Workshop, Seoul, Korea (South), 2019, pp. 1910–1919.
[40]	W. J. Zhang and J. C. Wang, “Dynamic hand gesture recognition based on 3D convolutional neural network models,” in Proc. IEEE 16th Int. Conf. Networking, Sensing and Control, Banff, Canada, 2019, pp. 224–229.
[41]	J. Y. H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond short snippets: Deep networks for video classification,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Boston, USA, 2015, pp. 4694–4702.
[42]	S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010. doi: 10.1109/TKDE.2009.191
[43]	C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, USA, 2016, pp. 2818–2826.
[44]	N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res., vol. 15, pp. 1929–1958, 2014.
[45]	J. Materzynska, G. Berger, I. Bax, and R. Memisevic, “The jester dataset: A large-scale video dataset of human gestures,” in Proc. IEEE/CVF Int. Conf. Computer Vision Workshop, Seoul, Korea (South), 2019, pp. 2874–2882.
[46]	Twentybn, Twenty Billion Neurons Inc. Toronto, Canada. (2017) [Online]. Available: https://20bn.com/datasets/jester.
[47]	H. Wang, D. Oneata, J. Verbeek, and C. Schmid, “A robust and efficient video representation for action recognition,” Int. J. Comput. Vis., vol. 119, no. 3, pp. 219–238, Oct.–Dec. 2016. doi: 10.1007/s11263-015-0846-5
[48]	P. Molchanov, X. D. Yang, S. Gupta, K. Kim, S. Tyree, and J. Kautz, “Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural network,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, USA, 2016, pp. 4207–4215.
[49]	S. C. Gao, M. C. Zhou, Y. R. Wang, J. J. Cheng, H. Yachi, and J. H. Wang, “Dendritic neuron model with effective learning algorithms for classification, approximation, and prediction,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 2, pp. 601–614, Feb. 2019. doi: 10.1109/TNNLS.2018.2846646
[50]	J. J. Wang and T. Kumbasar, “Parameter optimization of interval Type-2 fuzzy neural networks based on PSO and BBBC methods,” IEEE/CAA J. Autom. Sinica, vol. 6, no. 1, pp. 247–257, Jan. 2019. doi: 10.1109/JAS.2019.1911348

Supplements(0)

Cited By

Proportional views

Proportional views

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Figures(15) / Tables(3)

Get Citation

PDF

XML

Article Metrics

Article views (3413) PDF downloads(311)

Highlights

This study designed a new deep learning neural network model that integrates several state-of-the-art techniques for action recognition to tackle the complexity and performance issues in dynamic hand gesture recognition. Short-term video sampling, feature fusion, ConvNets with transfer learning and LSTMs are the key components of the new model.
This study developed a novel approach to “zoom-out” the existing dataset to increase the diversity of the dataset and thus ensure the robustness of a trained model.
Compared with existing models, the proposed network achieved a very competitive recognition performance on the two most popular hand gesture datasets, Jester and Nvidia.

Dynamic Hand Gesture Recognition Based on Short-Term Sampling Neural Networks

doi: 10.1109/JAS.2020.1003465

Abstract

References

Proportional views

Catalog

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Highlights

Export File

Citation

Format

Content