Human Action Recognition Using Transfer Learning with Spatio-Temporal Templates

Document Type : Original Article

Authors

Electrical Engineering Department, Yazd University, Yazd, Iran

Abstract

A gait energy image (GEI) is a spatial template that collapses regions of motion into a single image in which more moving pixels are brighter than others. The discrete wavelet transform template (DWT-TEMP) is a temporal template that represents the time changes of motion. The static and dynamic information of every video is compressed utilizing these templates. In the proposed method, every video is parted into N groups of successive frames, and the GEI and DWT-TEMP are made for every group, resulting spatial and temporal templates. Transfer learning method has been utilized for classifying. It gives the recognition accuracies of 92.40%, 95.30% and 87.06% for UCF Sport, UCF-11 and Olympic Sport action datasets, respectively.

Keywords


[1] L. Chen, J. Hoey, C. D. Nugent, D. J. Cook, and Z. Yu, "Sensor-based activity recognition," IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 6, pp. 790-808, 2012.
[2] H.-H. Phan, N.-S. Vu, V.-L. Nguyen, and M. Quoy, "Action recognition based on motion of oriented magnitude patterns and feature selection," IET Computer Vision, vol. 12, no. 5, pp. 735-743, 2018.
[3] A. Gaidon, Z. Harchaoui, and C. Schmid, "Activity representation with motion hierarchies," International journal of computer vision, vol. 107, no. 3, pp. 219-238, 2014.
[4] S. K. Dwivedi, V. Gupta, R. Mitra, S. Ahmed, and A. Jain, "ProtoGAN: Towards Few Shot Learning for Action Recognition," arXiv preprint arXiv:1909.07945, 2019.
[5] J. Cho, M. Lee, H. J. Chang, and S. Oh, "Robust action recognition using local motion and group sparsity," Pattern Recognition, vol. 47, no. 5, pp. 1813-1825, 2014.
[6] S. Sharma, R. Kiros, and R. Salakhutdinov, "Action recognition using visual attention," arXiv preprint arXiv:1511.04119, 2015.
[7] N. Souly and M. Shah, "Visual saliency detection using group lasso regularization in videos of natural scenes," International Journal of Computer Vision, vol. 117, no. 1, pp. 93-110, 2016.
[8] M. Saremi and F. Yaghmaee, "Efficient encoding of video descriptor distribution for action recognition," Multimedia Tools and Applications, vol. 79, no. 9, pp. 6025-6043, 2020.
[9] Y. Zhang, M. Ding, Y. Bai, D. Liu, and B. Ghanem, "Learning a strong detector for action localization in videos," Pattern Recognition Letters, vol. 128, pp. 407-413, 2019.
[10] J. Cong and B. Zhang, "Multi-model feature fusion for human action recognition towards sport sceneries," Signal Processing: Image Communication, p. 115803, 2020.
[11] S. Javanmardi, A. Latif, and V. Derhami, "Image Tag Completion by Applying SPFCM Clustering on the Features Learned by Deep Convolutional Neural Networks," TABRIZ JOURNAL OF ELECTRICAL ENGINEERING, vol. 49, no. 1, pp. 111-123, 2019.
[12] A. Sezavar, H. Farsi, and S. Mohamadzadeh, "Content-Based Image Retrieval using Deep Convolutional Neural Networks," TABRIZ JOURNAL OF ELECTRICAL ENGINEERING, vol. 48, no. 4, pp. 1595-1603, 2019.
[13] J. Han and B. Bhanu, "Statistical feature fusion for gait-based human recognition," in Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., 2004, vol. 2: IEEE, pp. II-II.
[14] S. Ji, W. Xu, M. Yang, and K. Yu, "3D convolutional neural networks for human action recognition," IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 221-231, 2012.
[15] S. Ramasinghe and R. Rodrigo, "Action recognition by single stream convolutional neural networks: An approach using combined motion and static information," in 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), 2015: IEEE, pp. 101-105, 2015.
[16] Y. Zhou, N. Pu, L. Qian, S. Wu, and G. Xiao, "Human Action Recognition in Videos of Realistic Scenes Based on Multi-scale CNN Feature," in Pacific Rim Conference on Multimedia, 2017: Springer, pp. 316-326.
[17] A. Ullah, J. Ahmad, K. Muhammad, M. Sajjad, and S. W. Baik, "Action recognition in video sequences using deep bi-directional LSTM with CNN features," IEEE Access, vol. 6, pp. 1155-1166, 2017.
[18] L. Wang, Y. Xu, J. Cheng, H. Xia, J. Yin, and J. Wu, "Human action recognition by learning spatio-temporal features with deep neural networks," IEEE access, vol. 6, pp. 17913-17922, 2018.
[19] J. Wei, H. Wang, Y. Yi, Q. Li, and D. Huang, "P3d-ctn: Pseudo-3d convolutional tube network for spatio-temporal action detection in videos," in 2019 IEEE International Conference on Image Processing (ICIP), 2019: IEEE, pp. 300-304.
[20] H. Ge, Z. Yan, W. Yu, and L. Sun, "An attention mechanism based convolutional LSTM network for video action recognition," Multimedia Tools and Applications, vol. 78, no. 14, pp. 20533-20556, 2019.
[21] A. Zare, H. A. Moghaddam, and A. Sharifi, "Video spatiotemporal mapping for human action recognition by convolutional neural network," Pattern Analysis and Applications, vol. 23, no. 1, pp. 265-279, 2020.
[22]    H. Wang, A. Kläser, C. Schmid, and C.-L. Liu, "Action recognition by dense trajectories," in CVPR IEEE 2011, pp. 3169-3176, , 2011.
[23] W. Zhu et al., "Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks," arXiv preprint arXiv:1603.07772, 2016.
[24] H. Rahmani, D. Q. Huynh, A. Mahmood, and A. Mian, "Discriminative human action classification using locality-constrained linear coding," Pattern recognition letters, vol. 72, pp. 62-71, 2016.
[25] A. A. Efros, A. C. Berg, G. Mori, and J. Malik, "Recognizing action at a distance," in null, 2003, p. 726: IEEE.
[26] E. Shechtman and M. Irani, "Space-time behavior based correlation," in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), 2005, vol. 1, pp. 405-412: IEEE.
[27]    H. J. Seo and P. Milanfar, "Action recognition from one example," IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 5, pp. 867-882, 2010.
[28] K. Soomro and A. R. Zamir, "Action recognition in realistic sports videos," in Computer vision in sports: Springer, 2014, pp. 181-208.
[29] J. Liu, J. Luo, and M. Shah, "Recognizing realistic actions from videos “in the wild”," in 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009: IEEE, pp. 1996-2003.
[30] J. C. Niebles, C.-W. Chen, and L. Fei-Fei, "Modeling temporal structure of decomposable motion segments for activity classification," in European conference on computer vision, 2010: Springer, pp. 392-405.