Visual Speech Recognition using Spatial-Temporal Gradient Analysis

Document Type : Original Article


1 Cyber space research inst., Shahid Beheshti University, Tehran, Iran

2 Cyber space research inst., Shahid Beheshti University, Tehran, Iran,


The use of visual information for voice recognition is an important solution in the absence of audio information. This paper presents a method for speech recognition using visual information by describing spatial-temporal changes in the lobe of the lips. The gradient of the image was used for feature extraction. In the proposed method, after lobe area detection and extraction of key points, the gradient was extracted to describe the spatial information of the key points. To describe the key areas of the lip during speaking, the 3D histogram of gradients path curve fitting was used. The main focus of this research was to provide an adequate description of speech. For this purpose, different classifiers were tested and the best one was recognized. To evaluate the proposed method, the MIRACL-VC1 database was used and the results were compared with previous methods for speech recognition which had an improvement about 11 to 17 percent.


[1]      A. Rekik, A. Ben-Hamadou and W. Mahdi, “An adaptive approach for lip-reading using image and depth data,” Multimedia Tools and Applications, vol. 75, no. 14, pp. 8609-8636, 2016.
[2]      I. Matthews, T. F. Cootes, J. A. Bangham, S. Cox, and R. Harvey, “Extraction of visual features for lipreading,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 2, pp. 198-213, 2002.
[3]      K. Paleček, “Lipreading using spatiotemporal histogram of oriented gradients.” 24th European Signal Processing Conference, pp. 1882-1885, Aug. 2016.
[4]      J. Shin, J. Lee and D. Kim, “Real-time lip reading system for isolated Korean word recognition,” Pattern Recognition, vol. 44, no. 3, pp. 559-571, 2011.
[5]      G. Sterpu and N. Harte, “Towards lipreading sentences with active appearance models,” arXiv preprint arXiv:1805.11688, 2018.
[6]      H. L. Bear, S. J. Cox and R. W. Harvey, “Speaker-independent machine lip-reading with speaker-dependent viseme classifiers,” arXiv preprint arXiv:1710.01122, 2017.
[7]      P. Dalka, P. Bratoszewski and A. Czyzewski, “Visual lip contour detection for the purpose of speech recognition.” International Conference on Signals and Electronic Systems, pp. 1-4, Sept 2014.
[8]      X. Ma, L. Yan and Q. Zhong, “Lip feature extraction based on improved jumping-snake model.” 35th Chinese Control Conference, pp. 6928-6933, 2016.
[9]      F. Faridah and B. Achmad, “Lip image feature extraction utilizing snake's control points for lip reading applications,” International Journal of Electrical and Computer Engineering, vol. 5, no. 4, pp. 720, 2015.
[10]      نصیبه اسدی‌پرور ماسوله و اسدالله شاه‌بهرامی, «تخمین خودکار سن از روی تصویر چهره با تلفیق ویژگی‌های آماری و بافت»، مجله مهندسی برق دانشگاه تبریز, دوره 47، شماره 3، صفحه 829-842، 2017.
[11]      Y. Pei, T.-K. Kim and H. Zha, “Unsupervised random forest manifold alignment for lipreading,” The IEEE International Conference on Computer Vision, pp. 129-136, 2013.
[12]      A. Jain and G. Rathna, “Visual speech recognition for isolated digits using discrete cosine transform and local binary pattern feature,” IEEE Global Conference on Signal and Information Processing, pp. 368-372, 2017.
[13]      L. D. Terissi, M. Parodi, and J. C. Gómez, "Lip reading using wavelet-based features and random forests classification." 22nd International Conference on Pattern Recognition, pp. 791-796, 2014.
[14]      S. S. Morade and S. Patnaik, “Lip reading by using 3-D discrete wavelet transform with dmey wavelet,” International Journal of Image Processing (IJIP), vol. 8, no. 5, pp. 384, 2014.
[15]      S. S. Morade and S. Patnaik, “Lip reading using DWT and LSDA,” IEEE International Advance Computing Conference, pp. 1013-1018, 2014.
[16]      سانازکشوری و عبدالله چاله‌چاله, «طبقه‌بندی سبک نقاشی هنرمندان با استفاده از هیستوگرام گرادیان جهت‌دار و الگوی باینری محلی»، مجله مهندسی برق دانشگاه تبریز، دوره47، شماره 3، صفحه 1195-1204، 2017.
[17]      منیره کوشش و غلامرضا اکبری‌زاده, «الگوریتم حذف Speckle با قابلیت حفظ لبه برای تصاویر سنجش‌ازدور رادار روزنه ترکیبی با استفاده از تبدیل چندمقیاسه‌ی Curvelet و آستانه‌گذاری وفقی»، مجله مهندسی برق دانشگاه تبریز، شماره 4، دوره 45، صفحه 153-161، 2015.
[18]      G. Zhao, M. Barnard and M. Pietikainen, “Lipreading with local spatiotemporal descriptors,” IEEE Transactions on Multimedia, vol. 11, no. 7, pp. 1254-1265, 2009.
[19]      W. C. Yau, D. K. Kumar and S. P. Arjunan, “Visual speech recognition using dynamic features and support vector machines,” International Journal of Image and Graphics, vol. 8, no. 03, pp. 419-437, 2008.
[20]      A. Rekik, A. Ben-Hamadou and W. Mahdi, “A new visual speech recognition approach for RGB-D cameras.” International Conference Image Analysis and Recognition, pp. 21-28, 2014.
[21]      A. Klaser, M. Marszałek and C. Schmid, “A spatio-temporal descriptor based on 3d-gradients.” 19th British Machine Vision Conference, pp. 275: 1-10, 2008.
[22]      P. Viola and M. J. Jones, “Robust real-time face detection,” International journal of computer vision, vol. 57, no. 2, pp. 137-154, 2004.
[23]      A. Asthana, S. Zafeiriou, S. Cheng and M. Pantic, “Incremental face alignment in the wild.” The IEEE Conference on Computer Vision and Pattern Recognition, pp. 1859-1866, 2014.
[24]      J. Fan, Local Polynomial Modelling and its Applications: Monographs on Statistics and Applied Probability 66: Routledge, 201.