Robust Speech Recognition using Long Short Term Memory Networks and Bottleneck Features

Document Type : Original Article

Authors

1 Faculty of Computer Engineering, Iran University of Science and Technology, Tehran, Iran

2 Faculty of Computer Engineering, K.N Toosi University of Technology, Tehran, Iran

Abstract

Deep neural networks have been widely used in speech recognition systems in recent years. However, the robustness of these models in the presence of environmental noise has been less discussed. In this paper, we propose two approaches for the robustness of deep neural networks models against environmental additive noise. In the first approach, we propose to increase the robustness of long short-term memory (LSTM) networks in the presence of noise based on their abilities in learning long-term noise behavior. For this purpose, we propose to use noisy speech for training models. In this way, LSTMs are trained in a noise-aware manner. The results on the noisy TIMIT dataset show that if the models are trained with noisy speech rather than clean speech, recognition accuracy will be improved up to 18%. In the second approach, we propose to reduce noise effects on the extracted features using a denoised autoencoder network and to use the bottleneck features to compress the feature vector and represent the higher level of input features. This method increases the accuracy of the proposed recognition system in the first approach by 4% in the presence of noise.

Keywords


 

[1]      A. Graves and N. Jaitly, “Towards End-To-End Speech Recognition with Recurrent Neural Networks,” Proceedings of the 31st International Conference on Machine Learning, 2014.
[2]      Y. Miao, M. Gowayyed and F. Metze, “EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding,” ASRU, 2015.
[3]      D. Amodei and a. et, “Deep Speech 2: End-to-End Speech Recognition in English and Mandarin,” International Conference on Machine Learning, New York, NY, USA, 2016.
[4]      Y. Bengio, P. Lamblin, D. Popovici and H. Larochelle, “Greedy Layer-Wise Training of Deep Networks,” NIPS, 2006.
[5]      H. Larochelle, Y. Bengio, J. Louradour and P. Lamblin, “Exploring Strategies for Training Deep Neural Networks,” JMLR, vol. 10, pp. 1-40, 2009.
[6]      A. Graves, Supervised Sequence Labelling with Recurrent Neural Networks, Springer, 2012.
[7]      A. Zeyer, P. Doetsch, P. Voigtlaender, R. Schlüter and H. Ney, “A Comprehensive Study of Deep Bidirectional LSTM RNNs for Acoustic Modeling in Speech Recognition,” Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, USA, 2017.
[8]      مجتبی حاجی‌آبادی, عباس ابراهیمی مقدم و حسین خوش‌بین, «حذف نویز صوتی مبتنی بر یک الگوریتم وفقی نوین»، مجله مهندسی برق دانشگاه تبریز, جلد 46, شماره 3, صفحه‌های 139-147, پائیز 1395.
[9]      مسعود گراوانچی‌زاده و ساناز قائمی سردرودی, «بهبود کیفیت گفتار مبتنی بر بهینه‌سازی ازدحام ذرات با استفاده از ویژگی‌های ماسک‌گذاری سیستم شنوایی انسان»، مجله مهندسی برق دانشگاه تبریز, جلد 46, شماره 3, صفحه‌های 287-297, پاییز 1395.
[10]      M. Seltzer, D. Yu and Y. Wang, “An investigation of deep neural networks for noise robust speech recognition,” Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013.
[11]      D. Yu, L. Deng, J. Droppo, J. Wu, Y. Gong and A. Acero, “Robust speech recognition using cepstral minimum-mean-square-error noise suppressor,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 5, 2008.
[12]      S. Sun, B. Zhang, L. Xie and Y. Zhang, “An unsupervised deep domain adaptation approach for robust speech recognition,” Neurocomputing, vol. 257, pp. 79-87, 2017.
[13]      V. Mitra, H. Franco, R. M. Stern, J. v. Hout, L. Ferrer, M. Graciarena, W. Wang, D. Vergyri, A. Alwan and J. H. L. Hansen, "Robust features in Deep Learning based Speech Recognition,” New Era for Robust Speech Recognition: Exploiting Deep Learning, Springer, 2017, pp. 187 - 217.
[14]      A. M. C. Martinez, S. H. Mallidi and B. T. Meyer, “On the relevance of auditory-based Gabor features for deep learning in robust speech recognition,” Computer Speech and Language, vol. 45, no. C, pp. 21-38, 2017.
[15]      D. Yu and M. Seltzer, “Improved Bottleneck Features Using Pretrained Deep Neural Networks,” INTERSPEECH, 2011.
[16]      T. N. Sainath, B. Kingsbury and B. Ramabhadran, “Auto-encoder bottleneck features using deep belief networks,” Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 2012.
[17]      J. e. a. Gehring, “Extracting deep bottleneck features using stacked auto-encoders,” Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013.
[18]      A. Senior, H. Sak, F. de Chaumont Quitry, T. N. Sainath and K. Rao, “Acoustic Modelling with CD-CTC-SMBR LSTM RNNS,” ASRU, 2015.
[19]      H. Sak, A. W. Senior and F. Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” INTERSPEECH, 2014.
[20]      A. L. Maas, Z. Xie, D. Jurafsky and A. Y. Ng., “Lexicon-Free Conversational Speech Recognition with Neural Networks,” NAACL, 2015.
[21]      D. Yu, K. Yao and Y. Zhang, “The Computational Network Toolkit,” IEEE Signal Processing Magazine, vol. 32, no. 6, pp. 123 - 126, 2015.
[22]      D. Dorde, T. Grozdic, S. T. Jovicic and M. Subotic, “Whispered speech recognition using deep denoising autoencoder,” Engineering Applications of Artificial Intelligence, vol. 59, pp. 15-22, 2017.
[23]      R. Fr, P. Matjka, F. Grzl, O. Plchot, K. Vesel and J. H. ernock, “Multilingually trained bottleneck features in spoken language recognition,” Computer Speech and Language, vol. 46, no. C, pp. 252-267, 2017.
[24]      R. Martin, “Noise Power Spectral Density Estimation Based on Optimal Smoothing and Minimum Statistics,” IEEE transactions on speech and audio processing, vol. 9, no. 5, pp. 504-512, 2001.
 
زیرنویس‌ها