A Two Phase Speech Enhancement Based on Deep Denoising Autoencoder

Document Type : Original Article

Authors

Department of Computer Engineering and Information Technology, Amirkabir University of Technology, Iran

Abstract

The short-and the long-term information in speech signal are useful for speech enhancement, especially if the speech signal is corrupted by both stationary and non-stationary noises. This paper proposes a new approach to provide long-term speech input for a deep denoising autoencoder by reducing the number of frequency sub-bands of the input data. This paper also proposes a two phase speech enhancement approach. The first phase performs short-term speech enhancement by using a deep denoising autoencoder. In the second phase, long-term speech enhancement denoising autoencoder is applied on the output of short-term enhanced speech data. The proposed models were evaluated on the Aurora-2 Speech recognition corpus and our results show significant improvements of 0.3 in PESQ score at lower SNR values. The proposed models were evaluated on the recognition task where the proposed method results in 4% reduction in word error rate for the multi-condition training when compared to the baseline MFCC front-end.

Keywords


[1]      S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. Acoust. Speech Signal Process., vol. 27, no. 2, pp. 113–120, 1979.
[2]      K. K. Ravi and P. V. Subbaiah, “A survey on speech enhancement methodologies,” Int. J. Intell. Syst. Appl., vol. 8, no. 12, p. 37, 2016.
[3]      V. Sunnydayal, N. Sivaprasad and T. K. Kumar, “A survey on statistical based single channel speech enhancement techniques,” Int. J. Intell. Syst. Appl., vol. 6, no. 12, p. 69, 2014.
[4]      I. Cohen and B. Berdugo, “Noise estimation by minima controlled recursive averaging for robust speech enhancement,” IEEE Signal Process. Lett., vol. 9, no. 1, pp. 12–15, 2002.
[5]      I. Cohen, “Optimal speech enhancement under signal presence uncertainty using log-spectral amplitude estimator,” IEEE Signal Process. Lett., vol. 9, no. 4, pp. 113–116, 2002.
[6]      Y. Ephraim and I. Cohen, “Recent advancements in speech enhancement,” Circuits Signals Speech Image Process., 2006.
[7]      Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error log-spectral amplitude estimator,” IEEE Trans. Acoust. Speech Signal Process., vol. 33, no. 2, pp. 443–445, 1985.
[8]      Y. Ephraim and D. Malah, “Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator,” IEEE Trans. Acoust. Speech Signal Process., vol. 32, no. 6, pp. 1109–1121, 1984.
[9]      مسعود ﮔﺮاواﻧﭽﯽزاده، ساناز قائمی ﺳﺮدرودی، « ﺑﻬﺒﻮد ﮐﯿﻔﯿﺖ ﮔﻔﺘﺎر ﻣﺒﺘﻨﯽ ﺑﺮ ﺑﻬﯿﻨﻪﺳﺎزی ازدﺣﺎم ذرات ﺑﺎ اﺳﺘﻔﺎده از وﯾﮋﮔﯽﻫﺎی ﻣﺎﺳﮏﮔﺬاری ﺳﯿﺴﺘﻢ ﺷﻨﻮاﺋﯽ اﻧﺴﺎن»، مجله مهندسی برق دانشگاه تبریز، جلد 46، شماره 3، شماره صفحه 287-297، زمستان 1395.
[10]      D. Wang, “Time-Frequency masking for speech separation and Its potential for hearing aid design,” Trends Amplif., vol. 12, no. 4, pp. 332–353, 2008.
[11]      حسین شایقی، علی قاسمی، «پیش‌بینی قیمت روزانه برق با شبکه عصبی بهبودیافته مبتنی بر تبدیل موجک و روش آشوبناک جستجوی گرانشی»، مجله مهندسی برق دانشگاه تبریز، جلد 45، شماره 4، شماره صفحه 103-113، زمستان 1394.
[12]      فرید کربلایی، حمیدرضا شعبانی، رضا ابراهیم‏پور، «ارزﯾﺎﺑﯽ ﺑﺮونﺧﻂ ﭘﺎﯾﺪاری  ﮔﺬرا ﺑﻪ وسیله تعیین دقیق CCT ﺑﺎ اﺳﺘﻔﺎده از ﺷﺒﮑﻪ ﻋﺼﺒﯽ ﺑﺎ ورودی‏های ﻣﺒﺘﻨﯽ ﺑﺮ ﺗﻮاﺑﻊ اﻧﺮژی»، مجله مهندسی برق دانشگاه تبریز، جلد 46، شماره 1، شماره صفحه 277-285، زمستان 1395.
 
[13]      Y. Xu, J. Du, L. R. Dai and C. H. Lee, “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 23, no. 1, pp. 7–19, Jan. 2015.
[14]      F. Weninger et al., “Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR,” in International Conference on Latent Variable Analysis and Signal Separation, 2015, pp. 91–99.
[15]      B. Li, Y. Tsao and K. C. Sim, “An investigation of spectral restoration algorithms for deep neural networks based noise robust speech recognition.,” in Proceedings of Interspeech 2013, pp. 3002–3006, 2013.
[16]      Z. Chen, S. Watanabe, H. Erdogan and J. R. Hershey, “Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks,” Unkn. J., vol. 2015–January, pp. 3274–3278, 2015.
[17]      L. Deng and D. Yu, " Deep learning: methods and applications", Foundations and Trends® in Signal Processing: Vol. 7: No. 3–4, pp 197-387, 2014.
[18]      L. Dehyadegary, S. Ali Seyyedsalehi and I. Nejadgholi, “Nonlinear enhancement of noisy speech, using continuous attractor dynamics formed in recurrent neural networks,” Neurocomputing, vol. 74, no. 17, pp. 2716–2724, Oct. 2011.
[19]      S. Tan and K. C. Sim, “Learning utterance-level normalisation using Variational Autoencoders for robust automatic speech recognition,” in 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 43–49, 2016.
[20]      X. Lu, Y. Tsao, S. Matsuda and C. Hori, “Speech enhancement based on deep denoising autoencoder.,” in Interspeech, pp. 436–440, 2013.
[21]      Y. Xu, J. Du, L.-R. Dai and C.-H. Lee, “An experimental study on speech enhancement based on deep neural networks,” IEEE Signal Process. Lett., vol. 21, no. 1, pp. 65–68, 2014.
[22]      G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006.
[23]      P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio and P.-A. Manzagol, “Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion,” J Mach Learn Res, vol. 11, pp. 3371–3408, Dec. 2010.
[24]      T. Gao, J. Du, Y. Xu, C. Liu, L.-R. Dai and C.-H. Lee, “Improving deep neural network based speech enhancement in low SNR environments,” in International Conference on Latent Variable Analysis and Signal Separation, pp. 75–82, 2015.
[25]      D. Pearce, H. Hirsch and E. E. D. Gmbh, “The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions,” in in ISCA ITRW ASR2000, pp. 29–32, 2000.
[26]      S. Vihari, A. S. Murthy, P. Soni and D. C. Naik, “Comparison of speech enhancement algorithms,” Procedia Comput. Sci., vol. 89, no. Supplement C, pp. 666–676, Jan. 2016.