學術產出-學位論文

文章檢視/開啟

書目匯出

Google ScholarTM

政大圖書館

引文資訊

TAIR相關學術產出

題名 多層次極限學習機於語音訊號處理上的應用
Hierarchical Extreme Learning Machine for Speech Signal Processing
作者 胡泰克
Hussain, Tassadaq
貢獻者 曹昱<br>廖文宏
Yu Tsao<br>Wen-Hung Liao
胡泰克
Tassadaq Hussain
關鍵詞 语音信号处理
分层极端学习机
公路极限学习机
剩余的极限学习机
渠道补偿
多模式学习
模型压缩
Speech Signal Processing
Hierarchical Extreme Learning Machines
Highway Extreme Learning Machines
Residual Extreme Learning Machines
Channel Compensation
Multimodal Learning
Model Compression
日期 2020
上傳時間 1-七月-2020 13:49:55 (UTC+8)
摘要 語音是人與人互動中最有效、最自然的手段,在過去的幾十年中,語音信號處理的各個議題已經被深入地研究,然而在真實聲學環境下有效提高人類聽覺、機器識別率仍然是一項艱鉅的任務。近年來以語音控制的個人助理系統(例如Alexa、Google Home等)已經被大幅使用,進而重塑了人機交互模式。在經常需要遠距離交談的實際應用中(例如,音頻數據挖掘和語音輔助應用),背景噪聲會嚴重降低語音信號的質量和清晰度,因此,能夠抑制噪聲是在實用環境下的重要議題。針對這個議題,本文首先提出了一種語音去噪框架,其目的是:(i)有效、快速地從單通道語音信號中去除背景噪聲;(ii)在不匹配測試條件下(靜態和非靜態噪聲以及不同SNR級別),能夠有效地從嘈雜的聲音中提取出清晰的語音特徵。(iii)在訓練數據量有限的情況下也可以獲得優異的除噪性能。實驗結果證實與基於深層類神經網絡的方法相比,在訓練數據量有限的情況下,所提出的HELM框架可以產生效果相當甚至更好的語音品質和清晰度。

除了噪音,混響是另一個語音的問題。混響通常是指反射聲音的總集合,會嚴重影響與語音應用的效能。近年來,深層類神經模型強大的回歸能力已經證實可以有效地對語音去除混響效果。但是深層類神經模型有個重大缺點,就是需要大量的混響-無混響訓練語音對來訓練,而大量的訓練資料對通常並不容易取得。因此,開發一種使用少量的訓練數據的演算法變成重要的研究議題。本論文研究以HELM來解決了混響問題和數據需求問題,同時提出了利用整體學習框架的優點。實驗結果表明,在匹配以及不匹配的測試條件下,該框架優於傳統方法和最近提出的整體深度學習演算法。

一個語音增強方法的局限是在沒見過的聲學條件下無法獲得令人滿意的性能。在本論文中,我們嘗試基於HELM解決通道不匹配的影響,可以在真實的聲學條件下將低質量的骨傳導麥克風話音轉換為高質量的空氣傳導麥克風話音。除了純音頻處理框架外,我們還將所提出的方法應用於多模態學習來改善純音頻語音增強模型的整體性能。在本論文中,我們也提出了一個結合聲音影像的語音增強系統。結果證實在不同的測試條件下,與僅有音頻的語音增強系統相比,結合聲音影像的語音增強系統可以提供更佳的效能。深度學習的另一個新興研究主題是促進模型壓縮以進一步增加應用性。我們提出了新穎的模型壓縮技術,可以有效地降低計算需求。未來我們預期壓縮後的模型能夠實現於硬體,並且與各種語音應用結合。
Speech is the most effective and natural medium of communication in human–human interaction. In the past few decades, a great amount of research has been conducted on various aspects and properties of speech signal processing. However, improving the intelligibility for both human listening and machine recognition in real acoustic conditions still remains a challenging task. In recent years, voice-controlled personal assistants systems (such as Alexa, Google Home, and Home Pod, etc.) have been widely used, and have reshaped the human-machine interaction mode. In practical applications that often require distant talking communications (e.g., audio data mining and voice-assisted applications), the effect of background noise can severely deteriorate the quality and intelligibility of speech signals for both human and machine listeners. Therefore, it is desirable that noise suppression can be made robust against changing noise conditions to operate in real-time environments. To address this issue, this dissertation initially presents a speech denoising framework which aims, (i) at the effective and fast removal of background noise from a single-channel speech signal, (ii) to extract clean speech features from the noisy counterpart and effective even under mismatch testing conditions (stationary and non-stationary noise and SNR levels), and (iii) to attain optimal performance when the amount of training data is limited. The proposed framework offers a universal approximation capability through comparative measures. The experimental results demonstrate that the proposed framework can yield comparable or even better speech quality and intelligibility compared with conventional signal processing- and deep neural-based approaches when the amount of training data is limited.

Besides noise, reverberation is yet another issue that can affect the learning effectiveness and robustness of distant-talking communication devices. Reverberation generally refers to the collection of reflected sounds that can affect the performance of speech-related applications significantly. In recent years, the approximation capabilities of deeper neural models have been exploited to study the reverberation effect. The outcome of these studies indicate that neural-based learning have strong regression capabilities, and can substantially achieve outstanding speech dereverberation results. However, deep neural models require a large amount of reverberant-anechoic training waveform pairs to achieve reasonable performance improvement. Therefore, it is required to develop a data-driven solution that can achieve robust generalization performance for realistic reverberated conditions and can be optimized with a small amount of training data, or more precisely adaptation data. Motivated by the promising performance achieved for speech denoising, this dissertation next addresses the reverberation and data requirement issue while preserving the advantages of deep neural structures leveraging upon ensemble learning framework. Experimental results reveal that the proposed framework outperforms both traditional methods and a recently proposed integrated deep and ensemble learning algorithm in terms of standardized evaluation metrics under matched and mismatched testing conditions.

A common drawback of most modern speech enhancement (SE) approaches is that they are typically evaluated using simulated datasets, where training and testing conditions are generated in controlled environments. Consequently, these approaches suffer from channel mismatch problems in unseen acoustic conditions and are unable to achieve satisfactory performance. In online learning, where data arrives from different channels and environments, an effective solution is required to address the channel mismatch problem. In this dissertation, we will next address the impact of channel mismatch and propose an alternative SE system which converts low-quality bone-conducted microphone utterances into high-quality air-conducted microphone utterances in real acoustic conditions.

Although the effects of noise and reverberation using audio-only frameworks are well examined under diverse sets of synthetically generated conditions, such frameworks need to initially acquire a large number of training data, covering as many environmental conditions as possible, to improve the robustness against unknown test conditions. Recent literature has exploited the great potential of auxiliary information in human-machine interactions. The data obtained from heterogeneous sensors and devices using the internet of things (IoT) can be useful for more robust inference, thereby providing further insights into multimodal learning. In addition to audio-only SE frameworks, multimodal learning has recently been adopted to improve the overall performances of audio-only SE models. The thesis later expands the audio-only paradigm of the SE framework and proposes an audio-visual SE system. The final results demonstrate that the incorporation of auxiliary information alongside audio can provide adequate performance enhancement over an audio-only SE system under different test conditions.

Another emerging focus of deep learning is to facilitate deep neural-based models to work in real-world applications. The problem with the existing deep neural models is that they are computationally expensive and memory intensive, thereby limiting the deployment in edge devices with low memory resources. Based on the successful results of audio-only and audio-visual SE frameworks, in this thesis, we propose a joint audio-visual SE framework to finally address model and data compression strategies in order to meet the computational demands and facilitate real-time predictions. The proposed framework demonstrates that incorporation of visual information helps the framework to retain most of the information lost by the audio-only framework, while the model compression lets the framework to further reduce the computation requirement. The model compression enables the model to land in the hardware implementation arena for multimodal environments to obtain efficient regression ability.
參考文獻 [1] J. Benesty, S. Makino, and J. Chen, Speech Enhancement. New York, USA: Springer, 2005.
[2] J. Li, L. Deng, R. HaebUmbach, and Y. Gong, Robust Automatic Speech Recognition: A Bridge to Practical Applications. Academic Press, 2015.
[3] B. Li, Y. Tsao, and K. C. Sim, “An investigation of spectral restoration algorithms for deep neural networks based noise robust speech recognition.,” in Proc. INTERSPEECH, pp. 3002– 006, 2013.
[4] A. ElSolh, A. Cuhadar, and R. Goubran, “Evaluation of speech enhancement techniques for speaker identification in noisy environments,” in Proc. ISMW, pp. 235–239, IEEE, 2007.
[5] J. Li, L. Yang, J. Zhang, Y. Yan, Y. Hu, M. Akagi, and P. C. Loizou, “Comparative intelligibility investigation of single-channel noise reduction algorithms for chinese, japanese, and english,” J. Acoust. Soc. Am., vol. 129, no. 5, pp. 3291–3301, 2011.
[6] F. Yan, A. Men, B. Yang, and Z. Jiang, “An improved ranking-based feature enhancement approach for robust speaker recognition,” IEEE Access, vol. 4, pp. 5258–5267, 2016.
[7] J. Li, S. Sakamoto, S. Hongo, M. Akagi, and Y. Suzuki, “Two-stage binaural speech enhancement with wiener filter for high quality speech communication,” Speech Commun., vol. 53, no. 5, pp. 677–689, 2011.
[8] T. Venema, Compression for clinicians. Delmar Pub, 2006.
[9] H. Levit, “Phd, noise reduction in hearing aids: An overview,” J. Rehabil. Res. Dev.
[10] Y.H. Lai, F. Chen, S.S. Wang, X. Lu, Y. Tsao, and C.H. Lee, “A deep denoising autoencoder approach to improving the intelligibility of vocoded speech in cochlear implant simulation,” IEEE Trans. Biomed. Eng., vol. 64, no. 7, pp. 1568–1578, 2017.
[11] F. Chen, Y. Hu, and M. Yuan, “Evaluation of noise reduction methods for sentence recognition by mandarin speaking cochlear implant listeners,” Ear and hearing, vol. 36, no. 1, pp. 61–71, 2015.
[12] P. Scalart et al., “Speech enhancement based on a priori signal to noise estimation,”in Proc. ICASSP, vol. 2, pp. 629–632, IEEE, 1996.
[13] E. Hänsler and G. Schmidt, Topics in acoustic echo and noise control: selected methods for the cancellation of acoustical echoes, the reduction of background noise, and speech processing. Springer Science & Business Media, 2006.
[14] J. Chen, J. Benesty, Y. Huang, and E. Diethorn, “Fundamentals of noise reduction in spring handbook of speech processing,” Springer, 2008.
[15] R. McAulay and T. Quatieri, “Speech analysis/synthesis based on a sinusoidal representation,” IEEE Trans. Acoust., Speech, Signal Process., vol. 34, no. 4, pp. 744–754, 1986.
[16] T. F. Quatieri and R. J. McAulay, “Shape invariant timescale and pitch modification of speech,” IEEE Trans. Signal Process., vol. 40, no. 3, pp. 497–510, 1992.
[17] J. Makhoul, “Linear prediction: A tutorial review,” Proceedings of the IEEE, vol. 63, no. 4, pp. 561–580, 1975.
[18] S. Suhadi, C. Last, and T. Fingscheidt, “A data-driven approach to a priori SNR estimation,” IEEE Trans. Audio, Speech, Language Process., vol. 19, no. 1, pp. 186– 195, 2011.
[19] T. Lotter and P. Vary, “Speech enhancement by map spectral amplitude estimation using a super gaussian speech model,” EURASIP Journal on Applied Signal Processing, vol. 2005, pp. 1110–1126, 2005.
[20] U. Kjems and J. Jensen, “Maximum likelihood based noise covariance matrix estimation
for multi-microphone speech enhancement,” in Proc. EUSIPCO, pp. 295–299, IEEE, 2012.
[21] R. McAulay and M. Malpass, “Speech enhancement using a soft decision noise suppression filter,” IEEE Trans. Acoust., Speech, Signal Process., vol. 28, no. 2, pp. 137–145, 1980.
[22] Y.C. Su, Y. Tsao, J.E. Wu, and F.R. Jean, “Speech enhancement using generalized maximum a posteriori spectral amplitude estimator,” in Proc. ICASSP, pp. 7467–7471, IEEE, 2013.
[23] R. Frazier, S. Samsam, L. Braida, and A. Oppenheim, “Enhancement of speech by adaptive filtering,” in Proc. ICASSP, vol. 1, pp. 251–253, IEEE, 1976.
[24] Y. Ephraim, “Statistical model-based speech enhancement systems,” Proceedings of the IEEE, vol. 80, no. 10, pp. 1526–1555, 1992.
[25] B. Atal and M. Schroeder, “Predictive coding of speech signals and subjective error criteria,” IEEE Trans. Acoust., Speech, Signal Process., vol. 27, no. 3, pp. 247–254, 1979.
[26] L. Rabiner and B. Juang, “An introduction to hidden markov models,” ieee asp magazine, vol. 3, no. 1, pp. 4–16, 1986.
[27] C.T. Lin, “Single-channel speech enhancement in variable noise level environment,” IEEE Transactions on Systems, Man, and Cybernetics Part A: Systems and Humans, vol. 33, no. 1, pp. 137–143, 2003.
[28] C. F. Stallmann and A. P. Engelbrecht, “Gramophone noise detection and reconstruction using time delay artificial neural networks,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 47, no. 6, pp. 893–905, 2017.
[29] J. Tchorz and B. Kollmeier, “SNR estimation based on amplitude modulation analysis with applications to noise suppression,” IEEE Transactions on Speech and Audio Processing, vol. 11, no. 3, pp. 184–192, 2003.
[30] S. Tamura, “An analysis of a noise reduction neural network,” in Proc. ICASSP, pp. 2001–2004, IEEE, 1989.
[31] F. Xie and D. Van Compernolle, “A family of mlp based nonlinear spectral estimators for noise reduction,” in Proc. ICASSP, vol. 2, pp. II–53, IEEE, 1994.
[32] E. A. Wan and A. T. Nelson, “Networks for speech enhancement,” Handbook of neural networks for speech processing. Artech House, Boston, USA, vol. 139, p. 1, 1999.
[33] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006.
[34] D. Erhan, Y. Bengio, A. Courville, P.A. Manzagol, P. Vincent, and S. Bengio, “Why does unsupervised pretraining help deep learning?” Journal of Machine Learning Research, vol. 11, no. Feb, pp. 625–660, 2010.
[35] B. Xia and C. Bao, “Speech enhancement with weighted denoising autoencoder,”in Proc. INTERSPEECH, pp. 3444–3448, 2013.
[36] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.A. Manzagol, “Stacked denoising
autoencoders: Learning useful representations in a deep network with a local denoising criterion,” Journal of Machine Learning Research, vol. 11, no. Dec, pp. 3371–3408, 2010.
[37] A. L. Maas, Q. V. Le, T. M. O’Neil, O. Vinyals, P. Nguyen, and A. Y. Ng, “Recurrent
neural networks for noise reduction in robust asr,” in Proc. ICASSP, 2012.
[38] M. Wöllmer, Z. Zhang, F. Weninger, B. Schuller, and G. Rigoll, “Feature enhancement by bidirectional lstm networks for conversational speech recognition in highly nonstationary
noise,” in Proc. ICASSP, pp. 6822–6826, IEEE, 2013.
[39] M. Z. Uddin, M. M. Hassan, A. Almogren, A. Alamri, M. Alrubaian, and G. Fortino, “Facial expression recognition utilizing local direction-based robust features and deep belief network,” IEEE Access, vol. 5, pp. 4525–4536, 2017.
[40] S. W. Akhtar, S. Rehman, M. Akhtar, M. A. Khan, F. Riaz, Q. Chaudry, and R. Young, “Improving the robustness of neural networks using k-support norm based adversarial training,” IEEE Access, vol. 4, pp. 9501–9511, 2016.
[41] S.W. Fu, Y. Tsao, X. Lu, and H. Kawai, “Raw waveform-based speech enhancement by fully convolutional networks,” in Proc. APSIPA, pp. 6–12, 2017.
[42] Y. Kim and B. Toomajian, “Hand gesture recognition using micro-Doppler signatures with convolutional neural network,” IEEE Access, vol. 4, pp. 7125–7130, 2016.
[43] Y. Wang and D. Wang, “Towards scaling up classification-based speech separation,” IEEE Trans. Audio, Speech, Language Process., vol. 21, no. 7, pp. 1381–1390, 2013.
[44] N. Wang, M. J. Er, and M. Han, “Generalized single hidden layer feedforward networks for regression problems,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 6, pp. 1161–1176, 2015.
[45] Y. Xu, J. Du, L.R. Dai, and C.H. Lee, “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 23, no. 1, pp. 7–19, 2015.
[46] X. Feng, Y. Zhang, and J. Glass, “Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition,” in Proc. ICASSP, pp. 1759–1763, 2014.
[47] J. Li, L. Deng, Y. Gong, and R. HaebUmbach, “An overview of noise-robust automatic
speech recognition,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 22, no. 4, pp. 745–777, 2014.
[48] S. M. Siniscalchi and V. M. Salerno, “Adaptation to new microphones using artificial neural networks with trainable activation functions,” IEEE Trans. Neural Netw. Learn. Syst., vol. 28, no. 8, pp. 1959–1965, 2017.
[49] Q. Jin, T. Schultz, and A. Waibel, “Far-field speaker recognition,” IEEE Audio, Speech, and Language Process., vol. 15, no. 7, pp. 2023–2032, 2007.
[50] X. Zhao, Y. Wang, and D. Wang, “Robust speaker identification in noisy and reverberant
conditions,” IEEE/ACM Trans. Audio, Speech and Language Process., vol. 22, no. 4, pp. 836–845, 2014.
[51] S. O. Sadjadi and J. H. Hansen, “Hilbert envelope based features for robust speaker identification under reverberant mismatched conditions,” in Proc. ICASSP, pp. 5448–5451, 2011.
[52] K. Kokkinakis, O. Hazrati, and P. C. Loizou, “A channel selection criterion for suppressing reverberation in cochlear implants,” J. Acoust. Soc. Am., vol. 129, no. 5, pp. 3221–3232, 2011.
[53] O. Hazrati, S. Omid Sadjadi, P. C. Loizou, and J. H. Hansen, “Simultaneous suppression of noise and reverberation in cochlear implants using a ratio masking strategy,” The Journal of the Acoustical Society of America, vol. 134, no. 5, pp. 3759–3765, 2013.
[54] J. Benesty, M. M. Sondhi, and Y. Huang, Springer handbook of speech processing. Springer, 2007.
[55] B. W. Gillespie, H. S. Malvar, and D. A. Florêncio, “Speech dereverberation via maximum kurtosis subband adaptive filtering,” in Proc. ICASSP, vol. 6, pp. 3701–3704, 2001.
[56] K. Kinoshita, M. Delcroix, T. Nakatani, and M. Miyoshi, “Suppression of late reverberation effect on speech signal using long term multiple-step linear prediction,” IEEE Trans. Audio, Speech, Language Process., vol. 17, no. 4, pp. 534–545, 2009.
[57] T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B.H. Juang, “Speech dereverberation based on variance normalized delayed linear prediction,” IEEE Trans. Audio, Speech, Language Process., vol. 18, no. 7, pp. 1717–1731, 2010.
[58] T. Nakatani, M. Miyoshi, and K. Kinoshita, “Singlemicrophone blind dereverberation,”
in Speech Enhancement, pp. 247–270, Springer, 2005.
[59] H. Attias, J. C. Platt, A. Acero, and L. Deng, “Speech denoising and dereverberation using probabilistic models,” in Proc. NIPS, pp. 758–764, 2001.
[60] J.T. Chien and Y.C. Chang, “Bayesian learning for speech dereverberation,” in Proc. MLSP, pp. 1–6, 2016.
[61] D. Bees, M. Blostein, and P. Kabal, “Reverberant speech enhancement using cepstral processing,” in Proc. ICASSP, pp. 977–980, 1991.
[62] K. Lebart, J.M. Boucher, and P. Denbigh, “A new method based on spectral subtraction
for speech dereverberation,” Acta Acoustica, vol. 87, no. 3, pp. 359–366, 2001.
[63] M. Miyoshi and Y. Kaneda, “Inverse filtering of room acoustics,” IEEE Trans. Acoust., Speech, Signal Process., vol. 36, no. 2, pp. 145–152, 1988.
[64] J. Flanagan, J. Johnston, R. Zahn, and G. Elko, “Computer steered microphone arrays for sound transduction in large rooms,” J. Acoust. Soc. Am., vol. 78, no. 5, pp. 1508–1518, 1985.
[65] J. L. Flanagan, A. C. Surendran, and E.E. Jan, “Spatially selective sound capture for speech and audio processing,” Speech Commun., vol. 13, no. 12, pp. 207–222,
[66] T. J. Cox, F. Li, and P. Darlington, “Extracting room reverberation time from speech using artificial neural networks,” Jour. Audio Eng. Soc., vol. 49, no. 4, pp. 219–230, 2001.
[67] J. Qi, J. Du, S. M. Siniscalchi, and C.H. Lee, “A theory on deep neural network-based vector to vector regression with an illustration of its expressive power in speech enhancement,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 27, no. 12, pp. 1932–1943, 2019.
[68] T. Ishii, H. Komiyama, T. Shinozaki, Y. Horiuchi, and S. Kuroiwa, “Reverberant speech recognition based on denoising autoencoder.,” in Proc. INTERSPEECH, pp. 3512–3516, 2013.
[69] Z. Zhang, J. Pinto, C. Plahl, B. Schuller, and D. Willett, “Channel mapping using bidirectional long short term memory for dereverberation in hands-free voice-controlled devices,” IEEE Trans. Consum. Electron., vol. 60, no. 3, pp. 525–533, 2014.
[70] F. Weninger, S. Watanabe, J. Le Roux, J. Hershey, Y. Tachioka, J. Geiger, B. Schuller, and G. Rigoll, “The merl/melco/tum system for the reverb challenge using deep recurrent neural network feature enhancement,” in Proc. REVERB Workshop, 2014.
[71] A. Schwarz, C. Huemmer, R. Maas, and W. Kellermann, “Spatial diffuseness features
for DNN-based speech recognition in noisy and reverberant environments,” in Proc. ICASSP, pp. 4380–4384, 2015.
[72] K. Han, Y. Wang, D. Wang, W. S. Woods, I. Merks, and T. Zhang, “Learning spectral
mapping for speech dereverberation and denoising,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 23, no. 6, pp. 982–992, 2015.
[73] X. Xiao, S. Zhao, D. H. H. Nguyen, X. Zhong, D. L. Jones, E. S. Chng, and H. Li, “Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation,” EURASIP J. Adv. Signal Process., vol. 2016, no. 1, p. 4, 2016.
[74] B. Wu, K. Li, M. Yang, and C.H. Lee, “A reverberation time-aware approach to speech dereverberation based on deep neural networks,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 25, no. 1, pp. 102–111, 2017.
[75] D. S. Williamson and D. Wang, “Time-frequency masking in the complex domain for speech dereverberation and denoising,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 25, no. 7, pp. 1492–1501, 2017.
[76] T. Shinamura and T. Tomikura, “Quality improvement of bone-conducted speech,”in Proc. ECCTD, vol. 3, pp. III–73, IEEE, 2005.
[77] Z. Zhang, Z. Liu, M. Sinclair, A. Acero, L. Deng, J. Droppo, X. Huang, and Y. Zheng, “Multisensory microphones for robust speech detection, enhancement, and recognition,” in Proc. ICASSP, vol. 3, pp. iii–781, IEEE, 2004.
[78] Y. Zheng, Z. Liu, Z. Zhang, M. Sinclair, J. Droppo, L. Deng, A. Acero, and X. Huang, “Air and bone conductive integrated microphones for robust speech detection and enhancement,” in 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No. 03EX721), pp. 249–254, IEEE, 2003.
[79] M. Graciarena, H. Franco, K. Sonmez, and H. Bratt, “Combining standard and throat microphones for robust speech recognition,” IEEE Signal Processing Letters, vol. 10, no. 3, pp. 72–74, 2003.
[80] T. V. Thang, K. Kimura, M. Unoki, and M. Akagi, “A study on the restoration of bone-conducted speech with mtf-based and lp-based models,” Journal of signal processing, 2006.
[81] Y. Tajiri, H. Kameoka, and T. Toda, “A noise suppression method for body conducted
soft speech based on the nonnegative tensor factorization of air and body conducted signals,” in Proc. ICASSP, pp. 4960–4964, IEEE, 2017.
[82] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” in Proc. ICML, pp. 689–696, 2011.
[83] Y. Mroueh, E. Marcheret, and V. Goel, “Deep multimodal learning for audiovisual speech recognition,” in Proc. ICASSP, pp. 2130–2134, 2015.
[84] S. Tamura, H. Ninomiya, N. Kitaoka, S. Osuga, Y. Iribe, K. Takeda, and S. Hayamizu, “Audio-visual speech recognition using deep bottleneck features and high-performance lipreading,” in Proc. APSIPA, pp. 575–582, 2015.
[85] J.C. Hou, S.S. Wang, Y.H. Lai, Y. Tsao, H.W. Chang, and H.M. Wang, “Audio-visual speech enhancement using multimodal deep convolutional neural networks,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, no. 2, pp. 117–128, 2018.
[86] A. Gabbay, A. Shamir, and S. Peleg, “Visual speech enhancement,” in Proc. INTERSPEECH, pp. 1170–1174, 2018.
[87] D. Michelsanti, Z.H. Tan, S. Sigurdsson, and J. Jensen, “Effects of Lombard reflex on the performance of deep learning-based audio-visual speech enhancement systems,” in Proc. ICASSP, pp. 6615–6619, 2019.
[88] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein, “Looking to listen at the cocktail party: A speaker-independent audiovisual model for speech separation,” arXiv preprint arXiv:1804.03619, 2018.
[89] K. Hwang and W. Sung, “Fixed point feedforward deep neural network design using weights+ 1, 0, and1,” in Proc. SiPS, pp. 1–6, 2014.
[90] R. Prabhavalkar, O. Alsharif, A. Bruguier, and L. McGraw, “On the compression of recurrent neural networks with an application to LVCSR acoustic modeling for embedded speech recognition,” in Proc. ICASSP, pp. 5970–5974, 2016.
[91] Y.T. Hsu, Y.C. Lin, S.W. Fu, Y. Tsao, and T.W. Kuo, “A study on speech enhancement
using exponent only floating-point quantized neural network (EOFPQNN),” in Proc. SLT, pp. 566–573, 2018.
[92] R. Livni, S. ShalevShwartz, and O. Shamir, “On the computational efficiency of training neural networks,” in Proc. NIPS, pp. 855–863, 2014.
[93] L. Perez and J. Wang, “The effectiveness of data augmentation in image classification
using deep learning,” arXiv preprint ar X iv:1712.04621, 2017.
[94] T. Hussain, S. M. Siniscalchi, C.C. Lee, S.S. Wang, Y. Tsao, and W.H. Liao, “Experimental study on extreme learning machine applications for speech enhancement,” IEEE Access, vol. 5, pp. 25542–25554, 2017.
[95] G.B. Huang, Q.Y. Zhu, and C.K. Siew, “Extreme learning machine: theory and applications,” Neurocomputing, vol. 70, no. 1, pp. 489–501, 2006.
[96] Z. Huang, Y. Yu, J. Gu, and H. Liu, “An efficient method for traffic sign recognition based on extreme learning machine,” IEEE Trans. Cybern., vol. 47, no. 4, pp. 920–933, 2017.
[97] J. Tang, C. Deng, and G.B. Huang, “Extreme learning machine for multilayer perceptron,” IEEE transactions on neural networks and learning systems, vol. 27, no. 4, pp. 809–821, 2016.
[98] F. Sun, C. Liu, W. Huang, and J. Zhang, “Object classification and grasp planning using visual and tactile sensing,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 46, no. 7, pp. 969–979, 2016.
[99] L. L. C. Kasun, H. Zhou, G.B. Huang, and C. M. Vong, “Representational learning with ELMs for big data,” 2013.
[100] W. Zhao, T. H. Beach, and Y. Rezgui, “Optimization of potable water distribution and wastewater collection networks: A systematic review and future research directions,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 46, no. 5, pp. 659–681, 2016.
[101] D. Wang, L. Bischof, R. Lagerstrom, V. Hilsenstein, A. Hornabrook, and G. Hornabrook, “Automated opal grading by imaging and statistical learning,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 46, no. 2, pp. 185–201, 2016.
[102] N. Wang, M. J. Er, and M. Han, “Parsimonious extreme learning machine using recursive orthogonal least squares,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 10, pp. 1828–1841, 2014.
[103] D. Liu, Q. Wei, and P. Yan, “Generalized policy iteration adaptive dynamic programming for discrete-time nonlinear systems,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 45, no. 12, pp. 1577–1591, 2015.
[104] G.B. Huang, H. Zhou, X. Ding, and R. Zhang, “Extreme learning machine for regression and multiclass classification,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 42, no. 2, pp. 513–529, 2012.
[105] D. Pearce and J. Picone, “Aurora working group: DSR front end LVCSR evaluation au/384/02,” Inst. for Signal & Inform. Process., Mississippi State Univ., Tech. Rep, 2002.
[106] D. D. Lee and H. S. Seung, “Algorithms for nonnegative matrix factorization,” in Advances in neural information processing systems, pp. 556–562, 2001.
[107] L. Finesso and P. Spreij, “Nonnegative matrix factorization and I divergence alternating
minimization,” Linear Algebra and its Applications, vol. 416, no. 23, pp. 270–287, 2006.
[108] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep denoising autoencoder.,” in Proc. INTERSPEECH, pp. 436–440, 2013.
[109] J. Martens, “Deep learning via hessian free optimization,” in Proc. ICML, pp. 735–
742, 2010.
[110] G.B. Huang, L. Chen, C. K. Siew, et al., “Universal approximation using incremental constructive feedforward networks with random hidden nodes,” IEEE Trans. Neural Networks, vol. 17, no. 4, pp. 879–892, 2006.
[111] A. Beck and M. Teboulle, “A fast iterative shrinkage thresholding algorithm for linear inverse problems,” SIAM journal on imaging sciences, vol. 2, no. 1, pp. 183–202, 2009.
[112] I.T. Recommendation, “Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow and telephone networks and speech codecs,” Rec. ITUT P. 862, 2001.
[113] S. Quackenbush, T. Barnwell, and M. Clements, “Objective measures of speech quality,” Englewood Cliffs, NJ: Prentice-Hall, 1988.
[114] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility
prediction of time-frequency weighted noisy speech,” IEEE Trans. Audio, Speech, Language Process., vol. 19, no. 7, pp. 2125–2136, 2011.
[115] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean square error short-time spectral amplitude estimator,” IEEE Trans. Acoust., Speech, Signal Process., vol. 32, no. 6, pp. 1109–1121, 1984.
[116] Y. Hu and P. C. Loizou, “A generalized subspace approach for enhancing speech corrupted by colored noise,” IEEE Speech Audio Process., vol. 11, no. 4, pp. 334–341, 2003.
[117] C. Sun, Q. Zhang, J. Wang, and J. Xie, “Noise reduction based on robust principal component analysis,” Journal of Computational Information Systems, vol. 10, no. 10, pp. 4403–4410, 2014.
[118] R. Minhas, A. Baradarani, S. Seifzadeh, and Q. J. Wu, “Human action recognition using extreme learning machine based on visual vocabularies,” Neurocomputing, vol. 73, no. 1012,
pp. 1906–1917, 2010.
[119] Y. Lan, Z. Hu, Y. C. Soh, and G.B. Huang, “An extreme learning machine approach for speaker recognition,” Neural Comput. Appl., vol. 22, no. 34, pp. 417–425, 2013.
[120] B. O. Odelowo and D. V. Anderson, “A framework for speech enhancement using extreme learning machines,” in Proc. WASPAA, pp. 1956–1960, 2017.
[121] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, “DARPA TIMIT acoustic-phonetic continuous speech corpus CDROM. NIST speech disc 11.1,” Tech. Rep., vol. 93, 1993.
[122] L. L. Wong, S. D. Soli, S. Liu, N. Han, and M.W. Huang, “Development of the Mandarin hearing in noise test (MHINT),” Ear and Hearing, vol. 28, no. 2, pp. 70S–74S, 2007.
[123] K. Kinoshita, M. Delcroix, T. Yoshioka, T. Nakatani, A. Sehr, W. Kellermann, and R. Maas, “The REVERB challenge: A common evaluation framework for dereverberation and recognition of reverberant speech,” in Proc. WASPAA, pp. 1–4, 2013.
[124] W.J. Lee, S.S. Wang, F. Chen, X. Lu, S.Y. Chien, and Y. Tsao, “Speech dereverberation based on integrated deep and ensemble learning algorithm,” in Proc. ICASSP, pp. 5454–5458, 2018.
[125] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Ensemble modeling of denoising autoencoder
for speech spectrum restoration,” in Proc. INTERSPEECH, 2014.
[126] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Highway networks,” arXiv preprint ar X iv:1505.00387, 2015.
[127] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hoekstra, “Perceptual evaluation of speech quality (PESQ), a new method for speech quality assessment of telephone networks and codecs,” in Proc. ICASSP, vol. 2, pp. 749–752, 2001.
[128] Y. Hu and P. C. Loizou, “Evaluation of objective quality measures for speech enhancement,” IEEE Trans. Audio, Speech, Language Process., vol. 16, no. 1, pp. 229–238, 2008.
[129] T. H. Falk, C. Zheng, and W.Y. Chan, “A nonintrusive quality and intelligibility measure of reverberant and dereverberation speech,” IEEE Trans. Audio, Speech, Language Process., vol. 18, no. 7, pp. 1766–1774, 2010.
[130] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR, pp. 770–778, 2016.
[131] M. Wu and D. Wang, “A two-stage algorithm for one microphone reverberant speech enhancement,” IEEE Trans. Audio, Speech, Language Process., vol. 14, no. 3, pp. 774–784, 2006.
[132] A. Schwarz and W. Kellermann, “Coherent to diffuse power ratio estimation for dereverberation,” IEEE Trans. Audio, Speech, Language Process., vol. 23, no. 6, pp. 1006–1018, 2015.
[133] Y.H. Lai, Y. Tsao, and F. Chen, “Effects of adaptation rate and noise suppression on the intelligibility of compressed envelope based speech,” PloS one, vol. 10, no. 7, p. e0133519, 2015.
[134] S.W. Fu, P.C. Li, Y.H. Lai, C.C. Yang, L.C. Hsieh, and Y. Tsao, “Joint dictionary learning-based nonnegative matrix factorization for voice conversion to improve speech intelligibility after oral surgery,” IEEE Trans. Biomed. Eng., vol. 64, no. 11, pp. 2584–2594, 2017.
[135] A. N. S. Institute, American National Standard: Methods for calculation of the speech intelligibility index. Acoustical Society of America, 1997.
[136] R. van Hoesel, M. Böhm, R. D. Battmer, J. Beckschebe, and T. Lenarz, “Amplitude mapping effects on speech intelligibility with unilateral and bilateral cochlear implants,” Ear and Hearing, vol. 26, no. 4, pp. 381–388, 2005.
[137] T. Robinson, J. Fransen, D. Pye, J. Foote, and S. Renals, “WSJCAMO: A British English speech corpus for large vocabulary continuous speech recognition,” in Proc. ICASSP, vol. 1, pp. 81–84, 1995.
[138] G. Hinton, N. Srivastava, and K. Swersky, “RMSProp: Divide the gradient by a running average of its recent magnitude,” Neural Netw. Mach. Learn., Coursera Lecture 6e, 2012.
[139] M. Jeub, M. Schafer, and P. Vary, “A binaural room impulse response database for the evaluation of dereverberation algorithms,” in Proc. Dig. Signal Process, pp. 1–5, 2009.
[140] L. D. Consortium, “CSRII (WSJ1) complete,” Linguistic Data Consortium, Philadelphia, vol. LDC94S13A, 1994.
[141] J. Li, M.T. Luong, and D. Jurafsky, “A hierarchical neural autoencoder for paragraphs
and documents,” in Proc. ACL, pp. 1106–1115, 2015.
[142] Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for skeleton-based action recognition,” in Proc. CVPR, pp. 1110–1118, 2015.
[143] ITU, “Method for the subjective assessment of intermediate quality levels of coding systems (MUSHRA).,” International Telecommunication Union, Recommendation BS.15341,
2003.
[144] T. Hussain, Y. Tsao, S. M. Siniscalchi, J.C. Wang, H.M. Wang, and W.H. Liao, “Bone conducted speech enhancement using hierarchical extreme learning machine,” in Proc. IWSDS 2019, to be published.
[145] S.W. Fu, Y. Tsao, and X. Lu, “SNR-aware convolutional neural network modeling for speech enhancement.,” in Proc. INTERSPEECH, pp. 3768–3772, 2016.
[146] S.W. Fu, T.W. Wang, Y. Tsao, X. Lu, and H. Kawai, “End-to-end waveform utterance
enhancement for direct evaluation metrics optimization by fully convolutional neural networks,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 26, no. 9, pp. 1570–1584, 2018.
[147] Z. Chen, S. Watanabe, H. Erdogan, and J. R. Hershey, “Speech enhancement and recognition using multitask learning of long short term memory recurrent neural networks,” in Proc. INTERSPEECH, 2015.
[148] S. Pascual, A. Bonafonte, and J. Serra, “Segan: Speech enhancement generative adversarial network,” arXiv preprint arXiv:1703.09452, 2017.
[149] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner,
A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” arXiv preprint ar X iv:1609.03499, 2016.
[150] H.P. Liu, Y. Tsao, and C.S. Fuh, “Bone conducted speech enhancement using deep denoising autoencoder,” Speech Communication, vol. 104, pp. 106–112, 2018.
[151] Google, “Cloud speech API, https://cloud.google.com/speech/,” 2017.
[152] D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 26, no. 10, pp. 1702–1726, 2018.
[153] J. OrtegaGarcía and J. GonzálezRodríguez, “Overview of speech enhancement techniques for automatic speaker recognition,” in Proc. ICSLP, pp. 929–932, 1996.
[154] M. Kolbk, Z.H. Tan, J. Jensen, M. Kolbk, Z.H. Tan, and J. Jensen, “Speech intelligibility
potential of general and specialized deep neural network-based speech enhancement systems,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 25, no. 1, pp. 153–167, 2017.
[155] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J. R. Hershey, and B. Schuller, “Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR,” in Proc. LVA/ICA, pp. 91–99, 2015.
[156] L. Sun, J. Du, L.R. Dai, and C.H. Lee, “Multiple target deep learning for LSTM-RNN
based speech enhancement,” in Proc. HSCMA, pp. 136–140, 2017.
[157] J. M. Kates and K. H. Arehart, “The hearing aid speech perception index (HASPI),”
Speech Communication, vol. 65, pp. 75–93, 2014.
[158] M. Huang, “Development of Taiwan mandarin hearing in noise test,” Department of
speech-language pathology and audiology, National Taipei University of Nursing and Health Science, 2005.
[159] P. Viola and M. J. Jones, “Robust real-time face detection,” International Journal of Computer Vision, vol. 57, no. 2, pp. 137–154, 2004.
[160] T. Hussain, Y. Tsao, H.M. Wang, J.C. Wang, S. M. Siniscalchi, and W.H. Liao, “Compressed multimodal hierarchical extreme learning machine for speech enhancement,”
in Proc. APSIPA 2019, to be published.
[161] D. L. Wang, “Deep learning reinvents the hearing aid,” IEEE Spectrum, vol. March Issue, pp. 32–37, 2017.
[162] Z. Zhao, H. Liu, and T. Fingscheidt, “Convolutional neural networks to enhance coded speech,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 27, no. 4, pp. 663–678, 2019.
[163] S. Doclo, M. Moonen, T. Van den Bogaert, and J. Wouters, “Reduced bandwidth and distributed mwf-based noise reduction algorithms for binaural hearing aids,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 1, pp. 38–51, 2009.
[164] Z.Q. Wang and D. Wang, “A joint training framework for robust automatic speech recognition,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 24, no. 4, pp. 796–806, 2016.
[165] D. Michelsanti and Z.H. Tan, “Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification,” in Proc. INTERSPEECH, pp. 2008–2012, 2017.
[166] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE
Trans. Acoust., Speech, Signal Process., vol. 27, no. 2, pp. 113–120, 1979.
[167] K. W. Wilson, B. Raj, P. Smaragdis, and A. Divakaran, “Speech denoising using nonnegative matrix factorization with priors,” in Proc. ICASSP, pp. 4029–4032, 2008.
[168] S.W. Fu, Y. Tsao, and X. Lu, “SNR-aware convolutional neural network modeling for speech enhancement,” in Proc. INTERSPEECH, pp. 3768–3772, 2016.
[169] S.W. Fu, T.Y. Hu, Y. Tsao, and X. Lu, “Complex spectrogram enhancement by a convolutional neural network with multimetric learning,” in Proc. MLSP, pp. 1–6, 2017.
[170] H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, “Phase-sensitive and recognition boosted speech separation using deep recurrent neural networks,” in Proc. ICASSP, pp. 708–712, 2015.
[171] W. Li, S. Wang, M. Lei, S. M. Siniscalchi, and C.H. Lee, “Improving audio-visual speech recognition performance with cross-modal student teacher training,” in Proc. ICASSP, pp. 6560–6564, 2019.
[172] M. Courbariaux, Y. Bengio and J.P. David, “Binary connect: Training deep neural networks with binary weights during propagations,” in Proc. NIPS, pp. 3123–3131, 2015.
[173] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,” arXiv preprint ar X iv:1612.01064, 2016.
[174] N. USED, “IEEE standard for binary floating-point arithmetic, 1985, ANSI,” IEEE
Standard, pp. 754–1985.
[175] P.S. Huang, S. D. Chen, P. Smaragdis, and M. HasegawaJohnson, “Singing voice separation from monaural recordings using robust principal component analysis,” in Proc. ICASSP, pp. 57–60, 2012.
描述 博士
國立政治大學
資訊科學系
103761507
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0103761507
資料類型 thesis
dc.contributor.advisor 曹昱<br>廖文宏zh_TW
dc.contributor.advisor Yu Tsao<br>Wen-Hung Liaoen_US
dc.contributor.author (作者) 胡泰克zh_TW
dc.contributor.author (作者) Tassadaq Hussainen_US
dc.creator (作者) 胡泰克zh_TW
dc.creator (作者) Hussain, Tassadaqen_US
dc.date (日期) 2020en_US
dc.date.accessioned 1-七月-2020 13:49:55 (UTC+8)-
dc.date.available 1-七月-2020 13:49:55 (UTC+8)-
dc.date.issued (上傳時間) 1-七月-2020 13:49:55 (UTC+8)-
dc.identifier (其他 識別碼) G0103761507en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/130588-
dc.description (描述) 博士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊科學系zh_TW
dc.description (描述) 103761507zh_TW
dc.description.abstract (摘要) 語音是人與人互動中最有效、最自然的手段,在過去的幾十年中,語音信號處理的各個議題已經被深入地研究,然而在真實聲學環境下有效提高人類聽覺、機器識別率仍然是一項艱鉅的任務。近年來以語音控制的個人助理系統(例如Alexa、Google Home等)已經被大幅使用,進而重塑了人機交互模式。在經常需要遠距離交談的實際應用中(例如,音頻數據挖掘和語音輔助應用),背景噪聲會嚴重降低語音信號的質量和清晰度,因此,能夠抑制噪聲是在實用環境下的重要議題。針對這個議題,本文首先提出了一種語音去噪框架,其目的是:(i)有效、快速地從單通道語音信號中去除背景噪聲;(ii)在不匹配測試條件下(靜態和非靜態噪聲以及不同SNR級別),能夠有效地從嘈雜的聲音中提取出清晰的語音特徵。(iii)在訓練數據量有限的情況下也可以獲得優異的除噪性能。實驗結果證實與基於深層類神經網絡的方法相比,在訓練數據量有限的情況下,所提出的HELM框架可以產生效果相當甚至更好的語音品質和清晰度。

除了噪音,混響是另一個語音的問題。混響通常是指反射聲音的總集合,會嚴重影響與語音應用的效能。近年來,深層類神經模型強大的回歸能力已經證實可以有效地對語音去除混響效果。但是深層類神經模型有個重大缺點,就是需要大量的混響-無混響訓練語音對來訓練,而大量的訓練資料對通常並不容易取得。因此,開發一種使用少量的訓練數據的演算法變成重要的研究議題。本論文研究以HELM來解決了混響問題和數據需求問題,同時提出了利用整體學習框架的優點。實驗結果表明,在匹配以及不匹配的測試條件下,該框架優於傳統方法和最近提出的整體深度學習演算法。

一個語音增強方法的局限是在沒見過的聲學條件下無法獲得令人滿意的性能。在本論文中,我們嘗試基於HELM解決通道不匹配的影響,可以在真實的聲學條件下將低質量的骨傳導麥克風話音轉換為高質量的空氣傳導麥克風話音。除了純音頻處理框架外,我們還將所提出的方法應用於多模態學習來改善純音頻語音增強模型的整體性能。在本論文中,我們也提出了一個結合聲音影像的語音增強系統。結果證實在不同的測試條件下,與僅有音頻的語音增強系統相比,結合聲音影像的語音增強系統可以提供更佳的效能。深度學習的另一個新興研究主題是促進模型壓縮以進一步增加應用性。我們提出了新穎的模型壓縮技術,可以有效地降低計算需求。未來我們預期壓縮後的模型能夠實現於硬體,並且與各種語音應用結合。
zh_TW
dc.description.abstract (摘要) Speech is the most effective and natural medium of communication in human–human interaction. In the past few decades, a great amount of research has been conducted on various aspects and properties of speech signal processing. However, improving the intelligibility for both human listening and machine recognition in real acoustic conditions still remains a challenging task. In recent years, voice-controlled personal assistants systems (such as Alexa, Google Home, and Home Pod, etc.) have been widely used, and have reshaped the human-machine interaction mode. In practical applications that often require distant talking communications (e.g., audio data mining and voice-assisted applications), the effect of background noise can severely deteriorate the quality and intelligibility of speech signals for both human and machine listeners. Therefore, it is desirable that noise suppression can be made robust against changing noise conditions to operate in real-time environments. To address this issue, this dissertation initially presents a speech denoising framework which aims, (i) at the effective and fast removal of background noise from a single-channel speech signal, (ii) to extract clean speech features from the noisy counterpart and effective even under mismatch testing conditions (stationary and non-stationary noise and SNR levels), and (iii) to attain optimal performance when the amount of training data is limited. The proposed framework offers a universal approximation capability through comparative measures. The experimental results demonstrate that the proposed framework can yield comparable or even better speech quality and intelligibility compared with conventional signal processing- and deep neural-based approaches when the amount of training data is limited.

Besides noise, reverberation is yet another issue that can affect the learning effectiveness and robustness of distant-talking communication devices. Reverberation generally refers to the collection of reflected sounds that can affect the performance of speech-related applications significantly. In recent years, the approximation capabilities of deeper neural models have been exploited to study the reverberation effect. The outcome of these studies indicate that neural-based learning have strong regression capabilities, and can substantially achieve outstanding speech dereverberation results. However, deep neural models require a large amount of reverberant-anechoic training waveform pairs to achieve reasonable performance improvement. Therefore, it is required to develop a data-driven solution that can achieve robust generalization performance for realistic reverberated conditions and can be optimized with a small amount of training data, or more precisely adaptation data. Motivated by the promising performance achieved for speech denoising, this dissertation next addresses the reverberation and data requirement issue while preserving the advantages of deep neural structures leveraging upon ensemble learning framework. Experimental results reveal that the proposed framework outperforms both traditional methods and a recently proposed integrated deep and ensemble learning algorithm in terms of standardized evaluation metrics under matched and mismatched testing conditions.

A common drawback of most modern speech enhancement (SE) approaches is that they are typically evaluated using simulated datasets, where training and testing conditions are generated in controlled environments. Consequently, these approaches suffer from channel mismatch problems in unseen acoustic conditions and are unable to achieve satisfactory performance. In online learning, where data arrives from different channels and environments, an effective solution is required to address the channel mismatch problem. In this dissertation, we will next address the impact of channel mismatch and propose an alternative SE system which converts low-quality bone-conducted microphone utterances into high-quality air-conducted microphone utterances in real acoustic conditions.

Although the effects of noise and reverberation using audio-only frameworks are well examined under diverse sets of synthetically generated conditions, such frameworks need to initially acquire a large number of training data, covering as many environmental conditions as possible, to improve the robustness against unknown test conditions. Recent literature has exploited the great potential of auxiliary information in human-machine interactions. The data obtained from heterogeneous sensors and devices using the internet of things (IoT) can be useful for more robust inference, thereby providing further insights into multimodal learning. In addition to audio-only SE frameworks, multimodal learning has recently been adopted to improve the overall performances of audio-only SE models. The thesis later expands the audio-only paradigm of the SE framework and proposes an audio-visual SE system. The final results demonstrate that the incorporation of auxiliary information alongside audio can provide adequate performance enhancement over an audio-only SE system under different test conditions.

Another emerging focus of deep learning is to facilitate deep neural-based models to work in real-world applications. The problem with the existing deep neural models is that they are computationally expensive and memory intensive, thereby limiting the deployment in edge devices with low memory resources. Based on the successful results of audio-only and audio-visual SE frameworks, in this thesis, we propose a joint audio-visual SE framework to finally address model and data compression strategies in order to meet the computational demands and facilitate real-time predictions. The proposed framework demonstrates that incorporation of visual information helps the framework to retain most of the information lost by the audio-only framework, while the model compression lets the framework to further reduce the computation requirement. The model compression enables the model to land in the hardware implementation arena for multimodal environments to obtain efficient regression ability.
en_US
dc.description.tableofcontents Acknowledgements i
中文摘要 iii
Abstract v
Contents ix
List of Figures xii
List of Tables xvi

1 SPEECH SIGNAL PROCESSING: AN OVERVIEW 1
1.1 Background . . . . . .. . . . . . . . . . . . 2
1.1.1 Speech Denoising . .. . . . . . . . . . . . 2
1.1.2 Speech Dereverberation . . . . . . . . . . 4
1.1.3 Channel Compensation .. . . . . . . . . . . 6
1.1.4 Multimodal Speech Enhancement . ..... . . . 6
1.2 Motivation . . . . . . . . . . . . . . . . . 7
1.3 Research Challenges . . . . . . . . . . . . . 8
1.4 Contributions . . . . . . . . . . . . . . . . 9
1.5 Dissertation Outline . . . . . . . . . .. . . 12

2 ELM-BASED SPEECH DENOISING 13
2.1 Overview . . . . . . . . . . . . . . . .. . 13
2.2 Introduction . . . . . . . . . . . . . .. . 14
2.3 Proposed Method . . . . . . . . . . . . . . 15
2.3.1 Conventional Spectral Restoration Methods. 16
2.3.2 Data Driven Methods . . . . . . . . . . . 17
2.3.3 The ELM Model . . . . . . . . .. . . . . . 19
2.4 Experiments . . . . . . . . . . . . . .. . . 24
2.4.1 Experimental Setup . . . . . . .. . . . . 24
2.4.2 Experimental Results . . . . . .. . . . . 27
2.5 Summary . . . . . . . . . . . . . . . . . . 40

3 ELM-BASED SPEECH DEREVERBERATION 42
3.1 Overview . . . . . . . . . . . . . . . . . 42
3.2 Introduction . . . . . . . . . . . . . . . 43
3.3 Proposed Method . . . . . . . . . . . . . . 46
3.3.1 Ensemble Learning for Speech Signal Processing . 46
3.3.2 HELM-based Speech Dereverberation System . . . 48
3.3.3 Highway HELM . . . . . . . . . . . . . . 51
3.3.4 Residual HELM . . . . . . . . . . . . . . 52
3.3.5 Ensemble HELM for Speech Dereverberation . 53
3.4 Experiments . . . . . . . . . . . . . . . .. 54
3.4.1 Experimental Setup . . . . . . . . . . ... 54
3.4.2 Experimental Results . . . . . . . . . ... 56
3.5 Summary . . . . . . . . . . . . . . . . . .. 86

4 ELM-BASED CHANNEL COMPENSATION 88
4.1 Overview . . . . . . . . . . . . . . . . . . 88
4.2 Introduction . . . . . . . . . . . . . . . . 89
4.3 Proposed Method . . . . . . . . . . . . . . 91
4.4 Experiments . . . . . . . . . . . . . . . . .92
4.4.1 Experimental Setup . . . . . . . . . . . .92
4.4.2 HELM-based SE System . . . . . . . . . 93
4.4.3 Spectrogram Analysis . . . . . . . . . . . 94
4.4.4 Automatic Speech Recognition . . . . . 95
4.4.5 Sensitivity/Stability Towards the Training Data . 96
4.5 Summary . . . . . . . . . . . . . . . . . . 99

5 ELM-BASED MULTIMODAL SPEECH ENHANCEMENT 100
5.1 Overview . . . . . . . . . . . . . . . . . 100
5.2 Introduction . . . . . . . . . . . . . . . 101
5.3 Proposed Method . . . . . . . . . . . . . . 103
5.3.1 Audio-only SE System . . . .. . . . . . . 103
5.3.2 Audio-Visual SE System . . . . . . . . . 104
5.4 Experiments . . . . . . . . . . . . . . . . 105
5.4.1 Experimental Setup . . . . . . . . . . . 105
5.4.2 Experimental Results . . . . . . . . . . 107
5.5 Summary . . . . . . . . . . . . . . . . . . 112

6 COMPRESSED MULTIMODAL SE 113
6.1 Overview . . . . . . . . . . . . . . . . . .113
6.2 Introduction . . . . . . . . . . . . . . . .114
6.3 Proposed method for SE . . . . .. . . . . 117
6.3.1 HELM-based multimodal System for SE . . . 117
6.3.2 Binarization and Quantization . . . . 118
6.4 Experimental Evaluations . . . . . . . . . 120
6.4.1 Experimental Setup . . . . . . . . . . . .120
6.4.2 Feature Extraction . . . . . . . . . . . 121
6.5 Summary . . . . . . . . . . . . . . . . . 128

7 CONCLUSION AND FUTURE WORK 129
7.1 Conclusion . . . . . . . . . . . . . . . . 129
7.2 Future Work . . . . . . . . . . . . . . . 131
Bibliography 133
VITA 157
zh_TW
dc.format.extent 5369083 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0103761507en_US
dc.subject (關鍵詞) 语音信号处理zh_TW
dc.subject (關鍵詞) 分层极端学习机zh_TW
dc.subject (關鍵詞) 公路极限学习机zh_TW
dc.subject (關鍵詞) 剩余的极限学习机zh_TW
dc.subject (關鍵詞) 渠道补偿zh_TW
dc.subject (關鍵詞) 多模式学习zh_TW
dc.subject (關鍵詞) 模型压缩zh_TW
dc.subject (關鍵詞) Speech Signal Processingen_US
dc.subject (關鍵詞) Hierarchical Extreme Learning Machinesen_US
dc.subject (關鍵詞) Highway Extreme Learning Machinesen_US
dc.subject (關鍵詞) Residual Extreme Learning Machinesen_US
dc.subject (關鍵詞) Channel Compensationen_US
dc.subject (關鍵詞) Multimodal Learningen_US
dc.subject (關鍵詞) Model Compressionen_US
dc.title (題名) 多層次極限學習機於語音訊號處理上的應用zh_TW
dc.title (題名) Hierarchical Extreme Learning Machine for Speech Signal Processingen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) [1] J. Benesty, S. Makino, and J. Chen, Speech Enhancement. New York, USA: Springer, 2005.
[2] J. Li, L. Deng, R. HaebUmbach, and Y. Gong, Robust Automatic Speech Recognition: A Bridge to Practical Applications. Academic Press, 2015.
[3] B. Li, Y. Tsao, and K. C. Sim, “An investigation of spectral restoration algorithms for deep neural networks based noise robust speech recognition.,” in Proc. INTERSPEECH, pp. 3002– 006, 2013.
[4] A. ElSolh, A. Cuhadar, and R. Goubran, “Evaluation of speech enhancement techniques for speaker identification in noisy environments,” in Proc. ISMW, pp. 235–239, IEEE, 2007.
[5] J. Li, L. Yang, J. Zhang, Y. Yan, Y. Hu, M. Akagi, and P. C. Loizou, “Comparative intelligibility investigation of single-channel noise reduction algorithms for chinese, japanese, and english,” J. Acoust. Soc. Am., vol. 129, no. 5, pp. 3291–3301, 2011.
[6] F. Yan, A. Men, B. Yang, and Z. Jiang, “An improved ranking-based feature enhancement approach for robust speaker recognition,” IEEE Access, vol. 4, pp. 5258–5267, 2016.
[7] J. Li, S. Sakamoto, S. Hongo, M. Akagi, and Y. Suzuki, “Two-stage binaural speech enhancement with wiener filter for high quality speech communication,” Speech Commun., vol. 53, no. 5, pp. 677–689, 2011.
[8] T. Venema, Compression for clinicians. Delmar Pub, 2006.
[9] H. Levit, “Phd, noise reduction in hearing aids: An overview,” J. Rehabil. Res. Dev.
[10] Y.H. Lai, F. Chen, S.S. Wang, X. Lu, Y. Tsao, and C.H. Lee, “A deep denoising autoencoder approach to improving the intelligibility of vocoded speech in cochlear implant simulation,” IEEE Trans. Biomed. Eng., vol. 64, no. 7, pp. 1568–1578, 2017.
[11] F. Chen, Y. Hu, and M. Yuan, “Evaluation of noise reduction methods for sentence recognition by mandarin speaking cochlear implant listeners,” Ear and hearing, vol. 36, no. 1, pp. 61–71, 2015.
[12] P. Scalart et al., “Speech enhancement based on a priori signal to noise estimation,”in Proc. ICASSP, vol. 2, pp. 629–632, IEEE, 1996.
[13] E. Hänsler and G. Schmidt, Topics in acoustic echo and noise control: selected methods for the cancellation of acoustical echoes, the reduction of background noise, and speech processing. Springer Science & Business Media, 2006.
[14] J. Chen, J. Benesty, Y. Huang, and E. Diethorn, “Fundamentals of noise reduction in spring handbook of speech processing,” Springer, 2008.
[15] R. McAulay and T. Quatieri, “Speech analysis/synthesis based on a sinusoidal representation,” IEEE Trans. Acoust., Speech, Signal Process., vol. 34, no. 4, pp. 744–754, 1986.
[16] T. F. Quatieri and R. J. McAulay, “Shape invariant timescale and pitch modification of speech,” IEEE Trans. Signal Process., vol. 40, no. 3, pp. 497–510, 1992.
[17] J. Makhoul, “Linear prediction: A tutorial review,” Proceedings of the IEEE, vol. 63, no. 4, pp. 561–580, 1975.
[18] S. Suhadi, C. Last, and T. Fingscheidt, “A data-driven approach to a priori SNR estimation,” IEEE Trans. Audio, Speech, Language Process., vol. 19, no. 1, pp. 186– 195, 2011.
[19] T. Lotter and P. Vary, “Speech enhancement by map spectral amplitude estimation using a super gaussian speech model,” EURASIP Journal on Applied Signal Processing, vol. 2005, pp. 1110–1126, 2005.
[20] U. Kjems and J. Jensen, “Maximum likelihood based noise covariance matrix estimation
for multi-microphone speech enhancement,” in Proc. EUSIPCO, pp. 295–299, IEEE, 2012.
[21] R. McAulay and M. Malpass, “Speech enhancement using a soft decision noise suppression filter,” IEEE Trans. Acoust., Speech, Signal Process., vol. 28, no. 2, pp. 137–145, 1980.
[22] Y.C. Su, Y. Tsao, J.E. Wu, and F.R. Jean, “Speech enhancement using generalized maximum a posteriori spectral amplitude estimator,” in Proc. ICASSP, pp. 7467–7471, IEEE, 2013.
[23] R. Frazier, S. Samsam, L. Braida, and A. Oppenheim, “Enhancement of speech by adaptive filtering,” in Proc. ICASSP, vol. 1, pp. 251–253, IEEE, 1976.
[24] Y. Ephraim, “Statistical model-based speech enhancement systems,” Proceedings of the IEEE, vol. 80, no. 10, pp. 1526–1555, 1992.
[25] B. Atal and M. Schroeder, “Predictive coding of speech signals and subjective error criteria,” IEEE Trans. Acoust., Speech, Signal Process., vol. 27, no. 3, pp. 247–254, 1979.
[26] L. Rabiner and B. Juang, “An introduction to hidden markov models,” ieee asp magazine, vol. 3, no. 1, pp. 4–16, 1986.
[27] C.T. Lin, “Single-channel speech enhancement in variable noise level environment,” IEEE Transactions on Systems, Man, and Cybernetics Part A: Systems and Humans, vol. 33, no. 1, pp. 137–143, 2003.
[28] C. F. Stallmann and A. P. Engelbrecht, “Gramophone noise detection and reconstruction using time delay artificial neural networks,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 47, no. 6, pp. 893–905, 2017.
[29] J. Tchorz and B. Kollmeier, “SNR estimation based on amplitude modulation analysis with applications to noise suppression,” IEEE Transactions on Speech and Audio Processing, vol. 11, no. 3, pp. 184–192, 2003.
[30] S. Tamura, “An analysis of a noise reduction neural network,” in Proc. ICASSP, pp. 2001–2004, IEEE, 1989.
[31] F. Xie and D. Van Compernolle, “A family of mlp based nonlinear spectral estimators for noise reduction,” in Proc. ICASSP, vol. 2, pp. II–53, IEEE, 1994.
[32] E. A. Wan and A. T. Nelson, “Networks for speech enhancement,” Handbook of neural networks for speech processing. Artech House, Boston, USA, vol. 139, p. 1, 1999.
[33] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006.
[34] D. Erhan, Y. Bengio, A. Courville, P.A. Manzagol, P. Vincent, and S. Bengio, “Why does unsupervised pretraining help deep learning?” Journal of Machine Learning Research, vol. 11, no. Feb, pp. 625–660, 2010.
[35] B. Xia and C. Bao, “Speech enhancement with weighted denoising autoencoder,”in Proc. INTERSPEECH, pp. 3444–3448, 2013.
[36] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.A. Manzagol, “Stacked denoising
autoencoders: Learning useful representations in a deep network with a local denoising criterion,” Journal of Machine Learning Research, vol. 11, no. Dec, pp. 3371–3408, 2010.
[37] A. L. Maas, Q. V. Le, T. M. O’Neil, O. Vinyals, P. Nguyen, and A. Y. Ng, “Recurrent
neural networks for noise reduction in robust asr,” in Proc. ICASSP, 2012.
[38] M. Wöllmer, Z. Zhang, F. Weninger, B. Schuller, and G. Rigoll, “Feature enhancement by bidirectional lstm networks for conversational speech recognition in highly nonstationary
noise,” in Proc. ICASSP, pp. 6822–6826, IEEE, 2013.
[39] M. Z. Uddin, M. M. Hassan, A. Almogren, A. Alamri, M. Alrubaian, and G. Fortino, “Facial expression recognition utilizing local direction-based robust features and deep belief network,” IEEE Access, vol. 5, pp. 4525–4536, 2017.
[40] S. W. Akhtar, S. Rehman, M. Akhtar, M. A. Khan, F. Riaz, Q. Chaudry, and R. Young, “Improving the robustness of neural networks using k-support norm based adversarial training,” IEEE Access, vol. 4, pp. 9501–9511, 2016.
[41] S.W. Fu, Y. Tsao, X. Lu, and H. Kawai, “Raw waveform-based speech enhancement by fully convolutional networks,” in Proc. APSIPA, pp. 6–12, 2017.
[42] Y. Kim and B. Toomajian, “Hand gesture recognition using micro-Doppler signatures with convolutional neural network,” IEEE Access, vol. 4, pp. 7125–7130, 2016.
[43] Y. Wang and D. Wang, “Towards scaling up classification-based speech separation,” IEEE Trans. Audio, Speech, Language Process., vol. 21, no. 7, pp. 1381–1390, 2013.
[44] N. Wang, M. J. Er, and M. Han, “Generalized single hidden layer feedforward networks for regression problems,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 6, pp. 1161–1176, 2015.
[45] Y. Xu, J. Du, L.R. Dai, and C.H. Lee, “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 23, no. 1, pp. 7–19, 2015.
[46] X. Feng, Y. Zhang, and J. Glass, “Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition,” in Proc. ICASSP, pp. 1759–1763, 2014.
[47] J. Li, L. Deng, Y. Gong, and R. HaebUmbach, “An overview of noise-robust automatic
speech recognition,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 22, no. 4, pp. 745–777, 2014.
[48] S. M. Siniscalchi and V. M. Salerno, “Adaptation to new microphones using artificial neural networks with trainable activation functions,” IEEE Trans. Neural Netw. Learn. Syst., vol. 28, no. 8, pp. 1959–1965, 2017.
[49] Q. Jin, T. Schultz, and A. Waibel, “Far-field speaker recognition,” IEEE Audio, Speech, and Language Process., vol. 15, no. 7, pp. 2023–2032, 2007.
[50] X. Zhao, Y. Wang, and D. Wang, “Robust speaker identification in noisy and reverberant
conditions,” IEEE/ACM Trans. Audio, Speech and Language Process., vol. 22, no. 4, pp. 836–845, 2014.
[51] S. O. Sadjadi and J. H. Hansen, “Hilbert envelope based features for robust speaker identification under reverberant mismatched conditions,” in Proc. ICASSP, pp. 5448–5451, 2011.
[52] K. Kokkinakis, O. Hazrati, and P. C. Loizou, “A channel selection criterion for suppressing reverberation in cochlear implants,” J. Acoust. Soc. Am., vol. 129, no. 5, pp. 3221–3232, 2011.
[53] O. Hazrati, S. Omid Sadjadi, P. C. Loizou, and J. H. Hansen, “Simultaneous suppression of noise and reverberation in cochlear implants using a ratio masking strategy,” The Journal of the Acoustical Society of America, vol. 134, no. 5, pp. 3759–3765, 2013.
[54] J. Benesty, M. M. Sondhi, and Y. Huang, Springer handbook of speech processing. Springer, 2007.
[55] B. W. Gillespie, H. S. Malvar, and D. A. Florêncio, “Speech dereverberation via maximum kurtosis subband adaptive filtering,” in Proc. ICASSP, vol. 6, pp. 3701–3704, 2001.
[56] K. Kinoshita, M. Delcroix, T. Nakatani, and M. Miyoshi, “Suppression of late reverberation effect on speech signal using long term multiple-step linear prediction,” IEEE Trans. Audio, Speech, Language Process., vol. 17, no. 4, pp. 534–545, 2009.
[57] T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B.H. Juang, “Speech dereverberation based on variance normalized delayed linear prediction,” IEEE Trans. Audio, Speech, Language Process., vol. 18, no. 7, pp. 1717–1731, 2010.
[58] T. Nakatani, M. Miyoshi, and K. Kinoshita, “Singlemicrophone blind dereverberation,”
in Speech Enhancement, pp. 247–270, Springer, 2005.
[59] H. Attias, J. C. Platt, A. Acero, and L. Deng, “Speech denoising and dereverberation using probabilistic models,” in Proc. NIPS, pp. 758–764, 2001.
[60] J.T. Chien and Y.C. Chang, “Bayesian learning for speech dereverberation,” in Proc. MLSP, pp. 1–6, 2016.
[61] D. Bees, M. Blostein, and P. Kabal, “Reverberant speech enhancement using cepstral processing,” in Proc. ICASSP, pp. 977–980, 1991.
[62] K. Lebart, J.M. Boucher, and P. Denbigh, “A new method based on spectral subtraction
for speech dereverberation,” Acta Acoustica, vol. 87, no. 3, pp. 359–366, 2001.
[63] M. Miyoshi and Y. Kaneda, “Inverse filtering of room acoustics,” IEEE Trans. Acoust., Speech, Signal Process., vol. 36, no. 2, pp. 145–152, 1988.
[64] J. Flanagan, J. Johnston, R. Zahn, and G. Elko, “Computer steered microphone arrays for sound transduction in large rooms,” J. Acoust. Soc. Am., vol. 78, no. 5, pp. 1508–1518, 1985.
[65] J. L. Flanagan, A. C. Surendran, and E.E. Jan, “Spatially selective sound capture for speech and audio processing,” Speech Commun., vol. 13, no. 12, pp. 207–222,
[66] T. J. Cox, F. Li, and P. Darlington, “Extracting room reverberation time from speech using artificial neural networks,” Jour. Audio Eng. Soc., vol. 49, no. 4, pp. 219–230, 2001.
[67] J. Qi, J. Du, S. M. Siniscalchi, and C.H. Lee, “A theory on deep neural network-based vector to vector regression with an illustration of its expressive power in speech enhancement,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 27, no. 12, pp. 1932–1943, 2019.
[68] T. Ishii, H. Komiyama, T. Shinozaki, Y. Horiuchi, and S. Kuroiwa, “Reverberant speech recognition based on denoising autoencoder.,” in Proc. INTERSPEECH, pp. 3512–3516, 2013.
[69] Z. Zhang, J. Pinto, C. Plahl, B. Schuller, and D. Willett, “Channel mapping using bidirectional long short term memory for dereverberation in hands-free voice-controlled devices,” IEEE Trans. Consum. Electron., vol. 60, no. 3, pp. 525–533, 2014.
[70] F. Weninger, S. Watanabe, J. Le Roux, J. Hershey, Y. Tachioka, J. Geiger, B. Schuller, and G. Rigoll, “The merl/melco/tum system for the reverb challenge using deep recurrent neural network feature enhancement,” in Proc. REVERB Workshop, 2014.
[71] A. Schwarz, C. Huemmer, R. Maas, and W. Kellermann, “Spatial diffuseness features
for DNN-based speech recognition in noisy and reverberant environments,” in Proc. ICASSP, pp. 4380–4384, 2015.
[72] K. Han, Y. Wang, D. Wang, W. S. Woods, I. Merks, and T. Zhang, “Learning spectral
mapping for speech dereverberation and denoising,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 23, no. 6, pp. 982–992, 2015.
[73] X. Xiao, S. Zhao, D. H. H. Nguyen, X. Zhong, D. L. Jones, E. S. Chng, and H. Li, “Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation,” EURASIP J. Adv. Signal Process., vol. 2016, no. 1, p. 4, 2016.
[74] B. Wu, K. Li, M. Yang, and C.H. Lee, “A reverberation time-aware approach to speech dereverberation based on deep neural networks,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 25, no. 1, pp. 102–111, 2017.
[75] D. S. Williamson and D. Wang, “Time-frequency masking in the complex domain for speech dereverberation and denoising,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 25, no. 7, pp. 1492–1501, 2017.
[76] T. Shinamura and T. Tomikura, “Quality improvement of bone-conducted speech,”in Proc. ECCTD, vol. 3, pp. III–73, IEEE, 2005.
[77] Z. Zhang, Z. Liu, M. Sinclair, A. Acero, L. Deng, J. Droppo, X. Huang, and Y. Zheng, “Multisensory microphones for robust speech detection, enhancement, and recognition,” in Proc. ICASSP, vol. 3, pp. iii–781, IEEE, 2004.
[78] Y. Zheng, Z. Liu, Z. Zhang, M. Sinclair, J. Droppo, L. Deng, A. Acero, and X. Huang, “Air and bone conductive integrated microphones for robust speech detection and enhancement,” in 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No. 03EX721), pp. 249–254, IEEE, 2003.
[79] M. Graciarena, H. Franco, K. Sonmez, and H. Bratt, “Combining standard and throat microphones for robust speech recognition,” IEEE Signal Processing Letters, vol. 10, no. 3, pp. 72–74, 2003.
[80] T. V. Thang, K. Kimura, M. Unoki, and M. Akagi, “A study on the restoration of bone-conducted speech with mtf-based and lp-based models,” Journal of signal processing, 2006.
[81] Y. Tajiri, H. Kameoka, and T. Toda, “A noise suppression method for body conducted
soft speech based on the nonnegative tensor factorization of air and body conducted signals,” in Proc. ICASSP, pp. 4960–4964, IEEE, 2017.
[82] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” in Proc. ICML, pp. 689–696, 2011.
[83] Y. Mroueh, E. Marcheret, and V. Goel, “Deep multimodal learning for audiovisual speech recognition,” in Proc. ICASSP, pp. 2130–2134, 2015.
[84] S. Tamura, H. Ninomiya, N. Kitaoka, S. Osuga, Y. Iribe, K. Takeda, and S. Hayamizu, “Audio-visual speech recognition using deep bottleneck features and high-performance lipreading,” in Proc. APSIPA, pp. 575–582, 2015.
[85] J.C. Hou, S.S. Wang, Y.H. Lai, Y. Tsao, H.W. Chang, and H.M. Wang, “Audio-visual speech enhancement using multimodal deep convolutional neural networks,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, no. 2, pp. 117–128, 2018.
[86] A. Gabbay, A. Shamir, and S. Peleg, “Visual speech enhancement,” in Proc. INTERSPEECH, pp. 1170–1174, 2018.
[87] D. Michelsanti, Z.H. Tan, S. Sigurdsson, and J. Jensen, “Effects of Lombard reflex on the performance of deep learning-based audio-visual speech enhancement systems,” in Proc. ICASSP, pp. 6615–6619, 2019.
[88] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein, “Looking to listen at the cocktail party: A speaker-independent audiovisual model for speech separation,” arXiv preprint arXiv:1804.03619, 2018.
[89] K. Hwang and W. Sung, “Fixed point feedforward deep neural network design using weights+ 1, 0, and1,” in Proc. SiPS, pp. 1–6, 2014.
[90] R. Prabhavalkar, O. Alsharif, A. Bruguier, and L. McGraw, “On the compression of recurrent neural networks with an application to LVCSR acoustic modeling for embedded speech recognition,” in Proc. ICASSP, pp. 5970–5974, 2016.
[91] Y.T. Hsu, Y.C. Lin, S.W. Fu, Y. Tsao, and T.W. Kuo, “A study on speech enhancement
using exponent only floating-point quantized neural network (EOFPQNN),” in Proc. SLT, pp. 566–573, 2018.
[92] R. Livni, S. ShalevShwartz, and O. Shamir, “On the computational efficiency of training neural networks,” in Proc. NIPS, pp. 855–863, 2014.
[93] L. Perez and J. Wang, “The effectiveness of data augmentation in image classification
using deep learning,” arXiv preprint ar X iv:1712.04621, 2017.
[94] T. Hussain, S. M. Siniscalchi, C.C. Lee, S.S. Wang, Y. Tsao, and W.H. Liao, “Experimental study on extreme learning machine applications for speech enhancement,” IEEE Access, vol. 5, pp. 25542–25554, 2017.
[95] G.B. Huang, Q.Y. Zhu, and C.K. Siew, “Extreme learning machine: theory and applications,” Neurocomputing, vol. 70, no. 1, pp. 489–501, 2006.
[96] Z. Huang, Y. Yu, J. Gu, and H. Liu, “An efficient method for traffic sign recognition based on extreme learning machine,” IEEE Trans. Cybern., vol. 47, no. 4, pp. 920–933, 2017.
[97] J. Tang, C. Deng, and G.B. Huang, “Extreme learning machine for multilayer perceptron,” IEEE transactions on neural networks and learning systems, vol. 27, no. 4, pp. 809–821, 2016.
[98] F. Sun, C. Liu, W. Huang, and J. Zhang, “Object classification and grasp planning using visual and tactile sensing,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 46, no. 7, pp. 969–979, 2016.
[99] L. L. C. Kasun, H. Zhou, G.B. Huang, and C. M. Vong, “Representational learning with ELMs for big data,” 2013.
[100] W. Zhao, T. H. Beach, and Y. Rezgui, “Optimization of potable water distribution and wastewater collection networks: A systematic review and future research directions,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 46, no. 5, pp. 659–681, 2016.
[101] D. Wang, L. Bischof, R. Lagerstrom, V. Hilsenstein, A. Hornabrook, and G. Hornabrook, “Automated opal grading by imaging and statistical learning,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 46, no. 2, pp. 185–201, 2016.
[102] N. Wang, M. J. Er, and M. Han, “Parsimonious extreme learning machine using recursive orthogonal least squares,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 10, pp. 1828–1841, 2014.
[103] D. Liu, Q. Wei, and P. Yan, “Generalized policy iteration adaptive dynamic programming for discrete-time nonlinear systems,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 45, no. 12, pp. 1577–1591, 2015.
[104] G.B. Huang, H. Zhou, X. Ding, and R. Zhang, “Extreme learning machine for regression and multiclass classification,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 42, no. 2, pp. 513–529, 2012.
[105] D. Pearce and J. Picone, “Aurora working group: DSR front end LVCSR evaluation au/384/02,” Inst. for Signal & Inform. Process., Mississippi State Univ., Tech. Rep, 2002.
[106] D. D. Lee and H. S. Seung, “Algorithms for nonnegative matrix factorization,” in Advances in neural information processing systems, pp. 556–562, 2001.
[107] L. Finesso and P. Spreij, “Nonnegative matrix factorization and I divergence alternating
minimization,” Linear Algebra and its Applications, vol. 416, no. 23, pp. 270–287, 2006.
[108] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep denoising autoencoder.,” in Proc. INTERSPEECH, pp. 436–440, 2013.
[109] J. Martens, “Deep learning via hessian free optimization,” in Proc. ICML, pp. 735–
742, 2010.
[110] G.B. Huang, L. Chen, C. K. Siew, et al., “Universal approximation using incremental constructive feedforward networks with random hidden nodes,” IEEE Trans. Neural Networks, vol. 17, no. 4, pp. 879–892, 2006.
[111] A. Beck and M. Teboulle, “A fast iterative shrinkage thresholding algorithm for linear inverse problems,” SIAM journal on imaging sciences, vol. 2, no. 1, pp. 183–202, 2009.
[112] I.T. Recommendation, “Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow and telephone networks and speech codecs,” Rec. ITUT P. 862, 2001.
[113] S. Quackenbush, T. Barnwell, and M. Clements, “Objective measures of speech quality,” Englewood Cliffs, NJ: Prentice-Hall, 1988.
[114] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility
prediction of time-frequency weighted noisy speech,” IEEE Trans. Audio, Speech, Language Process., vol. 19, no. 7, pp. 2125–2136, 2011.
[115] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean square error short-time spectral amplitude estimator,” IEEE Trans. Acoust., Speech, Signal Process., vol. 32, no. 6, pp. 1109–1121, 1984.
[116] Y. Hu and P. C. Loizou, “A generalized subspace approach for enhancing speech corrupted by colored noise,” IEEE Speech Audio Process., vol. 11, no. 4, pp. 334–341, 2003.
[117] C. Sun, Q. Zhang, J. Wang, and J. Xie, “Noise reduction based on robust principal component analysis,” Journal of Computational Information Systems, vol. 10, no. 10, pp. 4403–4410, 2014.
[118] R. Minhas, A. Baradarani, S. Seifzadeh, and Q. J. Wu, “Human action recognition using extreme learning machine based on visual vocabularies,” Neurocomputing, vol. 73, no. 1012,
pp. 1906–1917, 2010.
[119] Y. Lan, Z. Hu, Y. C. Soh, and G.B. Huang, “An extreme learning machine approach for speaker recognition,” Neural Comput. Appl., vol. 22, no. 34, pp. 417–425, 2013.
[120] B. O. Odelowo and D. V. Anderson, “A framework for speech enhancement using extreme learning machines,” in Proc. WASPAA, pp. 1956–1960, 2017.
[121] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, “DARPA TIMIT acoustic-phonetic continuous speech corpus CDROM. NIST speech disc 11.1,” Tech. Rep., vol. 93, 1993.
[122] L. L. Wong, S. D. Soli, S. Liu, N. Han, and M.W. Huang, “Development of the Mandarin hearing in noise test (MHINT),” Ear and Hearing, vol. 28, no. 2, pp. 70S–74S, 2007.
[123] K. Kinoshita, M. Delcroix, T. Yoshioka, T. Nakatani, A. Sehr, W. Kellermann, and R. Maas, “The REVERB challenge: A common evaluation framework for dereverberation and recognition of reverberant speech,” in Proc. WASPAA, pp. 1–4, 2013.
[124] W.J. Lee, S.S. Wang, F. Chen, X. Lu, S.Y. Chien, and Y. Tsao, “Speech dereverberation based on integrated deep and ensemble learning algorithm,” in Proc. ICASSP, pp. 5454–5458, 2018.
[125] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Ensemble modeling of denoising autoencoder
for speech spectrum restoration,” in Proc. INTERSPEECH, 2014.
[126] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Highway networks,” arXiv preprint ar X iv:1505.00387, 2015.
[127] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hoekstra, “Perceptual evaluation of speech quality (PESQ), a new method for speech quality assessment of telephone networks and codecs,” in Proc. ICASSP, vol. 2, pp. 749–752, 2001.
[128] Y. Hu and P. C. Loizou, “Evaluation of objective quality measures for speech enhancement,” IEEE Trans. Audio, Speech, Language Process., vol. 16, no. 1, pp. 229–238, 2008.
[129] T. H. Falk, C. Zheng, and W.Y. Chan, “A nonintrusive quality and intelligibility measure of reverberant and dereverberation speech,” IEEE Trans. Audio, Speech, Language Process., vol. 18, no. 7, pp. 1766–1774, 2010.
[130] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR, pp. 770–778, 2016.
[131] M. Wu and D. Wang, “A two-stage algorithm for one microphone reverberant speech enhancement,” IEEE Trans. Audio, Speech, Language Process., vol. 14, no. 3, pp. 774–784, 2006.
[132] A. Schwarz and W. Kellermann, “Coherent to diffuse power ratio estimation for dereverberation,” IEEE Trans. Audio, Speech, Language Process., vol. 23, no. 6, pp. 1006–1018, 2015.
[133] Y.H. Lai, Y. Tsao, and F. Chen, “Effects of adaptation rate and noise suppression on the intelligibility of compressed envelope based speech,” PloS one, vol. 10, no. 7, p. e0133519, 2015.
[134] S.W. Fu, P.C. Li, Y.H. Lai, C.C. Yang, L.C. Hsieh, and Y. Tsao, “Joint dictionary learning-based nonnegative matrix factorization for voice conversion to improve speech intelligibility after oral surgery,” IEEE Trans. Biomed. Eng., vol. 64, no. 11, pp. 2584–2594, 2017.
[135] A. N. S. Institute, American National Standard: Methods for calculation of the speech intelligibility index. Acoustical Society of America, 1997.
[136] R. van Hoesel, M. Böhm, R. D. Battmer, J. Beckschebe, and T. Lenarz, “Amplitude mapping effects on speech intelligibility with unilateral and bilateral cochlear implants,” Ear and Hearing, vol. 26, no. 4, pp. 381–388, 2005.
[137] T. Robinson, J. Fransen, D. Pye, J. Foote, and S. Renals, “WSJCAMO: A British English speech corpus for large vocabulary continuous speech recognition,” in Proc. ICASSP, vol. 1, pp. 81–84, 1995.
[138] G. Hinton, N. Srivastava, and K. Swersky, “RMSProp: Divide the gradient by a running average of its recent magnitude,” Neural Netw. Mach. Learn., Coursera Lecture 6e, 2012.
[139] M. Jeub, M. Schafer, and P. Vary, “A binaural room impulse response database for the evaluation of dereverberation algorithms,” in Proc. Dig. Signal Process, pp. 1–5, 2009.
[140] L. D. Consortium, “CSRII (WSJ1) complete,” Linguistic Data Consortium, Philadelphia, vol. LDC94S13A, 1994.
[141] J. Li, M.T. Luong, and D. Jurafsky, “A hierarchical neural autoencoder for paragraphs
and documents,” in Proc. ACL, pp. 1106–1115, 2015.
[142] Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for skeleton-based action recognition,” in Proc. CVPR, pp. 1110–1118, 2015.
[143] ITU, “Method for the subjective assessment of intermediate quality levels of coding systems (MUSHRA).,” International Telecommunication Union, Recommendation BS.15341,
2003.
[144] T. Hussain, Y. Tsao, S. M. Siniscalchi, J.C. Wang, H.M. Wang, and W.H. Liao, “Bone conducted speech enhancement using hierarchical extreme learning machine,” in Proc. IWSDS 2019, to be published.
[145] S.W. Fu, Y. Tsao, and X. Lu, “SNR-aware convolutional neural network modeling for speech enhancement.,” in Proc. INTERSPEECH, pp. 3768–3772, 2016.
[146] S.W. Fu, T.W. Wang, Y. Tsao, X. Lu, and H. Kawai, “End-to-end waveform utterance
enhancement for direct evaluation metrics optimization by fully convolutional neural networks,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 26, no. 9, pp. 1570–1584, 2018.
[147] Z. Chen, S. Watanabe, H. Erdogan, and J. R. Hershey, “Speech enhancement and recognition using multitask learning of long short term memory recurrent neural networks,” in Proc. INTERSPEECH, 2015.
[148] S. Pascual, A. Bonafonte, and J. Serra, “Segan: Speech enhancement generative adversarial network,” arXiv preprint arXiv:1703.09452, 2017.
[149] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner,
A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” arXiv preprint ar X iv:1609.03499, 2016.
[150] H.P. Liu, Y. Tsao, and C.S. Fuh, “Bone conducted speech enhancement using deep denoising autoencoder,” Speech Communication, vol. 104, pp. 106–112, 2018.
[151] Google, “Cloud speech API, https://cloud.google.com/speech/,” 2017.
[152] D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 26, no. 10, pp. 1702–1726, 2018.
[153] J. OrtegaGarcía and J. GonzálezRodríguez, “Overview of speech enhancement techniques for automatic speaker recognition,” in Proc. ICSLP, pp. 929–932, 1996.
[154] M. Kolbk, Z.H. Tan, J. Jensen, M. Kolbk, Z.H. Tan, and J. Jensen, “Speech intelligibility
potential of general and specialized deep neural network-based speech enhancement systems,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 25, no. 1, pp. 153–167, 2017.
[155] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J. R. Hershey, and B. Schuller, “Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR,” in Proc. LVA/ICA, pp. 91–99, 2015.
[156] L. Sun, J. Du, L.R. Dai, and C.H. Lee, “Multiple target deep learning for LSTM-RNN
based speech enhancement,” in Proc. HSCMA, pp. 136–140, 2017.
[157] J. M. Kates and K. H. Arehart, “The hearing aid speech perception index (HASPI),”
Speech Communication, vol. 65, pp. 75–93, 2014.
[158] M. Huang, “Development of Taiwan mandarin hearing in noise test,” Department of
speech-language pathology and audiology, National Taipei University of Nursing and Health Science, 2005.
[159] P. Viola and M. J. Jones, “Robust real-time face detection,” International Journal of Computer Vision, vol. 57, no. 2, pp. 137–154, 2004.
[160] T. Hussain, Y. Tsao, H.M. Wang, J.C. Wang, S. M. Siniscalchi, and W.H. Liao, “Compressed multimodal hierarchical extreme learning machine for speech enhancement,”
in Proc. APSIPA 2019, to be published.
[161] D. L. Wang, “Deep learning reinvents the hearing aid,” IEEE Spectrum, vol. March Issue, pp. 32–37, 2017.
[162] Z. Zhao, H. Liu, and T. Fingscheidt, “Convolutional neural networks to enhance coded speech,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 27, no. 4, pp. 663–678, 2019.
[163] S. Doclo, M. Moonen, T. Van den Bogaert, and J. Wouters, “Reduced bandwidth and distributed mwf-based noise reduction algorithms for binaural hearing aids,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 1, pp. 38–51, 2009.
[164] Z.Q. Wang and D. Wang, “A joint training framework for robust automatic speech recognition,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 24, no. 4, pp. 796–806, 2016.
[165] D. Michelsanti and Z.H. Tan, “Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification,” in Proc. INTERSPEECH, pp. 2008–2012, 2017.
[166] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE
Trans. Acoust., Speech, Signal Process., vol. 27, no. 2, pp. 113–120, 1979.
[167] K. W. Wilson, B. Raj, P. Smaragdis, and A. Divakaran, “Speech denoising using nonnegative matrix factorization with priors,” in Proc. ICASSP, pp. 4029–4032, 2008.
[168] S.W. Fu, Y. Tsao, and X. Lu, “SNR-aware convolutional neural network modeling for speech enhancement,” in Proc. INTERSPEECH, pp. 3768–3772, 2016.
[169] S.W. Fu, T.Y. Hu, Y. Tsao, and X. Lu, “Complex spectrogram enhancement by a convolutional neural network with multimetric learning,” in Proc. MLSP, pp. 1–6, 2017.
[170] H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, “Phase-sensitive and recognition boosted speech separation using deep recurrent neural networks,” in Proc. ICASSP, pp. 708–712, 2015.
[171] W. Li, S. Wang, M. Lei, S. M. Siniscalchi, and C.H. Lee, “Improving audio-visual speech recognition performance with cross-modal student teacher training,” in Proc. ICASSP, pp. 6560–6564, 2019.
[172] M. Courbariaux, Y. Bengio and J.P. David, “Binary connect: Training deep neural networks with binary weights during propagations,” in Proc. NIPS, pp. 3123–3131, 2015.
[173] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,” arXiv preprint ar X iv:1612.01064, 2016.
[174] N. USED, “IEEE standard for binary floating-point arithmetic, 1985, ANSI,” IEEE
Standard, pp. 754–1985.
[175] P.S. Huang, S. D. Chen, P. Smaragdis, and M. HasegawaJohnson, “Singing voice separation from monaural recordings using robust principal component analysis,” in Proc. ICASSP, pp. 57–60, 2012.
zh_TW
dc.identifier.doi (DOI) 10.6814/NCCU202000466en_US