混合人聲之聲音場景辨識

Publications-Theses

Article View/Open

pdf(393)

Publication Export

Google Scholar^TM

NCCU Library

Discovery System

Citation Infomation

Related Publications in TAIR

Simple Record
Full Record

題名	混合人聲之聲音場景辨識 Classification of Acoustic Scenes with Mixtures of Human Voice and Background Audio
作者	李御國 Li, Yu-Guo
貢獻者	廖文宏 Liao, Wen-Hung 李御國 Li, Yu-Guo
關鍵詞	卷積神經網路 DCASE音訊資料集聲音場景辨識線上身份驗證 Voice-based Online Identity Verification Convolutional Neural Network DCASE Dataset Acoustic Scene Classification
日期	2020
上傳時間	2-Sep-2020 13:15:07 (UTC+8)
摘要	日常生活環境週遭聲音，從來不是單獨事件，而是多種音源重疊在一起，使得環境音辨識充滿了各種挑戰。本研究以DCASE2016 比賽Task1所提供的音訊資料，包括海邊(Beach)與輕軌電車(Tram)等共15種場景的環境錄音為基礎，搭配16位人聲進行合成，針對混合人聲後的場景進行分析與辨識。聲音特徵萃取採用了普遍使用於聲音辨識的對數梅爾頻譜(Log-Mel Spectrogram)，用以保留最多聲音特徵，並利用卷積神經網路(CNN)來分辨出這些相互疊合聲音場景，整體平均辨識率達79%，於車輛(Car)類別辨識率可達93%，希望能將其運用在線上身份驗證之聲紋辨識的前處理階段。 The sounds around the environment of daily life are never separate events but consist of overlapping audio sources, making environmental sound recognition a challenging issue. This research employs audio data provided by Task1 of the Detection and Classification of Acoustic Scenes and Events 2016 (DCASE2016) competition, including environmental recordings of 15 scenes in different settings such as beach and tram. They are mixed with 16 human voices to create a new dataset. Acoustic features are extracted from the Log-Mel spectrogram, which is commonly used in voice recognition to retain the most distinct sound properties. Convolutional neural network (CNN) is employed to distinguish these overlapping sound scenes. We achiveve an overall accuracy of 79% and 93% accudacy in the ‘car’ scene. We expect the outcome to be applied as the pre-processing stage of voice-based online identity verification.
參考文獻	[1] ESC Dataset https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/YDEPUT [2] UrbanSound8K https://urbansounddataset.weebly.com/urbansound8k.html [3] DCASE Challenge http://dcase.community/ [4] Liao, Wen-Hung, Jin-Yao Wen, and Jen-Ho Kuo. "Streaming audio classification in smart home environments." The First Asian Conference on Pattern Recognition. IEEE, 2011. [5] Nordby, Jon Opedal. Environmental sound classification on microcontrollers using Convolutional Neural Networks. MS thesis. Norwegian University of Life Sciences, Ås, 2019. [6] Wu, Yuzhong, and Tan Lee. "Enhancing sound texture in CNN-based acoustic scene classification." ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019. [7] Salamon, Justin, and Juan Pablo Bello. "Deep convolutional neural networks and data augmentation for environmental sound classification." IEEE Signal Processing Letters 24.3 (2017): 279-283. [8] Dai Wei, Juncheng Li, et al. "Acoustic scene recognition with deep neural networks (DCASE challenge 2016)." Robert Bosch Research and Technology Center 3 (2016). [9] Hussain, Khalid, Mazhar Hussain, and Muhammad Gufran Khan. "An Improved Acoustic Scene Classification Method Using Convolutional Neural Networks (CNNs)." American Scientific Research Journal for Engineering, Technology, and Sciences (ASRJETS) 44.1 (2018): 68-76. [10] Han, Yoonchang, and Kyogu Lee. "Acoustic scene classification using convolutional neural network and multiple-width frequency-delta data augmentation." arXiv preprint arXiv:1607.02383 (2016). [11] Kim, Jaehun, and Kyogu Lee. "Empirical study on ensemble method of deep neural networks for acoustic scene classification." Proc. of IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) (2016). [12] Santoso, Andri, Chien-Yao Wang, and Jia-Ching Wang. Acoustic scene classification using network-in-network based convolutional neural network. DCASE2016 Challenge, Tech. Rep, 2016. [13] Becker, Sören, et al. "Interpreting and explaining deep neural networks for classification of audio signals." arXiv preprint arXiv:1807.03418 (2018). [14] Keren, Gil, and Björn Schuller. "Convolutional RNN: an enhanced model for extracting features from sequential data." 2016 International Joint Conference on Neural Networks (IJCNN). IEEE, 2016. [15] CH.Tseng，初探卷積神經網路 https://chtseng.wordpress.com/2017/09/12/%E5%88%9D%E6%8E%A2%E5%8D%B7%E7%A9%8D%E7%A5%9E%E7%B6%93%E7%B6%B2%E8%B7%AF/ [16] Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." arXiv preprint arXiv:1312.4400 (2013). [17] Y. LeCun, Y. Bengio, G. Hinton, L. Y., B. Y., and H. G., “Deep learning,” Nature, vol. 521,no. 7553, pp. 436–444, 2015. [18] NVIDIA DIGITS https://developer.nvidia.com/digits [19] Keras https://keras.io/ [20] François Chollet，Deep learning 深度學習必讀：Keras 大神帶你用 Python 實作，旗標，ISBN：9789863125501，2019 [21] 郭秋田等，多媒體導論與應用第三版，旗標，ISBN:9574426246，2008。 [22] 丁建均，時頻分析近年來的發展 http://www.ancad.com.tw/Training/ppt_download/%E4%B8%81%E5%BB%BA%E5%9D%87%E6%95%99%E6%8E%880628.pdf [23] Pu Sun, “Comparison of STFT and Wavelet Transform in Timefrequency Analysis”,2014. [24] Solovyev, Roman A., et al. "Deep Learning Approaches for Understanding Simple Speech Commands." arXiv preprint arXiv:1810.02364 (2018). [25] Librosa https://librosa.github.io/librosa/feature.html [26] Pydub, AudioSegment https://github.com/jiaaro/pydub [27] Sklearn.preprocessing.StandardScaler https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html [28] description of acoustic scene classes in TUT Acoustic scenes 2016 dataset. http://www.cs.tut.fi/sgn/arg/dcase2016/acoustic-scenes
描述	碩士國立政治大學資訊科學系碩士在職專班 105971016
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0105971016
資料類型	thesis

dc.contributor.advisor	廖文宏	zh_TW
dc.contributor.advisor	Liao, Wen-Hung	en_US
dc.contributor.author (Authors)	李御國	zh_TW
dc.contributor.author (Authors)	Li, Yu-Guo	en_US
dc.creator (作者)	李御國	zh_TW
dc.creator (作者)	Li, Yu-Guo	en_US
dc.date (日期)	2020	en_US
dc.date.accessioned	2-Sep-2020 13:15:07 (UTC+8)	-
dc.date.available	2-Sep-2020 13:15:07 (UTC+8)	-
dc.date.issued (上傳時間)	2-Sep-2020 13:15:07 (UTC+8)	-
dc.identifier (Other Identifiers)	G0105971016	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/131936	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	資訊科學系碩士在職專班	zh_TW
dc.description (描述)	105971016	zh_TW
dc.description.abstract (摘要)	日常生活環境週遭聲音，從來不是單獨事件，而是多種音源重疊在一起，使得環境音辨識充滿了各種挑戰。本研究以DCASE2016 比賽Task1所提供的音訊資料，包括海邊(Beach)與輕軌電車(Tram)等共15種場景的環境錄音為基礎，搭配16位人聲進行合成，針對混合人聲後的場景進行分析與辨識。聲音特徵萃取採用了普遍使用於聲音辨識的對數梅爾頻譜(Log-Mel Spectrogram)，用以保留最多聲音特徵，並利用卷積神經網路(CNN)來分辨出這些相互疊合聲音場景，整體平均辨識率達79%，於車輛(Car)類別辨識率可達93%，希望能將其運用在線上身份驗證之聲紋辨識的前處理階段。	zh_TW
dc.description.abstract (摘要)	The sounds around the environment of daily life are never separate events but consist of overlapping audio sources, making environmental sound recognition a challenging issue. This research employs audio data provided by Task1 of the Detection and Classification of Acoustic Scenes and Events 2016 (DCASE2016) competition, including environmental recordings of 15 scenes in different settings such as beach and tram. They are mixed with 16 human voices to create a new dataset. Acoustic features are extracted from the Log-Mel spectrogram, which is commonly used in voice recognition to retain the most distinct sound properties. Convolutional neural network (CNN) is employed to distinguish these overlapping sound scenes. We achiveve an overall accuracy of 79% and 93% accudacy in the ‘car’ scene. We expect the outcome to be applied as the pre-processing stage of voice-based online identity verification.	en_US
dc.description.tableofcontents	第一章緒論 1 1.1 研究動機 1 1.2 論文架構 4 第二章相關研究 5 2.1 文獻探討 5 2.2 工具探討 9 第三章研究方法 12 3.1 基本構想 12 3.2 前期研究 13 3.2.1 音訊輸入(Input Signal) 13 3.2.2 短時傅立葉轉換(Short-Time Fourier Transform) 14 3.2.3 梅爾頻譜轉換(Mel Spectrogram) 15 3.2.4 對數梅爾頻譜轉換(Log-Mel Spectrogram) 16 3.3 研究架構設計 17 3.3.1 問題陳述 17 3.3.2 研究架構 18 3.3.3 研究工具 20 3.3.4 前期測試 21 3.3.4.1 音訊資料前處理 22 3.3.4.2 特徵描述設定 22 3.3.4.3 模型設定 22 3.3.4.4 初測結果與特徵描述選定 23 3.3.4.5 模型選定 26 3.3.4.6 資料長度選定 27 3.4 目標設定 28 第四章研究過程與結果分析 29 4.1 研究過程 30 4.1.1 聲音前置處理 30 4.1.2 聲音音量正規化 30 4.1.3 聲音合成 31 4.1.4 特徵描述 31 4.1.5 模型訓練 34 4.2 預測項目 37 4.2.1 預測情境1：純場景音(-20dB)式 37 4.2.2 預測情境2：場景音音量(-20dB)小於人聲音量(-13dB) 39 4.2.3 預測情境3：場景音音量(-20dB)小於人聲音量(-13dB) 40 4.2.4 預測情境4：場景音音量(-20dB)等於人聲音量(-20dB) 42 4.2.5 預測情境5：場景音音量(-20dB)大於人聲音量(-35dB) 43 4.3 成果分析以及探討 45 4.3.1 從整體正確率來檢視預測結果 45 4.3.2 從混淆矩陣來檢視預測結果 46 4.4 延伸探討 49 第五章結論與未來研究方向 51 5.1 結論 51 5.2 未來研究方向 51 參考文獻 53 附錄 56	zh_TW
dc.format.extent	5105838 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0105971016	en_US
dc.subject (關鍵詞)	卷積神經網路	zh_TW
dc.subject (關鍵詞)	DCASE音訊資料集	zh_TW
dc.subject (關鍵詞)	聲音場景辨識	zh_TW
dc.subject (關鍵詞)	線上身份驗證	zh_TW
dc.subject (關鍵詞)	Voice-based Online Identity Verification	en_US
dc.subject (關鍵詞)	Convolutional Neural Network	en_US
dc.subject (關鍵詞)	DCASE Dataset	en_US
dc.subject (關鍵詞)	Acoustic Scene Classification	en_US
dc.title (題名)	混合人聲之聲音場景辨識	zh_TW
dc.title (題名)	Classification of Acoustic Scenes with Mixtures of Human Voice and Background Audio	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	[1] ESC Dataset https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/YDEPUT [2] UrbanSound8K https://urbansounddataset.weebly.com/urbansound8k.html [3] DCASE Challenge http://dcase.community/ [4] Liao, Wen-Hung, Jin-Yao Wen, and Jen-Ho Kuo. "Streaming audio classification in smart home environments." The First Asian Conference on Pattern Recognition. IEEE, 2011. [5] Nordby, Jon Opedal. Environmental sound classification on microcontrollers using Convolutional Neural Networks. MS thesis. Norwegian University of Life Sciences, Ås, 2019. [6] Wu, Yuzhong, and Tan Lee. "Enhancing sound texture in CNN-based acoustic scene classification." ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019. [7] Salamon, Justin, and Juan Pablo Bello. "Deep convolutional neural networks and data augmentation for environmental sound classification." IEEE Signal Processing Letters 24.3 (2017): 279-283. [8] Dai Wei, Juncheng Li, et al. "Acoustic scene recognition with deep neural networks (DCASE challenge 2016)." Robert Bosch Research and Technology Center 3 (2016). [9] Hussain, Khalid, Mazhar Hussain, and Muhammad Gufran Khan. "An Improved Acoustic Scene Classification Method Using Convolutional Neural Networks (CNNs)." American Scientific Research Journal for Engineering, Technology, and Sciences (ASRJETS) 44.1 (2018): 68-76. [10] Han, Yoonchang, and Kyogu Lee. "Acoustic scene classification using convolutional neural network and multiple-width frequency-delta data augmentation." arXiv preprint arXiv:1607.02383 (2016). [11] Kim, Jaehun, and Kyogu Lee. "Empirical study on ensemble method of deep neural networks for acoustic scene classification." Proc. of IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) (2016). [12] Santoso, Andri, Chien-Yao Wang, and Jia-Ching Wang. Acoustic scene classification using network-in-network based convolutional neural network. DCASE2016 Challenge, Tech. Rep, 2016. [13] Becker, Sören, et al. "Interpreting and explaining deep neural networks for classification of audio signals." arXiv preprint arXiv:1807.03418 (2018). [14] Keren, Gil, and Björn Schuller. "Convolutional RNN: an enhanced model for extracting features from sequential data." 2016 International Joint Conference on Neural Networks (IJCNN). IEEE, 2016. [15] CH.Tseng，初探卷積神經網路 https://chtseng.wordpress.com/2017/09/12/%E5%88%9D%E6%8E%A2%E5%8D%B7%E7%A9%8D%E7%A5%9E%E7%B6%93%E7%B6%B2%E8%B7%AF/ [16] Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." arXiv preprint arXiv:1312.4400 (2013). [17] Y. LeCun, Y. Bengio, G. Hinton, L. Y., B. Y., and H. G., “Deep learning,” Nature, vol. 521,no. 7553, pp. 436–444, 2015. [18] NVIDIA DIGITS https://developer.nvidia.com/digits [19] Keras https://keras.io/ [20] François Chollet，Deep learning 深度學習必讀：Keras 大神帶你用 Python 實作，旗標，ISBN：9789863125501，2019 [21] 郭秋田等，多媒體導論與應用第三版，旗標，ISBN:9574426246，2008。 [22] 丁建均，時頻分析近年來的發展 http://www.ancad.com.tw/Training/ppt_download/%E4%B8%81%E5%BB%BA%E5%9D%87%E6%95%99%E6%8E%880628.pdf [23] Pu Sun, “Comparison of STFT and Wavelet Transform in Timefrequency Analysis”,2014. [24] Solovyev, Roman A., et al. "Deep Learning Approaches for Understanding Simple Speech Commands." arXiv preprint arXiv:1810.02364 (2018). [25] Librosa https://librosa.github.io/librosa/feature.html [26] Pydub, AudioSegment https://github.com/jiaaro/pydub [27] Sklearn.preprocessing.StandardScaler https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html [28] description of acoustic scene classes in TUT Acoustic scenes 2016 dataset. http://www.cs.tut.fi/sgn/arg/dcase2016/acoustic-scenes	zh_TW
dc.identifier.doi (DOI)	10.6814/NCCU202001422	en_US

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM