基於轉換器之跨語言語者辨識強健性分析

陳威妤; Chen, Wei-Yu

Please use this identifier to cite or link to this item: https://ah.lib.nccu.edu.tw/handle/140.119/139218

題名:	基於轉換器之跨語言語者辨識強健性分析 On the Robustness of Cross-Lingual Speaker Recognition Using Transformer-Based Approaches
作者:	陳威妤 Chen, Wei-Yu
貢獻者:	廖文宏 Liao, Wen-Hung 陳威妤 Chen, Wei-Yu
關鍵詞:	語者辨識跨語言音料庫深度神經網路轉換器對抗例攻擊 Speaker Recognition Cross-lingual Dataset Deep Neural Networks Transformer Adversarial Attack
日期:	2022
上傳時間:	1-Mar-2022
摘要:	語者辨識廣泛運用於生活之中，小至語音助理，大至犯罪鑑識。隨著深度學習技術進展，語者辨識正確率逐步提升，不過多數研究專注於單一語言，鮮少處理跨語言的語者辨識任務，跨語言資料集亦相當稀少。本研究錄製跨語言資料集MET-40，其中每位參與者皆以三種語言（包含華語、英語和台語）錄音，當中有40位參與者，男性與女性人數各半，總和時長為740分鐘。華語、英語與台語文本主要取自小學教材，部分英語文本採用TIMIT資料集，錄製後會評估每位參與者運用個別語言之流暢度。\n　　本論文採用轉換器與卷積網路為概念的網路架構，探討單一語言、混合語言、跨語言之語者辨識，訓練模型包含ResNet、Vision Transformer（ViT）和Convolutional Vision Transformer（CvT），搭配三種常用語音特徵（頻譜圖、梅爾頻譜圖和梅爾倒頻譜係數圖）進行實驗。混合語言與跨語言差異在於識別語言是否加入訓練資料，混合語言是測試之語言已存在於訓練集，反之跨語言則是識別語言不在訓練集。實驗結果在MET-40單一語言模型下，最高可達97.16%準確率。與單一語言模型相比，混合兩種語言以上之模型最高識別率為99.17%。跨語言識別中，語者混合越多種類語言有助於提升模型之泛化性。在我們實驗中，單一語言模型在跨語言識別最高準確率為79.64%，混合兩種語言以上之模型於跨語言辨識最高準確率為90.92%。實驗結果顯示CvT不易受抽取特徵影響，且具有較佳泛化性，不論是單一語言、混合語言或跨語言，整體識別度最佳。\n　　模型的強健度，攸關應用時安全性，因此本論文分析不同模型的語者辨識遭致對抗例攻擊影響程度。從對抗例攻擊實驗結果，訓練與測試選擇相同語言資料集，可利用FGSM與PGD產生有效攻擊。進而探討跨語言攻擊之可轉移性，其中特徵使用頻譜圖或梅爾頻譜圖之模型攻擊不具可轉移性，而梅爾倒頻譜係數圖特徵雖在語者識別任務有卓越表現，但易受對抗例攻擊影響使得識別率降低。即使沒有訓練資料依然能產生攻擊，在FGSM跨語言攻擊中平均下降31.57%識別率，因此採用梅爾倒頻譜係數圖特徵之模型需要更加小心保護。 Speaker recognition is widely used in daily life, ranging from voice assistants to criminal forensics. With the rapid progress in deep learning technology, accuracy of speaker recognition has increased accordingly. However, most studies focus on a single language. Cross-language speaker recognition is rarely investigated. Cross-language data sets are also quite scarce. This study collected trilingual (including Mandarin, English, and Taiwanese) cross-language recordings named MET-40. A total of 40 participants (20 male, 20 female) contribute to the dataset which contains 740 minutes of audio. Mandarin, English and Taiwanese texts are mainly taken from elementary school textbooks, and some English texts use TIMIT. The fluency of individual participant in each language is also evaluated.\n\nWe employ ResNet, vision transformer (ViT), and convolutional vision transformer (CvT) in combination with three acoustic features, namely, spectrogram, Mel spectrogram and Mel frequency cepstral coefficient) for single, mixed and cross-language speaker recognition tasks. In the mixed-language setting, the language to be tested is included in the training set, while in the cross-language scenario the language to be tested is not used for training. Experimental results show that the highest accuracy is 97.16% for single language models. Mixture of two languages improves the performance to 99.17%. In cross-language situations, the accuracy drops significantly to 79.64%, as the spoken language is not present in the training data. When two languages are employed for training, the accuracy increased to 90.92%. In general, CvT-based models demonstrate best generalization ability in all cases.\n\nThe robustness of the model is critical to security in practical applications. Therefore, we analyze how adversarial attacks impact different speaker identification models. Experimental results reveal that if the same language dataset is selected for training and testing, FGSM and PGD attacks can be effectively generated. In the case of cross-language models, however, adversarial attacks using spectrogram or Mel-spectrogram are not transferable. Finally, when MFCC is chosen to be the acoustic feature, extra caution needs to be taken as attacks can still take place without training data, and the recognition rate is reduced by 31.57% using FGSM cross-language attack.
參考文獻:	[1] Eberhard, D. M., Simons, G. F., Fennig, C. D. (eds).Ethnologue: Languages of the World. 23rd Edition. Dallas, TX: SIL International, 2020.\n[2] Grenier, Gilles Zhang, et al “The value of language skills”, IZA World of Labor, 2021.\n[3] Nawaz, Shah, et al. "Cross-modal Speaker Verification and Recognition: A Multilingual Perspective." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.\n[4] Wu, Yi-Chieh, and Wen-Hung Liao. "Toward Text-independent Cross-lingual Speaker Recognition Using English-Mandarin-Taiwanese Dataset." 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 2021.\n[5] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.\n[6] Parmar, Niki, et al. "Image Transformer." International Conference on Machine Learning. PMLR, 2018.\n[7] Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).\n[8] Wu, Haiping, et al. "Cvt: Introducing convolutions to Vision Transformers." arXiv preprint arXiv:2103.15808 (2021).\n[9] Kua, Jia Min Karen, Julien Epps, and Eliathamby Ambikairajah. "i-Vector with sparse representation classification for speaker verification." Speech Communication 55.5 (2013): 707-720.\n[10] Variani, Ehsan, et al. "Deep neural networks for small footprint text-dependent speaker verification." 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2014.\n[11] Snyder, David, et al. "Deep Neural Network Embeddings for Text-Independent Speaker Verification." Interspeech. 2017.\n[12] Ravanelli, Mirco, and Yoshua Bengio. "Speaker recognition from raw waveform with sincnet." 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2018.\n[13] Ghezaiel, Wajdi, Luc Brun, and Olivier Lézoray. "Hybrid network for end-to-end text-independent speaker identification." 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 2021.\n[14] LeCun, Yann, et al. "Gradient-based learning applied to document recognition." Proceedings of the IEEE 86.11 (1998): 2278-2324.\n[15] Nagrani, Arsha, Joon Son Chung, and Andrew Zisserman. "Voxceleb: a large-scale speaker identification dataset." arXiv preprint arXiv:1706.08612 (2017).\n[16] Chatfield, Ken, et al. "Return of the devil in the details: Delving deep into convolutional nets." arXiv preprint arXiv:1405.3531 (2014).\n[17] Ding, Shaojin, et al. "Autospeech: Neural architecture search for speaker recognition." arXiv preprint arXiv:2005.03215 (2020).\n[18] Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.\n[19] Gong, Yuan, Yu-An Chung, and James Glass. "AST: Audio Spectrogram Transformer." arXiv preprint arXiv:2104.01778 (2021).\n[20] Touvron, Hugo, et al. "Training data-efficient image transformers & distillation through attention." International Conference on Machine Learning. PMLR, 2021.\n[21] Durou, Geoffrey. Multilingual text-independent speaker identification. FACULTE POLYTECHNIQUE DE MONS (BELGIUM), 2000.\n[22] Xie, Weidi, et al. "Utterance-level aggregation for speaker recognition in the wild." ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.\n[23] Min, Feixia, Xiaofeng Qiu, and Fan Wu. "Adversarial attack? Don`t panic." 2018 4th International Conference on Big Data Computing and Communications （BIGCOM）. IEEE, 2018.\n[24] Alexey Kurakin, Ian J Goodfellow, and Samy Bengio. Nips 2017: Defense against adversarial attack, 2017c. URL https://www.kaggle.com/c/ nips-2017-defense-against-adversarial-attack.\n[25] Wang, Xin, et al. "ASVspoof 2019: a large-scale public database of synthetized, converted and replayed speech." Computer Speech & Language （2020）: 101114.\n[26] Goodfellow, Ian J., Jonathon Shlens, and Christian Szegedy. "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 （2014）\n[27] Olivier, Raphael, Bhiksha Raj, and Muhammad Shah. "High-Frequency Adversarial Defense for Speech and Audio." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.\n[28] Madry, Aleksander, et al. "Towards deep learning models resistant to adversarial attacks." arXiv preprint arXiv:1706.06083 (2017).\n[29] Panayotov, Vassil, et al. "Librispeech: an asr corpus based on public domain audio books." 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015.\n[30] Salamon, Justin, Christopher Jacoby, and Juan Pablo Bello. "A dataset and taxonomy for urban sound research." Proceedings of the 22nd ACM international conference on Multimedia. 2014.\n[31] Shao, Rulin, et al. "On the adversarial robustness of visual transformer." arXiv preprint arXiv:2103.15670 (2021).\n[32] He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.\n[33] Goodfellow, Ian J., Jonathon Shlens, and Christian Szegedy. "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014).\n[34] Jati, Arindam, et al. "Adversarial attack and defense strategies for deep speaker recognition systems." Computer Speech & Language 68 (2021): 101199.
描述:	碩士國立政治大學資訊科學系 108753131
資料來源:	http://thesis.lib.nccu.edu.tw/record/#G0108753131
資料類型:	thesis
Appears in Collections:	學位論文

Files in This Item:

File	Description	Size	Format
313101.pdf		4.79 MB	Adobe PDF2	View/Open

Show full item record

Google Scholar^TM

Check

Files in This Item:

Google Scholar^TM

Altmetric

Altmetric

Files in This Item:

Google ScholarTM

Altmetric

Altmetric

Google Scholar^TM