Publications-Theses
Article View/Open
Publication Export
-
Google ScholarTM
NCCU Library
Citation Infomation
Related Publications in TAIR
題名 自監督式聲音特徵在跨語言語者辨識的表現評估
Evaluation of Cross-Lingual Speaker Recognition using SSL-Based Acoustic Features作者 陳柏翰
Chen, Po-Han貢獻者 廖文宏
Liao, Wen-Hung
陳柏翰
Chen, Po-Han關鍵詞 深度學習
跨語言語者辨識
自監督式學習
聲音特徵
Deep learning
Cross-lingual speaker recognition
Self-supervised learning
Acoustic feature日期 2024 上傳時間 5-Aug-2024 12:46:26 (UTC+8) 摘要 語者辨識作為一種身分辨識的技術,已被廣泛應用在我們的生活當中,如保全系統、語音助手等。過去的語者辨識研究中,大多以語者使用單一語言情境下的辨識為主,但是現今在生活中使用兩種以上語言的人越來越多,當他們使用和註冊時不同語言進行辨識,就可能發生錯誤,為此就需要跨語言的語者辨識模型。而近年來所提出的自監督式學習(Self-supervised learning, SSL)模型,已經能夠從大量未標記資料中學習通用特徵,相較於頻譜圖和梅爾倒頻譜係數(MFCC)等,該經過預訓練的通用特徵,在跨語言語者辨識任務的表現則有待評估。 在本論文中,我們提出以預訓練的SSL深度學習模型,將音訊資料轉換為聲音特徵,並用於跨語言語者辨識的評估,另外也會針對資料擴增的特徵做分析。具體而言,我們直接將音訊資料輸入SSL預訓練模型來產生嵌入向量作為聲音特徵,接著再使用ResNet架構的語者辨識模型做跨語言表現分析。透過此方法,我們測試在由實驗室收集包含120位語者資料的MET-120,並且使用SSL模型的Wav2Vec 2.0 和 BEATs來取得聲音特徵,我們發現經過微調的 Wav2Vec 2.0模型在MET-120平均表現上達到了九成以上,取得優秀且穩定的結果,而在未經微調的情況下,BEATs在MET-120也獲得最佳的表現。並且我們也發現,語言是否為母語以及語者的性別差異,都可能會對後續的辨識表現造成影響。在資料擴增的實驗中,則是使用SpecAugment和ShuffleAugment這類近年來用在聲音資料上的方法進行跨語言測試。結果顯示,後者更能有效改善跨語言的辨識效果,並在後續搭配對特徵降維來取得最佳的擴增效果。最後,我們在合成語音的跨語言攻擊測試中看到,這類先進的合成語音不容易透過特徵轉移的方式,對使用嵌入特徵的辨識模型,在跨語言測試造成混淆攻擊。
Speaker recognition, as a form of biometric identification technology, has been widely integrated into our daily lives, such as in security systems and voice assistants. Traditionally, speaker recognition research has predominantly focused on scenarios where the speaker uses a single language. However, with the increasing number of people using multiple languages in their daily lives, recognition errors may occur when speakers use a different language from the one they registered with. This necessitates the development of cross-lingual speaker recognition models. In recent years, self-supervised learning (SSL) models have demonstrated the capability to learn general features from large amounts of unlabeled data. Compared to spectrograms and Mel-frequency cepstral coefficients (MFCCs), the performance of these pretrained general features in cross-lingual speaker recognition tasks requires further evaluation. In this paper, we propose utilizing pretrained SSL deep learning models to convert audio data into acoustic features and evaluate their performance in cross-lingual speaker recognition. Additionally, we analyze the impact of data augmentation techniques on these features. Specifically, we input raw audio data into SSL pretrained models to generate embedding vectors as acoustic features, followed by performance analysis using ResNet as a speaker recognition model in cross-lingual scenarios. We tested a speech dataset, MET-120, collected from 120 participants in our laboratory. We obtained acoustic features using SSL models Wav2Vec 2.0 and BEATs. Our findings indicate that the fine-tuned Wav2Vec 2.0 model achieved over 90% accuracy on MET-120, demonstrating excellent and stable results. Without fine-tuning, BEATs also delivered optimal performance on MET-120. We observed that factors such as whether the language is the speaker's native language and the speaker's gender could influence recognition performance. In the data augmentation experiments, we primarily used recent methods applied to audio data such as SpecAugment and ShuffleAugment for cross-lingual testing. Results showed that the latter effectively improved cross-lingual recognition performance. In the final dimensionality reduction experiment, combining dimensionality reduction with ShuffleAugment yielded the best results, enhancing performance in both same-language and cross-lingual tests. Finally, in cross-lingual attack tests with synthetic speech, we found that advanced synthetic speech is not easily confounded through feature transfer, indicating the robustness of embedding features against such attacks.參考文獻 [1] Wu, Y.-C. and W.-H. Liao. Toward text-independent cross-lingual speaker recognition using english-mandarin-taiwanese dataset. in 2020 25th International Conference on Pattern Recognition (ICPR). 2021. IEEE. [2] Liao, W.-H., W.-Y. Chen, and Y.-C. Wu. On the Robustness of Cross-lingual Speaker Recognition using Transformer-based Approaches. in 2022 26th International Conference on Pattern Recognition (ICPR). 2022. IEEE. [3] Dehak, N., et al., Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 2010. 19(4): p. 788-798. [4] Wan, L., et al. Generalized end-to-end loss for speaker verification. in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2018. IEEE. [5] Snyder, D., et al. X-vectors: Robust dnn embeddings for speaker recognition. in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). 2018. IEEE. [6] Dosovitskiy, A., et al., An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. [7] Wu, H., et al. Cvt: Introducing convolutions to vision transformers. in Proceedings of the IEEE/CVF international conference on computer vision. 2021. [8] Mohamed, A., et al., Self-supervised speech representation learning: A review. IEEE Journal of Selected Topics in Signal Processing, 2022. [9] Chen, T., et al., Big self-supervised models are strong semi-supervised learners. Advances in neural information processing systems, 2020. 33: p. 22243-22255. [10] Baevski, A., et al., wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 2020. 33: p. 12449-12460. [11] Chen, S., et al., Beats: Audio pre-training with acoustic tokenizers. arXiv preprint arXiv:2212.09058, 2022. [12] Park, D.S., et al., Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779, 2019. [13] Sato, Y., N. Ikeda, and H. Takahashi. Shuffleaugment: A Data Augmentation Method Using Time Shuffling. in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2023. IEEE. [14] Barrault, L., et al., Seamless: Multilingual Expressive and Streaming Speech Translation. arXiv preprint arXiv:2312.05187, 2023. [15] Abayomi-Alli, O.O., et al., Data augmentation and deep learning methods in sound classification: A systematic review. Electronics, 2022. 11(22): p. 3795. [16] Mokgonyane, T.B., et al. Automatic speaker recognition system based on machine learning algorithms. in 2019 Southern African Universities Power Engineering Conference/Robotics and Mechatronics/Pattern Recognition Association of South Africa (SAUPEC/RobMech/PRASA). 2019. IEEE. [17] Jaiswal, A., et al., A survey on contrastive self-supervised learning. Technologies, 2020. 9(1): p. 2. [18] Song, X., et al. SpecSwap: A Simple Data Augmentation Method for End-to-End Speech Recognition. in Interspeech. 2020. [19] Zhang, D. and Z.-H. Zhou, (2D) 2PCA: Two-directional two-dimensional PCA for efficient face representation and recognition. Neurocomputing, 2005. 69(1-3): p. 224-231. [20] He, K., et al. Deep residual learning for image recognition. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. [21] Van der Maaten, L. and G. Hinton, Visualizing data using t-SNE. Journal of machine learning research, 2008. 9(11). [22] Li, P., et al., Reliable visualization for deep speaker recognition. arXiv preprint arXiv:2204.03852, 2022. [23] Selvaraju, R.R., et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. in Proceedings of the IEEE international conference on computer vision. 2017. [24] Hutiri, W.T. and A.Y. Ding. Bias in automated speaker recognition. in Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 2022. [25] Mason, J. and J. Thompson, Gender effects in speaker recognition. Proc. ICSP-93, Beijing, 1993: p. 733-736. [26] Wang, S., et al. Investigation of specaugment for deep speaker embedding learning. in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2020. IEEE. 描述 碩士
國立政治大學
資訊科學系
111753208資料來源 http://thesis.lib.nccu.edu.tw/record/#G0111753208 資料類型 thesis dc.contributor.advisor 廖文宏 zh_TW dc.contributor.advisor Liao, Wen-Hung en_US dc.contributor.author (Authors) 陳柏翰 zh_TW dc.contributor.author (Authors) Chen, Po-Han en_US dc.creator (作者) 陳柏翰 zh_TW dc.creator (作者) Chen, Po-Han en_US dc.date (日期) 2024 en_US dc.date.accessioned 5-Aug-2024 12:46:26 (UTC+8) - dc.date.available 5-Aug-2024 12:46:26 (UTC+8) - dc.date.issued (上傳時間) 5-Aug-2024 12:46:26 (UTC+8) - dc.identifier (Other Identifiers) G0111753208 en_US dc.identifier.uri (URI) https://nccur.lib.nccu.edu.tw/handle/140.119/152575 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 資訊科學系 zh_TW dc.description (描述) 111753208 zh_TW dc.description.abstract (摘要) 語者辨識作為一種身分辨識的技術,已被廣泛應用在我們的生活當中,如保全系統、語音助手等。過去的語者辨識研究中,大多以語者使用單一語言情境下的辨識為主,但是現今在生活中使用兩種以上語言的人越來越多,當他們使用和註冊時不同語言進行辨識,就可能發生錯誤,為此就需要跨語言的語者辨識模型。而近年來所提出的自監督式學習(Self-supervised learning, SSL)模型,已經能夠從大量未標記資料中學習通用特徵,相較於頻譜圖和梅爾倒頻譜係數(MFCC)等,該經過預訓練的通用特徵,在跨語言語者辨識任務的表現則有待評估。 在本論文中,我們提出以預訓練的SSL深度學習模型,將音訊資料轉換為聲音特徵,並用於跨語言語者辨識的評估,另外也會針對資料擴增的特徵做分析。具體而言,我們直接將音訊資料輸入SSL預訓練模型來產生嵌入向量作為聲音特徵,接著再使用ResNet架構的語者辨識模型做跨語言表現分析。透過此方法,我們測試在由實驗室收集包含120位語者資料的MET-120,並且使用SSL模型的Wav2Vec 2.0 和 BEATs來取得聲音特徵,我們發現經過微調的 Wav2Vec 2.0模型在MET-120平均表現上達到了九成以上,取得優秀且穩定的結果,而在未經微調的情況下,BEATs在MET-120也獲得最佳的表現。並且我們也發現,語言是否為母語以及語者的性別差異,都可能會對後續的辨識表現造成影響。在資料擴增的實驗中,則是使用SpecAugment和ShuffleAugment這類近年來用在聲音資料上的方法進行跨語言測試。結果顯示,後者更能有效改善跨語言的辨識效果,並在後續搭配對特徵降維來取得最佳的擴增效果。最後,我們在合成語音的跨語言攻擊測試中看到,這類先進的合成語音不容易透過特徵轉移的方式,對使用嵌入特徵的辨識模型,在跨語言測試造成混淆攻擊。 zh_TW dc.description.abstract (摘要) Speaker recognition, as a form of biometric identification technology, has been widely integrated into our daily lives, such as in security systems and voice assistants. Traditionally, speaker recognition research has predominantly focused on scenarios where the speaker uses a single language. However, with the increasing number of people using multiple languages in their daily lives, recognition errors may occur when speakers use a different language from the one they registered with. This necessitates the development of cross-lingual speaker recognition models. In recent years, self-supervised learning (SSL) models have demonstrated the capability to learn general features from large amounts of unlabeled data. Compared to spectrograms and Mel-frequency cepstral coefficients (MFCCs), the performance of these pretrained general features in cross-lingual speaker recognition tasks requires further evaluation. In this paper, we propose utilizing pretrained SSL deep learning models to convert audio data into acoustic features and evaluate their performance in cross-lingual speaker recognition. Additionally, we analyze the impact of data augmentation techniques on these features. Specifically, we input raw audio data into SSL pretrained models to generate embedding vectors as acoustic features, followed by performance analysis using ResNet as a speaker recognition model in cross-lingual scenarios. We tested a speech dataset, MET-120, collected from 120 participants in our laboratory. We obtained acoustic features using SSL models Wav2Vec 2.0 and BEATs. Our findings indicate that the fine-tuned Wav2Vec 2.0 model achieved over 90% accuracy on MET-120, demonstrating excellent and stable results. Without fine-tuning, BEATs also delivered optimal performance on MET-120. We observed that factors such as whether the language is the speaker's native language and the speaker's gender could influence recognition performance. In the data augmentation experiments, we primarily used recent methods applied to audio data such as SpecAugment and ShuffleAugment for cross-lingual testing. Results showed that the latter effectively improved cross-lingual recognition performance. In the final dimensionality reduction experiment, combining dimensionality reduction with ShuffleAugment yielded the best results, enhancing performance in both same-language and cross-lingual tests. Finally, in cross-lingual attack tests with synthetic speech, we found that advanced synthetic speech is not easily confounded through feature transfer, indicating the robustness of embedding features against such attacks. en_US dc.description.tableofcontents 第1章 緒論 1 1.1 研究背景與動機 1 1.2 研究目的與貢獻 3 1.3 論文架構 5 第2章 技術背景與相關研究 6 2.1 跨語言語者辨識 6 2.2 MET-40 9 2.3 自監督式學習(Self-supervised learning, SSL) 9 2.4 Wav2Vec 2.0 11 2.5 BEATs 12 2.6 資料擴增(Data Augmentation) 13 2.7 Seamless Expressive 17 第3章 研究方法 19 3.1 跨語言資料集介紹以及前處理 19 3.1.1 MET-40 19 3.1.2 MET-120 19 3.1.3 資料集前處理 20 3.2 聲音特徵產生和語者辨識模型 22 3.2.1 聲音特徵產生 23 3.2.2 語者辨識模型 24 第4章 實驗結果 26 4.1 參數設定 26 4.2 語者嵌入特徵辨識結果 26 4.2.1 與常見聲音特徵進行比較 27 4.2.2 MET-40資料集結果 27 4.2.3 MET-120資料集結果 29 4.2.4 混合語言結果 33 4.2.5 三語資料平衡結果 37 4.2.6 更長的音訊片段結果 39 4.2.7 男女特徵差異比較 40 4.3 資料擴增結果比較 44 4.3.1 基於音訊的擴增方式 44 4.3.2 SpecAugment 46 4.3.3 ShuffleAugment 48 4.3.4 2D-2D PCA 50 4.4 語音合成技術攻擊 53 4.4.1 測試結果 54 第5章 結論與未來工作 57 參考文獻 58 zh_TW dc.format.extent 4211382 bytes - dc.format.mimetype application/pdf - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0111753208 en_US dc.subject (關鍵詞) 深度學習 zh_TW dc.subject (關鍵詞) 跨語言語者辨識 zh_TW dc.subject (關鍵詞) 自監督式學習 zh_TW dc.subject (關鍵詞) 聲音特徵 zh_TW dc.subject (關鍵詞) Deep learning en_US dc.subject (關鍵詞) Cross-lingual speaker recognition en_US dc.subject (關鍵詞) Self-supervised learning en_US dc.subject (關鍵詞) Acoustic feature en_US dc.title (題名) 自監督式聲音特徵在跨語言語者辨識的表現評估 zh_TW dc.title (題名) Evaluation of Cross-Lingual Speaker Recognition using SSL-Based Acoustic Features en_US dc.type (資料類型) thesis en_US dc.relation.reference (參考文獻) [1] Wu, Y.-C. and W.-H. Liao. Toward text-independent cross-lingual speaker recognition using english-mandarin-taiwanese dataset. in 2020 25th International Conference on Pattern Recognition (ICPR). 2021. IEEE. [2] Liao, W.-H., W.-Y. Chen, and Y.-C. Wu. On the Robustness of Cross-lingual Speaker Recognition using Transformer-based Approaches. in 2022 26th International Conference on Pattern Recognition (ICPR). 2022. IEEE. [3] Dehak, N., et al., Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 2010. 19(4): p. 788-798. [4] Wan, L., et al. Generalized end-to-end loss for speaker verification. in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2018. IEEE. [5] Snyder, D., et al. X-vectors: Robust dnn embeddings for speaker recognition. in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). 2018. IEEE. [6] Dosovitskiy, A., et al., An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. [7] Wu, H., et al. Cvt: Introducing convolutions to vision transformers. in Proceedings of the IEEE/CVF international conference on computer vision. 2021. [8] Mohamed, A., et al., Self-supervised speech representation learning: A review. IEEE Journal of Selected Topics in Signal Processing, 2022. [9] Chen, T., et al., Big self-supervised models are strong semi-supervised learners. Advances in neural information processing systems, 2020. 33: p. 22243-22255. [10] Baevski, A., et al., wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 2020. 33: p. 12449-12460. [11] Chen, S., et al., Beats: Audio pre-training with acoustic tokenizers. arXiv preprint arXiv:2212.09058, 2022. [12] Park, D.S., et al., Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779, 2019. [13] Sato, Y., N. Ikeda, and H. Takahashi. Shuffleaugment: A Data Augmentation Method Using Time Shuffling. in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2023. IEEE. [14] Barrault, L., et al., Seamless: Multilingual Expressive and Streaming Speech Translation. arXiv preprint arXiv:2312.05187, 2023. [15] Abayomi-Alli, O.O., et al., Data augmentation and deep learning methods in sound classification: A systematic review. Electronics, 2022. 11(22): p. 3795. [16] Mokgonyane, T.B., et al. Automatic speaker recognition system based on machine learning algorithms. in 2019 Southern African Universities Power Engineering Conference/Robotics and Mechatronics/Pattern Recognition Association of South Africa (SAUPEC/RobMech/PRASA). 2019. IEEE. [17] Jaiswal, A., et al., A survey on contrastive self-supervised learning. Technologies, 2020. 9(1): p. 2. [18] Song, X., et al. SpecSwap: A Simple Data Augmentation Method for End-to-End Speech Recognition. in Interspeech. 2020. [19] Zhang, D. and Z.-H. Zhou, (2D) 2PCA: Two-directional two-dimensional PCA for efficient face representation and recognition. Neurocomputing, 2005. 69(1-3): p. 224-231. [20] He, K., et al. Deep residual learning for image recognition. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. [21] Van der Maaten, L. and G. Hinton, Visualizing data using t-SNE. Journal of machine learning research, 2008. 9(11). [22] Li, P., et al., Reliable visualization for deep speaker recognition. arXiv preprint arXiv:2204.03852, 2022. [23] Selvaraju, R.R., et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. in Proceedings of the IEEE international conference on computer vision. 2017. [24] Hutiri, W.T. and A.Y. Ding. Bias in automated speaker recognition. in Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 2022. [25] Mason, J. and J. Thompson, Gender effects in speaker recognition. Proc. ICSP-93, Beijing, 1993: p. 733-736. [26] Wang, S., et al. Investigation of specaugment for deep speaker embedding learning. in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2020. IEEE. zh_TW