Publications-Proceedings

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

題名 Unveiling the Potential of SSL-Generated Audio Embeddings for Cross-Lingual Speaker Recognition
作者 廖文宏
Liao, Wen-Hung;Chen, Po-Han;Wu, Yi-Chieh
貢獻者 資訊系
關鍵詞 Cross-lingual speaker recognition; Self-supervised learning; Audio embeddings; Data augmentation for audio
日期 2024-12
上傳時間 19-May-2025 11:44:32 (UTC+8)
摘要 This research explores the effectiveness of SSL-based audio embeddings in cross-lingual speaker recognition. We collected speech data from 120 participants, named MET-120 in which each participant recorded in three languages (Mandarin, English, and Taiwanese). We then employ self-supervised learning (SSL) pre-trained models, including Wav2vec 2.0 and BEATs, to extract audio features that can characterize the speaker. A simple residual neural network (ResNet) is trained to perform cross-lingual speaker recognition tasks. Experimental results show that the fine-tuned Wav2vec 2.0 model achieves over 90% average performance on MET-120, obtaining the best overall results. Without fine-tuning, BEATs achieves 80% average performance on MET-120, suggesting that it might serve as a soft biometric in cross-lingual scenarios. The influence of native or proficient languages on recognition results is observed. Furthermore, we evaluate the efficacy of acoustic data augmentation schemes such as SpecAugment and ShuffleAugment. Experimental results demonstrate that ShuffleAugment, when used alongside dimensionality-reduction techniques like PCA, significantly improves performance in both same-language and cross-lingual tests.
關聯 2024 International Symposium on Multimedia (ISM), IEEE Technical Committee on Multimedia (TCMC)
資料類型 conference
DOI https://doi.org/10.1109/ISM63611.2024.00010
dc.contributor 資訊系
dc.creator (作者) 廖文宏
dc.creator (作者) Liao, Wen-Hung;Chen, Po-Han;Wu, Yi-Chieh
dc.date (日期) 2024-12
dc.date.accessioned 19-May-2025 11:44:32 (UTC+8)-
dc.date.available 19-May-2025 11:44:32 (UTC+8)-
dc.date.issued (上傳時間) 19-May-2025 11:44:32 (UTC+8)-
dc.identifier.uri (URI) https://nccur.lib.nccu.edu.tw/handle/140.119/157013-
dc.description.abstract (摘要) This research explores the effectiveness of SSL-based audio embeddings in cross-lingual speaker recognition. We collected speech data from 120 participants, named MET-120 in which each participant recorded in three languages (Mandarin, English, and Taiwanese). We then employ self-supervised learning (SSL) pre-trained models, including Wav2vec 2.0 and BEATs, to extract audio features that can characterize the speaker. A simple residual neural network (ResNet) is trained to perform cross-lingual speaker recognition tasks. Experimental results show that the fine-tuned Wav2vec 2.0 model achieves over 90% average performance on MET-120, obtaining the best overall results. Without fine-tuning, BEATs achieves 80% average performance on MET-120, suggesting that it might serve as a soft biometric in cross-lingual scenarios. The influence of native or proficient languages on recognition results is observed. Furthermore, we evaluate the efficacy of acoustic data augmentation schemes such as SpecAugment and ShuffleAugment. Experimental results demonstrate that ShuffleAugment, when used alongside dimensionality-reduction techniques like PCA, significantly improves performance in both same-language and cross-lingual tests.
dc.format.extent 107 bytes-
dc.format.mimetype text/html-
dc.relation (關聯) 2024 International Symposium on Multimedia (ISM), IEEE Technical Committee on Multimedia (TCMC)
dc.subject (關鍵詞) Cross-lingual speaker recognition; Self-supervised learning; Audio embeddings; Data augmentation for audio
dc.title (題名) Unveiling the Potential of SSL-Generated Audio Embeddings for Cross-Lingual Speaker Recognition
dc.type (資料類型) conference
dc.identifier.doi (DOI) 10.1109/ISM63611.2024.00010
dc.doi.uri (DOI) https://doi.org/10.1109/ISM63611.2024.00010