Unveiling the Potential of SSL-Generated Audio Embeddings for Cross-Lingual Speaker Recognition | Publication | NCCU Academic Hub

Publications-Proceedings

Article View/Open

html(242)

Publication Export

Google Scholar^TM

NCCU Library

Discovery System

Citation Infomation

Related Publications in TAIR

Simple Record
Full Record

題名	Unveiling the Potential of SSL-Generated Audio Embeddings for Cross-Lingual Speaker Recognition
作者	廖文宏 Liao, Wen-Hung;Chen, Po-Han;Wu, Yi-Chieh
貢獻者	資訊系
關鍵詞	Cross-lingual speaker recognition; Self-supervised learning; Audio embeddings; Data augmentation for audio
日期	2024-12
上傳時間	19-May-2025 11:44:32 (UTC+8)
摘要	This research explores the effectiveness of SSL-based audio embeddings in cross-lingual speaker recognition. We collected speech data from 120 participants, named MET-120 in which each participant recorded in three languages (Mandarin, English, and Taiwanese). We then employ self-supervised learning (SSL) pre-trained models, including Wav2vec 2.0 and BEATs, to extract audio features that can characterize the speaker. A simple residual neural network (ResNet) is trained to perform cross-lingual speaker recognition tasks. Experimental results show that the fine-tuned Wav2vec 2.0 model achieves over 90% average performance on MET-120, obtaining the best overall results. Without fine-tuning, BEATs achieves 80% average performance on MET-120, suggesting that it might serve as a soft biometric in cross-lingual scenarios. The influence of native or proficient languages on recognition results is observed. Furthermore, we evaluate the efficacy of acoustic data augmentation schemes such as SpecAugment and ShuffleAugment. Experimental results demonstrate that ShuffleAugment, when used alongside dimensionality-reduction techniques like PCA, significantly improves performance in both same-language and cross-lingual tests.
關聯	2024 International Symposium on Multimedia (ISM), IEEE Technical Committee on Multimedia (TCMC)
資料類型	conference
DOI	https://doi.org/10.1109/ISM63611.2024.00010

dc.contributor	資訊系
dc.creator (作者)	廖文宏
dc.creator (作者)	Liao, Wen-Hung;Chen, Po-Han;Wu, Yi-Chieh
dc.date (日期)	2024-12
dc.date.accessioned	19-May-2025 11:44:32 (UTC+8)	-
dc.date.available	19-May-2025 11:44:32 (UTC+8)	-
dc.date.issued (上傳時間)	19-May-2025 11:44:32 (UTC+8)	-
dc.identifier.uri (URI)	https://nccur.lib.nccu.edu.tw/handle/140.119/157013	-
dc.description.abstract (摘要)	This research explores the effectiveness of SSL-based audio embeddings in cross-lingual speaker recognition. We collected speech data from 120 participants, named MET-120 in which each participant recorded in three languages (Mandarin, English, and Taiwanese). We then employ self-supervised learning (SSL) pre-trained models, including Wav2vec 2.0 and BEATs, to extract audio features that can characterize the speaker. A simple residual neural network (ResNet) is trained to perform cross-lingual speaker recognition tasks. Experimental results show that the fine-tuned Wav2vec 2.0 model achieves over 90% average performance on MET-120, obtaining the best overall results. Without fine-tuning, BEATs achieves 80% average performance on MET-120, suggesting that it might serve as a soft biometric in cross-lingual scenarios. The influence of native or proficient languages on recognition results is observed. Furthermore, we evaluate the efficacy of acoustic data augmentation schemes such as SpecAugment and ShuffleAugment. Experimental results demonstrate that ShuffleAugment, when used alongside dimensionality-reduction techniques like PCA, significantly improves performance in both same-language and cross-lingual tests.
dc.format.extent	107 bytes	-
dc.format.mimetype	text/html	-
dc.relation (關聯)	2024 International Symposium on Multimedia (ISM), IEEE Technical Committee on Multimedia (TCMC)
dc.subject (關鍵詞)	Cross-lingual speaker recognition; Self-supervised learning; Audio embeddings; Data augmentation for audio
dc.title (題名)	Unveiling the Potential of SSL-Generated Audio Embeddings for Cross-Lingual Speaker Recognition
dc.type (資料類型)	conference
dc.identifier.doi (DOI)	10.1109/ISM63611.2024.00010
dc.doi.uri (DOI)	https://doi.org/10.1109/ISM63611.2024.00010