| dc.contributor | 資訊系 | |
| dc.creator (作者) | 廖文宏 | |
| dc.creator (作者) | Liao, Wen-Hung;Chen, Po-Han;Wu, Yi-Chieh | |
| dc.date (日期) | 2024-12 | |
| dc.date.accessioned | 19-May-2025 11:44:32 (UTC+8) | - |
| dc.date.available | 19-May-2025 11:44:32 (UTC+8) | - |
| dc.date.issued (上傳時間) | 19-May-2025 11:44:32 (UTC+8) | - |
| dc.identifier.uri (URI) | https://nccur.lib.nccu.edu.tw/handle/140.119/157013 | - |
| dc.description.abstract (摘要) | This research explores the effectiveness of SSL-based audio embeddings in cross-lingual speaker recognition. We collected speech data from 120 participants, named MET-120 in which each participant recorded in three languages (Mandarin, English, and Taiwanese). We then employ self-supervised learning (SSL) pre-trained models, including Wav2vec 2.0 and BEATs, to extract audio features that can characterize the speaker. A simple residual neural network (ResNet) is trained to perform cross-lingual speaker recognition tasks. Experimental results show that the fine-tuned Wav2vec 2.0 model achieves over 90% average performance on MET-120, obtaining the best overall results. Without fine-tuning, BEATs achieves 80% average performance on MET-120, suggesting that it might serve as a soft biometric in cross-lingual scenarios. The influence of native or proficient languages on recognition results is observed. Furthermore, we evaluate the efficacy of acoustic data augmentation schemes such as SpecAugment and ShuffleAugment. Experimental results demonstrate that ShuffleAugment, when used alongside dimensionality-reduction techniques like PCA, significantly improves performance in both same-language and cross-lingual tests. | |
| dc.format.extent | 107 bytes | - |
| dc.format.mimetype | text/html | - |
| dc.relation (關聯) | 2024 International Symposium on Multimedia (ISM), IEEE Technical Committee on Multimedia (TCMC) | |
| dc.subject (關鍵詞) | Cross-lingual speaker recognition; Self-supervised learning; Audio embeddings; Data augmentation for audio | |
| dc.title (題名) | Unveiling the Potential of SSL-Generated Audio Embeddings for Cross-Lingual Speaker Recognition | |
| dc.type (資料類型) | conference | |
| dc.identifier.doi (DOI) | 10.1109/ISM63611.2024.00010 | |
| dc.doi.uri (DOI) | https://doi.org/10.1109/ISM63611.2024.00010 | |