| dc.contributor | 資訊系 | |
| dc.creator (作者) | 廖文宏 | |
| dc.creator (作者) | Liao, Wen-Hung;Ou, Yen-Chun;Chen, Po-Han;Wu, Yi-Chieh | |
| dc.date (日期) | 2023-12 | |
| dc.date.accessioned | 7-Jan-2025 09:36:43 (UTC+8) | - |
| dc.date.available | 7-Jan-2025 09:36:43 (UTC+8) | - |
| dc.date.issued (上傳時間) | 7-Jan-2025 09:36:43 (UTC+8) | - |
| dc.identifier.uri (URI) | https://nccur.lib.nccu.edu.tw/handle/140.119/155078 | - |
| dc.description.abstract (摘要) | People use multiple languages in their daily lives across regions worldwide, which motivated us to investigate cross-lingual speaker recognition. In this work, we propose to collect recordings of Mandarin and Spanish, namely the Mandarin-Spanish-Speech Dataset (MSSD-40), to analyze the performance of various audio embeddings for cross-lingual speaker recognition tasks. All participants are fluent in Mandarin, but none of the participants have prior knowledge of the Spanish language. As such, they have been advised to adopt a parroting mode of Spanish speech production, wherein they simply repeat the sounds emanating from the loudspeaker. Using this approach, variations resulting from individual differences in language fluency can be reduced, enabling us to focus on the anatomical aspects of the speech production mechanism.Embeddings extracted from models pre-trained with a large number of audio segments have become effective solutions for coping with audio analysis tasks using small datasets. Preliminary experimental results using two collected multi-lingual datasets indicate that both embedding methods and the language employed will affect the robustness of the speaker recognition task. Precisely, stable performance is observed when familiar languages are used. BEATs embedding generates the best outcome in all languages when no fine-tuning is exercised. | |
| dc.format.extent | 107 bytes | - |
| dc.format.mimetype | text/html | - |
| dc.relation (關聯) | Proceedings of the 25th International Sympisium on Multimedia, IEEE Technical Committee on Multimedia (TCMC), pp.193-197 | |
| dc.subject (關鍵詞) | text-independent speaker recognition; cross-lingual dataset; deep-learning; audio embedding; parroting mode | |
| dc.title (題名) | The Impact of Parroting Mode on Cross-Lingual Speaker Recognition | |
| dc.type (資料類型) | conference | |
| dc.identifier.doi (DOI) | 10.1109/ISM59092.2023.00035 | |
| dc.doi.uri (DOI) | https://doi.org/10.1109/ISM59092.2023.00035 | |