學術產出-會議論文

文章檢視/開啟

書目匯出

Google ScholarTM

政大圖書館

引文資訊

TAIR相關學術產出

題名 Deep Learning-Based Restoration of Voice-Converted Audio for Speech and Speaker Recognition
作者 廖文宏
Liao, Wen-Hung;Huang, David
貢獻者 資訊系
關鍵詞 Deep Learning; Speaker Recognition; Speech Recognition; Restoration of Transformed Audio
日期 2025-11
上傳時間 11-二月-2026 09:11:07 (UTC+8)
摘要 Voice conversion alters acoustic features such as pitch, timbre, and rhythm, often degrading the performance of automatic speech and speaker recognition systems. This study explores deep learning–based restoration methods to recover intelligibility and speaker identity from voice-converted audio. We systematically compare generative models including DiscoGAN, CycleGAN, HiFi-GAN, and VITS-SVC, and further introduce a hybrid HiFi-GAN–VITS-SVC architecture. In addition, we evaluate Retrieval-based Voice Conversion (RVC) for its potential in reconstructing both speech quality and speaker characteristics. Experiments on the MET-40 dataset, assessed by character error rate (CER), Perceptual Evaluation of Speech Quality (PESQ), and Top-1/Top-5 speaker identification, show that while HiFi-GAN excels under mild distortions, RVC consistently achieves superior restoration across all conversion types. Importantly, restored audio often retains sufficient speaker identity to enable re-identification, raising privacy and security concerns. Our findings underscore the trade-off between recognition performance and user anonymity, and point toward the need for future research on privacy-preserving speech restoration.
關聯 Pattern Recognition and Computer Vision: 8th Asian Conference on Pattern Recognition, ACPR 2025, IAPR, pp.265-280
資料類型 conference
DOI https://doi.org/10.1007/978-981-95-4398-4_19
dc.contributor 資訊系
dc.creator (作者) 廖文宏
dc.creator (作者) Liao, Wen-Hung;Huang, David
dc.date (日期) 2025-11
dc.date.accessioned 11-二月-2026 09:11:07 (UTC+8)-
dc.date.available 11-二月-2026 09:11:07 (UTC+8)-
dc.date.issued (上傳時間) 11-二月-2026 09:11:07 (UTC+8)-
dc.identifier.uri (URI) https://nccur.lib.nccu.edu.tw/handle/140.119/161639-
dc.description.abstract (摘要) Voice conversion alters acoustic features such as pitch, timbre, and rhythm, often degrading the performance of automatic speech and speaker recognition systems. This study explores deep learning–based restoration methods to recover intelligibility and speaker identity from voice-converted audio. We systematically compare generative models including DiscoGAN, CycleGAN, HiFi-GAN, and VITS-SVC, and further introduce a hybrid HiFi-GAN–VITS-SVC architecture. In addition, we evaluate Retrieval-based Voice Conversion (RVC) for its potential in reconstructing both speech quality and speaker characteristics. Experiments on the MET-40 dataset, assessed by character error rate (CER), Perceptual Evaluation of Speech Quality (PESQ), and Top-1/Top-5 speaker identification, show that while HiFi-GAN excels under mild distortions, RVC consistently achieves superior restoration across all conversion types. Importantly, restored audio often retains sufficient speaker identity to enable re-identification, raising privacy and security concerns. Our findings underscore the trade-off between recognition performance and user anonymity, and point toward the need for future research on privacy-preserving speech restoration.
dc.format.extent 108 bytes-
dc.format.mimetype text/html-
dc.relation (關聯) Pattern Recognition and Computer Vision: 8th Asian Conference on Pattern Recognition, ACPR 2025, IAPR, pp.265-280
dc.subject (關鍵詞) Deep Learning; Speaker Recognition; Speech Recognition; Restoration of Transformed Audio
dc.title (題名) Deep Learning-Based Restoration of Voice-Converted Audio for Speech and Speaker Recognition
dc.type (資料類型) conference
dc.identifier.doi (DOI) 10.1007/978-981-95-4398-4_19
dc.doi.uri (DOI) https://doi.org/10.1007/978-981-95-4398-4_19