結合關鍵目標擾動增強浮水印技術以抵抗語音克隆攻擊 | Publication

Publications-Theses

Article View/Open

pdf(0)

Publication Export

Google Scholar^TM

題名	結合關鍵目標擾動增強浮水印技術以抵抗語音克隆攻擊 Enhancement Watermark with Pivotal Objective Perturbation against Voice Cloning Attack
作者	林柏含 Lin, Bo-Han
貢獻者	胡毓忠 Hu, Yuh-Jong 林柏含 Lin, Bo-Han
關鍵詞	語音克隆反電子欺騙數位浮水印關鍵目標擾動雜訊 Voice cloning Anti-Spoofing Digital watermarking Perturbation noise
日期	2025
上傳時間	3-Nov-2025 14:45:04 (UTC+8)
摘要	近年來，生成式人工智慧模型的快速進步推動了多項應用的突破性發展。其中，生成式語音合成技術以極高的精確度模仿人類聲音，這項技術的快速發展也帶來了嚴重的安全隱患，包括身份盜用、詐欺行為以及操縱性內容的傳播。這些威脅不僅危害個人隱私，還可能對社會信任與公共安全造成深遠影響。本研究提出一種多層次防護策略，結合數位浮水印與關鍵目標擾動雜訊技術，以應對上述風險。數位浮水印可用於驗證語音內容是否經過授權或被他人篡改；而關鍵目標擾動雜訊則能有效防止語音特徵被盜用，從而阻礙未經授權的語音生成。實驗結果顯示，結合這兩種防禦機制後，數位浮水印的檢測性能依然高度可靠，AUC 值達到 0.993，未受擾動雜訊影響。此外，當使用語音合成模型使用受保護的語音想生成語音時，生成器無法產生有效辨識的語音，生成詞錯誤率（WER）高達 1.01，表明生成的語音難以被人理解。總體而言，這兩種防禦機制能夠協同運作，互不影響性能為語音合成技術的安全應用提供了有效保障。 Recent advances in generative AI have driven breakthroughs in applications, particularly in speech synthesis, which can accurately mimic human voices. However, this progress raises serious security concerns, including identity theft, fraud, and manipulated content, threatening individual privacy and societal trust. This study proposes a dual defense strategy combining digital watermarking and targeted perturbation noise. Watermarking verifies audio authenticity, while perturbation prevents unauthorized voice synthesis. Results show watermark detection remains reliable AUC 0.9993 despite perturbation, and synthesized voices are unintelligible WER 1.01. These mechanisms coexist effectively, ensuring robust protection for speech synthesis applications.
參考文獻	[1]Anonymous. “Proactive Detection of Voice Cloning with Localized Watermark-ing”. In: arXiv preprint arXiv:2401.17264 (2024). [2]Michael Arnold. Techniques and Applications of Digital Watermarking and Content Protection. Artech House, 2003. ISBN: 9781580531115. [3]Starling Bank. Starling Bank Launches Safe Phrases Campaign. https://www.starlingbank.com/news/starling-bank-launches-safe-phrases-campaign/. Accessed: 2025-06-29. 2023. [4]Nicholas Carlini and David Wagner. “Audio Adversarial Examples: Targeted Attacks on Speech-to-Text”. In: 2018 IEEE Security and Privacy Workshops (SPW)(2018), pp. 1–7. DOI: 10.1109/SPW.2018.00009. [5]Hyeonseung Choi, Jihoon Lee, and Youngjin Park. “Robustness of Mel-Spectrogram Features in Speaker Recognition”. In: IEEE Signal Processing Letters 26.8 (2019), pp. 1187–1191. DOI: 10.1109/LSP.2019.2921912. [6]Federal Trade Commission. FTC Proposes New Protections to Combat AI Impersonation of Individuals. https://www.ftc.gov/news-events/news/pressreleases/2024/02/ftc-proposes-new-protections-combat-ai-impersonation-individuals. Accessed: 2025-06-29. 2024. [7]Keith Ito and Linda Johnson. LJ Speech Dataset. 2017. URL: https://keithito.com/LJ-Speech-Dataset/. [8]Jaehyeon Kim, Jungil Kong, and Juhee Son. “Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech”. In: International Conference on Machine Learning (2021). arXiv:2106.06103. [9]G. Kubin, B. S. Atal, and W. B. Kleijn. “Performance of noise excitation for unvoiced speech”. In: IEEE Workshop on Speech Coding for Telecommunications(1996). [10]Yixin Liu et al. “Stable Unlearnable Example: Enhancing the Robustness of Unlearnable Examples via Stable Error-Minimizing Noise”. In: arXiv preprint arXiv:2302.04847(2023). [11]Trend Micro. Unusual CEO Fraud via Deepfake Audio Steals $243,000 from U.K. Company. https://www.trendmicro.com/vinfo/us/security/news/cyberattacks/unusual-ceo-fraud-via-deepfake-audio-steals-us-243-000-from-u-k-company. Accessed: 2025-06-29. 2019. [12]Robin San Roman et al. “Proactive Detection of Voice Cloning with Localized Watermarking”. In: arXiv preprint arXiv:2401.17264 (2024). [13]pindrop security. Pindrop Security Raises 100M illiontoExpandDeepf akeDetectionT echnology. https://www.securityweek.com/pindrop-security-raises-100-million-to-expand-deepfake-detection-technology/. Accessed: 2025-07-08. 2024. [14]Xin Shen et al. “Deepfakes: The Coming Infocalypse in Audio and Video”. In: IEEE Transactions on Multimedia 22.10 (2020), pp. 2601–2612. DOI: 10.1109/TMM.2020.2982567. [15]Kenneth N. Stevens. Acoustic Phonetics. MIT Press, 1998. ISBN: 9780262194044. [16]Christian Szegedy et al. “Intriguing Properties of Neural Networks”. In: International Conference on Learning Representations (ICLR) (2014). arXiv:1312.6199. [17]truecaller. Truecaller Insights 2021 U.S. Spam Scam Report. https://www.truecaller.com/blog/insights/us-spam-scam-report-21. Accessed: 2025-07-08. 2021. [18]Changhan Wang et al. “VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation”. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, Aug. 2021, pp. 993–1003. DOI: 10 . 18653 / v1 / 2021 . acl - long . 80. URL: https://aclanthology.org/2021.acl-long.80/. [19]Rui Wang, Xin Zhang, and Yang Liu. “Detecting Audio Deepfakes Using Mel-Spectrogram Features and Convolutional Neural Networks”. In: Computer Speech Language 68 (2021), p. 101203. DOI: 10.1016/j.csl.2021.101203. [20]Zhiyuan Yu et al. “AntiFake: Using Adversarial Audio to Prevent Unauthorized Speech Synthesis”. In: arXiv preprint arXiv:2305.12737 (2023). [21]Heiga Zen et al. “LibriTTS: A corpus derived from LibriSpeech for text-to-speech”. In: arXiv preprint arXiv:1904.02882 (2019). [22]Zhisheng Zhang et al. “Mitigating Unauthorized Speech Synthesis for Voice Protection”. In: arXiv preprint arXiv:2405.12686 (2024).
描述	碩士國立政治大學資訊科學系碩士在職專班 109971006
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0109971006
資料類型	thesis

dc.contributor.advisor	胡毓忠	zh_TW
dc.contributor.advisor	Hu, Yuh-Jong	en_US
dc.contributor.author (Authors)	林柏含	zh_TW
dc.contributor.author (Authors)	Lin, Bo-Han	en_US
dc.creator (作者)	林柏含	zh_TW
dc.creator (作者)	Lin, Bo-Han	en_US
dc.date (日期)	2025	en_US
dc.date.accessioned	3-Nov-2025 14:45:04 (UTC+8)	-
dc.date.available	3-Nov-2025 14:45:04 (UTC+8)	-
dc.date.issued (上傳時間)	3-Nov-2025 14:45:04 (UTC+8)	-
dc.identifier (Other Identifiers)	G0109971006	en_US
dc.identifier.uri (URI)	https://nccur.lib.nccu.edu.tw/handle/140.119/160072	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	資訊科學系碩士在職專班	zh_TW
dc.description (描述)	109971006	zh_TW
dc.description.abstract (摘要)	近年來，生成式人工智慧模型的快速進步推動了多項應用的突破性發展。其中，生成式語音合成技術以極高的精確度模仿人類聲音，這項技術的快速發展也帶來了嚴重的安全隱患，包括身份盜用、詐欺行為以及操縱性內容的傳播。這些威脅不僅危害個人隱私，還可能對社會信任與公共安全造成深遠影響。本研究提出一種多層次防護策略，結合數位浮水印與關鍵目標擾動雜訊技術，以應對上述風險。數位浮水印可用於驗證語音內容是否經過授權或被他人篡改；而關鍵目標擾動雜訊則能有效防止語音特徵被盜用，從而阻礙未經授權的語音生成。實驗結果顯示，結合這兩種防禦機制後，數位浮水印的檢測性能依然高度可靠，AUC 值達到 0.993，未受擾動雜訊影響。此外，當使用語音合成模型使用受保護的語音想生成語音時，生成器無法產生有效辨識的語音，生成詞錯誤率（WER）高達 1.01，表明生成的語音難以被人理解。總體而言，這兩種防禦機制能夠協同運作，互不影響性能為語音合成技術的安全應用提供了有效保障。	zh_TW
dc.description.abstract (摘要)	Recent advances in generative AI have driven breakthroughs in applications, particularly in speech synthesis, which can accurately mimic human voices. However, this progress raises serious security concerns, including identity theft, fraud, and manipulated content, threatening individual privacy and societal trust. This study proposes a dual defense strategy combining digital watermarking and targeted perturbation noise. Watermarking verifies audio authenticity, while perturbation prevents unauthorized voice synthesis. Results show watermark detection remains reliable AUC 0.9993 despite perturbation, and synthesized voices are unintelligible WER 1.01. These mechanisms coexist effectively, ensuring robust protection for speech synthesis applications.	en_US
dc.description.tableofcontents	第一章緒論 1 1.1 研究背景 1 1.1.1 語音深偽技術的發展與風險 1 1.1.2 電話詐騙的規模與影響 1 1.1.3 防範語音詐騙的對策與措施 2 1.1.4 提取聲音特徵與語音合成的轉換過程 2 1.1.5 防範合成語音的防禦機制 3 1.1.6 使用生成式語音進行中間人攻擊 5 1.2 研究動機 8 1.3 研究目的 8 1.4 研究問題 9 第二章文獻探討 10 2.1 雜訊保護機制 10 2.1.1 POP 算法如何達到添加擾動，有效降低合成語音的品質？ 11 2.2 嵌入浮水印 13 2.2.1 AudioSeal 如何嵌入浮水印 13 2.2.2 浮水印如何驗證語音來源的合法性與是否被竄改的可能性? 14 2.3 TTS 目標生成模型 15 2.3.1 惡意第三方的如何進行仿冒？ 15 2.4 結合 POP 算法與浮水印 16 2.4.1 如何阻止惡意第三方仿冒？ 16 第三章研究方法 17 3.1 研究流程 17 3.1.1 第一階段 18 3.1.2 第二階段 19 3.1.3 第三階段 20 3.2 實驗流程 21 3.2.1 蒐集訓練數據集 21 3.2.2 訓練模型 22 第四章研究結果與分析 31 4.1 研究環境 31 4.1.1 硬體環境 31 4.1.2 軟體環境 31 4.2 資料集來源 32 4.2.1 訓練資料集 32 4.2.2 數據前處理 32 4.3 模型訓練配置 33 4.4 效果分析 33 4.4.1 實驗結果 34 4.4.2 平衡參數 Lambda 的相關性分析 44 4.4.3 浮水印檢測與添加擾動效能分析 45 4.4.4 TTS 生成音頻品質分析 47 4.4.5 浮水印檢測分析 47 4.4.6 保護與檢測是否適用於即時串流語音 47 第五章結論 48 5.1 研究結論 48 5.1.1 技術貢獻 48 5.1.2 實用價值 49 5.1.3 方法優勢 49 5.1.4 方法劣勢 49 5.2 未來展望 50 5.2.1 技術發展方向 50 5.2.2 挑戰與機遇 50 Bibliography 51	zh_TW
dc.format.extent	13465231 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0109971006	en_US
dc.subject (關鍵詞)	語音克隆	zh_TW
dc.subject (關鍵詞)	反電子欺騙	zh_TW
dc.subject (關鍵詞)	數位浮水印	zh_TW
dc.subject (關鍵詞)	關鍵目標擾動雜訊	zh_TW
dc.subject (關鍵詞)	Voice cloning	en_US
dc.subject (關鍵詞)	Anti-Spoofing	en_US
dc.subject (關鍵詞)	Digital watermarking	en_US
dc.subject (關鍵詞)	Perturbation noise	en_US
dc.title (題名)	結合關鍵目標擾動增強浮水印技術以抵抗語音克隆攻擊	zh_TW
dc.title (題名)	Enhancement Watermark with Pivotal Objective Perturbation against Voice Cloning Attack	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	[1]Anonymous. “Proactive Detection of Voice Cloning with Localized Watermark-ing”. In: arXiv preprint arXiv:2401.17264 (2024). [2]Michael Arnold. Techniques and Applications of Digital Watermarking and Content Protection. Artech House, 2003. ISBN: 9781580531115. [3]Starling Bank. Starling Bank Launches Safe Phrases Campaign. https://www.starlingbank.com/news/starling-bank-launches-safe-phrases-campaign/. Accessed: 2025-06-29. 2023. [4]Nicholas Carlini and David Wagner. “Audio Adversarial Examples: Targeted Attacks on Speech-to-Text”. In: 2018 IEEE Security and Privacy Workshops (SPW)(2018), pp. 1–7. DOI: 10.1109/SPW.2018.00009. [5]Hyeonseung Choi, Jihoon Lee, and Youngjin Park. “Robustness of Mel-Spectrogram Features in Speaker Recognition”. In: IEEE Signal Processing Letters 26.8 (2019), pp. 1187–1191. DOI: 10.1109/LSP.2019.2921912. [6]Federal Trade Commission. FTC Proposes New Protections to Combat AI Impersonation of Individuals. https://www.ftc.gov/news-events/news/pressreleases/2024/02/ftc-proposes-new-protections-combat-ai-impersonation-individuals. Accessed: 2025-06-29. 2024. [7]Keith Ito and Linda Johnson. LJ Speech Dataset. 2017. URL: https://keithito.com/LJ-Speech-Dataset/. [8]Jaehyeon Kim, Jungil Kong, and Juhee Son. “Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech”. In: International Conference on Machine Learning (2021). arXiv:2106.06103. [9]G. Kubin, B. S. Atal, and W. B. Kleijn. “Performance of noise excitation for unvoiced speech”. In: IEEE Workshop on Speech Coding for Telecommunications(1996). [10]Yixin Liu et al. “Stable Unlearnable Example: Enhancing the Robustness of Unlearnable Examples via Stable Error-Minimizing Noise”. In: arXiv preprint arXiv:2302.04847(2023). [11]Trend Micro. Unusual CEO Fraud via Deepfake Audio Steals $243,000 from U.K. Company. https://www.trendmicro.com/vinfo/us/security/news/cyberattacks/unusual-ceo-fraud-via-deepfake-audio-steals-us-243-000-from-u-k-company. Accessed: 2025-06-29. 2019. [12]Robin San Roman et al. “Proactive Detection of Voice Cloning with Localized Watermarking”. In: arXiv preprint arXiv:2401.17264 (2024). [13]pindrop security. Pindrop Security Raises 100M illiontoExpandDeepf akeDetectionT echnology. https://www.securityweek.com/pindrop-security-raises-100-million-to-expand-deepfake-detection-technology/. Accessed: 2025-07-08. 2024. [14]Xin Shen et al. “Deepfakes: The Coming Infocalypse in Audio and Video”. In: IEEE Transactions on Multimedia 22.10 (2020), pp. 2601–2612. DOI: 10.1109/TMM.2020.2982567. [15]Kenneth N. Stevens. Acoustic Phonetics. MIT Press, 1998. ISBN: 9780262194044. [16]Christian Szegedy et al. “Intriguing Properties of Neural Networks”. In: International Conference on Learning Representations (ICLR) (2014). arXiv:1312.6199. [17]truecaller. Truecaller Insights 2021 U.S. Spam Scam Report. https://www.truecaller.com/blog/insights/us-spam-scam-report-21. Accessed: 2025-07-08. 2021. [18]Changhan Wang et al. “VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation”. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, Aug. 2021, pp. 993–1003. DOI: 10 . 18653 / v1 / 2021 . acl - long . 80. URL: https://aclanthology.org/2021.acl-long.80/. [19]Rui Wang, Xin Zhang, and Yang Liu. “Detecting Audio Deepfakes Using Mel-Spectrogram Features and Convolutional Neural Networks”. In: Computer Speech Language 68 (2021), p. 101203. DOI: 10.1016/j.csl.2021.101203. [20]Zhiyuan Yu et al. “AntiFake: Using Adversarial Audio to Prevent Unauthorized Speech Synthesis”. In: arXiv preprint arXiv:2305.12737 (2023). [21]Heiga Zen et al. “LibriTTS: A corpus derived from LibriSpeech for text-to-speech”. In: arXiv preprint arXiv:1904.02882 (2019). [22]Zhisheng Zhang et al. “Mitigating Unauthorized Speech Synthesis for Voice Protection”. In: arXiv preprint arXiv:2405.12686 (2024).	zh_TW

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM