大型語言模型誠實性對齊的持續優化與多指標評估 | Publication

Publications-Theses

Article View/Open

pdf(0)

Publication Export

Google Scholar^TM

題名	大型語言模型誠實性對齊的持續優化與多指標評估 Continuous Optimization and Multi-Metric Evaluation of Honesty Alignment in Large Language Models
作者	蔡品洋 Tsai, Pin-Yang
貢獻者	陳恭 Chen, Kung 蔡品洋 Tsai, Pin-Yang
關鍵詞	幻覺誠實性迭代微調拒答策略知識邊界 Hallucination Honesty Iterative Fine-tuning Refusal Strategy Knowledge Boundary
日期	2025
上傳時間	4-Aug-2025 14:28:20 (UTC+8)
摘要	本研究旨在解決大型語言模型（LLM）中關鍵的「幻覺」問題，也就是模型生成看似合理但實質錯誤資訊的傾向。透過系統性的迭代微調策略，提升模型的「誠實性」，訓練其準確辨識自身知識邊界，並在面對不確定的問題時主動拒絕回答。為此，本研究基於「Alignment for Honesty」的理論框架，以 GPT-4o-mini 模型為實驗對象，設計了兩組實驗。第一組實驗在原始論文資料集上進行了四輪微調，而第二組則擴展至三個異質資料集，並進行了長達十輪的迭代微調，以驗證資料多樣性與訓練持續性的增益效果。研究結果表明，迭代微調能顯著提升模型的拒答能力，且在多資料集環境下的長期訓練展現出更穩健且持續的學習軌跡。實驗發現，雖然模型的準確率（Accuracy）因學會拒答而下降，但本研究提出的「正確率」（Correctness）指標（包含正確回答與正確拒答）始終維持在高水平，這證明了模型並非性能退化，而是學會了將潛在的錯誤回答策略性地轉換為合理的拒絕。此外，本研究也證實直接調整模型參數的微調方法，在行為對齊上的效果遠優於僅依賴提示語的引導。而在訓練模板的選擇上，我們發現語言型信心表達（CONFIDENCE-VERB）比數值型更能帶來穩定且可控的訓練成果。整體來說，本研究驗證了多資料集的長期迭代微調，是培養大型語言模型知識邊界意識、構建更可靠 LLM 的一條可行路徑，為後續的誠實性對齊研究提供了全面的評估框架與實證基礎，也為開發與部署更值得信賴的人工智慧系統提供了具體的實務指導。 This study addresses "hallucination" in Large Language Models (LLMs) by using iterative fine-tuning to enhance model "honesty," training them to identify their knowledge boundaries and refuse unknown questions. Based on the "Alignment for Honesty" framework, our research used the GPT-4o-mini model to perform iterative fine-tuning for up to ten rounds on diverse datasets, testing the benefits of data diversity and sustained training. The results show that this method significantly improves the model's refusal capability, especially with long-term, multi-dataset training. While traditional "Accuracy" drops, our proposed "Correctness" metric—which includes appropriate refusals—remains high, proving the model learns a better response strategy rather than simply experiencing performance degradation. We also confirmed that direct parameter fine-tuning is substantially more effective than prompt-based guidance and that language-based confidence templates (CONFIDENCE-VERB) yield more stable outcomes. Overall, this study validates that long-term, multi-dataset iterative fine-tuning is a viable path for cultivating an awareness of knowledge boundaries in Large Language Models and building more reliable LLMs. It offers a robust framework and practical guidance for developing and evaluating trustworthy AI.
參考文獻	Askell, A., Bai, Y., Chen, A., Drain, D., Ganguli, D., Henighan, T., ... & Kaplan, J. (2021). A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861. Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., ... & Kaplan, J. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862. Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., ... & Kaplan, J. (2022). Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019, June). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) (pp. 4171-4186). Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., ... & Liu, T. (2025). A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2), 1-55. Jin, D., Pan, E., Oufattole, N., Weng, W. H., Fang, H., & Szolovits, P. (2021). What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14), 6421. Joshi, M., Choi, E., Weld, D. S., & Zettlemoyer, L. (2017). Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551. Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., ... & Petrov, S. (2019). Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7, 453-466. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., ... & Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35, 27730-27744. Rajpurkar, P., Jia, R., & Liang, P. (2018). Know what you don't know: Unanswerable questions for SQuAD. arXiv preprint arXiv:1806.03822. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. Yang, Y., Chern, E., Qiu, X., Neubig, G., & Liu, P. (2024). Alignment for honesty. Advances in Neural Information Processing Systems, 37, 63565-63598. Zhang, H., Diao, S., Lin, Y., Fung, Y. R., Lian, Q., Wang, X., ... & Zhang, T. (2023). R-tuning: Teaching large language models to refuse unknown questions. arXiv preprint arXiv:2311.09677.
描述	碩士國立政治大學資訊管理學系 112356044
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0112356044
資料類型	thesis

dc.contributor.advisor	陳恭	zh_TW
dc.contributor.advisor	Chen, Kung	en_US
dc.contributor.author (Authors)	蔡品洋	zh_TW
dc.contributor.author (Authors)	Tsai, Pin-Yang	en_US
dc.creator (作者)	蔡品洋	zh_TW
dc.creator (作者)	Tsai, Pin-Yang	en_US
dc.date (日期)	2025	en_US
dc.date.accessioned	4-Aug-2025 14:28:20 (UTC+8)	-
dc.date.available	4-Aug-2025 14:28:20 (UTC+8)	-
dc.date.issued (上傳時間)	4-Aug-2025 14:28:20 (UTC+8)	-
dc.identifier (Other Identifiers)	G0112356044	en_US
dc.identifier.uri (URI)	https://nccur.lib.nccu.edu.tw/handle/140.119/158582	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	資訊管理學系	zh_TW
dc.description (描述)	112356044	zh_TW
dc.description.abstract (摘要)	本研究旨在解決大型語言模型（LLM）中關鍵的「幻覺」問題，也就是模型生成看似合理但實質錯誤資訊的傾向。透過系統性的迭代微調策略，提升模型的「誠實性」，訓練其準確辨識自身知識邊界，並在面對不確定的問題時主動拒絕回答。為此，本研究基於「Alignment for Honesty」的理論框架，以 GPT-4o-mini 模型為實驗對象，設計了兩組實驗。第一組實驗在原始論文資料集上進行了四輪微調，而第二組則擴展至三個異質資料集，並進行了長達十輪的迭代微調，以驗證資料多樣性與訓練持續性的增益效果。研究結果表明，迭代微調能顯著提升模型的拒答能力，且在多資料集環境下的長期訓練展現出更穩健且持續的學習軌跡。實驗發現，雖然模型的準確率（Accuracy）因學會拒答而下降，但本研究提出的「正確率」（Correctness）指標（包含正確回答與正確拒答）始終維持在高水平，這證明了模型並非性能退化，而是學會了將潛在的錯誤回答策略性地轉換為合理的拒絕。此外，本研究也證實直接調整模型參數的微調方法，在行為對齊上的效果遠優於僅依賴提示語的引導。而在訓練模板的選擇上，我們發現語言型信心表達（CONFIDENCE-VERB）比數值型更能帶來穩定且可控的訓練成果。整體來說，本研究驗證了多資料集的長期迭代微調，是培養大型語言模型知識邊界意識、構建更可靠 LLM 的一條可行路徑，為後續的誠實性對齊研究提供了全面的評估框架與實證基礎，也為開發與部署更值得信賴的人工智慧系統提供了具體的實務指導。	zh_TW
dc.description.abstract (摘要)	This study addresses "hallucination" in Large Language Models (LLMs) by using iterative fine-tuning to enhance model "honesty," training them to identify their knowledge boundaries and refuse unknown questions. Based on the "Alignment for Honesty" framework, our research used the GPT-4o-mini model to perform iterative fine-tuning for up to ten rounds on diverse datasets, testing the benefits of data diversity and sustained training. The results show that this method significantly improves the model's refusal capability, especially with long-term, multi-dataset training. While traditional "Accuracy" drops, our proposed "Correctness" metric—which includes appropriate refusals—remains high, proving the model learns a better response strategy rather than simply experiencing performance degradation. We also confirmed that direct parameter fine-tuning is substantially more effective than prompt-based guidance and that language-based confidence templates (CONFIDENCE-VERB) yield more stable outcomes. Overall, this study validates that long-term, multi-dataset iterative fine-tuning is a viable path for cultivating an awareness of knowledge boundaries in Large Language Models and building more reliable LLMs. It offers a robust framework and practical guidance for developing and evaluating trustworthy AI.	en_US
dc.description.tableofcontents	第一章緒論 1 1.1 研究背景與動機 1 1.2 研究目的 2 1.3 研究貢獻 2 1.4 論文架構 3 第二章技術背景與文獻探討 4 2.1 大規模語言模型的發展概述 4 2.2 幻覺現象與模型拒答策略 5 2.3 模型拒答的框架與階段 7 2.4 指令微調與強化學習的對齊方法 10 2.5 R-TUNING 和 ALIGNMENT FOR HONESTY 方法之比較 12 2.6 當前對齊方法的局限與展望 16 第三章研究方法與設計 19 3.1 研究設計概述 19 3.2 資料集選擇與處理 19 3.3 實驗方法設計 21 3.4 模型微調流程 27 3.5 評估指標與方法 29 第四章實驗結果與分析 33 4.1 實驗設計總覽 33 4.2 單資料集：迭代微調成效 33 4.3 多資料集：擴增效益分析 38 4.4整體數趨勢與比較 42 第五章結論與未來研究方向 43 5.1 研究成果與貢獻 43 5.2 結論與未來方向 43 參考文獻 47	zh_TW
dc.format.extent	1351099 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0112356044	en_US
dc.subject (關鍵詞)	幻覺	zh_TW
dc.subject (關鍵詞)	誠實性	zh_TW
dc.subject (關鍵詞)	迭代微調	zh_TW
dc.subject (關鍵詞)	拒答策略	zh_TW
dc.subject (關鍵詞)	知識邊界	zh_TW
dc.subject (關鍵詞)	Hallucination	en_US
dc.subject (關鍵詞)	Honesty	en_US
dc.subject (關鍵詞)	Iterative Fine-tuning	en_US
dc.subject (關鍵詞)	Refusal Strategy	en_US
dc.subject (關鍵詞)	Knowledge Boundary	en_US
dc.title (題名)	大型語言模型誠實性對齊的持續優化與多指標評估	zh_TW
dc.title (題名)	Continuous Optimization and Multi-Metric Evaluation of Honesty Alignment in Large Language Models	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	Askell, A., Bai, Y., Chen, A., Drain, D., Ganguli, D., Henighan, T., ... & Kaplan, J. (2021). A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861. Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., ... & Kaplan, J. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862. Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., ... & Kaplan, J. (2022). Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019, June). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) (pp. 4171-4186). Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., ... & Liu, T. (2025). A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2), 1-55. Jin, D., Pan, E., Oufattole, N., Weng, W. H., Fang, H., & Szolovits, P. (2021). What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14), 6421. Joshi, M., Choi, E., Weld, D. S., & Zettlemoyer, L. (2017). Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551. Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., ... & Petrov, S. (2019). Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7, 453-466. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., ... & Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35, 27730-27744. Rajpurkar, P., Jia, R., & Liang, P. (2018). Know what you don't know: Unanswerable questions for SQuAD. arXiv preprint arXiv:1806.03822. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. Yang, Y., Chern, E., Qiu, X., Neubig, G., & Liu, P. (2024). Alignment for honesty. Advances in Neural Information Processing Systems, 37, 63565-63598. Zhang, H., Diao, S., Lin, Y., Fung, Y. R., Lian, Q., Wang, X., ... & Zhang, T. (2023). R-tuning: Teaching large language models to refuse unknown questions. arXiv preprint arXiv:2311.09677.	zh_TW

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM