Publications-Theses
Article View/Open
Publication Export
-
Google ScholarTM
NCCU Library
Citation Infomation
Related Publications in TAIR
題名 結合規則式評分與分群方法之大型語言模型語意風險與合規性評估
Semantic risk and compliance evaluation on LLM responses using rule-based scoring and clustering作者 陳卉縈
Chen, Hui-Ying貢獻者 郁方
Yu, Fang
陳卉縈
Chen, Hui-Ying關鍵詞 大型語言模型
PyRIT
GHSOM
倫理合規性
安全性評估
對抗式提示
越獄攻擊
Large Language Models
PyRIT
GHSOM
Ethical compliance
Safety evaluation
Adversarial prompts
Jailbreaking日期 2025 上傳時間 4-Aug-2025 14:28:07 (UTC+8) 摘要 隨著大型語言模型(LLM)廣泛應用於自然語言處理領域,如何強化其倫理防護能力、抵禦惡意提示詞攻擊,成為當前重要的研究課題。本研究提出一套具可解釋性的雙層評估架構,結合 PyRIT 規則式風險評分與 GHSOM 語意分群方法,從合規性與語氣風險兩個層面,系統性檢視模型的安全性表現。在本架構中,模型回應依據風險程度與語言風格被分類為四種語氣行為類型:明確違規(Vulgar)、語氣冒犯(Blunt)、潛在誤導(Deceptive)與合規回應(Eloquent)。此外,透過語意分群與特徵選取分析,本方法亦能辨識群集層級的風險特徵,並協助偵測出規則式評分中常見的誤判情形。實驗涵蓋 10 組情境與 12 種越獄攻擊腳本,共分析 2,925 筆模型回應。結果顯示,Gemini 產出的違規回應數量最多(119 筆),其次為 Perplexity(70 筆)與 DeepSeek(59 筆),而 Claude 與 ChatGPT 則整體展現出較高的倫理一致性。為進一步驗證風險行為是否具有跨模型的遷移性,本研究將其中 170 筆高風險提示詞重新測試於 API 模型與本地量化模型。結果顯示,API 模型仍容易受到對抗性提示詞影響,而量化模型則因理解能力較弱,導致攻擊成功率相對較低。整體而言,本研究所提出的整合式雙層評估方法,能有效補足傳統規則式指標的侷限,並提升語言模型風險分析的深度與可解釋性,為未來的 LLM 安全評估與對抗性測試提供重要的實證基礎與應用潛力。
Large Language Models (LLMs) have advanced natural language processing (NLP) applications but remain vulnerable to ethical misalignment and adversarial prompts. This study proposes a dual-layer evaluation framework that integrates rule-based scoring using the Python Risk Identification Tool (PyRIT) with clustering via the Growing Hierarchical Self-Organizing Map (GHSOM). LLM outputs are categorized into Vulgar, Blunt, Deceptive, and Eloquent behaviors based on compliance and semantic risks. The framework also enables cluster-level feature identification and false positive detection. Evaluating 2,925 responses across 10 scenarios and 12 jailbreak scripts, Gemini generated the highest number of Vulgar outputs (119), fol- lowed by Perplexity (70) and DeepSeek (59), while Claude and ChatGPT were more ethically aligned. Testing 170 high-risk prompts on API-based versus quantized local models revealed that API models remain susceptible to adversarial inputs, whereas quantized models exhibited lower attack success rates—likely due to reduced comprehension rather than stronger alignment safeguards. These findings underscore the value of layered evaluation frameworks for improving the safety and interpretability of LLMs.參考文獻 AI, D. (2024a). Deepseek-r1-distill-llama-8b [Accessed: 2025-05]. AI, M. (2024b). Meta-llama-3.1-8b-instruct [Accessed: 2025-05]. Anthropic. (2023). Claude [Model version: Claude 3.5 Haiku].https://www.anthropic. com/claude DeepSeek-AI, Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., et al. (2024). Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. https://arxiv.org/abs/2405.04434 Deng, G., Liu, Y., Li, Y., Wang, K., Zhang, Y., Li, Z., Wang, H., Zhang, T., & Liu, Y. (2024). Masterkey: Automated jailbreaking of large language model chatbots. Proceedings 2024 Network and Distributed System Security Symposium. https://doi. org/10.14722/ndss.2024.24188 Dittenbach, M., Merkl, D., & Rauber, A. (2001). Hierarchical clustering of document archives with the growing hierarchical self-organizing map. Proceedings of the International Conference on Artificial Neural Networks (ICANN), 486–491. https://doi.org/10.1007/3-540-44668-0_70 Gehman, S., Gururangan, S., Sap, M., Choi, Y., & Smith, N. A. (2020). Realtoxicityprompts: Evaluating neural toxic degeneration in language models. In T. Cohn, Y. He, & Y. Liu (Eds.), Findings of the association for computational linguistics: Emnlp 2020 (pp. 3356–3369). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.findings-emnlp.301 Google. (2024). Gemini [Model version: Gemini 2.0 Flash-Lite]. https://gemini.google. com/app Guo, Z., Jin, R., Liu, C., Huang, Y., Shi, D., Supryadi, Yu, L., Liu, Y., Li, J., Xiong, B., & Xiong, D. (2023). Evaluating large language models: A comprehensive survey. https://arxiv.org/abs/2310.19736 Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D., & Steinhardt, J. (2021). Aligning ai with shared human values. International Conference on Learning Rep- resentations. https://openreview.net/forum?id=dNy%5C_RKzJacY Huang, Y., Zhang, Q., Y, P. S., & Sun, L. (2023). Trustgpt: A benchmark for trustworthy and responsible large language models. https://arxiv.org/abs/2306.11507 Kohonen, T. (1990). The self-organizing map. Proceedings of the IEEE, 78(9), 1464– 1480. https://doi.org/10.1109/5.58325 Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K., Toutanova, K., Jones, L., Kelcey, M., Chang, M.-W., Dai, A., Uszkoreit, J., Le, Q., & Petrov, S. (2019). Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7, 453–466. https://doi.org/10.1162/tacl_a_00276 Lees, A., Tran, V. Q., Tay, Y., Sorensen, J., Gupta, J., Metzler, D., & Vasserman, L. (2022). A new generation of perspective api: Efficient multilingual character-level transformers. Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 3197–3207. https://doi.org/10.1145/3534678.3539147 Liu, Y., Deng, G., Xu, Z., Li, Y., Zheng, Y., Zhang, Y., Zhao, L., Zhang, T., Wang, K., & Liu, Y. (2024). Jailbreaking chatgpt via prompt engineering: An empirical study. Munoz, G. D. L., Minnich, A. J., Lutz, R., Lundeen, R., Dheekonda, R. S. R., Chikanov, N., Jagdagdorj, B.-E., Pouliot, M., Chawla, S., Maxwell, W., Bullwinkel, B., Pratt, K., de Gruyter, J., Siska, C., Bryan, P., Westerhoff, T., Kawaguchi, C., Seifert, C., Kumar, R. S. S., & Zunger, Y. (2024). Pyrit: A framework for security risk identification and red teaming in generative ai system. https://arxiv.org/abs/2410.02828 Nangia, N., Vania, C., Bhalerao, R., & Bowman, S. R. (2020). Crows-pairs: A challenge dataset for measuring social biases in masked language models. In B. Webber, T. Cohn, Y. He, & Y. Liu (Eds.), Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) (pp. 1953–1967). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.154 OpenAI. (2023). Chatgpt [Model version: GPT-4o mini]. https://openai.com/chatgpt Patil, S. G., Zhang, T., Wang, X., & Gonzalez, J. E. (2023). Gorilla: Large language model connected with massive apis. https://arxiv.org/abs/2305.15334 Perplexity. (2023). Perplexity ai [Model version: Sonar]. https://www.perplexity.ai Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). Squad: 100, 000+ questions for machine comprehension of text. https://arxiv.org/abs/1606.05250 Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. https://arxiv.org/abs/1908.10084 Rudinger, R., Naradowsky, J., Leonard, B., & Durme, B. V. (2018). Gender bias in coreference resolution. https://arxiv.org/abs/1804.09301 Su, J., Kempe, J., & Ullrich, K. (2024). Mission impossible: A statistical perspective on jailbreaking llms. https://arxiv.org/abs/2408.01420 Talmor, A., Herzig, J., Lourie, N., & Berant, J. (2019). Commonsenseqa: A question answering challenge targeting commonsense knowledge. https://arxiv.org/abs/1811. 00937 Tang, H., Li, H., Liu, J., Hong, Y., Wu, H., & Wang, H. (2021). Dureade_robust: A chinese dataset towards evaluating robustness and generalization of machine reading comprehension in real-world applications. https://arxiv.org/abs/2004.11142 Wen, S.-J., Chang, J.-M., & Yu, F. (2024). Scghsom: Hierarchical clustering and visualization of single-cell and crispr data using growing hierarchical som. https://arxiv. org/abs/2407.16984 Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., & Manning, C. D. (2018). Hotpotqa: A dataset for diverse, explainable multi-hop question answering. https://arxiv.org/abs/1809.09600 Zhao, J., Wang, T., Yatskar, M., Ordonez, V., & Chang, K.-W. (2018). Gender bias in coreference resolution: Evaluation and debiasing methods. In M. Walker, H. Ji, & A. Stent (Eds.), Proceedings of the 2018 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 2 (short papers) (pp. 15–20). Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-2003 Zhao, Y., Zhao, C., Nan, L., Qi, Z., Zhang, W., Tang, X., Mi, B., & Radev, D. (2023). RobuT: A systematic study of table QA robustness against human-annotated adversarial perturbations. In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 6064–6081). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.acl-long.334 Zhu, K., Wang, J., Zhou, J., Wang, Z., Chen, H., Wang, Y., Yang, L., Ye, W., Zhang, Y., Gong, N. Z., & Xie, X. (2024). Promptrobust: Towards evaluating the robustness of large language models on adversarial prompts. https://arxiv.org/abs/2306.04528 Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., & Fredrikson, M. (2023). Universal and transferable adversarial attacks on aligned language models. https://arxiv.org/ abs/2307.15043 描述 碩士
國立政治大學
資訊管理學系
112356043資料來源 http://thesis.lib.nccu.edu.tw/record/#G0112356043 資料類型 thesis dc.contributor.advisor 郁方 zh_TW dc.contributor.advisor Yu, Fang en_US dc.contributor.author (Authors) 陳卉縈 zh_TW dc.contributor.author (Authors) Chen, Hui-Ying en_US dc.creator (作者) 陳卉縈 zh_TW dc.creator (作者) Chen, Hui-Ying en_US dc.date (日期) 2025 en_US dc.date.accessioned 4-Aug-2025 14:28:07 (UTC+8) - dc.date.available 4-Aug-2025 14:28:07 (UTC+8) - dc.date.issued (上傳時間) 4-Aug-2025 14:28:07 (UTC+8) - dc.identifier (Other Identifiers) G0112356043 en_US dc.identifier.uri (URI) https://nccur.lib.nccu.edu.tw/handle/140.119/158581 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 資訊管理學系 zh_TW dc.description (描述) 112356043 zh_TW dc.description.abstract (摘要) 隨著大型語言模型(LLM)廣泛應用於自然語言處理領域,如何強化其倫理防護能力、抵禦惡意提示詞攻擊,成為當前重要的研究課題。本研究提出一套具可解釋性的雙層評估架構,結合 PyRIT 規則式風險評分與 GHSOM 語意分群方法,從合規性與語氣風險兩個層面,系統性檢視模型的安全性表現。在本架構中,模型回應依據風險程度與語言風格被分類為四種語氣行為類型:明確違規(Vulgar)、語氣冒犯(Blunt)、潛在誤導(Deceptive)與合規回應(Eloquent)。此外,透過語意分群與特徵選取分析,本方法亦能辨識群集層級的風險特徵,並協助偵測出規則式評分中常見的誤判情形。實驗涵蓋 10 組情境與 12 種越獄攻擊腳本,共分析 2,925 筆模型回應。結果顯示,Gemini 產出的違規回應數量最多(119 筆),其次為 Perplexity(70 筆)與 DeepSeek(59 筆),而 Claude 與 ChatGPT 則整體展現出較高的倫理一致性。為進一步驗證風險行為是否具有跨模型的遷移性,本研究將其中 170 筆高風險提示詞重新測試於 API 模型與本地量化模型。結果顯示,API 模型仍容易受到對抗性提示詞影響,而量化模型則因理解能力較弱,導致攻擊成功率相對較低。整體而言,本研究所提出的整合式雙層評估方法,能有效補足傳統規則式指標的侷限,並提升語言模型風險分析的深度與可解釋性,為未來的 LLM 安全評估與對抗性測試提供重要的實證基礎與應用潛力。 zh_TW dc.description.abstract (摘要) Large Language Models (LLMs) have advanced natural language processing (NLP) applications but remain vulnerable to ethical misalignment and adversarial prompts. This study proposes a dual-layer evaluation framework that integrates rule-based scoring using the Python Risk Identification Tool (PyRIT) with clustering via the Growing Hierarchical Self-Organizing Map (GHSOM). LLM outputs are categorized into Vulgar, Blunt, Deceptive, and Eloquent behaviors based on compliance and semantic risks. The framework also enables cluster-level feature identification and false positive detection. Evaluating 2,925 responses across 10 scenarios and 12 jailbreak scripts, Gemini generated the highest number of Vulgar outputs (119), fol- lowed by Perplexity (70) and DeepSeek (59), while Claude and ChatGPT were more ethically aligned. Testing 170 high-risk prompts on API-based versus quantized local models revealed that API models remain susceptible to adversarial inputs, whereas quantized models exhibited lower attack success rates—likely due to reduced comprehension rather than stronger alignment safeguards. These findings underscore the value of layered evaluation frameworks for improving the safety and interpretability of LLMs. en_US dc.description.tableofcontents 摘要 i Abstract ii Contents iii List of Figures vi List of Tables viii 1 Introduction 1 2 Related Work 4 2.1 Knowledge and Capability Evaluation 4 2.2 Alignment Evaluation 5 2.3 Safety Evaluation 6 2.4 Limitations of Existing Approaches 7 3 Methodology 9 3.1 Prompt Generation 10 3.1.1 Contextual Prompts 10 3.1.2 Jailbreak Prompts 11 3.1.3 External Prompt Baseline from AdvBench 13 3.2 Response Collection 14 3.3 Semantic Embedding Conversion 14 3.4 Scoring and Classification with PyRIT 15 3.4.1 Binary Compliance Classification 16 3.4.2 Likert-Scale Compliance Scoring 16 3.4.3 Categorical Compliance Assessment 16 3.4.4 Objective Success Evaluation 17 3.5 Clustering Analysis with GHSOM 18 3.5.1 False Positive Detection 18 3.5.2 Feature Identification 19 3.6 Integration of PyRIT and GHSOM 20 4 Evaluation 22 4.1 PyRIT Scoring Analysis 22 4.1.1 Binary Compliance Classification 22 4.1.2 Likert-Scale Compliance Scoring 23 4.1.3 Categorical Compliance Assessment 23 4.1.4 Objective Success Evaluation 24 4.2 GHSOM Clustering Analysis 26 4.2.1 False Positive Detection 26 4.2.2 Feature Identification 28 4.3 Semantic Risk Quadrant Analysis 31 4.3.1 Vulgar Responses 32 4.3.2 Blunt Responses 33 4.3.3 Deceptive Responses 34 4.3.4 Eloquent Responses 34 4.4 Backtracking Analysis of Adversarial Responses 35 4.5 Transferability Evaluation Across Advanced and Quantized Models 38 4.6 Comparison with AdvBench Prompts 40 5 Conclusion 42 Reference 44 Appendix 48 A Representative Examples 48 A.1 Vulgar Response 48 A.2 Blunt Response 49 A.3 Deceptive Response 50 A.4 Eloquent Response 51 zh_TW dc.format.extent 2226221 bytes - dc.format.mimetype application/pdf - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0112356043 en_US dc.subject (關鍵詞) 大型語言模型 zh_TW dc.subject (關鍵詞) PyRIT zh_TW dc.subject (關鍵詞) GHSOM zh_TW dc.subject (關鍵詞) 倫理合規性 zh_TW dc.subject (關鍵詞) 安全性評估 zh_TW dc.subject (關鍵詞) 對抗式提示 zh_TW dc.subject (關鍵詞) 越獄攻擊 zh_TW dc.subject (關鍵詞) Large Language Models en_US dc.subject (關鍵詞) PyRIT en_US dc.subject (關鍵詞) GHSOM en_US dc.subject (關鍵詞) Ethical compliance en_US dc.subject (關鍵詞) Safety evaluation en_US dc.subject (關鍵詞) Adversarial prompts en_US dc.subject (關鍵詞) Jailbreaking en_US dc.title (題名) 結合規則式評分與分群方法之大型語言模型語意風險與合規性評估 zh_TW dc.title (題名) Semantic risk and compliance evaluation on LLM responses using rule-based scoring and clustering en_US dc.type (資料類型) thesis en_US dc.relation.reference (參考文獻) AI, D. (2024a). Deepseek-r1-distill-llama-8b [Accessed: 2025-05]. AI, M. (2024b). Meta-llama-3.1-8b-instruct [Accessed: 2025-05]. Anthropic. (2023). Claude [Model version: Claude 3.5 Haiku].https://www.anthropic. com/claude DeepSeek-AI, Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., et al. (2024). Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. https://arxiv.org/abs/2405.04434 Deng, G., Liu, Y., Li, Y., Wang, K., Zhang, Y., Li, Z., Wang, H., Zhang, T., & Liu, Y. (2024). Masterkey: Automated jailbreaking of large language model chatbots. Proceedings 2024 Network and Distributed System Security Symposium. https://doi. org/10.14722/ndss.2024.24188 Dittenbach, M., Merkl, D., & Rauber, A. (2001). Hierarchical clustering of document archives with the growing hierarchical self-organizing map. Proceedings of the International Conference on Artificial Neural Networks (ICANN), 486–491. https://doi.org/10.1007/3-540-44668-0_70 Gehman, S., Gururangan, S., Sap, M., Choi, Y., & Smith, N. A. (2020). Realtoxicityprompts: Evaluating neural toxic degeneration in language models. In T. Cohn, Y. He, & Y. Liu (Eds.), Findings of the association for computational linguistics: Emnlp 2020 (pp. 3356–3369). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.findings-emnlp.301 Google. (2024). Gemini [Model version: Gemini 2.0 Flash-Lite]. https://gemini.google. com/app Guo, Z., Jin, R., Liu, C., Huang, Y., Shi, D., Supryadi, Yu, L., Liu, Y., Li, J., Xiong, B., & Xiong, D. (2023). Evaluating large language models: A comprehensive survey. https://arxiv.org/abs/2310.19736 Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D., & Steinhardt, J. (2021). Aligning ai with shared human values. International Conference on Learning Rep- resentations. https://openreview.net/forum?id=dNy%5C_RKzJacY Huang, Y., Zhang, Q., Y, P. S., & Sun, L. (2023). Trustgpt: A benchmark for trustworthy and responsible large language models. https://arxiv.org/abs/2306.11507 Kohonen, T. (1990). The self-organizing map. Proceedings of the IEEE, 78(9), 1464– 1480. https://doi.org/10.1109/5.58325 Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K., Toutanova, K., Jones, L., Kelcey, M., Chang, M.-W., Dai, A., Uszkoreit, J., Le, Q., & Petrov, S. (2019). Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7, 453–466. https://doi.org/10.1162/tacl_a_00276 Lees, A., Tran, V. Q., Tay, Y., Sorensen, J., Gupta, J., Metzler, D., & Vasserman, L. (2022). A new generation of perspective api: Efficient multilingual character-level transformers. Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 3197–3207. https://doi.org/10.1145/3534678.3539147 Liu, Y., Deng, G., Xu, Z., Li, Y., Zheng, Y., Zhang, Y., Zhao, L., Zhang, T., Wang, K., & Liu, Y. (2024). Jailbreaking chatgpt via prompt engineering: An empirical study. Munoz, G. D. L., Minnich, A. J., Lutz, R., Lundeen, R., Dheekonda, R. S. R., Chikanov, N., Jagdagdorj, B.-E., Pouliot, M., Chawla, S., Maxwell, W., Bullwinkel, B., Pratt, K., de Gruyter, J., Siska, C., Bryan, P., Westerhoff, T., Kawaguchi, C., Seifert, C., Kumar, R. S. S., & Zunger, Y. (2024). Pyrit: A framework for security risk identification and red teaming in generative ai system. https://arxiv.org/abs/2410.02828 Nangia, N., Vania, C., Bhalerao, R., & Bowman, S. R. (2020). Crows-pairs: A challenge dataset for measuring social biases in masked language models. In B. Webber, T. Cohn, Y. He, & Y. Liu (Eds.), Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) (pp. 1953–1967). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.154 OpenAI. (2023). Chatgpt [Model version: GPT-4o mini]. https://openai.com/chatgpt Patil, S. G., Zhang, T., Wang, X., & Gonzalez, J. E. (2023). Gorilla: Large language model connected with massive apis. https://arxiv.org/abs/2305.15334 Perplexity. (2023). Perplexity ai [Model version: Sonar]. https://www.perplexity.ai Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). Squad: 100, 000+ questions for machine comprehension of text. https://arxiv.org/abs/1606.05250 Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. https://arxiv.org/abs/1908.10084 Rudinger, R., Naradowsky, J., Leonard, B., & Durme, B. V. (2018). Gender bias in coreference resolution. https://arxiv.org/abs/1804.09301 Su, J., Kempe, J., & Ullrich, K. (2024). Mission impossible: A statistical perspective on jailbreaking llms. https://arxiv.org/abs/2408.01420 Talmor, A., Herzig, J., Lourie, N., & Berant, J. (2019). Commonsenseqa: A question answering challenge targeting commonsense knowledge. https://arxiv.org/abs/1811. 00937 Tang, H., Li, H., Liu, J., Hong, Y., Wu, H., & Wang, H. (2021). Dureade_robust: A chinese dataset towards evaluating robustness and generalization of machine reading comprehension in real-world applications. https://arxiv.org/abs/2004.11142 Wen, S.-J., Chang, J.-M., & Yu, F. (2024). Scghsom: Hierarchical clustering and visualization of single-cell and crispr data using growing hierarchical som. https://arxiv. org/abs/2407.16984 Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., & Manning, C. D. (2018). Hotpotqa: A dataset for diverse, explainable multi-hop question answering. https://arxiv.org/abs/1809.09600 Zhao, J., Wang, T., Yatskar, M., Ordonez, V., & Chang, K.-W. (2018). Gender bias in coreference resolution: Evaluation and debiasing methods. In M. Walker, H. Ji, & A. Stent (Eds.), Proceedings of the 2018 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 2 (short papers) (pp. 15–20). Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-2003 Zhao, Y., Zhao, C., Nan, L., Qi, Z., Zhang, W., Tang, X., Mi, B., & Radev, D. (2023). RobuT: A systematic study of table QA robustness against human-annotated adversarial perturbations. In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 6064–6081). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.acl-long.334 Zhu, K., Wang, J., Zhou, J., Wang, Z., Chen, H., Wang, Y., Yang, L., Ye, W., Zhang, Y., Gong, N. Z., & Xie, X. (2024). Promptrobust: Towards evaluating the robustness of large language models on adversarial prompts. https://arxiv.org/abs/2306.04528 Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., & Fredrikson, M. (2023). Universal and transferable adversarial attacks on aligned language models. https://arxiv.org/ abs/2307.15043 zh_TW
