Publications-Theses
Article View/Open
Publication Export
-
Google ScholarTM
NCCU Library
Citation Infomation
Related Publications in TAIR
題名 基於階層式聚類的文本檢索樹於增強式生成系統之應用:以台灣法規為例
Tree-Based Text Retrieval via Hierarchical Clustering in RAG Frameworks:Application on Taiwanese Regulations作者 余嘉恆
Yu, Chia-Heng貢獻者 蔡炎龍
Tsai, Yeng-Lung
余嘉恆
Yu, Chia-Heng關鍵詞 檢索增強生成(RAG)
階層式聚類
向量語意檢索
人工智慧
語意層次結構
文本檢索
Retrieval-Augmented Generation (RAG)
Hierarchical Clustering
Text Retrieval
Semantic Vector Search
AI
Semantic Hierarchy日期 2025 上傳時間 1-Sep-2025 16:31:04 (UTC+8) 摘要 本篇文章將探討Retrieval-Augmented Generation (RAG)技術並將其應用於法規的檢索與生成。我們蒐集了台灣法規作為資料集,以pre-trained model將文字轉換為向量後,透過自行設計的演算法進行檢索。 檢索方式採用餘弦距離作為Hierarchical clustering的度量,將相關的文本向量進行聚類,同時將使用者的query轉為向量後以Breadth-first search(BFS)找出tree中與query向量最接近的node,並將該node的子樹中的所有leaf node回傳作為檢索結果。 生成方面,我們詢問了相關領域專家是如何逐步分解題目情境與梳理脈絡,將專家們的意見融合prompt當中,採用了CoT技術來引導模型從檢索文件中生成更完整且具結構性的回應。 在經過專家評分與假設檢定實驗後,我們確認了相對於原始採用Sematic search的RAG系統,我們設計的系統有助於提高檢索結果的精準度,增加生成器回答時的正確率。
This study explores the application of Retrieval-Augmented Generation (RAG) techniques to the retrieval and generation of legal statutes. We compiled a dataset consisting of Taiwanese legal regulations and utilized a pre-trained model to encode the texts into vector representations. Retrieval is performed using a custom-designed algorithm that applies cosine distance as the similarity metric for hierarchical clustering of the document embeddings. User queries are also converted into vectors, and a Breadth-First Search (BFS) is employed to identify the node in the cluster tree most similar to the query vector. All leaf nodes within the corresponding subtree are returned as the retrieval results. For the generation stage, we cooperate with domain experts and enquiry how legal professionals interpret and decompose complex legal scenarios. Their guidance was embedded into the prompt design using Chain-of-Thought (CoT) techniques to steer the model in generating more complete and well-structured responses. Through expert evaluation and hypothesis testing, we demonstrate that our enhanced system improves retrieval precision, outperforming baseline RAG implementations based on standard semantic search.參考文獻 [1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is All you Need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S.Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. [2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. [3] Tomas Mikolov, Kai Chen, G.s Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. Proceedings of Workshop at ICLR, 2013, 01 2013. [4] Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1188–1196, Bejing, China, 22–24 Jun 2014. PMLR. [5] Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst., 43(2), January 2025. [6] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA, 2020. Curran Associates Inc. [7] Kishore Papineni. Why inverse document frequency? In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 1–8, 2001. [8] Peter F Brown, Vincent J Della Pietra, Peter V deSouza, Jennifer C Lai, and Robert L Mercer. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467–479, 1992. [9] Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. Recent trends in deep learning based natural language processing [review article]. IEEE Computational Intelligence Magazine, 13:55–75, 08 2018. [10] Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117, 2015. [11] Saeed Damadi, Golnaz Moharrer, Mostafa Cham, and Jinglai Shen. The backpropagation algorithm for a math student. In 2023 International Joint Conference on Neural Networks (IJCNN), pages 01–09, 2023. [12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. [13] Jimmy Ba, Jamie Kiros, and Geoffrey Hinton. Layer normalization, 2016. arXiv preprint arXiv:1607.06450. [14] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML), pages 807–814, 2010. [15] Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online, November 2020. Association for Computational Linguistics. [16] Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. A survey on rag meeting llms: Towards retrieval-augmented large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’24, page 6491–6501, New York, NY, USA, 2024. Association for Computing Machinery. [17] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2022. Curran Associates Inc. [18] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA, 2020. Curran Associates Inc. [19] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023. arXiv preprint arXiv:2302.13971. [20] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2022. Curran Associates Inc. [21] Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. The faiss library, 2024. arXiv preprint arXiv:2401.08281. [22] Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Multilingual e5 text embeddings: A technical report, 2024. arXiv preprint arXiv:2402.05672. [23] OpenAI. Gpt-4o-mini models, 2025. Available at: https://platform.openai.com/docs/models/gpt-4o-mini (accessed: 2025-04-12). 描述 碩士
國立政治大學
應用數學系
112751004資料來源 http://thesis.lib.nccu.edu.tw/record/#G0112751004 資料類型 thesis dc.contributor.advisor 蔡炎龍 zh_TW dc.contributor.advisor Tsai, Yeng-Lung en_US dc.contributor.author (Authors) 余嘉恆 zh_TW dc.contributor.author (Authors) Yu, Chia-Heng en_US dc.creator (作者) 余嘉恆 zh_TW dc.creator (作者) Yu, Chia-Heng en_US dc.date (日期) 2025 en_US dc.date.accessioned 1-Sep-2025 16:31:04 (UTC+8) - dc.date.available 1-Sep-2025 16:31:04 (UTC+8) - dc.date.issued (上傳時間) 1-Sep-2025 16:31:04 (UTC+8) - dc.identifier (Other Identifiers) G0112751004 en_US dc.identifier.uri (URI) https://nccur.lib.nccu.edu.tw/handle/140.119/159321 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 應用數學系 zh_TW dc.description (描述) 112751004 zh_TW dc.description.abstract (摘要) 本篇文章將探討Retrieval-Augmented Generation (RAG)技術並將其應用於法規的檢索與生成。我們蒐集了台灣法規作為資料集,以pre-trained model將文字轉換為向量後,透過自行設計的演算法進行檢索。 檢索方式採用餘弦距離作為Hierarchical clustering的度量,將相關的文本向量進行聚類,同時將使用者的query轉為向量後以Breadth-first search(BFS)找出tree中與query向量最接近的node,並將該node的子樹中的所有leaf node回傳作為檢索結果。 生成方面,我們詢問了相關領域專家是如何逐步分解題目情境與梳理脈絡,將專家們的意見融合prompt當中,採用了CoT技術來引導模型從檢索文件中生成更完整且具結構性的回應。 在經過專家評分與假設檢定實驗後,我們確認了相對於原始採用Sematic search的RAG系統,我們設計的系統有助於提高檢索結果的精準度,增加生成器回答時的正確率。 zh_TW dc.description.abstract (摘要) This study explores the application of Retrieval-Augmented Generation (RAG) techniques to the retrieval and generation of legal statutes. We compiled a dataset consisting of Taiwanese legal regulations and utilized a pre-trained model to encode the texts into vector representations. Retrieval is performed using a custom-designed algorithm that applies cosine distance as the similarity metric for hierarchical clustering of the document embeddings. User queries are also converted into vectors, and a Breadth-First Search (BFS) is employed to identify the node in the cluster tree most similar to the query vector. All leaf nodes within the corresponding subtree are returned as the retrieval results. For the generation stage, we cooperate with domain experts and enquiry how legal professionals interpret and decompose complex legal scenarios. Their guidance was embedded into the prompt design using Chain-of-Thought (CoT) techniques to steer the model in generating more complete and well-structured responses. Through expert evaluation and hypothesis testing, we demonstrate that our enhanced system improves retrieval precision, outperforming baseline RAG implementations based on standard semantic search. en_US dc.description.tableofcontents 致謝i 中文摘要iii Abstract iv Contents v List of Tables vii List of Figures viii 1 Introduction 1 2 Neural Network Foundations and Transformer Architecture 2 2.1 Deep Learning 2 2.2 Loss Function and Backpropagation 3 2.3 Transformer Architecture 5 2.3.1 The input of Transformer 6 2.3.2 The Attention Layer and the Attention Function 8 2.3.3 MultiHead Attention 9 2.3.4 Residual Connection and Layer Normalization 11 2.3.5 Feed Forward Network 11 2.3.6 The Decoder of Transformer 12 3 Retrieval-Augmented Generation(RAG) 16 3.1 Method in RAG 16 3.2 Decoder in RAG 17 3.3 The Training parts in RAG 18 3.4 RAG without Fine-tune 18 3.5 Prompt Engineering 21 3.6 Chain-of-Thought 22 4 Text Retrieval with Hierarchical Clustering 24 4.1 Method 24 4.1.1 Hierarchical clustering 24 4.1.2 Searching Algorithm 28 5 Experiment and Evaluation 30 5.1 Corpus Selection and Preprocessing 30 5.2 Model Selection 31 5.3 Evaluation Metrics 31 5.3.1 Performance Comparison of Retrieval Methods 32 6 Future Work 34 6.1 Limitation sand Potential Extensions 34 6.2 Technical and Methodological Improvements 35 6.3 Broader Applications 36 Bibliography 37 Appendix A:生成範例 41 Appendix B:程式碼 46 zh_TW dc.format.extent 1317466 bytes - dc.format.mimetype application/pdf - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0112751004 en_US dc.subject (關鍵詞) 檢索增強生成(RAG) zh_TW dc.subject (關鍵詞) 階層式聚類 zh_TW dc.subject (關鍵詞) 向量語意檢索 zh_TW dc.subject (關鍵詞) 人工智慧 zh_TW dc.subject (關鍵詞) 語意層次結構 zh_TW dc.subject (關鍵詞) 文本檢索 zh_TW dc.subject (關鍵詞) Retrieval-Augmented Generation (RAG) en_US dc.subject (關鍵詞) Hierarchical Clustering en_US dc.subject (關鍵詞) Text Retrieval en_US dc.subject (關鍵詞) Semantic Vector Search en_US dc.subject (關鍵詞) AI en_US dc.subject (關鍵詞) Semantic Hierarchy en_US dc.title (題名) 基於階層式聚類的文本檢索樹於增強式生成系統之應用:以台灣法規為例 zh_TW dc.title (題名) Tree-Based Text Retrieval via Hierarchical Clustering in RAG Frameworks:Application on Taiwanese Regulations en_US dc.type (資料類型) thesis en_US dc.relation.reference (參考文獻) [1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is All you Need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S.Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. [2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. [3] Tomas Mikolov, Kai Chen, G.s Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. Proceedings of Workshop at ICLR, 2013, 01 2013. [4] Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1188–1196, Bejing, China, 22–24 Jun 2014. PMLR. [5] Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst., 43(2), January 2025. [6] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA, 2020. Curran Associates Inc. [7] Kishore Papineni. Why inverse document frequency? In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 1–8, 2001. [8] Peter F Brown, Vincent J Della Pietra, Peter V deSouza, Jennifer C Lai, and Robert L Mercer. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467–479, 1992. [9] Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. Recent trends in deep learning based natural language processing [review article]. IEEE Computational Intelligence Magazine, 13:55–75, 08 2018. [10] Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117, 2015. [11] Saeed Damadi, Golnaz Moharrer, Mostafa Cham, and Jinglai Shen. The backpropagation algorithm for a math student. In 2023 International Joint Conference on Neural Networks (IJCNN), pages 01–09, 2023. [12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. [13] Jimmy Ba, Jamie Kiros, and Geoffrey Hinton. Layer normalization, 2016. arXiv preprint arXiv:1607.06450. [14] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML), pages 807–814, 2010. [15] Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online, November 2020. Association for Computational Linguistics. [16] Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. A survey on rag meeting llms: Towards retrieval-augmented large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’24, page 6491–6501, New York, NY, USA, 2024. Association for Computing Machinery. [17] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2022. Curran Associates Inc. [18] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA, 2020. Curran Associates Inc. [19] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023. arXiv preprint arXiv:2302.13971. [20] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2022. Curran Associates Inc. [21] Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. The faiss library, 2024. arXiv preprint arXiv:2401.08281. [22] Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Multilingual e5 text embeddings: A technical report, 2024. arXiv preprint arXiv:2402.05672. [23] OpenAI. Gpt-4o-mini models, 2025. Available at: https://platform.openai.com/docs/models/gpt-4o-mini (accessed: 2025-04-12). zh_TW
