Publications-Theses
Article View/Open
Publication Export
-
Google ScholarTM
NCCU Library
Citation Infomation
Related Publications in TAIR
題名 大型語言模型的非檢索式上下文延展機制研究:從鍵值緩存到微積分AI教師
RAG-Free Contextual Extension for LLMs: A Study on KV-Cache and Calculus AI Tutoring作者 孫翊珈
Sun, Yi-Jia貢獻者 蔡炎龍
Tsai, Yen-Lung
孫翊珈
Sun, Yi-Jia關鍵詞 大型語言模型
鍵值快取
上下文延展
AI 教學系統
非檢索生成
Large Language Models
Key-Value Cache
Context Extension
AI Tutoring System
Retrieval-Free Generation日期 2025 上傳時間 1-Sep-2025 16:30:03 (UTC+8) 摘要 隨著大型語言模型(Large Language Models, LLMs)在自然語言處理領域的快速發展,其應用已逐漸擴展至教育場域。然而,現有 LLM 面臨上下文長度(context window)受限的挑戰,使其在處理長篇教材與多輪教學問答時難以維持語境連貫性與邏輯一致性。傳統解法如檢索增強生成(Retrieval-Augmented Generation, RAG)雖能引入外部知識,但也易產生檢索偏誤及語境斷裂,影響教學應用的效能。 本研究提出一種基於鍵值快取(Key-Value Cache, KV-Cache)的非檢索式上下文延展策略,並設計實作了一套以微積分教材為基礎的 AI 教師系統。系統透過分段預填充(chunked prefill)將教材內容逐步輸入模型,並快取中間計算結果,讓模型在後續教學問答中能延續語境、節省運算資源並提升語義一致性。實驗比較了 KV-Cache 系統、RAG 系統與無快取系統,評估其記憶體使用與回應延遲。 實驗結果顯示,所提出的 KV-Cache 機制在長文本教學場景下能有效提升語境連貫性,並顯著降低回應延遲,展現其於 AI 教學應用中的潛力。
With the advancement of Large Language Models (LLMs), their integration into educational applications has attracted increasing attention. However, LLMs are constrained by their fixed context window size, making it difficult to handle long instructional materials and maintain coherent multi-turn teaching dialogues. While Retrieval-Augmented Generation (RAG) alleviates some knowledge limitations by incorporating external retrieval, it often introduces retrieval bias and context fragmentation, reducing its effectiveness in educational scenarios. This study proposes a retrieval-free context extension approach based on Key-Value Cache (KV-Cache) and implements a calculus-focused AI tutoring system. The system incrementally feeds LaTeX-based calculus textbooks into the model using a chunked prefill strategy, caching intermediate computations to enable consistent context retention and improved semantic coherence in subsequent teaching interactions. The experiments compare the proposed system with RAG-based and non-caching baselines, focusing on response latency and teaching continuity. Experimental results demonstrate that the KV-Cache mechanism effectively enhances contextual coherence and significantly reduces response latency in long-text teaching scenarios, showing great potential for future AI-driven educational systems.參考文獻 [1] Jie Hu, Shengnan Wang, Yutong He, Ping Gong, Jiawei Yi, Juncheng Zhang, Youhui Bai, Renhai Chen, Gong Zhang, Cheng Li, et al. Efficient long-context llm inference via kv cache clustering. arXiv preprint arXiv:2506.11418, 2025. [2] Neusha Javidnia, Bita Darvish Rouhani, and Farinaz Koushanfar. Key, value, compress: A systematic exploration of kv cache compression techniques. In 2025 IEEE Custom Integrated Circuits Conference (CICC), pages 1–3. IEEE, 2025. [3] Jushi Kai, Boyi Zeng, Yixuan Wang, Haoli Bai, Ziwei He, Bo Jiang, and Zhouhan Lin. Freqkv: Frequencydomainkey-valuecompressionforefficientcontextwindowextension. arXiv preprint arXiv:2505.00570, 2025. [4] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459–9474, 2020. [5] Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, and Minyi Guo. Clusterkv: Manipulating llm kv cache in semantic space for recallable compression. arXiv preprint arXiv:2412.03213, 2024. [6] A. Palu and B. Smith. Kv-cache compression with low-rank projection. In International Conference on Learning Representations (ICLR), 2024. [7] Aurick Qiao, Zhewei Yao, Samyam Rajbhandari, and Yuxiong He. Swiftkv: Fast prefill optimized inference with knowledge-preserving model transformation. arXiv preprint arXiv:2410.03960, 2024. [8] Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, and Beidi Chen. Shadowkv: Kv cache in shadows for high-throughput long-context llm inference. arXiv preprint arXiv:2410.21465, 2024. [9] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [10] Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Multilingual e5 text embeddings: A technical report. arXiv preprint arXiv:2402.05672, 2024. [11] Jialong Wu, Zhenglin Wang, Linhai Zhang, Yilong Lai, Yulan He, and Deyu Zhou. Scope: Optimizing key-value cache compression in long-context generation. arXiv preprint arXiv:2412.13649, 2024. [12] Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, and Shiyu Chang. Kvlink: Accelerating large language models via efficient kv cache reuse. arXiv preprint arXiv:2502.16002, 2025. 描述 碩士
國立政治大學
應用數學系
111751001資料來源 http://thesis.lib.nccu.edu.tw/record/#G0111751001 資料類型 thesis dc.contributor.advisor 蔡炎龍 zh_TW dc.contributor.advisor Tsai, Yen-Lung en_US dc.contributor.author (Authors) 孫翊珈 zh_TW dc.contributor.author (Authors) Sun, Yi-Jia en_US dc.creator (作者) 孫翊珈 zh_TW dc.creator (作者) Sun, Yi-Jia en_US dc.date (日期) 2025 en_US dc.date.accessioned 1-Sep-2025 16:30:03 (UTC+8) - dc.date.available 1-Sep-2025 16:30:03 (UTC+8) - dc.date.issued (上傳時間) 1-Sep-2025 16:30:03 (UTC+8) - dc.identifier (Other Identifiers) G0111751001 en_US dc.identifier.uri (URI) https://nccur.lib.nccu.edu.tw/handle/140.119/159317 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 應用數學系 zh_TW dc.description (描述) 111751001 zh_TW dc.description.abstract (摘要) 隨著大型語言模型(Large Language Models, LLMs)在自然語言處理領域的快速發展,其應用已逐漸擴展至教育場域。然而,現有 LLM 面臨上下文長度(context window)受限的挑戰,使其在處理長篇教材與多輪教學問答時難以維持語境連貫性與邏輯一致性。傳統解法如檢索增強生成(Retrieval-Augmented Generation, RAG)雖能引入外部知識,但也易產生檢索偏誤及語境斷裂,影響教學應用的效能。 本研究提出一種基於鍵值快取(Key-Value Cache, KV-Cache)的非檢索式上下文延展策略,並設計實作了一套以微積分教材為基礎的 AI 教師系統。系統透過分段預填充(chunked prefill)將教材內容逐步輸入模型,並快取中間計算結果,讓模型在後續教學問答中能延續語境、節省運算資源並提升語義一致性。實驗比較了 KV-Cache 系統、RAG 系統與無快取系統,評估其記憶體使用與回應延遲。 實驗結果顯示,所提出的 KV-Cache 機制在長文本教學場景下能有效提升語境連貫性,並顯著降低回應延遲,展現其於 AI 教學應用中的潛力。 zh_TW dc.description.abstract (摘要) With the advancement of Large Language Models (LLMs), their integration into educational applications has attracted increasing attention. However, LLMs are constrained by their fixed context window size, making it difficult to handle long instructional materials and maintain coherent multi-turn teaching dialogues. While Retrieval-Augmented Generation (RAG) alleviates some knowledge limitations by incorporating external retrieval, it often introduces retrieval bias and context fragmentation, reducing its effectiveness in educational scenarios. This study proposes a retrieval-free context extension approach based on Key-Value Cache (KV-Cache) and implements a calculus-focused AI tutoring system. The system incrementally feeds LaTeX-based calculus textbooks into the model using a chunked prefill strategy, caching intermediate computations to enable consistent context retention and improved semantic coherence in subsequent teaching interactions. The experiments compare the proposed system with RAG-based and non-caching baselines, focusing on response latency and teaching continuity. Experimental results demonstrate that the KV-Cache mechanism effectively enhances contextual coherence and significantly reduces response latency in long-text teaching scenarios, showing great potential for future AI-driven educational systems. en_US dc.description.tableofcontents 致謝 ii 中文摘要 iii Abstract iv 目錄 v 表目錄 vii 圖目錄 viii 第一章 緒論 1 第二章 相關技術介紹 3 第一節 Transformer架構與注意力機制 3 第二節 擴展大型語言模型的上下文限制問題 10 一、位置編碼擴展技術:旋轉位置編碼(RoPE) 10 二、位置編碼擴展技術:線性偏置注意力(ALiBi) 11 第三節 檢索增強生成(RAG)架構簡介 13 第四節 鍵值快取(Key-ValueCache,KV-Cache)技術 15 第五節 Cache增強生成(CAG)簡介 18 第三章 系統設計與實作 20 第一節 系統架構概述 20 第二節 教材預處理與段落切分 20 第三節 快取建構與上下文延展策略 21 第四節 學生提問流程與模型生成 21 第四章 實驗設計與評估方法 22 第一節 實驗目標 22 第二節 實驗設計 22 第三節 測試資料 23 第四節 評估指標 23 第五節 實作環境 23 第五章 實驗結果與分析 24 第一節 實作結果分析 24 一、KV-Cache系統結果 25 二、無快取系統結果 27 三、RAG系統結果 29 第二節 記憶體使用量比較 30 第三節 回應延遲時間分析 31 第四節 質性觀察分析 31 第五節 小結 32 第六節 討論 32 第六章 結論與未來展望 33 第一節 研究結論 33 第二節 研究限制 33 第三節 未來工作方向 34 附錄A附錄編輯 35 A.1附錄內容 35 參考文獻 36 zh_TW dc.format.extent 1129657 bytes - dc.format.mimetype application/pdf - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0111751001 en_US dc.subject (關鍵詞) 大型語言模型 zh_TW dc.subject (關鍵詞) 鍵值快取 zh_TW dc.subject (關鍵詞) 上下文延展 zh_TW dc.subject (關鍵詞) AI 教學系統 zh_TW dc.subject (關鍵詞) 非檢索生成 zh_TW dc.subject (關鍵詞) Large Language Models en_US dc.subject (關鍵詞) Key-Value Cache en_US dc.subject (關鍵詞) Context Extension en_US dc.subject (關鍵詞) AI Tutoring System en_US dc.subject (關鍵詞) Retrieval-Free Generation en_US dc.title (題名) 大型語言模型的非檢索式上下文延展機制研究:從鍵值緩存到微積分AI教師 zh_TW dc.title (題名) RAG-Free Contextual Extension for LLMs: A Study on KV-Cache and Calculus AI Tutoring en_US dc.type (資料類型) thesis en_US dc.relation.reference (參考文獻) [1] Jie Hu, Shengnan Wang, Yutong He, Ping Gong, Jiawei Yi, Juncheng Zhang, Youhui Bai, Renhai Chen, Gong Zhang, Cheng Li, et al. Efficient long-context llm inference via kv cache clustering. arXiv preprint arXiv:2506.11418, 2025. [2] Neusha Javidnia, Bita Darvish Rouhani, and Farinaz Koushanfar. Key, value, compress: A systematic exploration of kv cache compression techniques. In 2025 IEEE Custom Integrated Circuits Conference (CICC), pages 1–3. IEEE, 2025. [3] Jushi Kai, Boyi Zeng, Yixuan Wang, Haoli Bai, Ziwei He, Bo Jiang, and Zhouhan Lin. Freqkv: Frequencydomainkey-valuecompressionforefficientcontextwindowextension. arXiv preprint arXiv:2505.00570, 2025. [4] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459–9474, 2020. [5] Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, and Minyi Guo. Clusterkv: Manipulating llm kv cache in semantic space for recallable compression. arXiv preprint arXiv:2412.03213, 2024. [6] A. Palu and B. Smith. Kv-cache compression with low-rank projection. In International Conference on Learning Representations (ICLR), 2024. [7] Aurick Qiao, Zhewei Yao, Samyam Rajbhandari, and Yuxiong He. Swiftkv: Fast prefill optimized inference with knowledge-preserving model transformation. arXiv preprint arXiv:2410.03960, 2024. [8] Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, and Beidi Chen. Shadowkv: Kv cache in shadows for high-throughput long-context llm inference. arXiv preprint arXiv:2410.21465, 2024. [9] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [10] Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Multilingual e5 text embeddings: A technical report. arXiv preprint arXiv:2402.05672, 2024. [11] Jialong Wu, Zhenglin Wang, Linhai Zhang, Yilong Lai, Yulan He, and Deyu Zhou. Scope: Optimizing key-value cache compression in long-context generation. arXiv preprint arXiv:2412.13649, 2024. [12] Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, and Shiyu Chang. Kvlink: Accelerating large language models via efficient kv cache reuse. arXiv preprint arXiv:2502.16002, 2025. zh_TW
