大型語言模型的非檢索式上下文延展機制研究：從鍵值緩存到微積分AI教師 | Publication

Publications-Theses

Article View/Open

pdf(0)

Publication Export

Google Scholar^TM

題名	大型語言模型的非檢索式上下文延展機制研究：從鍵值緩存到微積分AI教師 RAG-Free Contextual Extension for LLMs: A Study on KV-Cache and Calculus AI Tutoring
作者	孫翊珈 Sun, Yi-Jia
貢獻者	蔡炎龍 Tsai, Yen-Lung 孫翊珈 Sun, Yi-Jia
關鍵詞	大型語言模型鍵值快取上下文延展 AI 教學系統非檢索生成 Large Language Models Key-Value Cache Context Extension AI Tutoring System Retrieval-Free Generation
日期	2025
上傳時間	1-Sep-2025 16:30:03 (UTC+8)
摘要	隨著大型語言模型（Large Language Models, LLMs）在自然語言處理領域的快速發展，其應用已逐漸擴展至教育場域。然而，現有 LLM 面臨上下文長度（context window）受限的挑戰，使其在處理長篇教材與多輪教學問答時難以維持語境連貫性與邏輯一致性。傳統解法如檢索增強生成（Retrieval-Augmented Generation, RAG）雖能引入外部知識，但也易產生檢索偏誤及語境斷裂，影響教學應用的效能。本研究提出一種基於鍵值快取（Key-Value Cache, KV-Cache）的非檢索式上下文延展策略，並設計實作了一套以微積分教材為基礎的 AI 教師系統。系統透過分段預填充（chunked prefill）將教材內容逐步輸入模型，並快取中間計算結果，讓模型在後續教學問答中能延續語境、節省運算資源並提升語義一致性。實驗比較了 KV-Cache 系統、RAG 系統與無快取系統，評估其記憶體使用與回應延遲。實驗結果顯示，所提出的 KV-Cache 機制在長文本教學場景下能有效提升語境連貫性，並顯著降低回應延遲，展現其於 AI 教學應用中的潛力。 With the advancement of Large Language Models (LLMs), their integration into educational applications has attracted increasing attention. However, LLMs are constrained by their fixed context window size, making it difficult to handle long instructional materials and maintain coherent multi-turn teaching dialogues. While Retrieval-Augmented Generation (RAG) alleviates some knowledge limitations by incorporating external retrieval, it often introduces retrieval bias and context fragmentation, reducing its effectiveness in educational scenarios. This study proposes a retrieval-free context extension approach based on Key-Value Cache (KV-Cache) and implements a calculus-focused AI tutoring system. The system incrementally feeds LaTeX-based calculus textbooks into the model using a chunked prefill strategy, caching intermediate computations to enable consistent context retention and improved semantic coherence in subsequent teaching interactions. The experiments compare the proposed system with RAG-based and non-caching baselines, focusing on response latency and teaching continuity. Experimental results demonstrate that the KV-Cache mechanism effectively enhances contextual coherence and significantly reduces response latency in long-text teaching scenarios, showing great potential for future AI-driven educational systems.
參考文獻	[1] Jie Hu, Shengnan Wang, Yutong He, Ping Gong, Jiawei Yi, Juncheng Zhang, Youhui Bai, Renhai Chen, Gong Zhang, Cheng Li, et al. Efficient long-context llm inference via kv cache clustering. arXiv preprint arXiv:2506.11418, 2025. [2] Neusha Javidnia, Bita Darvish Rouhani, and Farinaz Koushanfar. Key, value, compress: A systematic exploration of kv cache compression techniques. In 2025 IEEE Custom Integrated Circuits Conference (CICC), pages 1–3. IEEE, 2025. [3] Jushi Kai, Boyi Zeng, Yixuan Wang, Haoli Bai, Ziwei He, Bo Jiang, and Zhouhan Lin. Freqkv: Frequencydomainkey-valuecompressionforefficientcontextwindowextension. arXiv preprint arXiv:2505.00570, 2025. [4] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459–9474, 2020. [5] Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, and Minyi Guo. Clusterkv: Manipulating llm kv cache in semantic space for recallable compression. arXiv preprint arXiv:2412.03213, 2024. [6] A. Palu and B. Smith. Kv-cache compression with low-rank projection. In International Conference on Learning Representations (ICLR), 2024. [7] Aurick Qiao, Zhewei Yao, Samyam Rajbhandari, and Yuxiong He. Swiftkv: Fast prefill optimized inference with knowledge-preserving model transformation. arXiv preprint arXiv:2410.03960, 2024. [8] Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, and Beidi Chen. Shadowkv: Kv cache in shadows for high-throughput long-context llm inference. arXiv preprint arXiv:2410.21465, 2024. [9] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [10] Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Multilingual e5 text embeddings: A technical report. arXiv preprint arXiv:2402.05672, 2024. [11] Jialong Wu, Zhenglin Wang, Linhai Zhang, Yilong Lai, Yulan He, and Deyu Zhou. Scope: Optimizing key-value cache compression in long-context generation. arXiv preprint arXiv:2412.13649, 2024. [12] Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, and Shiyu Chang. Kvlink: Accelerating large language models via efficient kv cache reuse. arXiv preprint arXiv:2502.16002, 2025.
描述	碩士國立政治大學應用數學系 111751001
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0111751001
資料類型	thesis

dc.contributor.advisor	蔡炎龍	zh_TW
dc.contributor.advisor	Tsai, Yen-Lung	en_US
dc.contributor.author (Authors)	孫翊珈	zh_TW
dc.contributor.author (Authors)	Sun, Yi-Jia	en_US
dc.creator (作者)	孫翊珈	zh_TW
dc.creator (作者)	Sun, Yi-Jia	en_US
dc.date (日期)	2025	en_US
dc.date.accessioned	1-Sep-2025 16:30:03 (UTC+8)	-
dc.date.available	1-Sep-2025 16:30:03 (UTC+8)	-
dc.date.issued (上傳時間)	1-Sep-2025 16:30:03 (UTC+8)	-
dc.identifier (Other Identifiers)	G0111751001	en_US
dc.identifier.uri (URI)	https://nccur.lib.nccu.edu.tw/handle/140.119/159317	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	應用數學系	zh_TW
dc.description (描述)	111751001	zh_TW
dc.description.abstract (摘要)	隨著大型語言模型（Large Language Models, LLMs）在自然語言處理領域的快速發展，其應用已逐漸擴展至教育場域。然而，現有 LLM 面臨上下文長度（context window）受限的挑戰，使其在處理長篇教材與多輪教學問答時難以維持語境連貫性與邏輯一致性。傳統解法如檢索增強生成（Retrieval-Augmented Generation, RAG）雖能引入外部知識，但也易產生檢索偏誤及語境斷裂，影響教學應用的效能。本研究提出一種基於鍵值快取（Key-Value Cache, KV-Cache）的非檢索式上下文延展策略，並設計實作了一套以微積分教材為基礎的 AI 教師系統。系統透過分段預填充（chunked prefill）將教材內容逐步輸入模型，並快取中間計算結果，讓模型在後續教學問答中能延續語境、節省運算資源並提升語義一致性。實驗比較了 KV-Cache 系統、RAG 系統與無快取系統，評估其記憶體使用與回應延遲。實驗結果顯示，所提出的 KV-Cache 機制在長文本教學場景下能有效提升語境連貫性，並顯著降低回應延遲，展現其於 AI 教學應用中的潛力。	zh_TW
dc.description.abstract (摘要)	With the advancement of Large Language Models (LLMs), their integration into educational applications has attracted increasing attention. However, LLMs are constrained by their fixed context window size, making it difficult to handle long instructional materials and maintain coherent multi-turn teaching dialogues. While Retrieval-Augmented Generation (RAG) alleviates some knowledge limitations by incorporating external retrieval, it often introduces retrieval bias and context fragmentation, reducing its effectiveness in educational scenarios. This study proposes a retrieval-free context extension approach based on Key-Value Cache (KV-Cache) and implements a calculus-focused AI tutoring system. The system incrementally feeds LaTeX-based calculus textbooks into the model using a chunked prefill strategy, caching intermediate computations to enable consistent context retention and improved semantic coherence in subsequent teaching interactions. The experiments compare the proposed system with RAG-based and non-caching baselines, focusing on response latency and teaching continuity. Experimental results demonstrate that the KV-Cache mechanism effectively enhances contextual coherence and significantly reduces response latency in long-text teaching scenarios, showing great potential for future AI-driven educational systems.	en_US
dc.description.tableofcontents	致謝 ii 中文摘要 iii Abstract iv 目錄 v 表目錄 vii 圖目錄 viii 第一章緒論 1 第二章相關技術介紹 3 第一節 Transformer架構與注意力機制 3 第二節擴展大型語言模型的上下文限制問題 10 一、位置編碼擴展技術：旋轉位置編碼（RoPE） 10 二、位置編碼擴展技術：線性偏置注意力（ALiBi） 11 第三節檢索增強生成（RAG）架構簡介 13 第四節鍵值快取（Key-ValueCache,KV-Cache）技術 15 第五節 Cache增強生成(CAG)簡介 18 第三章系統設計與實作 20 第一節系統架構概述 20 第二節教材預處理與段落切分 20 第三節快取建構與上下文延展策略 21 第四節學生提問流程與模型生成 21 第四章實驗設計與評估方法 22 第一節實驗目標 22 第二節實驗設計 22 第三節測試資料 23 第四節評估指標 23 第五節實作環境 23 第五章實驗結果與分析 24 第一節實作結果分析 24 一、KV-Cache系統結果 25 二、無快取系統結果 27 三、RAG系統結果 29 第二節記憶體使用量比較 30 第三節回應延遲時間分析 31 第四節質性觀察分析 31 第五節小結 32 第六節討論 32 第六章結論與未來展望 33 第一節研究結論 33 第二節研究限制 33 第三節未來工作方向 34 附錄A附錄編輯 35 A.1附錄內容 35 參考文獻 36	zh_TW
dc.format.extent	1129657 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0111751001	en_US
dc.subject (關鍵詞)	大型語言模型	zh_TW
dc.subject (關鍵詞)	鍵值快取	zh_TW
dc.subject (關鍵詞)	上下文延展	zh_TW
dc.subject (關鍵詞)	AI 教學系統	zh_TW
dc.subject (關鍵詞)	非檢索生成	zh_TW
dc.subject (關鍵詞)	Large Language Models	en_US
dc.subject (關鍵詞)	Key-Value Cache	en_US
dc.subject (關鍵詞)	Context Extension	en_US
dc.subject (關鍵詞)	AI Tutoring System	en_US
dc.subject (關鍵詞)	Retrieval-Free Generation	en_US
dc.title (題名)	大型語言模型的非檢索式上下文延展機制研究：從鍵值緩存到微積分AI教師	zh_TW
dc.title (題名)	RAG-Free Contextual Extension for LLMs: A Study on KV-Cache and Calculus AI Tutoring	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	[1] Jie Hu, Shengnan Wang, Yutong He, Ping Gong, Jiawei Yi, Juncheng Zhang, Youhui Bai, Renhai Chen, Gong Zhang, Cheng Li, et al. Efficient long-context llm inference via kv cache clustering. arXiv preprint arXiv:2506.11418, 2025. [2] Neusha Javidnia, Bita Darvish Rouhani, and Farinaz Koushanfar. Key, value, compress: A systematic exploration of kv cache compression techniques. In 2025 IEEE Custom Integrated Circuits Conference (CICC), pages 1–3. IEEE, 2025. [3] Jushi Kai, Boyi Zeng, Yixuan Wang, Haoli Bai, Ziwei He, Bo Jiang, and Zhouhan Lin. Freqkv: Frequencydomainkey-valuecompressionforefficientcontextwindowextension. arXiv preprint arXiv:2505.00570, 2025. [4] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459–9474, 2020. [5] Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, and Minyi Guo. Clusterkv: Manipulating llm kv cache in semantic space for recallable compression. arXiv preprint arXiv:2412.03213, 2024. [6] A. Palu and B. Smith. Kv-cache compression with low-rank projection. In International Conference on Learning Representations (ICLR), 2024. [7] Aurick Qiao, Zhewei Yao, Samyam Rajbhandari, and Yuxiong He. Swiftkv: Fast prefill optimized inference with knowledge-preserving model transformation. arXiv preprint arXiv:2410.03960, 2024. [8] Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, and Beidi Chen. Shadowkv: Kv cache in shadows for high-throughput long-context llm inference. arXiv preprint arXiv:2410.21465, 2024. [9] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [10] Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Multilingual e5 text embeddings: A technical report. arXiv preprint arXiv:2402.05672, 2024. [11] Jialong Wu, Zhenglin Wang, Linhai Zhang, Yilong Lai, Yulan He, and Deyu Zhou. Scope: Optimizing key-value cache compression in long-context generation. arXiv preprint arXiv:2412.13649, 2024. [12] Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, and Shiyu Chang. Kvlink: Accelerating large language models via efficient kv cache reuse. arXiv preprint arXiv:2502.16002, 2025.	zh_TW

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM