Publications-Theses
Article View/Open
Publication Export
-
Google ScholarTM
NCCU Library
Citation Infomation
Related Publications in TAIR
題名 結合大型語言模型之代理用於 Android App 錯誤重現任務
Combining Large Language Models for Agent Tasks in Android App Bug Reproduction作者 黃毓學 貢獻者 蔡子傑
黃毓學關鍵詞 自動化錯誤重現
軟體測試除錯
大型語言模型
提示工程
Android App
Automated Bug Reproduction
Software testing and debugging
Large Language Models
Prompt Engineering日期 2025 上傳時間 3-Mar-2025 14:28:52 (UTC+8) 摘要 代理(Agent)任務與大型語言模型(Large Language Models, LLM)兩者研究領域持續互相影響者,代理任務為 LLM 模型擴展了更多數據的類別,而 LLM 為代理研究解決了以往透過強化學習、監督學習做不到的問題,兩者結合謂為趨勢。本文即是探討使用 LLM 作為代理,試圖解決運行在 Android App中的錯誤描述的重現任務中,當遺失錯誤步驟過多時,因而無法用強化學習方式順利重現錯誤的問題。透過任務的轉換將強化學習獎勵設計的困難,轉為如何輸入適當的提示詞給 LLM,包括使用日誌解析工具來降低長上下文對 LLM 的生成文字準確性的影響。 借鏡強化學習訓練的思維,高度結合 LLM,為降低龐大狀態空間搜索,代理可能低效率探索,本文使用子目標區域(subgoal regions)的概念,透過 LLM 找出只與目標句有高度關聯的區域去搜索,進而降低要搜尋比對的數量。也將問題拆解成可以用 LLM 作為代理去運行的子任務,規劃流程為子目標區、制定靜態計畫、動態調整、動態探索的流程、應用 LLM 的規劃(planning)、推理(reasoning)、提取代換文字的能力。本文貢獻為在大量遺漏描述任務如何結合 LLM 在錯誤重現任務的提示工程。 從流程各項子任務評估驗證 LLM 的規劃及推理能力,評估結果:在子目標區域(subgoal regions)子任務,本文使用 GPT-4 在 Top-1 Accuracy:57%, Top-2 Accuracy:100% 可映射到正確目標區域。在靜態計畫子任務中 LLM 的表現,有 Top-1 Accuracy - 42%、 Top-2 Accuracy - 71%、Top-3 Accuracy - 100%。同時為了減少長上下文的影響,對 LLM 可能會有不正確的生成,因此使用事件日誌提取參數的工具 Spell 演算法,使得在提取特定字串的子任務中,LLM 有 90%的準確率。 但在兩項子任務中,將提取後相關文字代換的子任務,以及在動態生成建議行動的子任務中,LLM 都呈現偽陽性(false positive)高的狀況,這在錯誤重現任務中,並不能允許這樣情況發生,因為可能導致後續重現錯誤的基礎與使用者描述不一致,這個結果顯示 LLM 代理用於錯誤重現任務在自動化仍有提升的空間。 未來研究方向為用思考方式的語言模型或是使用 Open AI 近期提出強化微調(Reinforcement Learning Fine-Tuning)方式,透過訓練調整 LLM 輸出的順序,使 LLM 代理能在特定任務中發揮更準確的表現,使錯誤重現任務達到自動化的目標。
The research fields of agent tasks and large language models (LLMs) continue to influence each other. Agent tasks expand the types of data available to LLMs, while LLMs solve problems that reinforcement learning and supervised learning could not address in the past. The combination of these two approaches is becoming a trend. This paper explores using LLMs as agents to solve the problem of reproducing error descriptions in Android apps when too many error steps are missing, making it difficult to reproduce the error using reinforcement learning. By transforming the task, the difficulty of reward design in reinforcement learning is shifted to how to input appropriate prompts to the LLM. This includes using log parsing tools to reduce the impact of long contexts on the accuracy of the generated text by the LLM. Drawing on reinforcement learning training concepts and closely integrating LLMs, the paper aims to reduce the inefficiency of agent exploration in a large state space. The concept of subgoal regions is employed to use LLMs to identify areas highly related to the target sentence, thereby reducing the number of regions to search and compare. The task is broken down into subtasks that can be executed by LLMs as agents, with a process that includes subgoal regions, static planning, dynamic adjustments, dynamic exploration, and the application of LLM capabilities such as planning, reasoning, and extracting substitute text. The contribution of this paper is in how LLMs are integrated into error reproduction tasks with prompt engineering, particularly when a significant portion of the description is missing. The paper evaluates LLM planning and reasoning capabilities across various subtasks in the workflow. The evaluation results show that for the subgoal regions subtask, using GPT-4 achieved Top-1 Accuracy: 57% and Top-2 Accuracy: 100%, mapping to the correct target regions. In the static planning subtask, LLM performance achieved Top-1 Accuracy: 42%, Top-2 Accuracy: 71%, and Top-3 Accuracy: 100%. To reduce the impact of long contexts, which can lead to inaccurate generation, the Spell algorithm, an event log parameter extraction tool, was used in specific string extraction subtasks, leading to 90% accuracy for the LLM. However, in two subtasks—substitute text extraction and dynamically generating suggested actions—the LLM showed high false positive rates, which is unacceptable in the error reproduction task, as it may result in a mismatch between the foundation of error reproduction and the user’s description. This outcome indicates that there is still room for improvement in using LLMs as agents for automating error reproduction tasks. Future research directions include using models with a reasoning approach or OpenAI’s recently proposed reinforcement learning fine-tuning method to adjust the order of LLM outputs through training. This will enable LLM agents to perform more accurately in specific tasks, ultimately achieving the goal of automating error reproduction tasks.參考文獻 [1] Zhang, Z., Winn, R., Zhao, Y., Yu, T., & Halfond, W. G. J. (2023). Automatically Reproducing Android Bug Reports using Natural Language Processing and Reinforcement Learning Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, Seattle, WA, USA. https://doi.org/10.1145/3597926.3598066 [2] Zhang, Z., Tawsif, F. M., Ryu, K., Yu, T., & Halfond, W. G. J. (2024). Mobile Bug Report Reproduction via Global Search on the App UI Model. Proc. ACM Softw. Eng., 1(FSE), Article 117. https://doi.org/10.1145/3660824 [3] Ran, D., Wang, H., Song, Z., Wu, M., Cao, Y., Zhang, Y., Yang, W., & Xie, T. (2024). Guardian: A Runtime Framework for LLM-Based UI Exploration Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, Vienna, Austria. https://doi.org/10.1145/3650212.3680334 [4] Dziri, N., Lu, X., Sclar, M., Li, X. L., Jiang, L., Yuchen Lin, B., West, P., Bhagavatula, C., Le Bras, R., Hwang, J. D., Sanyal, S., Welleck, S., Ren, X., Ettinger, A., Harchaoui, Z., & Choi, Y. (2023). Faith and Fate: Limits of Transformers on Compositionality. arXiv:2305.18654. Retrieved May 01, 2023, from https://ui.adsabs.harvard.edu/abs/2023arXiv230518654D [5] Zheran Liu, E., et al. (2018) Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration. arXiv:1802.08802 DOI: 10.48550/arXiv.1802.08802 [6] Kim, G., et al. (2023) Language Models can Solve Computer Tasks. arXiv:2303.17491 DOI: 10.48550/arXiv.2303.17491 [7] Xi, Z., et al. (2023) The Rise and Potential of Large Language Model Based Agents: A Survey. arXiv:2309.07864 DOI: 10.48550/arXiv.2309.07864 [8] Jothimurugan, K., Bastani, O., & Alur, R. (2020). Abstract Value Iteration for Hierarchical Reinforcement Learning. arXiv:2010.15638. Retrieved October 01, 2020, from https://ui.adsabs.harvard.edu/abs/2020arXiv201015638J [9] Du, Y., Watkins, O., Wang, Z., Colas, C., Darrell, T., Abbeel, P., Gupta, A., & Andreas, J. (2023). Guiding Pretraining in Reinforcement Learning with Large Language Models. arXiv:2302.06692. Retrieved February 01, 2023, from https://ui.adsabs.harvard.edu/abs/2023arXiv230206692D [10] Song, C. H., Wu, J., Washington, C., Sadler, B. M., Chao, W.-L., & Su, Y. (2022). LLM- Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models. arXiv:2212.04088. Retrieved December 01, 2022, from https://ui.adsabs.harvard.edu/abs/2022arXiv221204088S [11] Lan, Y., Lu, Y., Li, Z., Pan, M., Yang, W., Zhang, T., & Li, X. (2024). Deeply Reinforcing Android GUI Testing with Deep Reinforcement Learning Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, Lisbon, Portugal. https://doi.org/10.1145/3597503.362334460 [12] Du, M., & Li, F. (2016). Spell: Streaming parsing of system event logs. 2016 IEEE 16th International Conference on Data Mining (ICDM), [13] Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, M., & Dormann, N. (2021). Stable-baselines3: reliable reinforcement learning implementations. J. Mach. Learn. Res., 22(1), Article 268. [14] Huang, S., & Ontañón, S. (2020). A Closer Look at Invalid Action Masking in Policy Gradient Algorithms. arXiv:2006.14171. Retrieved June 01, 2020, from https://ui.adsabs.harvard.edu/abs/2020arXiv200614171H [15] Vaswani, A. (2017). Attention is all you need. Advances in Neural Information Processing Systems. [16] Gaon, M., & Brafman, R. (2020). Reinforcement learning with non-markovian rewards. Proceedings of the AAAI conference on artificial intelligence, [17] Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C.,…Amodei, D. (2020). Language models are few-shot learners Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada. [18] Ye, J., Chen, X., Xu, N., Zu, C., Shao, Z., Liu, S., Cui, Y., Zhou, Z., Gong, C., Shen, Y., Zhou, J., Chen, S., Gui, T., Zhang, Q., & Huang, X. (2023). A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models. ArXiv, abs/2303.10420. [19] Achiam, O. J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H.-i., Bavarian, M., Belgum, J., Bello, I.,…Zoph, B. (2023). GPT-4 Technical Report. [20] Du, M., Li, F., Zheng, G., & Srikumar, V. (2017). DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, Dallas, Texas, USA. https://doi.org/10.1145/3133956.3134015 [21] Wang, W., Bao, H., Huang, S., Dong, L., & Wei, F. (2021, August). MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers. In C. Zong, F. Xia, W. Li, & R. Navigli, Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 Online. [22] UI Automator. https://developer.android.com/training/testing/ui-automator. [23] 2020. Bug Report – AnkiDroid 6432. https://github.com/ankidroid/Anki-Android/issues/6432 [24] 2023. ReproBot Website. https://sites.google.com/usc.edu/reprobot/home.61 [25] Wang, D., Zhao, Y., Feng, S., Zhang, Z., Halfond, W. G. J., Chen, C., Sun, X., Shi, J., & Yu, T. (2024). Feedback-Driven Automated Whole Bug Report Reproduction for Android Apps. arXiv:2407.05165. Retrieved July 01, 2024, from https://ui.adsabs.harvard.edu/abs/2024arXiv240705165W [26] Peng, A., Sucholutsky, I., Li, B. Z., Sumers, T. R., Griffiths, T. L., Andreas, J., & Shah, J.A. (2024). Learning with language-guided state abstractions. arXiv preprint arXiv:2402.18759. 描述 碩士
國立政治大學
資訊科學系碩士在職專班
110971022資料來源 http://thesis.lib.nccu.edu.tw/record/#G0110971022 資料類型 thesis dc.contributor.advisor 蔡子傑 zh_TW dc.contributor.author (Authors) 黃毓學 zh_TW dc.creator (作者) 黃毓學 zh_TW dc.date (日期) 2025 en_US dc.date.accessioned 3-Mar-2025 14:28:52 (UTC+8) - dc.date.available 3-Mar-2025 14:28:52 (UTC+8) - dc.date.issued (上傳時間) 3-Mar-2025 14:28:52 (UTC+8) - dc.identifier (Other Identifiers) G0110971022 en_US dc.identifier.uri (URI) https://nccur.lib.nccu.edu.tw/handle/140.119/155990 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 資訊科學系碩士在職專班 zh_TW dc.description (描述) 110971022 zh_TW dc.description.abstract (摘要) 代理(Agent)任務與大型語言模型(Large Language Models, LLM)兩者研究領域持續互相影響者,代理任務為 LLM 模型擴展了更多數據的類別,而 LLM 為代理研究解決了以往透過強化學習、監督學習做不到的問題,兩者結合謂為趨勢。本文即是探討使用 LLM 作為代理,試圖解決運行在 Android App中的錯誤描述的重現任務中,當遺失錯誤步驟過多時,因而無法用強化學習方式順利重現錯誤的問題。透過任務的轉換將強化學習獎勵設計的困難,轉為如何輸入適當的提示詞給 LLM,包括使用日誌解析工具來降低長上下文對 LLM 的生成文字準確性的影響。 借鏡強化學習訓練的思維,高度結合 LLM,為降低龐大狀態空間搜索,代理可能低效率探索,本文使用子目標區域(subgoal regions)的概念,透過 LLM 找出只與目標句有高度關聯的區域去搜索,進而降低要搜尋比對的數量。也將問題拆解成可以用 LLM 作為代理去運行的子任務,規劃流程為子目標區、制定靜態計畫、動態調整、動態探索的流程、應用 LLM 的規劃(planning)、推理(reasoning)、提取代換文字的能力。本文貢獻為在大量遺漏描述任務如何結合 LLM 在錯誤重現任務的提示工程。 從流程各項子任務評估驗證 LLM 的規劃及推理能力,評估結果:在子目標區域(subgoal regions)子任務,本文使用 GPT-4 在 Top-1 Accuracy:57%, Top-2 Accuracy:100% 可映射到正確目標區域。在靜態計畫子任務中 LLM 的表現,有 Top-1 Accuracy - 42%、 Top-2 Accuracy - 71%、Top-3 Accuracy - 100%。同時為了減少長上下文的影響,對 LLM 可能會有不正確的生成,因此使用事件日誌提取參數的工具 Spell 演算法,使得在提取特定字串的子任務中,LLM 有 90%的準確率。 但在兩項子任務中,將提取後相關文字代換的子任務,以及在動態生成建議行動的子任務中,LLM 都呈現偽陽性(false positive)高的狀況,這在錯誤重現任務中,並不能允許這樣情況發生,因為可能導致後續重現錯誤的基礎與使用者描述不一致,這個結果顯示 LLM 代理用於錯誤重現任務在自動化仍有提升的空間。 未來研究方向為用思考方式的語言模型或是使用 Open AI 近期提出強化微調(Reinforcement Learning Fine-Tuning)方式,透過訓練調整 LLM 輸出的順序,使 LLM 代理能在特定任務中發揮更準確的表現,使錯誤重現任務達到自動化的目標。 zh_TW dc.description.abstract (摘要) The research fields of agent tasks and large language models (LLMs) continue to influence each other. Agent tasks expand the types of data available to LLMs, while LLMs solve problems that reinforcement learning and supervised learning could not address in the past. The combination of these two approaches is becoming a trend. This paper explores using LLMs as agents to solve the problem of reproducing error descriptions in Android apps when too many error steps are missing, making it difficult to reproduce the error using reinforcement learning. By transforming the task, the difficulty of reward design in reinforcement learning is shifted to how to input appropriate prompts to the LLM. This includes using log parsing tools to reduce the impact of long contexts on the accuracy of the generated text by the LLM. Drawing on reinforcement learning training concepts and closely integrating LLMs, the paper aims to reduce the inefficiency of agent exploration in a large state space. The concept of subgoal regions is employed to use LLMs to identify areas highly related to the target sentence, thereby reducing the number of regions to search and compare. The task is broken down into subtasks that can be executed by LLMs as agents, with a process that includes subgoal regions, static planning, dynamic adjustments, dynamic exploration, and the application of LLM capabilities such as planning, reasoning, and extracting substitute text. The contribution of this paper is in how LLMs are integrated into error reproduction tasks with prompt engineering, particularly when a significant portion of the description is missing. The paper evaluates LLM planning and reasoning capabilities across various subtasks in the workflow. The evaluation results show that for the subgoal regions subtask, using GPT-4 achieved Top-1 Accuracy: 57% and Top-2 Accuracy: 100%, mapping to the correct target regions. In the static planning subtask, LLM performance achieved Top-1 Accuracy: 42%, Top-2 Accuracy: 71%, and Top-3 Accuracy: 100%. To reduce the impact of long contexts, which can lead to inaccurate generation, the Spell algorithm, an event log parameter extraction tool, was used in specific string extraction subtasks, leading to 90% accuracy for the LLM. However, in two subtasks—substitute text extraction and dynamically generating suggested actions—the LLM showed high false positive rates, which is unacceptable in the error reproduction task, as it may result in a mismatch between the foundation of error reproduction and the user’s description. This outcome indicates that there is still room for improvement in using LLMs as agents for automating error reproduction tasks. Future research directions include using models with a reasoning approach or OpenAI’s recently proposed reinforcement learning fine-tuning method to adjust the order of LLM outputs through training. This will enable LLM agents to perform more accurately in specific tasks, ultimately achieving the goal of automating error reproduction tasks. en_US dc.description.tableofcontents 第一章 緒論 1 第一節 背景 1 第二節 錯誤重現任務介紹 2 第三節 研究動機與目標 5 第四節 相關文獻研究 8 第五節 章節介紹 13 第二章 使用方法理論基礎 15 第一節 方法建構時參考文獻 15 第二節 錯誤重現代理任務的框架-文字正規化 18 第三節 降低搜尋空間的複雜度 - 劃分子目標區域空間 20 第四節 使用預訓練語言模組做子目標區域匹配 21 第五節 輸出低層次的動作規劃 23 第六節 探索獎勵函數,改爲語言目標方式 24 第七節 大型語言模型應用在人類語言的電腦任務 26 第八節 減少上下文的資訊萃取方式 27 第三章 實驗方法 29 第一節 各節介紹 29 第二節 空間劃分及子目標區域路徑 30 第三節 系統架構圖 32 第四節 句子轉換為錯誤重現步驟及比對 37 第五節 子目標區域的劃分 40 第六節 動態探索與靜態計畫修正 42 第四章 實驗結果與分析 48 第一節 實驗說明 48 第二節 實驗結果 53 第三節 結果分析 58 第五章 結論與未來展望 59 參考文獻 60 zh_TW dc.format.extent 3293134 bytes - dc.format.mimetype application/pdf - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0110971022 en_US dc.subject (關鍵詞) 自動化錯誤重現 zh_TW dc.subject (關鍵詞) 軟體測試除錯 zh_TW dc.subject (關鍵詞) 大型語言模型 zh_TW dc.subject (關鍵詞) 提示工程 zh_TW dc.subject (關鍵詞) Android App en_US dc.subject (關鍵詞) Automated Bug Reproduction en_US dc.subject (關鍵詞) Software testing and debugging en_US dc.subject (關鍵詞) Large Language Models en_US dc.subject (關鍵詞) Prompt Engineering en_US dc.title (題名) 結合大型語言模型之代理用於 Android App 錯誤重現任務 zh_TW dc.title (題名) Combining Large Language Models for Agent Tasks in Android App Bug Reproduction en_US dc.type (資料類型) thesis en_US dc.relation.reference (參考文獻) [1] Zhang, Z., Winn, R., Zhao, Y., Yu, T., & Halfond, W. G. J. (2023). Automatically Reproducing Android Bug Reports using Natural Language Processing and Reinforcement Learning Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, Seattle, WA, USA. https://doi.org/10.1145/3597926.3598066 [2] Zhang, Z., Tawsif, F. M., Ryu, K., Yu, T., & Halfond, W. G. J. (2024). Mobile Bug Report Reproduction via Global Search on the App UI Model. Proc. ACM Softw. Eng., 1(FSE), Article 117. https://doi.org/10.1145/3660824 [3] Ran, D., Wang, H., Song, Z., Wu, M., Cao, Y., Zhang, Y., Yang, W., & Xie, T. (2024). Guardian: A Runtime Framework for LLM-Based UI Exploration Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, Vienna, Austria. https://doi.org/10.1145/3650212.3680334 [4] Dziri, N., Lu, X., Sclar, M., Li, X. L., Jiang, L., Yuchen Lin, B., West, P., Bhagavatula, C., Le Bras, R., Hwang, J. D., Sanyal, S., Welleck, S., Ren, X., Ettinger, A., Harchaoui, Z., & Choi, Y. (2023). Faith and Fate: Limits of Transformers on Compositionality. arXiv:2305.18654. Retrieved May 01, 2023, from https://ui.adsabs.harvard.edu/abs/2023arXiv230518654D [5] Zheran Liu, E., et al. (2018) Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration. arXiv:1802.08802 DOI: 10.48550/arXiv.1802.08802 [6] Kim, G., et al. (2023) Language Models can Solve Computer Tasks. arXiv:2303.17491 DOI: 10.48550/arXiv.2303.17491 [7] Xi, Z., et al. (2023) The Rise and Potential of Large Language Model Based Agents: A Survey. arXiv:2309.07864 DOI: 10.48550/arXiv.2309.07864 [8] Jothimurugan, K., Bastani, O., & Alur, R. (2020). Abstract Value Iteration for Hierarchical Reinforcement Learning. arXiv:2010.15638. Retrieved October 01, 2020, from https://ui.adsabs.harvard.edu/abs/2020arXiv201015638J [9] Du, Y., Watkins, O., Wang, Z., Colas, C., Darrell, T., Abbeel, P., Gupta, A., & Andreas, J. (2023). Guiding Pretraining in Reinforcement Learning with Large Language Models. arXiv:2302.06692. Retrieved February 01, 2023, from https://ui.adsabs.harvard.edu/abs/2023arXiv230206692D [10] Song, C. H., Wu, J., Washington, C., Sadler, B. M., Chao, W.-L., & Su, Y. (2022). LLM- Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models. arXiv:2212.04088. Retrieved December 01, 2022, from https://ui.adsabs.harvard.edu/abs/2022arXiv221204088S [11] Lan, Y., Lu, Y., Li, Z., Pan, M., Yang, W., Zhang, T., & Li, X. (2024). Deeply Reinforcing Android GUI Testing with Deep Reinforcement Learning Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, Lisbon, Portugal. https://doi.org/10.1145/3597503.362334460 [12] Du, M., & Li, F. (2016). Spell: Streaming parsing of system event logs. 2016 IEEE 16th International Conference on Data Mining (ICDM), [13] Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, M., & Dormann, N. (2021). Stable-baselines3: reliable reinforcement learning implementations. J. Mach. Learn. Res., 22(1), Article 268. [14] Huang, S., & Ontañón, S. (2020). A Closer Look at Invalid Action Masking in Policy Gradient Algorithms. arXiv:2006.14171. Retrieved June 01, 2020, from https://ui.adsabs.harvard.edu/abs/2020arXiv200614171H [15] Vaswani, A. (2017). Attention is all you need. Advances in Neural Information Processing Systems. [16] Gaon, M., & Brafman, R. (2020). Reinforcement learning with non-markovian rewards. Proceedings of the AAAI conference on artificial intelligence, [17] Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C.,…Amodei, D. (2020). Language models are few-shot learners Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada. [18] Ye, J., Chen, X., Xu, N., Zu, C., Shao, Z., Liu, S., Cui, Y., Zhou, Z., Gong, C., Shen, Y., Zhou, J., Chen, S., Gui, T., Zhang, Q., & Huang, X. (2023). A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models. ArXiv, abs/2303.10420. [19] Achiam, O. J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H.-i., Bavarian, M., Belgum, J., Bello, I.,…Zoph, B. (2023). GPT-4 Technical Report. [20] Du, M., Li, F., Zheng, G., & Srikumar, V. (2017). DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, Dallas, Texas, USA. https://doi.org/10.1145/3133956.3134015 [21] Wang, W., Bao, H., Huang, S., Dong, L., & Wei, F. (2021, August). MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers. In C. Zong, F. Xia, W. Li, & R. Navigli, Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 Online. [22] UI Automator. https://developer.android.com/training/testing/ui-automator. [23] 2020. Bug Report – AnkiDroid 6432. https://github.com/ankidroid/Anki-Android/issues/6432 [24] 2023. ReproBot Website. https://sites.google.com/usc.edu/reprobot/home.61 [25] Wang, D., Zhao, Y., Feng, S., Zhang, Z., Halfond, W. G. J., Chen, C., Sun, X., Shi, J., & Yu, T. (2024). Feedback-Driven Automated Whole Bug Report Reproduction for Android Apps. arXiv:2407.05165. Retrieved July 01, 2024, from https://ui.adsabs.harvard.edu/abs/2024arXiv240705165W [26] Peng, A., Sucholutsky, I., Li, B. Z., Sumers, T. R., Griffiths, T. L., Andreas, J., & Shah, J.A. (2024). Learning with language-guided state abstractions. arXiv preprint arXiv:2402.18759. zh_TW
