Publications-Theses
Article View/Open
Publication Export
-
題名 基於詞組的注意力機制用於長文轉換器模型
Token-wise Attention Mechanism for Long Input Transformer Models作者 賴建郡
Lai, Jian-Jyun貢獻者 黃瀚萱
Huang, Hen-Hsen
賴建郡
Lai, Jian-Jyun關鍵詞 自然語言處理
長文處理
轉換器
注意力機制
基於詞組分析
Natural language processing
Long text processing
Transformer
Attention mechanism
Token-wise analysis日期 2022 上傳時間 2-Dec-2022 15:20:46 (UTC+8) 摘要 在現今的自然預言處理的領域當中,以轉換器作為基礎的模型是一個經常被使用的架構,通常來說依照使用該架構來針對大型文本進行預訓練,再針對下游不同的任務分別再進行微調被視為是有效的;在轉換器模型當中,注意力機制是該模型得以獲得資訊的關鍵,而由於注意力機制本身的架構,當字串的長度增加,使用的記憶體也會巨幅的成長,同時,轉換器模型在執行長字串的任務的表現仍舊有進步的空間。本文嘗試以個別詞組來重新定義注意力機制觀測的範圍,分別為詞性標記和獨立的詞組注意力機制,並以一個切隔注意力機制的矩陣計算方式來達到降低記憶體使用。在長字串分類和長字串問答中,使用獨立的詞組注意力機制的模型能達到與現今的傑出長字串模型—Longformer相互競爭的表現,並相較於該模型使用較少的記憶體,使其能夠更輕易的應用於自然語言任務。
Transformer-based models are the mainstream in natural language processing (NLP). This scheme is proven an efficient method essential in pre-training and fine-tuning. In the Transformer-based models, the attention mechanism is critical to gaining information on sequences. However, the architecture in the attention mechanism has led to time-consuming and significantly affected by the length of sequences. Also, the performance of the Transformer-based models dealing with long sequences tasks still has much room for further improvement.In this work, we tend to use a token-wise method to redefine the limiting of the attention mechanism: POS tagging and independent attention. Moreover, with splitting attention matrix computing, the model tends to occupy less memory.While dealing with long sequences classification and question-answering tasks, the independent attention mechanism models show competitive performance with Longformer. In addition, memory usage also shows an advantage. Thus, using the proposed method tends to be easier in dealing with NLP tasks.參考文獻 [1] Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-documenttransformer, 2020.[2] Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A largescale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017.Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URLhttps://aclanthology.org/P17-1147.[3] Johannes Kiesel, Maria Mestre, Rishabh Shukla, Emmanuel Vincent, Payam Adineh,David Corney, Benno Stein, and Martin Potthast. SemEval-2019 task 4: Hyperpartisan news detection. In Proceedings of the 13th International Workshop onSemantic Evaluation, pages 829–839, Minneapolis, Minnesota, USA, June 2019.Association for Computational Linguistics. doi: 10.18653/v1/S19-2145. URLhttps://aclanthology.org/S19-2145.[4] Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, and Kentaro Inui. Attention is notonly a weight: Analyzing transformers with vector norms, 2020.[5] Quoc V. Le and Tomas Mikolov. Distributed representations of sentences and documents, 2014.[6] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation ofword representations in vector space, 2013.[7] Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers:A survey, 2020.[8] Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, and Che Zheng.Synthesizer: Rethinking self-attention in transformer models, 2021.[9] Trieu H. Trinh and Quoc V. Le. A simple method for commonsense reasoning, 2019.[10] Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi,Franziska Roesner, and Yejin Choi. Defending against neural fake news, 2020.[11] Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun,Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-likevisual explanations by watching movies and reading books, 2015. 描述 碩士
國立政治大學
資訊科學系
109753205資料來源 http://thesis.lib.nccu.edu.tw/record/#G0109753205 資料類型 thesis dc.contributor.advisor 黃瀚萱 zh_TW dc.contributor.advisor Huang, Hen-Hsen en_US dc.contributor.author (Authors) 賴建郡 zh_TW dc.contributor.author (Authors) Lai, Jian-Jyun en_US dc.creator (作者) 賴建郡 zh_TW dc.creator (作者) Lai, Jian-Jyun en_US dc.date (日期) 2022 en_US dc.date.accessioned 2-Dec-2022 15:20:46 (UTC+8) - dc.date.available 2-Dec-2022 15:20:46 (UTC+8) - dc.date.issued (上傳時間) 2-Dec-2022 15:20:46 (UTC+8) - dc.identifier (Other Identifiers) G0109753205 en_US dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/142642 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 資訊科學系 zh_TW dc.description (描述) 109753205 zh_TW dc.description.abstract (摘要) 在現今的自然預言處理的領域當中,以轉換器作為基礎的模型是一個經常被使用的架構,通常來說依照使用該架構來針對大型文本進行預訓練,再針對下游不同的任務分別再進行微調被視為是有效的;在轉換器模型當中,注意力機制是該模型得以獲得資訊的關鍵,而由於注意力機制本身的架構,當字串的長度增加,使用的記憶體也會巨幅的成長,同時,轉換器模型在執行長字串的任務的表現仍舊有進步的空間。本文嘗試以個別詞組來重新定義注意力機制觀測的範圍,分別為詞性標記和獨立的詞組注意力機制,並以一個切隔注意力機制的矩陣計算方式來達到降低記憶體使用。在長字串分類和長字串問答中,使用獨立的詞組注意力機制的模型能達到與現今的傑出長字串模型—Longformer相互競爭的表現,並相較於該模型使用較少的記憶體,使其能夠更輕易的應用於自然語言任務。 zh_TW dc.description.abstract (摘要) Transformer-based models are the mainstream in natural language processing (NLP). This scheme is proven an efficient method essential in pre-training and fine-tuning. In the Transformer-based models, the attention mechanism is critical to gaining information on sequences. However, the architecture in the attention mechanism has led to time-consuming and significantly affected by the length of sequences. Also, the performance of the Transformer-based models dealing with long sequences tasks still has much room for further improvement.In this work, we tend to use a token-wise method to redefine the limiting of the attention mechanism: POS tagging and independent attention. Moreover, with splitting attention matrix computing, the model tends to occupy less memory.While dealing with long sequences classification and question-answering tasks, the independent attention mechanism models show competitive performance with Longformer. In addition, memory usage also shows an advantage. Thus, using the proposed method tends to be easier in dealing with NLP tasks. en_US dc.description.tableofcontents 誌謝 i摘要 iiAbstract iiiContents vList of Figures viii1 Introduction 11.1 Background 11.2 Motivation 21.3 Research Goals 32 Related Work 52.1 Reducing the Computation Complexity of Transformer-based Models 52.1.1 Fixed Patterns 52.1.2 Combination of Patterns 62.1.3 Learnable Patterns 62.1.4 Memory 62.1.5 Low-Rank Methods 62.1.6 Kernels 72.1.7 Recurrence 72.2 Transformer-based models for Long Input Sequences 72.3 Replacement of the Attention Matrix 82.4 Importance of the Attention Matrix and Input Sequences 83 Datasets 103.1 The First Stage 113.2 The Second Stage 123.3 The Third Stage 134 Methodology 164.1 Part-of-speech(POS) tagging 164.1.1 Global attention 174.1.2 Large local attention 184.1.3 Small local attention 184.1.4 Mask language modeling 184.2 Independent attention window size 184.2.1 Transform attention window size without limitation 194.2.2 Transform attention window size with limitation 204.2.3 Decrease memory consuming with independent limitation 204.3 A Three Stage of Computing Attention 215 Experiments 235.1 Hyperpartisan 235.1.1 Evualtion of The Models Pre-trained With Continuous Task and Data 245.1.2 Evualtion of The Models Pre-trained on the Third Stage Data 245.2 TriviaQA 256 Analysis 286.1 Training with POS taggings 286.2 The Independent Attention Tokens’ Distribution 306.3 The Independent Attention Tokens’ Diversification during Pre-training 316.4 The Tendency of Hyperpartisan Score on Continuous Dataset 346.5 The Tendency of Hyperpartisan Score on Not Continuous Dataset 376.6 The TriviaQA Performance on Each Answer Position 376.7 The F1-score and EM-score on Each Answer Position in TriviaQA 426.8 McNemar’s Test in Hyperpartisan 426.9 McNemar’s Test in TriviaQA 436.10 VRAM Occupied 437 Conclusions 45Reference 46 zh_TW dc.format.extent 36365732 bytes - dc.format.mimetype application/pdf - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0109753205 en_US dc.subject (關鍵詞) 自然語言處理 zh_TW dc.subject (關鍵詞) 長文處理 zh_TW dc.subject (關鍵詞) 轉換器 zh_TW dc.subject (關鍵詞) 注意力機制 zh_TW dc.subject (關鍵詞) 基於詞組分析 zh_TW dc.subject (關鍵詞) Natural language processing en_US dc.subject (關鍵詞) Long text processing en_US dc.subject (關鍵詞) Transformer en_US dc.subject (關鍵詞) Attention mechanism en_US dc.subject (關鍵詞) Token-wise analysis en_US dc.title (題名) 基於詞組的注意力機制用於長文轉換器模型 zh_TW dc.title (題名) Token-wise Attention Mechanism for Long Input Transformer Models en_US dc.type (資料類型) thesis en_US dc.relation.reference (參考文獻) [1] Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-documenttransformer, 2020.[2] Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A largescale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017.Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URLhttps://aclanthology.org/P17-1147.[3] Johannes Kiesel, Maria Mestre, Rishabh Shukla, Emmanuel Vincent, Payam Adineh,David Corney, Benno Stein, and Martin Potthast. SemEval-2019 task 4: Hyperpartisan news detection. In Proceedings of the 13th International Workshop onSemantic Evaluation, pages 829–839, Minneapolis, Minnesota, USA, June 2019.Association for Computational Linguistics. doi: 10.18653/v1/S19-2145. URLhttps://aclanthology.org/S19-2145.[4] Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, and Kentaro Inui. Attention is notonly a weight: Analyzing transformers with vector norms, 2020.[5] Quoc V. Le and Tomas Mikolov. Distributed representations of sentences and documents, 2014.[6] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation ofword representations in vector space, 2013.[7] Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers:A survey, 2020.[8] Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, and Che Zheng.Synthesizer: Rethinking self-attention in transformer models, 2021.[9] Trieu H. Trinh and Quoc V. Le. A simple method for commonsense reasoning, 2019.[10] Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi,Franziska Roesner, and Yejin Choi. Defending against neural fake news, 2020.[11] Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun,Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-likevisual explanations by watching movies and reading books, 2015. zh_TW dc.identifier.doi (DOI) 10.6814/NCCU202201686 en_US