基於詞組的注意力機制用於長文轉換器模型

Publications-Theses

Article View/Open

pdf(74)

Publication Export

Google Scholar^TM

NCCU Library

Discovery System

Citation Infomation

Related Publications in TAIR

Simple Record
Full Record

題名	基於詞組的注意力機制用於長文轉換器模型 Token-wise Attention Mechanism for Long Input Transformer Models
作者	賴建郡 Lai, Jian-Jyun
貢獻者	黃瀚萱 Huang, Hen-Hsen 賴建郡 Lai, Jian-Jyun
關鍵詞	自然語言處理長文處理轉換器注意力機制基於詞組分析 Natural language processing Long text processing Transformer Attention mechanism Token-wise analysis
日期	2022
上傳時間	2-Dec-2022 15:20:46 (UTC+8)
摘要	在現今的自然預言處理的領域當中，以轉換器作為基礎的模型是一個經常被使用的架構，通常來說依照使用該架構來針對大型文本進行預訓練，再針對下游不同的任務分別再進行微調被視為是有效的；在轉換器模型當中，注意力機制是該模型得以獲得資訊的關鍵，而由於注意力機制本身的架構，當字串的長度增加，使用的記憶體也會巨幅的成長，同時，轉換器模型在執行長字串的任務的表現仍舊有進步的空間。本文嘗試以個別詞組來重新定義注意力機制觀測的範圍，分別為詞性標記和獨立的詞組注意力機制，並以一個切隔注意力機制的矩陣計算方式來達到降低記憶體使用。在長字串分類和長字串問答中，使用獨立的詞組注意力機制的模型能達到與現今的傑出長字串模型—Longformer相互競爭的表現，並相較於該模型使用較少的記憶體，使其能夠更輕易的應用於自然語言任務。 Transformer-based models are the mainstream in natural language processing (NLP). This scheme is proven an efficient method essential in pre-training and fine-tuning. In the Transformer-based models, the attention mechanism is critical to gaining information on sequences. However, the architecture in the attention mechanism has led to time-consuming and significantly affected by the length of sequences. Also, the performance of the Transformer-based models dealing with long sequences tasks still has much room for further improvement. In this work, we tend to use a token-wise method to redefine the limiting of the attention mechanism: POS tagging and independent attention. Moreover, with splitting attention matrix computing, the model tends to occupy less memory. While dealing with long sequences classification and question-answering tasks, the independent attention mechanism models show competitive performance with Longformer. In addition, memory usage also shows an advantage. Thus, using the proposed method tends to be easier in dealing with NLP tasks.
參考文獻	[1] Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer, 2020. [2] Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URL https://aclanthology.org/P17-1147. [3] Johannes Kiesel, Maria Mestre, Rishabh Shukla, Emmanuel Vincent, Payam Adineh, David Corney, Benno Stein, and Martin Potthast. SemEval-2019 task 4: Hyperpartisan news detection. In Proceedings of the 13th International Workshop on Semantic Evaluation, pages 829–839, Minneapolis, Minnesota, USA, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/S19-2145. URL https://aclanthology.org/S19-2145. [4] Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, and Kentaro Inui. Attention is not only a weight: Analyzing transformers with vector norms, 2020. [5] Quoc V. Le and Tomas Mikolov. Distributed representations of sentences and documents, 2014. [6] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space, 2013. [7] Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey, 2020. [8] Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, and Che Zheng. Synthesizer: Rethinking self-attention in transformer models, 2021. [9] Trieu H. Trinh and Quoc V. Le. A simple method for commonsense reasoning, 2019. [10] Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. Defending against neural fake news, 2020. [11] Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books, 2015.
描述	碩士國立政治大學資訊科學系 109753205
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0109753205
資料類型	thesis

dc.contributor.advisor	黃瀚萱	zh_TW
dc.contributor.advisor	Huang, Hen-Hsen	en_US
dc.contributor.author (Authors)	賴建郡	zh_TW
dc.contributor.author (Authors)	Lai, Jian-Jyun	en_US
dc.creator (作者)	賴建郡	zh_TW
dc.creator (作者)	Lai, Jian-Jyun	en_US
dc.date (日期)	2022	en_US
dc.date.accessioned	2-Dec-2022 15:20:46 (UTC+8)	-
dc.date.available	2-Dec-2022 15:20:46 (UTC+8)	-
dc.date.issued (上傳時間)	2-Dec-2022 15:20:46 (UTC+8)	-
dc.identifier (Other Identifiers)	G0109753205	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/142642	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	資訊科學系	zh_TW
dc.description (描述)	109753205	zh_TW
dc.description.abstract (摘要)	在現今的自然預言處理的領域當中，以轉換器作為基礎的模型是一個經常被使用的架構，通常來說依照使用該架構來針對大型文本進行預訓練，再針對下游不同的任務分別再進行微調被視為是有效的；在轉換器模型當中，注意力機制是該模型得以獲得資訊的關鍵，而由於注意力機制本身的架構，當字串的長度增加，使用的記憶體也會巨幅的成長，同時，轉換器模型在執行長字串的任務的表現仍舊有進步的空間。本文嘗試以個別詞組來重新定義注意力機制觀測的範圍，分別為詞性標記和獨立的詞組注意力機制，並以一個切隔注意力機制的矩陣計算方式來達到降低記憶體使用。在長字串分類和長字串問答中，使用獨立的詞組注意力機制的模型能達到與現今的傑出長字串模型—Longformer相互競爭的表現，並相較於該模型使用較少的記憶體，使其能夠更輕易的應用於自然語言任務。	zh_TW
dc.description.abstract (摘要)	Transformer-based models are the mainstream in natural language processing (NLP). This scheme is proven an efficient method essential in pre-training and fine-tuning. In the Transformer-based models, the attention mechanism is critical to gaining information on sequences. However, the architecture in the attention mechanism has led to time-consuming and significantly affected by the length of sequences. Also, the performance of the Transformer-based models dealing with long sequences tasks still has much room for further improvement. In this work, we tend to use a token-wise method to redefine the limiting of the attention mechanism: POS tagging and independent attention. Moreover, with splitting attention matrix computing, the model tends to occupy less memory. While dealing with long sequences classification and question-answering tasks, the independent attention mechanism models show competitive performance with Longformer. In addition, memory usage also shows an advantage. Thus, using the proposed method tends to be easier in dealing with NLP tasks.	en_US
dc.description.tableofcontents	誌謝 i 摘要 ii Abstract iii Contents v List of Figures viii 1 Introduction 1 1.1 Background 1 1.2 Motivation 2 1.3 Research Goals 3 2 Related Work 5 2.1 Reducing the Computation Complexity of Transformer-based Models 5 2.1.1 Fixed Patterns 5 2.1.2 Combination of Patterns 6 2.1.3 Learnable Patterns 6 2.1.4 Memory 6 2.1.5 Low-Rank Methods 6 2.1.6 Kernels 7 2.1.7 Recurrence 7 2.2 Transformer-based models for Long Input Sequences 7 2.3 Replacement of the Attention Matrix 8 2.4 Importance of the Attention Matrix and Input Sequences 8 3 Datasets 10 3.1 The First Stage 11 3.2 The Second Stage 12 3.3 The Third Stage 13 4 Methodology 16 4.1 Part-of-speech(POS) tagging 16 4.1.1 Global attention 17 4.1.2 Large local attention 18 4.1.3 Small local attention 18 4.1.4 Mask language modeling 18 4.2 Independent attention window size 18 4.2.1 Transform attention window size without limitation 19 4.2.2 Transform attention window size with limitation 20 4.2.3 Decrease memory consuming with independent limitation 20 4.3 A Three Stage of Computing Attention 21 5 Experiments 23 5.1 Hyperpartisan 23 5.1.1 Evualtion of The Models Pre-trained With Continuous Task and Data 24 5.1.2 Evualtion of The Models Pre-trained on the Third Stage Data 24 5.2 TriviaQA 25 6 Analysis 28 6.1 Training with POS taggings 28 6.2 The Independent Attention Tokens’ Distribution 30 6.3 The Independent Attention Tokens’ Diversification during Pre-training 31 6.4 The Tendency of Hyperpartisan Score on Continuous Dataset 34 6.5 The Tendency of Hyperpartisan Score on Not Continuous Dataset 37 6.6 The TriviaQA Performance on Each Answer Position 37 6.7 The F1-score and EM-score on Each Answer Position in TriviaQA 42 6.8 McNemar’s Test in Hyperpartisan 42 6.9 McNemar’s Test in TriviaQA 43 6.10 VRAM Occupied 43 7 Conclusions 45 Reference 46	zh_TW
dc.format.extent	36365732 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0109753205	en_US
dc.subject (關鍵詞)	自然語言處理	zh_TW
dc.subject (關鍵詞)	長文處理	zh_TW
dc.subject (關鍵詞)	轉換器	zh_TW
dc.subject (關鍵詞)	注意力機制	zh_TW
dc.subject (關鍵詞)	基於詞組分析	zh_TW
dc.subject (關鍵詞)	Natural language processing	en_US
dc.subject (關鍵詞)	Long text processing	en_US
dc.subject (關鍵詞)	Transformer	en_US
dc.subject (關鍵詞)	Attention mechanism	en_US
dc.subject (關鍵詞)	Token-wise analysis	en_US
dc.title (題名)	基於詞組的注意力機制用於長文轉換器模型	zh_TW
dc.title (題名)	Token-wise Attention Mechanism for Long Input Transformer Models	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	[1] Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer, 2020. [2] Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URL https://aclanthology.org/P17-1147. [3] Johannes Kiesel, Maria Mestre, Rishabh Shukla, Emmanuel Vincent, Payam Adineh, David Corney, Benno Stein, and Martin Potthast. SemEval-2019 task 4: Hyperpartisan news detection. In Proceedings of the 13th International Workshop on Semantic Evaluation, pages 829–839, Minneapolis, Minnesota, USA, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/S19-2145. URL https://aclanthology.org/S19-2145. [4] Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, and Kentaro Inui. Attention is not only a weight: Analyzing transformers with vector norms, 2020. [5] Quoc V. Le and Tomas Mikolov. Distributed representations of sentences and documents, 2014. [6] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space, 2013. [7] Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey, 2020. [8] Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, and Che Zheng. Synthesizer: Rethinking self-attention in transformer models, 2021. [9] Trieu H. Trinh and Quoc V. Le. A simple method for commonsense reasoning, 2019. [10] Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. Defending against neural fake news, 2020. [11] Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books, 2015.	zh_TW
dc.identifier.doi (DOI)	10.6814/NCCU202201686	en_US

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM