Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

題名 基於詞組的注意力機制用於長文轉換器模型
Token-wise Attention Mechanism for Long Input Transformer Models
作者 賴建郡
Lai, Jian-Jyun
貢獻者 黃瀚萱
Huang, Hen-Hsen
賴建郡
Lai, Jian-Jyun
關鍵詞 自然語言處理
長文處理
轉換器
注意力機制
基於詞組分析
Natural language processing
Long text processing
Transformer
Attention mechanism
Token-wise analysis
日期 2022
上傳時間 2-Dec-2022 15:20:46 (UTC+8)
摘要 在現今的自然預言處理的領域當中,以轉換器作為基礎的模型是一個經常被使用的架構,通常來說依照使用該架構來針對大型文本進行預訓練,再針對下游不同的任務分別再進行微調被視為是有效的;在轉換器模型當中,注意力機制是該模型得以獲得資訊的關鍵,而由於注意力機制本身的架構,當字串的長度增加,使用的記憶體也會巨幅的成長,同時,轉換器模型在執行長字串的任務的表現仍舊有進步的空間。

本文嘗試以個別詞組來重新定義注意力機制觀測的範圍,分別為詞性標記和獨立的詞組注意力機制,並以一個切隔注意力機制的矩陣計算方式來達到降低記憶體使用。

在長字串分類和長字串問答中,使用獨立的詞組注意力機制的模型能達到與現今的傑出長字串模型—Longformer相互競爭的表現,並相較於該模型使用較少的記憶體,使其能夠更輕易的應用於自然語言任務。
Transformer-based models are the mainstream in natural language processing (NLP). This scheme is proven an efficient method essential in pre-training and fine-tuning. In the Transformer-based models, the attention mechanism is critical to gaining information on sequences. However, the architecture in the attention mechanism has led to time-consuming and significantly affected by the length of sequences. Also, the performance of the Transformer-based models dealing with long sequences tasks still has much room for further improvement.

In this work, we tend to use a token-wise method to redefine the limiting of the attention mechanism: POS tagging and independent attention. Moreover, with splitting attention matrix computing, the model tends to occupy less memory.

While dealing with long sequences classification and question-answering tasks, the independent attention mechanism models show competitive performance with Longformer. In addition, memory usage also shows an advantage. Thus, using the proposed method tends to be easier in dealing with NLP tasks.
參考文獻 [1] Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document
transformer, 2020.
[2] Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large
scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017.
Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URL
https://aclanthology.org/P17-1147.
[3] Johannes Kiesel, Maria Mestre, Rishabh Shukla, Emmanuel Vincent, Payam Adineh,
David Corney, Benno Stein, and Martin Potthast. SemEval-2019 task 4: Hyperpartisan news detection. In Proceedings of the 13th International Workshop on
Semantic Evaluation, pages 829–839, Minneapolis, Minnesota, USA, June 2019.
Association for Computational Linguistics. doi: 10.18653/v1/S19-2145. URL
https://aclanthology.org/S19-2145.
[4] Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, and Kentaro Inui. Attention is not
only a weight: Analyzing transformers with vector norms, 2020.
[5] Quoc V. Le and Tomas Mikolov. Distributed representations of sentences and documents, 2014.
[6] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of
word representations in vector space, 2013.
[7] Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers:
A survey, 2020.
[8] Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, and Che Zheng.
Synthesizer: Rethinking self-attention in transformer models, 2021.
[9] Trieu H. Trinh and Quoc V. Le. A simple method for commonsense reasoning, 2019.
[10] Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi,
Franziska Roesner, and Yejin Choi. Defending against neural fake news, 2020.
[11] Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun,
Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like
visual explanations by watching movies and reading books, 2015.
描述 碩士
國立政治大學
資訊科學系
109753205
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0109753205
資料類型 thesis
dc.contributor.advisor 黃瀚萱zh_TW
dc.contributor.advisor Huang, Hen-Hsenen_US
dc.contributor.author (Authors) 賴建郡zh_TW
dc.contributor.author (Authors) Lai, Jian-Jyunen_US
dc.creator (作者) 賴建郡zh_TW
dc.creator (作者) Lai, Jian-Jyunen_US
dc.date (日期) 2022en_US
dc.date.accessioned 2-Dec-2022 15:20:46 (UTC+8)-
dc.date.available 2-Dec-2022 15:20:46 (UTC+8)-
dc.date.issued (上傳時間) 2-Dec-2022 15:20:46 (UTC+8)-
dc.identifier (Other Identifiers) G0109753205en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/142642-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊科學系zh_TW
dc.description (描述) 109753205zh_TW
dc.description.abstract (摘要) 在現今的自然預言處理的領域當中,以轉換器作為基礎的模型是一個經常被使用的架構,通常來說依照使用該架構來針對大型文本進行預訓練,再針對下游不同的任務分別再進行微調被視為是有效的;在轉換器模型當中,注意力機制是該模型得以獲得資訊的關鍵,而由於注意力機制本身的架構,當字串的長度增加,使用的記憶體也會巨幅的成長,同時,轉換器模型在執行長字串的任務的表現仍舊有進步的空間。

本文嘗試以個別詞組來重新定義注意力機制觀測的範圍,分別為詞性標記和獨立的詞組注意力機制,並以一個切隔注意力機制的矩陣計算方式來達到降低記憶體使用。

在長字串分類和長字串問答中,使用獨立的詞組注意力機制的模型能達到與現今的傑出長字串模型—Longformer相互競爭的表現,並相較於該模型使用較少的記憶體,使其能夠更輕易的應用於自然語言任務。
zh_TW
dc.description.abstract (摘要) Transformer-based models are the mainstream in natural language processing (NLP). This scheme is proven an efficient method essential in pre-training and fine-tuning. In the Transformer-based models, the attention mechanism is critical to gaining information on sequences. However, the architecture in the attention mechanism has led to time-consuming and significantly affected by the length of sequences. Also, the performance of the Transformer-based models dealing with long sequences tasks still has much room for further improvement.

In this work, we tend to use a token-wise method to redefine the limiting of the attention mechanism: POS tagging and independent attention. Moreover, with splitting attention matrix computing, the model tends to occupy less memory.

While dealing with long sequences classification and question-answering tasks, the independent attention mechanism models show competitive performance with Longformer. In addition, memory usage also shows an advantage. Thus, using the proposed method tends to be easier in dealing with NLP tasks.
en_US
dc.description.tableofcontents 誌謝 i
摘要 ii
Abstract iii
Contents v
List of Figures viii
1 Introduction 1
1.1 Background 1
1.2 Motivation 2
1.3 Research Goals 3
2 Related Work 5
2.1 Reducing the Computation Complexity of Transformer-based Models 5
2.1.1 Fixed Patterns 5
2.1.2 Combination of Patterns 6
2.1.3 Learnable Patterns 6
2.1.4 Memory 6
2.1.5 Low-Rank Methods 6
2.1.6 Kernels 7
2.1.7 Recurrence 7
2.2 Transformer-based models for Long Input Sequences 7
2.3 Replacement of the Attention Matrix 8
2.4 Importance of the Attention Matrix and Input Sequences 8
3 Datasets 10
3.1 The First Stage 11
3.2 The Second Stage 12
3.3 The Third Stage 13
4 Methodology 16
4.1 Part-of-speech(POS) tagging 16
4.1.1 Global attention 17
4.1.2 Large local attention 18
4.1.3 Small local attention 18
4.1.4 Mask language modeling 18
4.2 Independent attention window size 18
4.2.1 Transform attention window size without limitation 19
4.2.2 Transform attention window size with limitation 20
4.2.3 Decrease memory consuming with independent limitation 20
4.3 A Three Stage of Computing Attention 21
5 Experiments 23
5.1 Hyperpartisan 23
5.1.1 Evualtion of The Models Pre-trained With Continuous Task and Data 24
5.1.2 Evualtion of The Models Pre-trained on the Third Stage Data 24
5.2 TriviaQA 25
6 Analysis 28
6.1 Training with POS taggings 28
6.2 The Independent Attention Tokens’ Distribution 30
6.3 The Independent Attention Tokens’ Diversification during Pre-training 31
6.4 The Tendency of Hyperpartisan Score on Continuous Dataset 34
6.5 The Tendency of Hyperpartisan Score on Not Continuous Dataset 37
6.6 The TriviaQA Performance on Each Answer Position 37
6.7 The F1-score and EM-score on Each Answer Position in TriviaQA 42
6.8 McNemar’s Test in Hyperpartisan 42
6.9 McNemar’s Test in TriviaQA 43
6.10 VRAM Occupied 43
7 Conclusions 45
Reference 46
zh_TW
dc.format.extent 36365732 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0109753205en_US
dc.subject (關鍵詞) 自然語言處理zh_TW
dc.subject (關鍵詞) 長文處理zh_TW
dc.subject (關鍵詞) 轉換器zh_TW
dc.subject (關鍵詞) 注意力機制zh_TW
dc.subject (關鍵詞) 基於詞組分析zh_TW
dc.subject (關鍵詞) Natural language processingen_US
dc.subject (關鍵詞) Long text processingen_US
dc.subject (關鍵詞) Transformeren_US
dc.subject (關鍵詞) Attention mechanismen_US
dc.subject (關鍵詞) Token-wise analysisen_US
dc.title (題名) 基於詞組的注意力機制用於長文轉換器模型zh_TW
dc.title (題名) Token-wise Attention Mechanism for Long Input Transformer Modelsen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) [1] Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document
transformer, 2020.
[2] Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large
scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017.
Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URL
https://aclanthology.org/P17-1147.
[3] Johannes Kiesel, Maria Mestre, Rishabh Shukla, Emmanuel Vincent, Payam Adineh,
David Corney, Benno Stein, and Martin Potthast. SemEval-2019 task 4: Hyperpartisan news detection. In Proceedings of the 13th International Workshop on
Semantic Evaluation, pages 829–839, Minneapolis, Minnesota, USA, June 2019.
Association for Computational Linguistics. doi: 10.18653/v1/S19-2145. URL
https://aclanthology.org/S19-2145.
[4] Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, and Kentaro Inui. Attention is not
only a weight: Analyzing transformers with vector norms, 2020.
[5] Quoc V. Le and Tomas Mikolov. Distributed representations of sentences and documents, 2014.
[6] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of
word representations in vector space, 2013.
[7] Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers:
A survey, 2020.
[8] Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, and Che Zheng.
Synthesizer: Rethinking self-attention in transformer models, 2021.
[9] Trieu H. Trinh and Quoc V. Le. A simple method for commonsense reasoning, 2019.
[10] Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi,
Franziska Roesner, and Yejin Choi. Defending against neural fake news, 2020.
[11] Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun,
Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like
visual explanations by watching movies and reading books, 2015.
zh_TW
dc.identifier.doi (DOI) 10.6814/NCCU202201686en_US