學術產出-Theses

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

題名 高壓縮比超長文本抽象式摘要生成
High Density Abstraction for Very Long Text Summarization
作者 蕭郁君
Hsiao, Yu-Chun
貢獻者 黃瀚萱
Huang, Hen-Hsen
蕭郁君
Hsiao, Yu-Chun
關鍵詞 自然語言生成
抽象式摘要
長文本摘要
Natural language processing
Abstractive summarization
Long text summarization
日期 2022
上傳時間 1-Mar-2022 17:21:36 (UTC+8)
摘要 本研究探討的是文本摘要的新任務:生成具有高壓縮比的長文本抽象化摘要。
自動摘要是自然語言處理領域中廣泛研究的主題,目前以 Transformer 架構為基礎的神經網路模型在新聞的抽象式摘要上,展現了一定的成效。
本研究則針對更具挑戰性的輸入類型,書籍,作為摘要的對象進行探討。
與長度僅數百字的新聞相比,書籍長達上萬字或更多,而對超長輸出入進行建模,是當前神經網路模型的一大挑戰。
除此之外,書籍摘要需要將大量文字,改用少許概括性的文字重新表述。
然而目前不論萃取式或抽象式摘要,主要的原理均是對輸入中的文句進行選擇與排序,故不易將大量詳細瑣碎的文句濃縮成概括性的宏觀概念。
因此,書籍摘要的高壓縮率,構成現有摘要生成技術的另一挑戰。

為解決上述的兩個挑戰,我們提出了一個基於 Transformer 神經網路的多層處理架構,適用於非常長的文本摘要,而且可以在監督式與非監督式兩種模式下運作。
為了訓練我們的模型,我們提出了偽標記的策略,在不需額外人工標記的情況下訓練生成模型,進一步提出了一種自監督學習任務,利用多任務學習的方式,促進抽象摘要模型選用廣泛的宏觀表達方式,將具體詳細的措辭重新表述。
實驗結果顯示,與現有方法相比,本篇論文提出的方法可以生成更好的摘要。
Text summarization is a topic widely-studied in the area of natural language processing. Most works of summarization focus on news or document summarization, where the length of input text is usually limited to hundreds of words. This work shows an attempt to deal with a much more challenging case, book summarization. Compared with news article, the length of a book usually exceeds ten thousands or even more, making a barrier to current neural network models, which have a shorter input limitation. The high compression ratio of book summarization forms another challenge for most current extractive and abstractive summarization models, which generate the summary by selecting and reordering sentences or words in the input, failing to condense details into broad, macro concepts. To address these two issues, we present a novel hierarchical model for very long text summarization in two ways, unsupervised and supervised. We train the Transformer-based generation model with pseudo-labeled data in the hierarchical manner for handling very long input text. A self-supervised learning task is further proposed for improving the ability of the abstractive summarization model for rephrasing specific, detailed wording with broad, macro expressions. Experimental results show our approach can generate better summaries compared with existing methods.
參考文獻 [1] Stefanos Angelidis and Mirella Lapata. Summarizing opinions: Aspect extraction meets sentiment prediction and they are both weakly supervised. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3675–3686, Brussels, Belgium, October-November 2018. Association for Computational Linguistics.
[2] Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv:2004.05150, 2020.
[3] Jan A. Botha, Manaal Faruqui, John Alex, Jason Baldridge, and Dipanjan Das. Learning to split and rephrase from wikipedia edit history, 2018.
[4] Yue Dong, Andrei Mircea, and Jackie C. K. Cheung. Discourse-aware unsupervised summarization of long scientific documents, 2021.
[5] Quentin Grail, Julien Perez, and Eric Gaussier. Globalizing BERT-based transformer architectures for long document summarization. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1792–1810, Online, April 2021. Association for Computational Linguistics.
[6] Max Grusky, Mor Naaman, and Yoav Artzi. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies, 2020.
[7] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel- softmax. arXiv preprint arXiv:1611.01144, 2016.
[8] Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer, 2020.
[9] Wojciech Kryściński, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, and Dragomir Radev. Booksum: A collection of datasets for long-form narrative summarization, 2021.
[10] Faisal Ladhak, Bryan Li, Yaser Al-Onaizan, and Kathleen McKeown. Exploring content selection in summarization of novel chapters, 2021.
[11] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. BART: Denoising sequence-to- sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online, July 2020. Association for Computational Linguistics.
[12] Chin-Yew Lin and Franz Josef Och. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pages 605– 612, Barcelona, Spain, July 2004.
[13] Peter J. Liu*, Mohammad Saleh*, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating wikipedia by summarizing long sequences. In International Conference on Learning Representations, 2018.
[14] Yang Liu and Mirella Lapata. Text summarization with pretrained encoders, 2019.
[15] Erwin Marsi and Emiel Krahmer. Explorations in sentence fusion. In Proceedings of the
Tenth European Workshop on Natural Language Generation (ENLG-05), 2005.
[16] Ani Nenkova, Sameer Maskey, and Yang Liu. Automatic summarization. In
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, page 3, Portland, Oregon, June 2011. Association for Computational Linguistics.
[17] Dragomir R. Radev, Eduard Hovy, and Kathleen McKeown. Introduction to the special issue on summarization. Comput. Linguist., 28(4):399–408, dec 2002.
[18] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR, abs/1910.10683, 2019.
[19] Noraini Mohd Razali, John Geraghty, et al. Genetic algorithm performance with different selection strategies in solving tsp. In Proceedings of the world congress on engineering, volume 2, pages 1–6. International Association of Engineers Hong Kong, 2011.
[20] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks, 2019.
[21] Amin Riazi. Genetic algorithm and a double-chromosome implementation to the traveling salesman problem. SN Applied Sciences, 1(11):1–7, 2019.
[22] Tobias Rohde, Xiaoxia Wu, and Yinhan Liu. Hierarchical learning for generation with long source sequences, 2021.
[23] Hasim Sak, Andrew W Senior, and Françoise Beaufays. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. 2014.
[24] Abigail See, Peter J. Liu, and Christopher D. Manning. Get to the point: Summarization with pointer-generator networks, 2017.
[25] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
[26] Peter West, Ari Holtzman, Jan Buys, and Yejin Choi. BottleSum: Unsupervised and self-supervised sentence summarization using the information bottleneck principle. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3752–3761, Hong Kong, China, November 2019. Association for Computational Linguistics.
[27] Wei Xu and Ralph Grishman. A parse-and-trim approach with information significance for Chinese sentence compression. In Proceedings of the 2009 Workshop on Language Generation and Summarisation (UCNLG+Sum 2009), pages 48–55, Suntec, Singapore, August 2009. Association for Computational Linguistics.
[28] Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33, 2020.
[29] Jingqing Zhang, Yao Zhao, Mohammad Ahmad Saleh, and Peter J. Liu. Pegasus: Pretraining with extracted gap-sentences for abstractive summarization by sequence-to- sequence models, 2020.
[30] MingZhong,PengfeiLiu,YiranChen,DanqingWang,XipengQiu,andXuanjingHuang. Extractive summarization as text matching. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6197–6208, Online, July 2020. Association for Computational Linguistics.
[31] ChenguangZhu,ZiyiYang,RobertGmyr,MichaelZeng,andXuedongHuang.Leveraging Lead Bias for Zero-Shot Abstractive News Summarization, page 1462–1471. Association for Computing Machinery, New York, NY, USA, 2021.
[32] Markus Zopf. Auto-hMDS: Automatic construction of a large heterogeneous multilingual multi-document summarization corpus. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018. European Language Resources Association (ELRA).
描述 碩士
國立政治大學
資訊科學系
109753203
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0109753203
資料類型 thesis
dc.contributor.advisor 黃瀚萱zh_TW
dc.contributor.advisor Huang, Hen-Hsenen_US
dc.contributor.author (Authors) 蕭郁君zh_TW
dc.contributor.author (Authors) Hsiao, Yu-Chunen_US
dc.creator (作者) 蕭郁君zh_TW
dc.creator (作者) Hsiao, Yu-Chunen_US
dc.date (日期) 2022en_US
dc.date.accessioned 1-Mar-2022 17:21:36 (UTC+8)-
dc.date.available 1-Mar-2022 17:21:36 (UTC+8)-
dc.date.issued (上傳時間) 1-Mar-2022 17:21:36 (UTC+8)-
dc.identifier (Other Identifiers) G0109753203en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/139219-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊科學系zh_TW
dc.description (描述) 109753203zh_TW
dc.description.abstract (摘要) 本研究探討的是文本摘要的新任務:生成具有高壓縮比的長文本抽象化摘要。
自動摘要是自然語言處理領域中廣泛研究的主題,目前以 Transformer 架構為基礎的神經網路模型在新聞的抽象式摘要上,展現了一定的成效。
本研究則針對更具挑戰性的輸入類型,書籍,作為摘要的對象進行探討。
與長度僅數百字的新聞相比,書籍長達上萬字或更多,而對超長輸出入進行建模,是當前神經網路模型的一大挑戰。
除此之外,書籍摘要需要將大量文字,改用少許概括性的文字重新表述。
然而目前不論萃取式或抽象式摘要,主要的原理均是對輸入中的文句進行選擇與排序,故不易將大量詳細瑣碎的文句濃縮成概括性的宏觀概念。
因此,書籍摘要的高壓縮率,構成現有摘要生成技術的另一挑戰。

為解決上述的兩個挑戰,我們提出了一個基於 Transformer 神經網路的多層處理架構,適用於非常長的文本摘要,而且可以在監督式與非監督式兩種模式下運作。
為了訓練我們的模型,我們提出了偽標記的策略,在不需額外人工標記的情況下訓練生成模型,進一步提出了一種自監督學習任務,利用多任務學習的方式,促進抽象摘要模型選用廣泛的宏觀表達方式,將具體詳細的措辭重新表述。
實驗結果顯示,與現有方法相比,本篇論文提出的方法可以生成更好的摘要。
zh_TW
dc.description.abstract (摘要) Text summarization is a topic widely-studied in the area of natural language processing. Most works of summarization focus on news or document summarization, where the length of input text is usually limited to hundreds of words. This work shows an attempt to deal with a much more challenging case, book summarization. Compared with news article, the length of a book usually exceeds ten thousands or even more, making a barrier to current neural network models, which have a shorter input limitation. The high compression ratio of book summarization forms another challenge for most current extractive and abstractive summarization models, which generate the summary by selecting and reordering sentences or words in the input, failing to condense details into broad, macro concepts. To address these two issues, we present a novel hierarchical model for very long text summarization in two ways, unsupervised and supervised. We train the Transformer-based generation model with pseudo-labeled data in the hierarchical manner for handling very long input text. A self-supervised learning task is further proposed for improving the ability of the abstractive summarization model for rephrasing specific, detailed wording with broad, macro expressions. Experimental results show our approach can generate better summaries compared with existing methods.en_US
dc.description.tableofcontents 中文摘要 i
Abstract ii
目錄 iii
表目錄 v
圖目錄 vi
第一章 緒論 1
第一節 研究背景 1
第二節 研究動機 2
第三節 研究目的 3
第二章 文獻探討 4
第一節 萃取式摘要 4
第二節 抽象式摘要 4
第三節 長文本摘要 6
第四節 句子融合與句子簡化 6
第三章 研究方法 8
第一節 概述 8
第二節 摘要任務 9
一、概論 9
二、維基百科目錄架構介紹以及定義 10
三、實例生成 11
第三節 知識融合任務 15
第四節 生成模型 16
第五節 推論模型 16
一、概論 16
二、循環式模型架構 17
第四章 實驗 21
第一節 資料集 21
一、介紹 21
二、資料集介紹 23
第二節 實驗評估指標 24
第三節 模型、超參數以及實驗配置 25
一、生成模型 25
二、循環模型 25
第四節 模型效能 25
一、基礎模型 25
二、消融實驗 28
三、循環模型實驗 29
四、其他模型比較 31
五、結果分析 33
第五章 結論 36
參考文獻 37
zh_TW
dc.format.extent 1312263 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0109753203en_US
dc.subject (關鍵詞) 自然語言生成zh_TW
dc.subject (關鍵詞) 抽象式摘要zh_TW
dc.subject (關鍵詞) 長文本摘要zh_TW
dc.subject (關鍵詞) Natural language processingen_US
dc.subject (關鍵詞) Abstractive summarizationen_US
dc.subject (關鍵詞) Long text summarizationen_US
dc.title (題名) 高壓縮比超長文本抽象式摘要生成zh_TW
dc.title (題名) High Density Abstraction for Very Long Text Summarizationen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) [1] Stefanos Angelidis and Mirella Lapata. Summarizing opinions: Aspect extraction meets sentiment prediction and they are both weakly supervised. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3675–3686, Brussels, Belgium, October-November 2018. Association for Computational Linguistics.
[2] Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv:2004.05150, 2020.
[3] Jan A. Botha, Manaal Faruqui, John Alex, Jason Baldridge, and Dipanjan Das. Learning to split and rephrase from wikipedia edit history, 2018.
[4] Yue Dong, Andrei Mircea, and Jackie C. K. Cheung. Discourse-aware unsupervised summarization of long scientific documents, 2021.
[5] Quentin Grail, Julien Perez, and Eric Gaussier. Globalizing BERT-based transformer architectures for long document summarization. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1792–1810, Online, April 2021. Association for Computational Linguistics.
[6] Max Grusky, Mor Naaman, and Yoav Artzi. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies, 2020.
[7] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel- softmax. arXiv preprint arXiv:1611.01144, 2016.
[8] Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer, 2020.
[9] Wojciech Kryściński, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, and Dragomir Radev. Booksum: A collection of datasets for long-form narrative summarization, 2021.
[10] Faisal Ladhak, Bryan Li, Yaser Al-Onaizan, and Kathleen McKeown. Exploring content selection in summarization of novel chapters, 2021.
[11] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. BART: Denoising sequence-to- sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online, July 2020. Association for Computational Linguistics.
[12] Chin-Yew Lin and Franz Josef Och. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pages 605– 612, Barcelona, Spain, July 2004.
[13] Peter J. Liu*, Mohammad Saleh*, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating wikipedia by summarizing long sequences. In International Conference on Learning Representations, 2018.
[14] Yang Liu and Mirella Lapata. Text summarization with pretrained encoders, 2019.
[15] Erwin Marsi and Emiel Krahmer. Explorations in sentence fusion. In Proceedings of the
Tenth European Workshop on Natural Language Generation (ENLG-05), 2005.
[16] Ani Nenkova, Sameer Maskey, and Yang Liu. Automatic summarization. In
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, page 3, Portland, Oregon, June 2011. Association for Computational Linguistics.
[17] Dragomir R. Radev, Eduard Hovy, and Kathleen McKeown. Introduction to the special issue on summarization. Comput. Linguist., 28(4):399–408, dec 2002.
[18] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR, abs/1910.10683, 2019.
[19] Noraini Mohd Razali, John Geraghty, et al. Genetic algorithm performance with different selection strategies in solving tsp. In Proceedings of the world congress on engineering, volume 2, pages 1–6. International Association of Engineers Hong Kong, 2011.
[20] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks, 2019.
[21] Amin Riazi. Genetic algorithm and a double-chromosome implementation to the traveling salesman problem. SN Applied Sciences, 1(11):1–7, 2019.
[22] Tobias Rohde, Xiaoxia Wu, and Yinhan Liu. Hierarchical learning for generation with long source sequences, 2021.
[23] Hasim Sak, Andrew W Senior, and Françoise Beaufays. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. 2014.
[24] Abigail See, Peter J. Liu, and Christopher D. Manning. Get to the point: Summarization with pointer-generator networks, 2017.
[25] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
[26] Peter West, Ari Holtzman, Jan Buys, and Yejin Choi. BottleSum: Unsupervised and self-supervised sentence summarization using the information bottleneck principle. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3752–3761, Hong Kong, China, November 2019. Association for Computational Linguistics.
[27] Wei Xu and Ralph Grishman. A parse-and-trim approach with information significance for Chinese sentence compression. In Proceedings of the 2009 Workshop on Language Generation and Summarisation (UCNLG+Sum 2009), pages 48–55, Suntec, Singapore, August 2009. Association for Computational Linguistics.
[28] Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33, 2020.
[29] Jingqing Zhang, Yao Zhao, Mohammad Ahmad Saleh, and Peter J. Liu. Pegasus: Pretraining with extracted gap-sentences for abstractive summarization by sequence-to- sequence models, 2020.
[30] MingZhong,PengfeiLiu,YiranChen,DanqingWang,XipengQiu,andXuanjingHuang. Extractive summarization as text matching. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6197–6208, Online, July 2020. Association for Computational Linguistics.
[31] ChenguangZhu,ZiyiYang,RobertGmyr,MichaelZeng,andXuedongHuang.Leveraging Lead Bias for Zero-Shot Abstractive News Summarization, page 1462–1471. Association for Computing Machinery, New York, NY, USA, 2021.
[32] Markus Zopf. Auto-hMDS: Automatic construction of a large heterogeneous multilingual multi-document summarization corpus. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018. European Language Resources Association (ELRA).
zh_TW
dc.identifier.doi (DOI) 10.6814/NCCU202200337en_US