高壓縮比超長文本抽象式摘要生成

蕭郁君; Hsiao, Yu-Chun

Please use this identifier to cite or link to this item: https://ah.lib.nccu.edu.tw/handle/140.119/139219

題名:	高壓縮比超長文本抽象式摘要生成 High Density Abstraction for Very Long Text Summarization
作者:	蕭郁君 Hsiao, Yu-Chun
貢獻者:	黃瀚萱 Huang, Hen-Hsen 蕭郁君 Hsiao, Yu-Chun
關鍵詞:	自然語言生成抽象式摘要長文本摘要 Natural language processing Abstractive summarization Long text summarization
日期:	2022
上傳時間:	1-Mar-2022
摘要:	本研究探討的是文本摘要的新任務：生成具有高壓縮比的長文本抽象化摘要。\n自動摘要是自然語言處理領域中廣泛研究的主題，目前以 Transformer 架構為基礎的神經網路模型在新聞的抽象式摘要上，展現了一定的成效。\n本研究則針對更具挑戰性的輸入類型，書籍，作為摘要的對象進行探討。\n與長度僅數百字的新聞相比，書籍長達上萬字或更多，而對超長輸出入進行建模，是當前神經網路模型的一大挑戰。\n除此之外，書籍摘要需要將大量文字，改用少許概括性的文字重新表述。\n然而目前不論萃取式或抽象式摘要，主要的原理均是對輸入中的文句進行選擇與排序，故不易將大量詳細瑣碎的文句濃縮成概括性的宏觀概念。\n因此，書籍摘要的高壓縮率，構成現有摘要生成技術的另一挑戰。\n\n為解決上述的兩個挑戰，我們提出了一個基於 Transformer 神經網路的多層處理架構，適用於非常長的文本摘要，而且可以在監督式與非監督式兩種模式下運作。\n為了訓練我們的模型，我們提出了偽標記的策略，在不需額外人工標記的情況下訓練生成模型，進一步提出了一種自監督學習任務，利用多任務學習的方式，促進抽象摘要模型選用廣泛的宏觀表達方式，將具體詳細的措辭重新表述。\n實驗結果顯示，與現有方法相比，本篇論文提出的方法可以生成更好的摘要。 Text summarization is a topic widely-studied in the area of natural language processing. Most works of summarization focus on news or document summarization, where the length of input text is usually limited to hundreds of words. This work shows an attempt to deal with a much more challenging case, book summarization. Compared with news article, the length of a book usually exceeds ten thousands or even more, making a barrier to current neural network models, which have a shorter input limitation. The high compression ratio of book summarization forms another challenge for most current extractive and abstractive summarization models, which generate the summary by selecting and reordering sentences or words in the input, failing to condense details into broad, macro concepts. To address these two issues, we present a novel hierarchical model for very long text summarization in two ways, unsupervised and supervised. We train the Transformer-based generation model with pseudo-labeled data in the hierarchical manner for handling very long input text. A self-supervised learning task is further proposed for improving the ability of the abstractive summarization model for rephrasing specific, detailed wording with broad, macro expressions. Experimental results show our approach can generate better summaries compared with existing methods.
參考文獻:	[1] Stefanos Angelidis and Mirella Lapata. Summarizing opinions: Aspect extraction meets sentiment prediction and they are both weakly supervised. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3675–3686, Brussels, Belgium, October-November 2018. Association for Computational Linguistics.\n[2] Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv:2004.05150, 2020.\n[3] Jan A. Botha, Manaal Faruqui, John Alex, Jason Baldridge, and Dipanjan Das. Learning to split and rephrase from wikipedia edit history, 2018.\n[4] Yue Dong, Andrei Mircea, and Jackie C. K. Cheung. Discourse-aware unsupervised summarization of long scientific documents, 2021.\n[5] Quentin Grail, Julien Perez, and Eric Gaussier. Globalizing BERT-based transformer architectures for long document summarization. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1792–1810, Online, April 2021. Association for Computational Linguistics.\n[6] Max Grusky, Mor Naaman, and Yoav Artzi. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies, 2020.\n[7] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel- softmax. arXiv preprint arXiv:1611.01144, 2016.\n[8] Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer, 2020.\n[9] Wojciech Kryściński, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, and Dragomir Radev. Booksum: A collection of datasets for long-form narrative summarization, 2021.\n[10] Faisal Ladhak, Bryan Li, Yaser Al-Onaizan, and Kathleen McKeown. Exploring content selection in summarization of novel chapters, 2021.\n[11] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. BART: Denoising sequence-to- sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online, July 2020. Association for Computational Linguistics.\n[12] Chin-Yew Lin and Franz Josef Och. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pages 605– 612, Barcelona, Spain, July 2004.\n[13] Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating wikipedia by summarizing long sequences. In International Conference on Learning Representations, 2018.\n[14] Yang Liu and Mirella Lapata. Text summarization with pretrained encoders, 2019.\n[15] Erwin Marsi and Emiel Krahmer. Explorations in sentence fusion. In Proceedings of the\nTenth European Workshop on Natural Language Generation (ENLG-05), 2005.\n[16] Ani Nenkova, Sameer Maskey, and Yang Liu. Automatic summarization. In\nProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, page 3, Portland, Oregon, June 2011. Association for Computational Linguistics.\n[17] Dragomir R. Radev, Eduard Hovy, and Kathleen McKeown. Introduction to the special issue on summarization. Comput. Linguist., 28(4):399–408, dec 2002.\n[18] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR, abs/1910.10683, 2019.\n[19] Noraini Mohd Razali, John Geraghty, et al. Genetic algorithm performance with different selection strategies in solving tsp. In Proceedings of the world congress on engineering, volume 2, pages 1–6. International Association of Engineers Hong Kong, 2011.\n[20] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks, 2019.\n[21] Amin Riazi. Genetic algorithm and a double-chromosome implementation to the traveling salesman problem. SN Applied Sciences, 1(11):1–7, 2019.\n[22] Tobias Rohde, Xiaoxia Wu, and Yinhan Liu. Hierarchical learning for generation with long source sequences, 2021.\n[23] Hasim Sak, Andrew W Senior, and Françoise Beaufays. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. 2014.\n[24] Abigail See, Peter J. Liu, and Christopher D. Manning. Get to the point: Summarization with pointer-generator networks, 2017.\n[25] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.\n[26] Peter West, Ari Holtzman, Jan Buys, and Yejin Choi. BottleSum: Unsupervised and self-supervised sentence summarization using the information bottleneck principle. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3752–3761, Hong Kong, China, November 2019. Association for Computational Linguistics.\n[27] Wei Xu and Ralph Grishman. A parse-and-trim approach with information significance for Chinese sentence compression. In Proceedings of the 2009 Workshop on Language Generation and Summarisation (UCNLG+Sum 2009), pages 48–55, Suntec, Singapore, August 2009. Association for Computational Linguistics.\n[28] Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33, 2020.\n[29] Jingqing Zhang, Yao Zhao, Mohammad Ahmad Saleh, and Peter J. Liu. Pegasus: Pretraining with extracted gap-sentences for abstractive summarization by sequence-to- sequence models, 2020.\n[30] MingZhong,PengfeiLiu,YiranChen,DanqingWang,XipengQiu,andXuanjingHuang. Extractive summarization as text matching. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6197–6208, Online, July 2020. Association for Computational Linguistics.\n[31] ChenguangZhu,ZiyiYang,RobertGmyr,MichaelZeng,andXuedongHuang.Leveraging Lead Bias for Zero-Shot Abstractive News Summarization, page 1462–1471. Association for Computing Machinery, New York, NY, USA, 2021.\n[32] Markus Zopf. Auto-hMDS: Automatic construction of a large heterogeneous multilingual multi-document summarization corpus. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018. European Language Resources Association (ELRA).
描述:	碩士國立政治大學資訊科學系 109753203
資料來源:	http://thesis.lib.nccu.edu.tw/record/#G0109753203
資料類型:	thesis
Appears in Collections:	學位論文

Files in This Item:

File	Description	Size	Format
320301.pdf		1.28 MB	Adobe PDF2	View/Open

Show full item record

Google Scholar^TM

Check

Files in This Item:

Google Scholar^TM

Altmetric

Altmetric

Files in This Item:

Google ScholarTM

Altmetric

Altmetric

Google Scholar^TM