Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

題名 中文新聞標題自動生成之研究
A Study on the Automatic Generation for Headlines of Chinese News Articles
作者 江珮翎
Chiang, Pei-ling
貢獻者 劉吉軒<br>陳光華
Liu, Jyi-Shane<br>Chen, Kuang-hua
江珮翎
Chiang, Pei-ling
關鍵詞 標題
自動生成
自然語言
生成
新聞標題
日期 2002
上傳時間 17-Sep-2009 13:51:52 (UTC+8)
摘要 在網路資訊爆炸的年代,資料的分析整理日趨重要,本論文之研究目標正是針對資料做標題生成的處理,為資料自動生成標題,進而將資料加值化,轉化為資訊。研究者首先閱讀英文相關文獻,分析整理後,認為中文的處理方式與英文有所差異,因此,在本論文中,提出與英文不同之中文前置作業與自動標題生成之方法。
研究者針對標題的自動生成提出了幾種特徵值考量,包括候選詞權重值,訓練標題-文本詞彙,標題長度的關係及詞組間距。本論文之研究分為兩階段,第一階段為訓練階段,將文件做前置處理與斷詞,接著訓練標題-文本詞彙與統計文件標題長度的機率。第二階段為執行階段,分析新文件之候選詞權重值,並參照訓練階段之標題-文本詞彙與標題長度之機率值參考表,考量詞組間距後自動為文件產生標題。本論文所採用的訓練文件集來源為1998年至1999年五種報紙,涵蓋不同主題,共84,211篇文件,而測試文件的實驗分為Outside Test與Inside Test兩部分。
研究者為實驗結果進行兩種評估,一為電腦評估,將自動生成之標題與記者所擬訂的標題比對後,計算出求準率、求全率與F1。Outside Test求準率為14.21%、求全率為11.43%、F1為12.67%。Inside Test求準率為15.84%、求全率為12.94%、F1為14.21%。實驗結果顯示,正確率方面與其他文獻之英文文件標題的生成結果(F1=3.2%~24%)相近,但與實際標題仍有差距,因此,在未來工作上,仍有很大的發展空間。二為人為評估,讓使用者在閱讀自動生成之標題後,加以評分。自動生成之標題的流暢度還算不錯。然總結來說,本論文之研究尚屬初始階段,盼未來能更加成熟,並可有更進一步的創新與改進。
As the number of digital documents on internet is growing up, analysis and organization of documents become quite important. In this thesis, we propose an approach for headline generation of documents. We can try our best to transfer the document data into information in some sense using the proposed approach. We review literature about the related topics, and present a different approach to deal with Chinese documents rather than English documents.
We propose some approach to Chinese documents headline generation. The thesis is separate two steps, one is training step, and the other is execution step. On the first step, the documents were preprocessed. Secondly, we trained the probability of headline-text words, and headline’s length. And on the execution step, we analyzed scores of headline candidates and gap, then referred to the probability of headline-text words, and headline’s length, finally we automatically generate headline for documents. The training documents are selected from a test collection for information retrieval, CIRB. Totally 84,211 Chinese news articles published between 1998 and 1999 are selected. Testing documents has two parts, one is for outside test, and the other is for inside test.
We conducted two evaluations, one is the automatic evaluation using metrics of presicion, recall and F1; the other is the human assessment. The precision of outside test is 14.21%、recall is 11.43%、F1 is 12.67%. And the precision of inside test is 15.84%、recall is 12.94%、F1 is 14.21%。The automatic evaluation result shows the accruacy is still not good enough, and the human assessment evaluation shows our approach can produce human-readable headlines.
參考文獻 參考文獻
[1]Michele Banko, Vibhu O. Mittal, and Michael J. Witbrock. 2000.“Headline Generation Based on Statistical Translation”. 38th Annual Meeting of the Association for Computational Linguistics, Hong Kong, China, 1-8 October.
[2]Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993 . “The mathematics of statistical machine translation: Parameter estimation”. Computational Linguistics, (2): 263-312.
[3]Brown, Cocke, Della-Pietra, Della-Pietra, Jelinek, Lafferty, Mercer, Roossin. 1990. “A Statistical Approach to Machine Translation”. Computational Linguistics, 16(2) June.
[4]Kuang-hua Chen and Hsin-Hsi Chen. 2001. “The Chinese Text Retrieval Tasks of NTCIR Workshop 2”. Proceedings of the Second NTCIR Workshop Meeting on Evaluation of Chinese & Japanese Text Retrieval and Text Summarization (NTCIR 2), pp. 51-72.
[5]G. D. Forney. 1973. “The Viterbi Algorithm”. Proc of the IEEE, pp. 268-278.
[6]Rong Jin and Alexander G. Hauptmann. 2001. “Headline Generation using a Training Corpus”. Second International Conference on Intelligent Text Text Processing and Computational Linguistics.
[7]R. Jin and A. G. Hauptmann. 2000. “Title Generation for Spoken Broadcast News using a Training Corpus”.Proceedings of ICSLP 2000, Beijing China.
[8]S. Katz. 1987. “Estimation of probabilities from sparse data for the language model component of a speech recognizer”. IEEE Transactions on Acoustics Speech and Signal Processing, pp. 24.
[9]Paul E. Kennedy and Alexander G. Hauptmann. 2000. “Automatic Title Generation for EM”. Proceedings of the fifth ACM conference on Digital libraries.
[10]G..J. McLachlan and K. E. Basford. 1988. Mixture Models. Marcel Dekker, NY.
[11]M. Mitra, Amit Sighal, and Chris Buckley. 1997. “Automatic text summarization by paragraph extraction”. In Proceedings of the ACL’97/EACL’97 Workshop on Intelligent Scalable Text Summarization, Madrid, Spain.
[12]Papineni, Kishore papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu. 2001. “IBM Research Division Technical Report”. RC22176(W0109-022), Yorktown Heights, New York.
[13] Gernard Salton, A.Singhal, M. Mitra, and C. Buckley. 1997 .“Automatic text structuring and summary”. Info. Proc. And Management, 33(2):193-207.
[14] T. Strzalkowski, J. Wang, and B.Wise. 1998. “A robust practical text summarization system”. In AAAI Intelligent Text Summarization Workshop, pp. 26-30, Stanford, CA.
[15]M. Witbrock and V. Mittal. 1999. “Ultra-Summarization: A Statistical Approach to Generating Highly Condensed Non-Extractive Summaries”. Proceedings of SIGIR 99, Berkeley, CA, August.
[16]David Zajic, Bonnie Dorr, and Richard Schwartz. 2002. “Automatic headline      generation for newspaper stories”. In Proceedings of the Workshop on Text Summarization Postconference workshop of ACL-02, Philadelphia, PA.
[17]陳光華。電子文獻資料主題分類之自動辨識。行政院國家科學委員會專題研究計畫成果報告,NSC 86-2621-E-002-025T,民國86年9月。
描述 碩士
國立政治大學
資訊科學學系
89753004
91
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0089753004
資料類型 thesis
dc.contributor.advisor 劉吉軒<br>陳光華zh_TW
dc.contributor.advisor Liu, Jyi-Shane<br>Chen, Kuang-huaen_US
dc.contributor.author (Authors) 江珮翎zh_TW
dc.contributor.author (Authors) Chiang, Pei-lingen_US
dc.creator (作者) 江珮翎zh_TW
dc.creator (作者) Chiang, Pei-lingen_US
dc.date (日期) 2002en_US
dc.date.accessioned 17-Sep-2009 13:51:52 (UTC+8)-
dc.date.available 17-Sep-2009 13:51:52 (UTC+8)-
dc.date.issued (上傳時間) 17-Sep-2009 13:51:52 (UTC+8)-
dc.identifier (Other Identifiers) G0089753004en_US
dc.identifier.uri (URI) https://nccur.lib.nccu.edu.tw/handle/140.119/32615-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊科學學系zh_TW
dc.description (描述) 89753004zh_TW
dc.description (描述) 91zh_TW
dc.description.abstract (摘要) 在網路資訊爆炸的年代,資料的分析整理日趨重要,本論文之研究目標正是針對資料做標題生成的處理,為資料自動生成標題,進而將資料加值化,轉化為資訊。研究者首先閱讀英文相關文獻,分析整理後,認為中文的處理方式與英文有所差異,因此,在本論文中,提出與英文不同之中文前置作業與自動標題生成之方法。
研究者針對標題的自動生成提出了幾種特徵值考量,包括候選詞權重值,訓練標題-文本詞彙,標題長度的關係及詞組間距。本論文之研究分為兩階段,第一階段為訓練階段,將文件做前置處理與斷詞,接著訓練標題-文本詞彙與統計文件標題長度的機率。第二階段為執行階段,分析新文件之候選詞權重值,並參照訓練階段之標題-文本詞彙與標題長度之機率值參考表,考量詞組間距後自動為文件產生標題。本論文所採用的訓練文件集來源為1998年至1999年五種報紙,涵蓋不同主題,共84,211篇文件,而測試文件的實驗分為Outside Test與Inside Test兩部分。
研究者為實驗結果進行兩種評估,一為電腦評估,將自動生成之標題與記者所擬訂的標題比對後,計算出求準率、求全率與F1。Outside Test求準率為14.21%、求全率為11.43%、F1為12.67%。Inside Test求準率為15.84%、求全率為12.94%、F1為14.21%。實驗結果顯示,正確率方面與其他文獻之英文文件標題的生成結果(F1=3.2%~24%)相近,但與實際標題仍有差距,因此,在未來工作上,仍有很大的發展空間。二為人為評估,讓使用者在閱讀自動生成之標題後,加以評分。自動生成之標題的流暢度還算不錯。然總結來說,本論文之研究尚屬初始階段,盼未來能更加成熟,並可有更進一步的創新與改進。
zh_TW
dc.description.abstract (摘要) As the number of digital documents on internet is growing up, analysis and organization of documents become quite important. In this thesis, we propose an approach for headline generation of documents. We can try our best to transfer the document data into information in some sense using the proposed approach. We review literature about the related topics, and present a different approach to deal with Chinese documents rather than English documents.
We propose some approach to Chinese documents headline generation. The thesis is separate two steps, one is training step, and the other is execution step. On the first step, the documents were preprocessed. Secondly, we trained the probability of headline-text words, and headline’s length. And on the execution step, we analyzed scores of headline candidates and gap, then referred to the probability of headline-text words, and headline’s length, finally we automatically generate headline for documents. The training documents are selected from a test collection for information retrieval, CIRB. Totally 84,211 Chinese news articles published between 1998 and 1999 are selected. Testing documents has two parts, one is for outside test, and the other is for inside test.
We conducted two evaluations, one is the automatic evaluation using metrics of presicion, recall and F1; the other is the human assessment. The precision of outside test is 14.21%、recall is 11.43%、F1 is 12.67%. And the precision of inside test is 15.84%、recall is 12.94%、F1 is 14.21%。The automatic evaluation result shows the accruacy is still not good enough, and the human assessment evaluation shows our approach can produce human-readable headlines.
en_US
dc.description.tableofcontents 目錄

第一章 緒論 1
1.1 背景 1
1.2 問題陳述 3
1.3 研究動機 3
1.4 研究目的 4
1.5研究假定 5
1.6研究方法與步驟 6
1.7名詞描述 9
1.8論文架構與貢獻 9
第二章 文獻探討 11
2.1 文件標題自動生成相關研究 11
2.2 文件標題自動生成之技術說明 14
2.2.1 以字為單位當作標題 14
2.2.2以句子為單位當作標題 18
2.2.3 各篇文章研究方式總表 19
2.3各文獻評估方式與實驗結果比較 21
2.4 總結 25
第三章 中文標題之生成 27
3.1 文件標題自動產生之概念形成 27
3.2 前置作業 29
3.3 特徵值分析 30
3.3.1 候選詞權重值 30
3.3.2 標題詞長度 32
3.3.3 候選詞彙的間距 32
3.3.4 演算步驟 34
3.4 總結 38
第四章 實作方法與結果分析 39
4.1 訓練文件與測試文件說明 40
4.2 實驗一 42
4.2.1 前置作業 42
4.2.2 標題長度 43
4.3 實驗二 45
4.3.1 前置作業 45
4.3.2 標題長度 46
4.3.3 文本詞(Text word)也出現在標題(Headline)的機率值 48
4.3.4 測試文件生成標題詞之評估結果 49
4.4 實驗三 52
4.4.1 測試文件生成標題詞之評估結果 52
4.5 分析與探討 53
4.6總結 53
第五章 人為評估結果與探討 54
5.1 問卷法之自動生成標題長度 54
5.2 問卷發放對象、評分方式與問卷內容 55
5.2.1 問卷對象與評分方式 55
5.2.2 問卷內容 55
5.3 人為評估與實際例子分析 56
5.3.1 問卷評估結果與探討 56
5.3.2 實例分析與探討 58
5. 4總結 59
第六章 結論與未來計劃 60
6.1 結論 60
6.2 未來工作 61
參考文獻 63
附錄一 自動生成之標題與原新聞稿 65
zh_TW
dc.format.extent 85569 bytes-
dc.format.extent 289029 bytes-
dc.format.extent 187889 bytes-
dc.format.extent 340548 bytes-
dc.format.extent 82908 bytes-
dc.format.extent 196927 bytes-
dc.format.mimetype application/pdf-
dc.format.mimetype application/pdf-
dc.format.mimetype application/pdf-
dc.format.mimetype application/pdf-
dc.format.mimetype application/pdf-
dc.format.mimetype application/pdf-
dc.language.iso en_US-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0089753004en_US
dc.subject (關鍵詞) 標題zh_TW
dc.subject (關鍵詞) 自動生成zh_TW
dc.subject (關鍵詞) 自然語言zh_TW
dc.subject (關鍵詞) 生成zh_TW
dc.subject (關鍵詞) 新聞標題zh_TW
dc.title (題名) 中文新聞標題自動生成之研究zh_TW
dc.title (題名) A Study on the Automatic Generation for Headlines of Chinese News Articlesen_US
dc.type (資料類型) thesisen
dc.relation.reference (參考文獻) 參考文獻zh_TW
dc.relation.reference (參考文獻) [1]Michele Banko, Vibhu O. Mittal, and Michael J. Witbrock. 2000.“Headline Generation Based on Statistical Translation”. 38th Annual Meeting of the Association for Computational Linguistics, Hong Kong, China, 1-8 October.zh_TW
dc.relation.reference (參考文獻) [2]Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993 . “The mathematics of statistical machine translation: Parameter estimation”. Computational Linguistics, (2): 263-312.zh_TW
dc.relation.reference (參考文獻) [3]Brown, Cocke, Della-Pietra, Della-Pietra, Jelinek, Lafferty, Mercer, Roossin. 1990. “A Statistical Approach to Machine Translation”. Computational Linguistics, 16(2) June.zh_TW
dc.relation.reference (參考文獻) [4]Kuang-hua Chen and Hsin-Hsi Chen. 2001. “The Chinese Text Retrieval Tasks of NTCIR Workshop 2”. Proceedings of the Second NTCIR Workshop Meeting on Evaluation of Chinese & Japanese Text Retrieval and Text Summarization (NTCIR 2), pp. 51-72.zh_TW
dc.relation.reference (參考文獻) [5]G. D. Forney. 1973. “The Viterbi Algorithm”. Proc of the IEEE, pp. 268-278.zh_TW
dc.relation.reference (參考文獻) [6]Rong Jin and Alexander G. Hauptmann. 2001. “Headline Generation using a Training Corpus”. Second International Conference on Intelligent Text Text Processing and Computational Linguistics.zh_TW
dc.relation.reference (參考文獻) [7]R. Jin and A. G. Hauptmann. 2000. “Title Generation for Spoken Broadcast News using a Training Corpus”.Proceedings of ICSLP 2000, Beijing China.zh_TW
dc.relation.reference (參考文獻) [8]S. Katz. 1987. “Estimation of probabilities from sparse data for the language model component of a speech recognizer”. IEEE Transactions on Acoustics Speech and Signal Processing, pp. 24.zh_TW
dc.relation.reference (參考文獻) [9]Paul E. Kennedy and Alexander G. Hauptmann. 2000. “Automatic Title Generation for EM”. Proceedings of the fifth ACM conference on Digital libraries.zh_TW
dc.relation.reference (參考文獻) [10]G..J. McLachlan and K. E. Basford. 1988. Mixture Models. Marcel Dekker, NY.zh_TW
dc.relation.reference (參考文獻) [11]M. Mitra, Amit Sighal, and Chris Buckley. 1997. “Automatic text summarization by paragraph extraction”. In Proceedings of the ACL’97/EACL’97 Workshop on Intelligent Scalable Text Summarization, Madrid, Spain.zh_TW
dc.relation.reference (參考文獻) [12]Papineni, Kishore papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu. 2001. “IBM Research Division Technical Report”. RC22176(W0109-022), Yorktown Heights, New York.zh_TW
dc.relation.reference (參考文獻) [13] Gernard Salton, A.Singhal, M. Mitra, and C. Buckley. 1997 .“Automatic text structuring and summary”. Info. Proc. And Management, 33(2):193-207.zh_TW
dc.relation.reference (參考文獻) [14] T. Strzalkowski, J. Wang, and B.Wise. 1998. “A robust practical text summarization system”. In AAAI Intelligent Text Summarization Workshop, pp. 26-30, Stanford, CA.zh_TW
dc.relation.reference (參考文獻) [15]M. Witbrock and V. Mittal. 1999. “Ultra-Summarization: A Statistical Approach to Generating Highly Condensed Non-Extractive Summaries”. Proceedings of SIGIR 99, Berkeley, CA, August.zh_TW
dc.relation.reference (參考文獻) [16]David Zajic, Bonnie Dorr, and Richard Schwartz. 2002. “Automatic headline      generation for newspaper stories”. In Proceedings of the Workshop on Text Summarization Postconference workshop of ACL-02, Philadelphia, PA.zh_TW
dc.relation.reference (參考文獻) [17]陳光華。電子文獻資料主題分類之自動辨識。行政院國家科學委員會專題研究計畫成果報告,NSC 86-2621-E-002-025T,民國86年9月。zh_TW