應用文字探勘於影評文章自動摘要之研究

Publications-Theses

Article View/Open

pdf(154)

Publication Export

Google Scholar^TM

題名	應用文字探勘於影評文章自動摘要之研究 A Study on Application of Text Mining for Automatic Text Summarization of Film Review
作者	鄧亦安 Teng, I An
貢獻者	楊建民鄧亦安 Teng, I An
關鍵詞	文字探勘電影影評摘要自動文章摘要 Text-mining Film review summary Automatic text summarization
日期	2016
上傳時間	20-Jul-2016 17:15:39 (UTC+8)
摘要	隨著網路世界的興起，在面臨選擇難題時，民眾不僅會接收口耳相傳的資訊，也會以關鍵字上網搜尋目標資訊，但是在海量資料的浪潮中，如何快速的整合資料是一大挑戰。電影影評文章摘要可以幫助民眾進電影院前了解電影的資訊，透過這樣的方式確認電影是自身有興趣的電影。本研究以電影：復仇者聯盟2影評66篇4616句、蝙蝠俠對超人：正義曙光60篇9345句、動物方城市60篇5545句、星際效應50篇4616句、高年級實習生62篇5622句為資料來源，以分群概念結合摘句之方法生成影評摘要。其中，利用K-Means演算法將五部電影的多篇影評特徵詞、句子進行分群後，使用TFIDF評比各分群語句的重要性來選取高權重語句，再以WWA方法挑選分群中不同面向的語句，最後以相似度計算最佳範本與各分群內容的相似度來決定每一群聚的排序順序，產生一篇具有相似內容段落和段落順序的影評多篇摘要。研究結果顯示，原本五部電影影評對最佳範本之相似度為15.87%，經由本研究方法產生之摘要對最佳範本單篇摘要之相似度為21.19%。另外，因為影評中各分群的順序是比對最佳範本相似度而產生的排序，整篇摘要會具有與最佳範本相似段落排序的摘要內容，其中內容包含了電影影評中廣泛提到的相似內容，不同的相似段落讓文章摘要的呈現更具廣泛性。藉由此摘要方法，可以幫助民眾藉由自動化彙整、萃取的摘要快速了解相關電影資訊內容和協助決策。 Abstract As Facing the Big Data issue, there are too many information on the website for reader to understand. How to perform and summarize essential information quickly is a challenge. People who want to go to a movie will also face this situation. Before choosing movies, they will search relative information of the movies. However, there are many film reviews all over the websites. Automatic text summarization can efficiently extract important information for readers, and conclude concepts of reviews on the websites. Through this method, readers can easily comprehend the best idea of all the reviews and save their time. The research presents a multi-concept and extractive film review summary for readers. It generates film review summary from the most popular blog platform, PIXNET, with extract-based method and clustering concept. The method using K-Means algorism let the film review summary focus on specific film to cluster the sentences by features, and having statistical sense and WWA method to measure the weight of sentences in order to choose the representative sentences. On the last step, it will compare to templates to decide the sequence of classified sentences and summary all represent sentences from each cluster. The research provides a multi-concept and extractive film review summary for people. From the result, there are five movies, which are used summary method increase the average similarity to 21.19% that comparing between the film reviews summary and templates summary. It shows that the automatic film reviews summarization can extract the important sentences from the reviews. Also, with comparing template method to order the cluster, it can sequentially list the cluster of the sentences to generate a movie review, which saves readers’ time and easily comprehend.
參考文獻	黃仁鵬、張貞瑩。2014。運用詞彙權重技術於自動文件摘要之研究。中華民國資訊管理學報12（4）。黃純敏、黃世源、盧韋秀。2011。自動摘要方法於新聞解讀之比較。商管與資訊研討會論文集(TBI 2011)（4）。張云濤、龔玲。2012。資料探勘原理與技術，台北市：五南圖書。袁立安。2007。混合式之自動文件摘要方法。碩士論文。國立中山大學資訊管理研究所。陶幼慧、黃清俊、楊誌欽。2006。網路論壇FAQ知識之自動轉換設計。資訊管理學報13（2），89-112。陳稼興、謝佳倫、許芳誠。2006。以遺傳演算法為基礎的中文斷詞研究。碩士論文。資訊管理研究。楊維邦、葉鎮源、劉政璋、柯皓仁。2006。以概念分群為基礎之新聞文件自動摘要系統。碩士論文。國立交通大學資訊科學系所。劉政璋。2005。以概念分群為基礎之新聞文件自動摘要系統。碩士論文。國立交通大學資訊科學研究所。張奇、黃萱菁、吳立德。2013。一種新的句子相似度度量及其在文本自動摘要的應用。中文訊息學報19（2）。葛加銀。2004。文本自動摘要技術的研究。碩士論文。復旦大學。英文文獻 Sullivan, D. (2001). Document Warehousing and Text Mining. Wiley. Dalal, M.K. and Zaveri M.A. (2011). Heuristics based automatic text summarization of unstructured text. Proceedings of the International Conference & Workshop on Emerging Trends in Technology (ICWET 2011), Mumbai, India, February 25-26. Das, D. and Martins A.F. (2007). A survey on automatic text summarization. Literature Survey for the Language and Statistics II course at CMU, Vol. 4, pp. 192-195. Gupta, V. and Lehal G.S. (2010). A survey of text summarization extractive techniques. Journal of Emerging Technologies in Web Intelligence, Vol. 2, No. 3, pp. 258-268. Mani, I. and Maybury M.T. (1999). Advances in Automatic Text Summarization. Vol. 293, Cambridge: MIT press. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, vol. 41, pp. 391-407. T. Hofmann (1999). Probabilistic latent semantic indexing. Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, Berkeley, California, USA, 1999. Dempster, N. Laird, and D. Rubin (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Society, Series B, vol. 39, pp. 1-38, 1977. D. M. Blei, A. Y. Ng, and M. I. Jordan (2003). Latent dirichlet allocation. J. Mach. Learn. Res., vol. 3, pp. 993-1022. D. Newman, K. Hagedorn, C. Chemudugunta, and P. Smyth (2007). Subject metadata enrichment using statistical topic models. Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, Vancouver, BC, Canada. J. Boyd-Graber, J. Chang, S. Gerrish, C. Wang, and D. Blei (2009). Reading tea leaves: how humans interpret topic models. Neural Information Processing Systems NIPS.
描述	碩士國立政治大學資訊管理學系 103356032
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0103356032
資料類型	thesis

dc.contributor.advisor	楊建民	zh_TW
dc.contributor.author (Authors)	鄧亦安	zh_TW
dc.contributor.author (Authors)	Teng, I An	en_US
dc.creator (作者)	鄧亦安	zh_TW
dc.creator (作者)	Teng, I An	en_US
dc.date (日期)	2016	en_US
dc.date.accessioned	20-Jul-2016 17:15:39 (UTC+8)	-
dc.date.available	20-Jul-2016 17:15:39 (UTC+8)	-
dc.date.issued (上傳時間)	20-Jul-2016 17:15:39 (UTC+8)	-
dc.identifier (Other Identifiers)	G0103356032	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/99338	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	資訊管理學系	zh_TW
dc.description (描述)	103356032	zh_TW
dc.description.abstract (摘要)	隨著網路世界的興起，在面臨選擇難題時，民眾不僅會接收口耳相傳的資訊，也會以關鍵字上網搜尋目標資訊，但是在海量資料的浪潮中，如何快速的整合資料是一大挑戰。電影影評文章摘要可以幫助民眾進電影院前了解電影的資訊，透過這樣的方式確認電影是自身有興趣的電影。本研究以電影：復仇者聯盟2影評66篇4616句、蝙蝠俠對超人：正義曙光60篇9345句、動物方城市60篇5545句、星際效應50篇4616句、高年級實習生62篇5622句為資料來源，以分群概念結合摘句之方法生成影評摘要。其中，利用K-Means演算法將五部電影的多篇影評特徵詞、句子進行分群後，使用TFIDF評比各分群語句的重要性來選取高權重語句，再以WWA方法挑選分群中不同面向的語句，最後以相似度計算最佳範本與各分群內容的相似度來決定每一群聚的排序順序，產生一篇具有相似內容段落和段落順序的影評多篇摘要。研究結果顯示，原本五部電影影評對最佳範本之相似度為15.87%，經由本研究方法產生之摘要對最佳範本單篇摘要之相似度為21.19%。另外，因為影評中各分群的順序是比對最佳範本相似度而產生的排序，整篇摘要會具有與最佳範本相似段落排序的摘要內容，其中內容包含了電影影評中廣泛提到的相似內容，不同的相似段落讓文章摘要的呈現更具廣泛性。藉由此摘要方法，可以幫助民眾藉由自動化彙整、萃取的摘要快速了解相關電影資訊內容和協助決策。	zh_TW
dc.description.abstract (摘要)	Abstract As Facing the Big Data issue, there are too many information on the website for reader to understand. How to perform and summarize essential information quickly is a challenge. People who want to go to a movie will also face this situation. Before choosing movies, they will search relative information of the movies. However, there are many film reviews all over the websites. Automatic text summarization can efficiently extract important information for readers, and conclude concepts of reviews on the websites. Through this method, readers can easily comprehend the best idea of all the reviews and save their time. The research presents a multi-concept and extractive film review summary for readers. It generates film review summary from the most popular blog platform, PIXNET, with extract-based method and clustering concept. The method using K-Means algorism let the film review summary focus on specific film to cluster the sentences by features, and having statistical sense and WWA method to measure the weight of sentences in order to choose the representative sentences. On the last step, it will compare to templates to decide the sequence of classified sentences and summary all represent sentences from each cluster. The research provides a multi-concept and extractive film review summary for people. From the result, there are five movies, which are used summary method increase the average similarity to 21.19% that comparing between the film reviews summary and templates summary. It shows that the automatic film reviews summarization can extract the important sentences from the reviews. Also, with comparing template method to order the cluster, it can sequentially list the cluster of the sentences to generate a movie review, which saves readers’ time and easily comprehend.	en_US
dc.description.tableofcontents	摘要 I ABSTRACT II 圖目錄 V 表目錄 VI 第一章緒論 1 第一節研究背景與動機 1 第二節研究目的 2 第三節論文架構 3 第二章文獻探討 4 第一節文字探勘與文章分群分類方法 4 第二節文字探勘與自動文章摘要 8 第三節自動文章摘要類別 9 2.3.1 單文件摘要和多文件摘要 9 2.3.2 單一語言文件摘要和多語言文件摘要 9 2.3.3 選取式文章摘要和抽像式文章摘要 9 2.3.4 資訊性摘要和指示性摘要 10 2.3.5 一般性摘要和查詢式摘要 10 第四節中文斷詞方法 10 第五節文章摘要方法 12 第三章研究方法與設計 14 第一節研究架構 14 第二節影評收集 16 第三節前序處理 17 3.3.1 文章內容過濾 17 3.3.2 建立斷詞詞庫 18 3.3.3 斷句和斷詞 18 第四節語句的分群 19 3.4.1 分群數量的決定 19 3.4.2 K-Means分群 20 第五節決定語句重要性 21 3.5.1 語句權重計算 21 3.5.2 語句權重調整算法(WWA) 22 第六節影評摘要整合 23 第七節最佳範本影評 25 第四章研究與結果 27 第一節自動文章摘要實驗設計與驗證 27 第二節影評文章自動摘要生成 28 第三節影評文章自動摘要驗證 39 第四節本研究影評文章多篇摘要 40 第五章結論建議與未來研究方向 46 第一節結論與建議 46 第二節研究限制與未來研究方向 47 參考文獻 49	zh_TW
dc.format.extent	2392262 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0103356032	en_US
dc.subject (關鍵詞)	文字探勘	zh_TW
dc.subject (關鍵詞)	電影影評摘要	zh_TW
dc.subject (關鍵詞)	自動文章摘要	zh_TW
dc.subject (關鍵詞)	Text-mining	en_US
dc.subject (關鍵詞)	Film review summary	en_US
dc.subject (關鍵詞)	Automatic text summarization	en_US
dc.title (題名)	應用文字探勘於影評文章自動摘要之研究	zh_TW
dc.title (題名)	A Study on Application of Text Mining for Automatic Text Summarization of Film Review	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	黃仁鵬、張貞瑩。2014。運用詞彙權重技術於自動文件摘要之研究。中華民國資訊管理學報12（4）。黃純敏、黃世源、盧韋秀。2011。自動摘要方法於新聞解讀之比較。商管與資訊研討會論文集(TBI 2011)（4）。張云濤、龔玲。2012。資料探勘原理與技術，台北市：五南圖書。袁立安。2007。混合式之自動文件摘要方法。碩士論文。國立中山大學資訊管理研究所。陶幼慧、黃清俊、楊誌欽。2006。網路論壇FAQ知識之自動轉換設計。資訊管理學報13（2），89-112。陳稼興、謝佳倫、許芳誠。2006。以遺傳演算法為基礎的中文斷詞研究。碩士論文。資訊管理研究。楊維邦、葉鎮源、劉政璋、柯皓仁。2006。以概念分群為基礎之新聞文件自動摘要系統。碩士論文。國立交通大學資訊科學系所。劉政璋。2005。以概念分群為基礎之新聞文件自動摘要系統。碩士論文。國立交通大學資訊科學研究所。張奇、黃萱菁、吳立德。2013。一種新的句子相似度度量及其在文本自動摘要的應用。中文訊息學報19（2）。葛加銀。2004。文本自動摘要技術的研究。碩士論文。復旦大學。英文文獻 Sullivan, D. (2001). Document Warehousing and Text Mining. Wiley. Dalal, M.K. and Zaveri M.A. (2011). Heuristics based automatic text summarization of unstructured text. Proceedings of the International Conference & Workshop on Emerging Trends in Technology (ICWET 2011), Mumbai, India, February 25-26. Das, D. and Martins A.F. (2007). A survey on automatic text summarization. Literature Survey for the Language and Statistics II course at CMU, Vol. 4, pp. 192-195. Gupta, V. and Lehal G.S. (2010). A survey of text summarization extractive techniques. Journal of Emerging Technologies in Web Intelligence, Vol. 2, No. 3, pp. 258-268. Mani, I. and Maybury M.T. (1999). Advances in Automatic Text Summarization. Vol. 293, Cambridge: MIT press. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, vol. 41, pp. 391-407. T. Hofmann (1999). Probabilistic latent semantic indexing. Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, Berkeley, California, USA, 1999. Dempster, N. Laird, and D. Rubin (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Society, Series B, vol. 39, pp. 1-38, 1977. D. M. Blei, A. Y. Ng, and M. I. Jordan (2003). Latent dirichlet allocation. J. Mach. Learn. Res., vol. 3, pp. 993-1022. D. Newman, K. Hagedorn, C. Chemudugunta, and P. Smyth (2007). Subject metadata enrichment using statistical topic models. Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, Vancouver, BC, Canada. J. Boyd-Graber, J. Chang, S. Gerrish, C. Wang, and D. Blei (2009). Reading tea leaves: how humans interpret topic models. Neural Information Processing Systems NIPS.	zh_TW

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM