學術產出-Theses

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

  • No doi shows Citation Infomation
題名 基於主題模型之社群媒體內容分析探索
Exploring Topic Models for Analyzing the Contents of Social Media
作者 廖舒婷
Liao, Shu Ting
貢獻者 陳恭
Chen, Kung
廖舒婷
Liao, Shu Ting
關鍵詞 主題分析
文字探勘
社群媒體
Topic Models
Text Mining
Social Media
日期 2016
上傳時間 22-Aug-2016 13:40:38 (UTC+8)
摘要 隨著網路文章訊息量的快速增長,傳統內容分析已無法在短時間內有效地處理和解析龐雜文本潛在意義,為此,本研究嘗試建置一套以非監督式學習主題模型技術為核心的工具,結合自然語言處理可協助研究學者快速處理與探索大量中文資料,挖掘蘊藏的知識。並透過整合自動化的評估機制,提供模型效果好壞之參考。另由於主題模型所產出的結果仍需要人工判讀,因此本研究再利用視覺化技術呈現,以輔助研究學者詮釋結果。
     本研究以太陽花學運期間六個來源收集資料為實驗對象,包括Facebook、Twitter以及四大即時新聞報,實驗結果顯示本研究建置之工具可以有效地應用於大量中文文本內容探索,有助於減少人工處理和手動作業,並縮短整個資料分析時程。藉由主題模型技術,我們得以探討社群媒體和新聞媒體關注議題之異同,而研究過程也發現不只台灣民眾以及新聞媒體關心太陽花學運,來自香港、大陸等世界各地的網友亦藉由社群媒體平台主動關注或發表意見。另依據主題的分布情況,亦可作為話題熱門度的指標。
     最後,本研究進行模型效度評估,觀察衡量主題模型應用於不同性質中文文本資料之可行性與限制。此外,本研究透過文本歸類計算取得資料集主題的組成便可作為初步篩選資料集之重要特徵,從而提出未來可延伸發展的方向。
Recently, the data retrieved from the internet are too large for traditional content analysis methods to handle and extract high quality insights in reasonable amounts of time. To address this issue, we develop a data analysis system based on unsupervised topic modeling method. In particular, we focus on applying this tool to process Chinese texts. By a proper integration with the Chinese tokenization tool, jieba, our system is able to explore and analyze Chinese documents rapidly yet effectively. Besides, the system also automatically performs a quantitative evaluation of the quality of the generated model, which is useful for the user to get an idea quickly about how well the model works. Finally, as the outputs produced by topic modeling rely on human interpretation, we present a method for visualizing topic modeling results to help end-users understand and interpret what topics have been discovered.
     To evaluate our system, six Chinese text data sets of different network media sources are used for experiment. The result in this study shows the proposed system can be applied to analyze large volumes of unlabeled Chinese text and help reduce manual work, and shorten the amount of time required. We then compare the topics found from social media with those from online news. It is observed that Taiwan’s Sunflower Movement not only received great attention from people in Taiwan, overseas users in Hong Kong or China also express their concerns and opinions through social media. Furthermore, according to topic distribution, we can also find hot topics easily.
     Finally, we conduct some experiments to evaluate and understand the limiting factors of the propose system. An interesting finding is that our system can act as a data filter tool where the composition of data sets can be computed and used to define the filters for quick selection of relevant data sets from large data sets.
參考文獻 [1] Sullivan, Dan. (2001). Document Warehousing and Text Mining Techniques for Improving Business Operations, Marketing,and Sales. New York: John Wiley & Sons.
     [2] Tan, A. H. (1999). Text mining: The state of the art and the challenges. In Proceedings of the PAKDD 1999 Workshop on Knowledge Disocovery from Advanced Databases (Vol. 8, pp. 65-70).
     [3] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, vol. 41,pp. 391-407.
     [4] T. Hofmann. (1999). Probabilistic latent semantic indexing. presented at the Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, Berkeley, California, USA.
     [5] D. M. Blei, A. Y. Ng, and M. I. Jordan. (2003). Latent dirichlet allocation. J. Mach. Learn. Res.,vol. 3,pp. 993-1022.
     [6] M. Steyvers and T. Griffths. Probabilistic topic models. (2006).
     [7] Hall, David, Daniel Jurafsky and Christopher D. Manning. (2008). Studying the history of ideas using topic models. Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics.
     [8] Phan, Xuan-Hieu, Le-Minh Nguyen, and Susumu Horiguchi. (2008). Learning to classify short and sparse text & web with hidden topics from large-scale data collections. Proceedings of the 17th international conference on World Wide Web. ACM.
     [9] Xin Zhao, Jing Jiang, JianshuWeng et al. (2011). Comparing Twitter and traditional media using topic models. In Proceedings of the European Conference on Information Retrieval.
     [10] Brody, Samuel, and Noemie Elhadad. (2010). An unsupervised aspect-sentiment model for online reviews. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics.
     [11] 楚克明, and 李芳. "基于 LDA 模型的新聞話題的演化." 计算机应用与软件 28.4 (2011): 4-7.
     [12] 冯时, 景珊, 杨卓, and 王大玲, "基于 LDA 模型的中文微博话题意见领袖挖掘," 东北大学学报: 自然科学版, vol. 34, pp. 490-494, 2013.
     [13] 張日威,"應用LDA進行Plurk主題分類及使用者情緒分析",雲科大資訊管理學系碩士論文,2014.
     [14] 李日斌, "探討臺灣網民對鄰國的情感",中山大學資訊管理學系研究所碩士論文,2014.
     [15] Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J. L., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems (pp. 288-296).
     [16] Newman, D., Lau, J. H. , Grieser, K. ,& Baldwin, T. (2010). Automatic evaluation of topic coherence. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp. 100-108). Association for Computational Linguistics.
     [17] Mimno, D., Wallach, H. M., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 262-272). Association for Computational Linguistics.
     [18] Röder, M., Both, A., & Hinneburg, A. (2015). Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining (pp. 399-408). ACM.ISO 690.
     [19] Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of science. The Annals of Applied Statistics,17-35.
     [20] Maiya, A. S., & Rolfe, R. M. (2014). Topic similarity networks: visual analytics for large document sets. In Big Data (Big Data),2014 IEEE International Conference on (pp. 364-372). IEEE.
     [21] Harris, Z. S. (1954). Distributional Structure. Word,10(2/3),146–162.
     [22] Parnas, D. L. (1972). On the criteria to be used in decomposing systems into modules. Communications of the ACM,15(12),1053-1058.
     [23] Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17(3), 37.
     [24] Newman, D., Hagedorn, K., Chemudugunta, C., & Smyth, P. (2007). Subject metadata enrichment using statistical topic models. In Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries (pp. 366-375). ACM.
     [25] 謝宗震 (2014)。服貿事件 X 資料科學。檢自:http://readata.org/ecfa-and-data-science/
描述 碩士
國立政治大學
資訊科學系碩士在職專班
103971002
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0103971002
資料類型 thesis
dc.contributor.advisor 陳恭zh_TW
dc.contributor.advisor Chen, Kungen_US
dc.contributor.author (Authors) 廖舒婷zh_TW
dc.contributor.author (Authors) Liao, Shu Tingen_US
dc.creator (作者) 廖舒婷zh_TW
dc.creator (作者) Liao, Shu Tingen_US
dc.date (日期) 2016en_US
dc.date.accessioned 22-Aug-2016 13:40:38 (UTC+8)-
dc.date.available 22-Aug-2016 13:40:38 (UTC+8)-
dc.date.issued (上傳時間) 22-Aug-2016 13:40:38 (UTC+8)-
dc.identifier (Other Identifiers) G0103971002en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/100571-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊科學系碩士在職專班zh_TW
dc.description (描述) 103971002zh_TW
dc.description.abstract (摘要) 隨著網路文章訊息量的快速增長,傳統內容分析已無法在短時間內有效地處理和解析龐雜文本潛在意義,為此,本研究嘗試建置一套以非監督式學習主題模型技術為核心的工具,結合自然語言處理可協助研究學者快速處理與探索大量中文資料,挖掘蘊藏的知識。並透過整合自動化的評估機制,提供模型效果好壞之參考。另由於主題模型所產出的結果仍需要人工判讀,因此本研究再利用視覺化技術呈現,以輔助研究學者詮釋結果。
     本研究以太陽花學運期間六個來源收集資料為實驗對象,包括Facebook、Twitter以及四大即時新聞報,實驗結果顯示本研究建置之工具可以有效地應用於大量中文文本內容探索,有助於減少人工處理和手動作業,並縮短整個資料分析時程。藉由主題模型技術,我們得以探討社群媒體和新聞媒體關注議題之異同,而研究過程也發現不只台灣民眾以及新聞媒體關心太陽花學運,來自香港、大陸等世界各地的網友亦藉由社群媒體平台主動關注或發表意見。另依據主題的分布情況,亦可作為話題熱門度的指標。
     最後,本研究進行模型效度評估,觀察衡量主題模型應用於不同性質中文文本資料之可行性與限制。此外,本研究透過文本歸類計算取得資料集主題的組成便可作為初步篩選資料集之重要特徵,從而提出未來可延伸發展的方向。
zh_TW
dc.description.abstract (摘要) Recently, the data retrieved from the internet are too large for traditional content analysis methods to handle and extract high quality insights in reasonable amounts of time. To address this issue, we develop a data analysis system based on unsupervised topic modeling method. In particular, we focus on applying this tool to process Chinese texts. By a proper integration with the Chinese tokenization tool, jieba, our system is able to explore and analyze Chinese documents rapidly yet effectively. Besides, the system also automatically performs a quantitative evaluation of the quality of the generated model, which is useful for the user to get an idea quickly about how well the model works. Finally, as the outputs produced by topic modeling rely on human interpretation, we present a method for visualizing topic modeling results to help end-users understand and interpret what topics have been discovered.
     To evaluate our system, six Chinese text data sets of different network media sources are used for experiment. The result in this study shows the proposed system can be applied to analyze large volumes of unlabeled Chinese text and help reduce manual work, and shorten the amount of time required. We then compare the topics found from social media with those from online news. It is observed that Taiwan’s Sunflower Movement not only received great attention from people in Taiwan, overseas users in Hong Kong or China also express their concerns and opinions through social media. Furthermore, according to topic distribution, we can also find hot topics easily.
     Finally, we conduct some experiments to evaluate and understand the limiting factors of the propose system. An interesting finding is that our system can act as a data filter tool where the composition of data sets can be computed and used to define the filters for quick selection of relevant data sets from large data sets.
-
dc.description.tableofcontents 第一章 緒論 1
     1.1研究背景與動機 1
     1.2研究目的 4
     1.3研究成果 5
     1.4章節概要 9
     第二章 相關研究與技術背景 10
     2.1文字探勘 10
     2.2主題模型概述 11
     2.2.1潛在語意分析 12
     2.2.2機率潛在語意分析 13
     2.2.3隱含狄利克雷分布 15
     2.3主題模型評估方法 18
     2.3.1 Perplexity 18
     2.3.2 Topic Coherence 19
     2.3.3 Topic Distance 20
     2.4詞袋模型 20
     2.5模組化 21
     2.6工作佇列技術 21
     第三章 系統設計與架構 23
     3.1分析流程 23
     3.2系統架構 24
     3.3資料來源 25
     3.4前處理作業模組 27
     3.4.1斷詞處理 27
     3.4.2詞性標記 28
     3.4.3停用字剔除 29
     3.4.4詞頻統計 30
     3.5資料格式轉換 31
     3.6主題模型建置模組 31
     3.7評估模組 36
     3.8視覺化模組 37
     3.9使用者操作模組 39
     3.10工作佇列架構 40
     第四章 實驗結果與評估 41
     4.1實作環境 41
     4.2資料概況分析 42
     4.3主題模型結果分析與討論 43
     4.3.1 Facebook主題模型 44
     4.3.2 Twitter主題模型 46
     4.3.3新聞四大報主題模型 47
     4.3.4綜合比較 54
     4.4主題模型評估 56
     第五章 結論與建議 58
     5.1結論 58
     5.2未來發展與建議 59
     5.2.1系統限制 59
     5.2.2系統之延伸應用 60
     參考文獻 61
zh_TW
dc.format.extent 4155002 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0103971002en_US
dc.subject (關鍵詞) 主題分析zh_TW
dc.subject (關鍵詞) 文字探勘zh_TW
dc.subject (關鍵詞) 社群媒體zh_TW
dc.subject (關鍵詞) Topic Modelsen_US
dc.subject (關鍵詞) Text Miningen_US
dc.subject (關鍵詞) Social Mediaen_US
dc.title (題名) 基於主題模型之社群媒體內容分析探索zh_TW
dc.title (題名) Exploring Topic Models for Analyzing the Contents of Social Mediaen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) [1] Sullivan, Dan. (2001). Document Warehousing and Text Mining Techniques for Improving Business Operations, Marketing,and Sales. New York: John Wiley & Sons.
     [2] Tan, A. H. (1999). Text mining: The state of the art and the challenges. In Proceedings of the PAKDD 1999 Workshop on Knowledge Disocovery from Advanced Databases (Vol. 8, pp. 65-70).
     [3] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, vol. 41,pp. 391-407.
     [4] T. Hofmann. (1999). Probabilistic latent semantic indexing. presented at the Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, Berkeley, California, USA.
     [5] D. M. Blei, A. Y. Ng, and M. I. Jordan. (2003). Latent dirichlet allocation. J. Mach. Learn. Res.,vol. 3,pp. 993-1022.
     [6] M. Steyvers and T. Griffths. Probabilistic topic models. (2006).
     [7] Hall, David, Daniel Jurafsky and Christopher D. Manning. (2008). Studying the history of ideas using topic models. Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics.
     [8] Phan, Xuan-Hieu, Le-Minh Nguyen, and Susumu Horiguchi. (2008). Learning to classify short and sparse text & web with hidden topics from large-scale data collections. Proceedings of the 17th international conference on World Wide Web. ACM.
     [9] Xin Zhao, Jing Jiang, JianshuWeng et al. (2011). Comparing Twitter and traditional media using topic models. In Proceedings of the European Conference on Information Retrieval.
     [10] Brody, Samuel, and Noemie Elhadad. (2010). An unsupervised aspect-sentiment model for online reviews. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics.
     [11] 楚克明, and 李芳. "基于 LDA 模型的新聞話題的演化." 计算机应用与软件 28.4 (2011): 4-7.
     [12] 冯时, 景珊, 杨卓, and 王大玲, "基于 LDA 模型的中文微博话题意见领袖挖掘," 东北大学学报: 自然科学版, vol. 34, pp. 490-494, 2013.
     [13] 張日威,"應用LDA進行Plurk主題分類及使用者情緒分析",雲科大資訊管理學系碩士論文,2014.
     [14] 李日斌, "探討臺灣網民對鄰國的情感",中山大學資訊管理學系研究所碩士論文,2014.
     [15] Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J. L., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems (pp. 288-296).
     [16] Newman, D., Lau, J. H. , Grieser, K. ,& Baldwin, T. (2010). Automatic evaluation of topic coherence. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp. 100-108). Association for Computational Linguistics.
     [17] Mimno, D., Wallach, H. M., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 262-272). Association for Computational Linguistics.
     [18] Röder, M., Both, A., & Hinneburg, A. (2015). Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining (pp. 399-408). ACM.ISO 690.
     [19] Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of science. The Annals of Applied Statistics,17-35.
     [20] Maiya, A. S., & Rolfe, R. M. (2014). Topic similarity networks: visual analytics for large document sets. In Big Data (Big Data),2014 IEEE International Conference on (pp. 364-372). IEEE.
     [21] Harris, Z. S. (1954). Distributional Structure. Word,10(2/3),146–162.
     [22] Parnas, D. L. (1972). On the criteria to be used in decomposing systems into modules. Communications of the ACM,15(12),1053-1058.
     [23] Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17(3), 37.
     [24] Newman, D., Hagedorn, K., Chemudugunta, C., & Smyth, P. (2007). Subject metadata enrichment using statistical topic models. In Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries (pp. 366-375). ACM.
     [25] 謝宗震 (2014)。服貿事件 X 資料科學。檢自:http://readata.org/ecfa-and-data-science/
zh_TW