Publications-Theses
Article View/Open
Publication Export
-
題名 以LDA機率模型進行PTT論壇文章主題分類並分析文章留言與文章主題之關聯
Using Latent Dirichlet Allocation Model for Topic Modeling with Articles of PTT Forum and Analyzing Relevance of Article Comments作者 郭泓志
Kuo, Haung-Chi貢獻者 江玥慧<br>劉昭麟
郭泓志
Kuo, Haung-Chi關鍵詞 文件主題模型
社群網路分析
PTT
Topic Modeling
Latent Dirichlet Allocation
Latent Dirichlet Allocation
PTT
Topic Modeling
Social Network Analysis日期 2020 上傳時間 2-Sep-2020 12:16:01 (UTC+8) 摘要 隨著科技日新月異,人們在網路上的社群平台與論壇發言越來越普遍,各個國家不同領域的人集合在同一個區域討論分享意見越來越頻繁,但是如何能自動化的分類出每個發言族群討論的內容為一件難事,基於許多分類方法,本研究使用台灣知名的論壇PTT為資料來源,以LDA(Latent Dirichlet Allocation)模型將文章分類出主題群,使用Word2Vec模型分類出回應給同一篇文章的留言之討論主題,觀察其留言與文章主題的關聯性,可作為進一步了解論壇內交流狀況之基礎。
With the rapid development of technology, people`s interaction on social networking platforms becomes more and more common. People from different fields in various countries gather in the same area to discuss and share opinions more and more frequently, but how can classify topics of discussion automatically is a difficult thing. This study uses Taiwan’s well-known online forum PTT as a data source, and adopts the LDA (Latent Dirichlet Allocation) model to classify articles into topic groups. Results of the model are used to further investigate if the comments of an article are related to the article in terms of topic groups. Analyzing the association between the comments and the articles can be used as a basis for further understanding of the communication in the PTT forum.參考文獻 [1] Hong, L., & Davison, B. D. (2010, July). Empirical study of topic modeling in twitter. In Proceedings of the first workshop on social media analytics (pp. 80-88). ACM.[2] Everett, B. (2013). An introduction to latent variable models. Springer Science & Business Media.[3] Landauer, T. K., McNamara, D. S., Dennis, S., & Kintsch, W. (Eds.). (2013). Handbook of latent semantic analysis. Psychology Press.[4] Manning, C., Raghavan, P., & Schütze, H. (2010). Introduction to information retrieval. Natural Language Engineering, 16(1), 100-103.[5] Hofmann, T. (2000). Learning the similarity of documents: An information-geometric approach to document retrieval and categorization. In Advances in neural information processing systems (pp. 914-920).[6] David M. Blei, Andrew Y. Ng, Michael I. Jordan. 2003. Latent Dirichlet Allocation. University of California, United States.[7] T. Mikolov, I. Sutskever, K. Chen, G. Corrado & J.Dean. 2013. DistributedRepresentations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems (pp.3111–3119)[8] PTT. (1995.9.14). Retrieved December 23, 2019, from https://www.ptt.cc/bbs/index.html[9] Jurafsky, D. (2000). Speech & language processing. Pearson Education India.[10] Nenkova, A., & McKeown, K. (2012). A survey of text summarization techniques. In Mining text data (pp. 43-76). Springer, Boston, MA.[11] Chaffar, S., & Inkpen, D. (2011, May). Using a heterogeneous dataset for emotion analysis in text. In Canadian conference on artificial intelligence (pp. 62-67). Springer, Berlin, Heidelberg.[12] 廖經庭. 2007. BBS 站的客家族群認同建構: 以 PTT 「Hakka Dream」版為例. 碩士論文. 國立中央大學, 桃園市, 台灣.[13] 蔣佳峰. 2017. PTT災害事件擷取系統. 碩士論文. 國立中央大學, 桃園市, 台灣.[14] 陳弘君. 2017. 社群媒體中鄉民對於政治議題之迴聲室效應:以PTT八卦版為例. 碩士論文. 私立元智大學, 桃園市, 台灣.[15] J. K. Pritchard, M. Stephens and P. Donnelly. 2000. Inference of Population Structure Using Multilocus Genotype Data. Genetics, 155(2), (pp.945-959). University of Oxford, Oxford OX1 3TG, United Kingdom.[16] Katherine A. Heller, Zoubin Ghahramani. 2001. Bayesian Hierarchical Clustering. University College London 17 Queen Square, London, WC1N 3AR, UK[17] Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 1-22.[18] Jensen, J. L. W. V. (1906). Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta mathematica, 30, 175-193.[19] Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The annals of mathematical statistics, 22(1), 79-86.[20] 沈裕傑. 2008. 以語句為主之LDA模型於文件摘要之應用Sentence-Based Latent Dirichlet Allocation for Text Summarization. 碩士論文. 國立成功大學, 台南市,台灣.[21] Rajaraman, A., & Ullman, J. D. (2011). Mining of massive datasets. Cambridge University Press.[22] Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. (2010, June). Automatic evaluation of topic coherence. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics (pp. 100-108).[23] Moody, C. E. (2016). Mixing dirichlet topic models and word embeddings to make lda2vec. arXiv preprint arXiv:1605.02019.[24] Wang, X., Wei, F., Liu, X., Zhou, M., & Zhang, M. (2011, October). Topic sentiment analysis in twitter: a graph-based hashtag sentiment classification approach. In Proceedings of the 20th ACM international conference on Information and knowledge management (pp. 1031-1040). ACM.[25] Quercia, D., Askham, H., & Crowcroft, J. (2012, June). TweetLDA: supervised topic classification and link prediction in Twitter. In Proceedings of the 4th Annual ACM Web Science Conference (pp. 247-250). ACM.[26] Ramage, D., Hall, D., Nallapati, R., & Manning, C. D. (2009, August). Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1 (pp. 248-256). Association for Computational Linguistics.[27] Pavitt, C., & Johnson, K. K. (1999). An examination of the coherence of group discussions. Communication Research, 26(3), 303-321.[28] Li, W., Xu, J., He, Y., Yan, S., & Wu, Y. (2019). Coherent comment generation for chinese articles with a graph-to-sequence model. arXiv preprint arXiv:1906.01231.[29] Gensim. (n.d.). Retrieved December 23, 2019, from https://radimrehurek.com/gensim/models/word2vec.html[30] Crummy. (1996). Retrieved December 24, 2019, from https://www.crummy.com/software/BeautifulSoup/[31] MongoDB. (2009). Retrieved December 31, 2019, from https://www.mongodb.com/[32] Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of research and development, 2(2), 159-165.[33] Jieba. (n.d.). Retrieved December 31, 2019, fromhttps://github.com/fxsjy/jieba[34] Wikipedia. (2001). Retrieved May 22, 2020, fromhttps://dumps.wikimedia.org/zhwiki/20200501/[35] Singhal, A. (2001). Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4), 35-43.[36] Gibbs, N. E., Poole Jr, W. G., & Stockmeyer, P. K. (1975). A Comparison of Several Bandwidth and Profile Reduction Algorithms (No. TR-6). COLLEGE OF WILLIAM AND MARY WILLIAMSBURG VA.[37] Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J. L., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems (pp. 288-296).[38] Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. (2010, June). Automatic evaluation of topic coherence. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics (pp. 100-108).[39] Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011, July). Optimizing semantic coherence in topic models. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (pp. 262-272).[40] Sievert, C., & Shirley, K. (2014, June). LDAvis: A method for visualizing and interpreting topics. In Proceedings of the workshop on interactive language learning, visualization, and interfaces (pp. 63-70). 描述 碩士
國立政治大學
資訊科學系
107753013資料來源 http://thesis.lib.nccu.edu.tw/record/#G0107753013 資料類型 thesis dc.contributor.advisor 江玥慧<br>劉昭麟 zh_TW dc.contributor.author (Authors) 郭泓志 zh_TW dc.contributor.author (Authors) Kuo, Haung-Chi en_US dc.creator (作者) 郭泓志 zh_TW dc.creator (作者) Kuo, Haung-Chi en_US dc.date (日期) 2020 en_US dc.date.accessioned 2-Sep-2020 12:16:01 (UTC+8) - dc.date.available 2-Sep-2020 12:16:01 (UTC+8) - dc.date.issued (上傳時間) 2-Sep-2020 12:16:01 (UTC+8) - dc.identifier (Other Identifiers) G0107753013 en_US dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/131634 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 資訊科學系 zh_TW dc.description (描述) 107753013 zh_TW dc.description.abstract (摘要) 隨著科技日新月異,人們在網路上的社群平台與論壇發言越來越普遍,各個國家不同領域的人集合在同一個區域討論分享意見越來越頻繁,但是如何能自動化的分類出每個發言族群討論的內容為一件難事,基於許多分類方法,本研究使用台灣知名的論壇PTT為資料來源,以LDA(Latent Dirichlet Allocation)模型將文章分類出主題群,使用Word2Vec模型分類出回應給同一篇文章的留言之討論主題,觀察其留言與文章主題的關聯性,可作為進一步了解論壇內交流狀況之基礎。 zh_TW dc.description.abstract (摘要) With the rapid development of technology, people`s interaction on social networking platforms becomes more and more common. People from different fields in various countries gather in the same area to discuss and share opinions more and more frequently, but how can classify topics of discussion automatically is a difficult thing. This study uses Taiwan’s well-known online forum PTT as a data source, and adopts the LDA (Latent Dirichlet Allocation) model to classify articles into topic groups. Results of the model are used to further investigate if the comments of an article are related to the article in terms of topic groups. Analyzing the association between the comments and the articles can be used as a basis for further understanding of the communication in the PTT forum. en_US dc.description.tableofcontents 第1章 緒論------11.1研究背景------11.2研究目的與動機 41.3本論文結構------5第2章 文獻探討------62.1 PTT的相關研究探討------62.2 LDA(Latent Dirichlet Allocation)分類研究:------72.3 國外論壇研究------102.4 文章留言連貫性研究------11第3章 研究方法------133.1 資料來源------163.2 前處理------193.3 訓練模型------223.4 決定最佳主題數------263.5 LDA視覺化------273.6 結果呈現------31第4章 實驗------334.1 實驗目的------334.2 實驗語料------334.3 不同斷詞工具實驗------334.3.1 實驗設計------334.3.2 實驗結果------334.4 斷詞詞性不包含副詞實驗------344.4.1 實驗設計------344.4.2 實驗結果------354.5 NER標籤實驗------364.5.1 實驗設計------364.5.2 實驗結果------394.6 檢驗模型最佳主題數實驗------414.6.1 實驗設計------414.6.2 實驗結果------41第5章 研究結果------445.1 LDA模型主題------445.2 以Word2Vec找出最相似詞------455.3 文章內主題分佈------465.4 文章主題判斷------505.5 留言主題判斷------515.6 文章與留言關係之文氏圖------525.7 文章與留言關係交集比率------57第6章 結論與未來展望------60參考文獻------61附錄A------67 zh_TW dc.format.extent 4266011 bytes - dc.format.mimetype application/pdf - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0107753013 en_US dc.subject (關鍵詞) 文件主題模型 zh_TW dc.subject (關鍵詞) 社群網路分析 zh_TW dc.subject (關鍵詞) PTT zh_TW dc.subject (關鍵詞) Topic Modeling zh_TW dc.subject (關鍵詞) Latent Dirichlet Allocation zh_TW dc.subject (關鍵詞) Latent Dirichlet Allocation en_US dc.subject (關鍵詞) PTT en_US dc.subject (關鍵詞) Topic Modeling en_US dc.subject (關鍵詞) Social Network Analysis en_US dc.title (題名) 以LDA機率模型進行PTT論壇文章主題分類並分析文章留言與文章主題之關聯 zh_TW dc.title (題名) Using Latent Dirichlet Allocation Model for Topic Modeling with Articles of PTT Forum and Analyzing Relevance of Article Comments en_US dc.type (資料類型) thesis en_US dc.relation.reference (參考文獻) [1] Hong, L., & Davison, B. D. (2010, July). Empirical study of topic modeling in twitter. In Proceedings of the first workshop on social media analytics (pp. 80-88). ACM.[2] Everett, B. (2013). An introduction to latent variable models. Springer Science & Business Media.[3] Landauer, T. K., McNamara, D. S., Dennis, S., & Kintsch, W. (Eds.). (2013). Handbook of latent semantic analysis. Psychology Press.[4] Manning, C., Raghavan, P., & Schütze, H. (2010). Introduction to information retrieval. Natural Language Engineering, 16(1), 100-103.[5] Hofmann, T. (2000). Learning the similarity of documents: An information-geometric approach to document retrieval and categorization. In Advances in neural information processing systems (pp. 914-920).[6] David M. Blei, Andrew Y. Ng, Michael I. Jordan. 2003. Latent Dirichlet Allocation. University of California, United States.[7] T. Mikolov, I. Sutskever, K. Chen, G. Corrado & J.Dean. 2013. DistributedRepresentations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems (pp.3111–3119)[8] PTT. (1995.9.14). Retrieved December 23, 2019, from https://www.ptt.cc/bbs/index.html[9] Jurafsky, D. (2000). Speech & language processing. Pearson Education India.[10] Nenkova, A., & McKeown, K. (2012). A survey of text summarization techniques. In Mining text data (pp. 43-76). Springer, Boston, MA.[11] Chaffar, S., & Inkpen, D. (2011, May). Using a heterogeneous dataset for emotion analysis in text. In Canadian conference on artificial intelligence (pp. 62-67). Springer, Berlin, Heidelberg.[12] 廖經庭. 2007. BBS 站的客家族群認同建構: 以 PTT 「Hakka Dream」版為例. 碩士論文. 國立中央大學, 桃園市, 台灣.[13] 蔣佳峰. 2017. PTT災害事件擷取系統. 碩士論文. 國立中央大學, 桃園市, 台灣.[14] 陳弘君. 2017. 社群媒體中鄉民對於政治議題之迴聲室效應:以PTT八卦版為例. 碩士論文. 私立元智大學, 桃園市, 台灣.[15] J. K. Pritchard, M. Stephens and P. Donnelly. 2000. Inference of Population Structure Using Multilocus Genotype Data. Genetics, 155(2), (pp.945-959). University of Oxford, Oxford OX1 3TG, United Kingdom.[16] Katherine A. Heller, Zoubin Ghahramani. 2001. Bayesian Hierarchical Clustering. University College London 17 Queen Square, London, WC1N 3AR, UK[17] Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 1-22.[18] Jensen, J. L. W. V. (1906). Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta mathematica, 30, 175-193.[19] Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The annals of mathematical statistics, 22(1), 79-86.[20] 沈裕傑. 2008. 以語句為主之LDA模型於文件摘要之應用Sentence-Based Latent Dirichlet Allocation for Text Summarization. 碩士論文. 國立成功大學, 台南市,台灣.[21] Rajaraman, A., & Ullman, J. D. (2011). Mining of massive datasets. Cambridge University Press.[22] Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. (2010, June). Automatic evaluation of topic coherence. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics (pp. 100-108).[23] Moody, C. E. (2016). Mixing dirichlet topic models and word embeddings to make lda2vec. arXiv preprint arXiv:1605.02019.[24] Wang, X., Wei, F., Liu, X., Zhou, M., & Zhang, M. (2011, October). Topic sentiment analysis in twitter: a graph-based hashtag sentiment classification approach. In Proceedings of the 20th ACM international conference on Information and knowledge management (pp. 1031-1040). ACM.[25] Quercia, D., Askham, H., & Crowcroft, J. (2012, June). TweetLDA: supervised topic classification and link prediction in Twitter. In Proceedings of the 4th Annual ACM Web Science Conference (pp. 247-250). ACM.[26] Ramage, D., Hall, D., Nallapati, R., & Manning, C. D. (2009, August). Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1 (pp. 248-256). Association for Computational Linguistics.[27] Pavitt, C., & Johnson, K. K. (1999). An examination of the coherence of group discussions. Communication Research, 26(3), 303-321.[28] Li, W., Xu, J., He, Y., Yan, S., & Wu, Y. (2019). Coherent comment generation for chinese articles with a graph-to-sequence model. arXiv preprint arXiv:1906.01231.[29] Gensim. (n.d.). Retrieved December 23, 2019, from https://radimrehurek.com/gensim/models/word2vec.html[30] Crummy. (1996). Retrieved December 24, 2019, from https://www.crummy.com/software/BeautifulSoup/[31] MongoDB. (2009). Retrieved December 31, 2019, from https://www.mongodb.com/[32] Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of research and development, 2(2), 159-165.[33] Jieba. (n.d.). Retrieved December 31, 2019, fromhttps://github.com/fxsjy/jieba[34] Wikipedia. (2001). Retrieved May 22, 2020, fromhttps://dumps.wikimedia.org/zhwiki/20200501/[35] Singhal, A. (2001). Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4), 35-43.[36] Gibbs, N. E., Poole Jr, W. G., & Stockmeyer, P. K. (1975). A Comparison of Several Bandwidth and Profile Reduction Algorithms (No. TR-6). COLLEGE OF WILLIAM AND MARY WILLIAMSBURG VA.[37] Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J. L., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems (pp. 288-296).[38] Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. (2010, June). Automatic evaluation of topic coherence. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics (pp. 100-108).[39] Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011, July). Optimizing semantic coherence in topic models. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (pp. 262-272).[40] Sievert, C., & Shirley, K. (2014, June). LDAvis: A method for visualizing and interpreting topics. In Proceedings of the workshop on interactive language learning, visualization, and interfaces (pp. 63-70). zh_TW dc.identifier.doi (DOI) 10.6814/NCCU202001523 en_US