學術產出-Theses

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

題名 以LDA機率模型進行PTT論壇文章主題分類並分析文章留言與文章主題之關聯
Using Latent Dirichlet Allocation Model for Topic Modeling with Articles of PTT Forum and Analyzing Relevance of Article Comments
作者 郭泓志
Kuo, Haung-Chi
貢獻者 江玥慧<br>劉昭麟
郭泓志
Kuo, Haung-Chi
關鍵詞 文件主題模型
社群網路分析
PTT
Topic Modeling
Latent Dirichlet Allocation
Latent Dirichlet Allocation
PTT
Topic Modeling
Social Network Analysis
日期 2020
上傳時間 2-Sep-2020 12:16:01 (UTC+8)
摘要 隨著科技日新月異,人們在網路上的社群平台與論壇發言越來越普遍,各個國家不同領域的人集合在同一個區域討論分享意見越來越頻繁,但是如何能自動化的分類出每個發言族群討論的內容為一件難事,基於許多分類方法,本研究使用台灣知名的論壇PTT為資料來源,以LDA(Latent Dirichlet Allocation)模型將文章分類出主題群,使用Word2Vec模型分類出回應給同一篇文章的留言之討論主題,觀察其留言與文章主題的關聯性,可作為進一步了解論壇內交流狀況之基礎。
With the rapid development of technology, people`s interaction on social networking platforms becomes more and more common. People from different fields in various countries gather in the same area to discuss and share opinions more and more frequently, but how can classify topics of discussion automatically is a difficult thing. This study uses Taiwan’s well-known online forum PTT as a data source, and adopts the LDA (Latent Dirichlet Allocation) model to classify articles into topic groups. Results of the model are used to further investigate if the comments of an article are related to the article in terms of topic groups. Analyzing the association between the comments and the articles can be used as a basis for further understanding of the communication in the PTT forum.
參考文獻 [1] Hong, L., & Davison, B. D. (2010, July). Empirical study of topic modeling in twitter. In Proceedings of the first workshop on social media analytics (pp. 80-88). ACM.
[2] Everett, B. (2013). An introduction to latent variable models. Springer Science & Business Media.
[3] Landauer, T. K., McNamara, D. S., Dennis, S., & Kintsch, W. (Eds.). (2013). Handbook of latent semantic analysis. Psychology Press.
[4] Manning, C., Raghavan, P., & Schütze, H. (2010). Introduction to information retrieval. Natural Language Engineering, 16(1), 100-103.
[5] Hofmann, T. (2000). Learning the similarity of documents: An information-geometric approach to document retrieval and categorization. In Advances in neural information processing systems (pp. 914-920).
[6] David M. Blei, Andrew Y. Ng, Michael I. Jordan. 2003. Latent Dirichlet Allocation. University of California, United States.
[7] T. Mikolov, I. Sutskever, K. Chen, G. Corrado & J.Dean. 2013. Distributed
Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems (pp.3111–3119)
[8] PTT. (1995.9.14). Retrieved December 23, 2019, from https://www.ptt.cc/bbs/index.html
[9] Jurafsky, D. (2000). Speech & language processing. Pearson Education India.
[10] Nenkova, A., & McKeown, K. (2012). A survey of text summarization techniques. In Mining text data (pp. 43-76). Springer, Boston, MA.
[11] Chaffar, S., & Inkpen, D. (2011, May). Using a heterogeneous dataset for emotion analysis in text. In Canadian conference on artificial intelligence (pp. 62-67). Springer, Berlin, Heidelberg.
[12] 廖經庭. 2007. BBS 站的客家族群認同建構: 以 PTT 「Hakka Dream」版為例. 碩士論文. 國立中央大學, 桃園市, 台灣.
[13] 蔣佳峰. 2017. PTT災害事件擷取系統. 碩士論文. 國立中央大學, 桃園市, 台灣.
[14] 陳弘君. 2017. 社群媒體中鄉民對於政治議題之迴聲室效應:以PTT八卦版為例. 碩士論文. 私立元智大學, 桃園市, 台灣.
[15] J. K. Pritchard, M. Stephens and P. Donnelly. 2000. Inference of Population Structure Using Multilocus Genotype Data. Genetics, 155(2), (pp.945-959). University of Oxford, Oxford OX1 3TG, United Kingdom.
[16] Katherine A. Heller, Zoubin Ghahramani. 2001. Bayesian Hierarchical Clustering. University College London 17 Queen Square, London, WC1N 3AR, UK
[17] Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 1-22.
[18] Jensen, J. L. W. V. (1906). Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta mathematica, 30, 175-193.
[19] Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The annals of mathematical statistics, 22(1), 79-86.
[20] 沈裕傑. 2008. 以語句為主之LDA模型於文件摘要之應用Sentence-Based Latent Dirichlet Allocation for Text Summarization. 碩士論文. 國立成功大學, 台南市,台灣.
[21] Rajaraman, A., & Ullman, J. D. (2011). Mining of massive datasets. Cambridge University Press.
[22] Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. (2010, June). Automatic evaluation of topic coherence. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics (pp. 100-108).
[23] Moody, C. E. (2016). Mixing dirichlet topic models and word embeddings to make lda2vec. arXiv preprint arXiv:1605.02019.
[24] Wang, X., Wei, F., Liu, X., Zhou, M., & Zhang, M. (2011, October). Topic sentiment analysis in twitter: a graph-based hashtag sentiment classification approach. In Proceedings of the 20th ACM international conference on Information and knowledge management (pp. 1031-1040). ACM.
[25] Quercia, D., Askham, H., & Crowcroft, J. (2012, June). TweetLDA: supervised topic classification and link prediction in Twitter. In Proceedings of the 4th Annual ACM Web Science Conference (pp. 247-250). ACM.
[26] Ramage, D., Hall, D., Nallapati, R., & Manning, C. D. (2009, August). Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1 (pp. 248-256). Association for Computational Linguistics.
[27] Pavitt, C., & Johnson, K. K. (1999). An examination of the coherence of group discussions. Communication Research, 26(3), 303-321.
[28] Li, W., Xu, J., He, Y., Yan, S., & Wu, Y. (2019). Coherent comment generation for chinese articles with a graph-to-sequence model. arXiv preprint arXiv:1906.01231.
[29] Gensim. (n.d.). Retrieved December 23, 2019, from https://radimrehurek.com/gensim/models/word2vec.html
[30] Crummy. (1996). Retrieved December 24, 2019, from https://www.crummy.com/software/BeautifulSoup/
[31] MongoDB. (2009). Retrieved December 31, 2019, from https://www.mongodb.com/
[32] Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of research and development, 2(2), 159-165.
[33] Jieba. (n.d.). Retrieved December 31, 2019, from
https://github.com/fxsjy/jieba
[34] Wikipedia. (2001). Retrieved May 22, 2020, from
https://dumps.wikimedia.org/zhwiki/20200501/
[35] Singhal, A. (2001). Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4), 35-43.
[36] Gibbs, N. E., Poole Jr, W. G., & Stockmeyer, P. K. (1975). A Comparison of Several Bandwidth and Profile Reduction Algorithms (No. TR-6). COLLEGE OF WILLIAM AND MARY WILLIAMSBURG VA.
[37] Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J. L., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems (pp. 288-296).
[38] Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. (2010, June). Automatic evaluation of topic coherence. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics (pp. 100-108).
[39] Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011, July). Optimizing semantic coherence in topic models. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (pp. 262-272).
[40] Sievert, C., & Shirley, K. (2014, June). LDAvis: A method for visualizing and interpreting topics. In Proceedings of the workshop on interactive language learning, visualization, and interfaces (pp. 63-70).
描述 碩士
國立政治大學
資訊科學系
107753013
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0107753013
資料類型 thesis
dc.contributor.advisor 江玥慧<br>劉昭麟zh_TW
dc.contributor.author (Authors) 郭泓志zh_TW
dc.contributor.author (Authors) Kuo, Haung-Chien_US
dc.creator (作者) 郭泓志zh_TW
dc.creator (作者) Kuo, Haung-Chien_US
dc.date (日期) 2020en_US
dc.date.accessioned 2-Sep-2020 12:16:01 (UTC+8)-
dc.date.available 2-Sep-2020 12:16:01 (UTC+8)-
dc.date.issued (上傳時間) 2-Sep-2020 12:16:01 (UTC+8)-
dc.identifier (Other Identifiers) G0107753013en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/131634-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊科學系zh_TW
dc.description (描述) 107753013zh_TW
dc.description.abstract (摘要) 隨著科技日新月異,人們在網路上的社群平台與論壇發言越來越普遍,各個國家不同領域的人集合在同一個區域討論分享意見越來越頻繁,但是如何能自動化的分類出每個發言族群討論的內容為一件難事,基於許多分類方法,本研究使用台灣知名的論壇PTT為資料來源,以LDA(Latent Dirichlet Allocation)模型將文章分類出主題群,使用Word2Vec模型分類出回應給同一篇文章的留言之討論主題,觀察其留言與文章主題的關聯性,可作為進一步了解論壇內交流狀況之基礎。zh_TW
dc.description.abstract (摘要) With the rapid development of technology, people`s interaction on social networking platforms becomes more and more common. People from different fields in various countries gather in the same area to discuss and share opinions more and more frequently, but how can classify topics of discussion automatically is a difficult thing. This study uses Taiwan’s well-known online forum PTT as a data source, and adopts the LDA (Latent Dirichlet Allocation) model to classify articles into topic groups. Results of the model are used to further investigate if the comments of an article are related to the article in terms of topic groups. Analyzing the association between the comments and the articles can be used as a basis for further understanding of the communication in the PTT forum.en_US
dc.description.tableofcontents 第1章 緒論------1
1.1研究背景------1
1.2研究目的與動機 4
1.3本論文結構------5
第2章 文獻探討------6
2.1 PTT的相關研究探討------6
2.2 LDA(Latent Dirichlet Allocation)分類研究:------7
2.3 國外論壇研究------10
2.4 文章留言連貫性研究------11
第3章 研究方法------13
3.1 資料來源------16
3.2 前處理------19
3.3 訓練模型------22
3.4 決定最佳主題數------26
3.5 LDA視覺化------27
3.6 結果呈現------31
第4章 實驗------33
4.1 實驗目的------33
4.2 實驗語料------33
4.3 不同斷詞工具實驗------33
4.3.1 實驗設計------33
4.3.2 實驗結果------33
4.4 斷詞詞性不包含副詞實驗------34
4.4.1 實驗設計------34
4.4.2 實驗結果------35
4.5 NER標籤實驗------36
4.5.1 實驗設計------36
4.5.2 實驗結果------39
4.6 檢驗模型最佳主題數實驗------41
4.6.1 實驗設計------41
4.6.2 實驗結果------41
第5章 研究結果------44
5.1 LDA模型主題------44
5.2 以Word2Vec找出最相似詞------45
5.3 文章內主題分佈------46
5.4 文章主題判斷------50
5.5 留言主題判斷------51
5.6 文章與留言關係之文氏圖------52
5.7 文章與留言關係交集比率------57
第6章 結論與未來展望------60
參考文獻------61
附錄A------67
zh_TW
dc.format.extent 4266011 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0107753013en_US
dc.subject (關鍵詞) 文件主題模型zh_TW
dc.subject (關鍵詞) 社群網路分析zh_TW
dc.subject (關鍵詞) PTTzh_TW
dc.subject (關鍵詞) Topic Modelingzh_TW
dc.subject (關鍵詞) Latent Dirichlet Allocationzh_TW
dc.subject (關鍵詞) Latent Dirichlet Allocationen_US
dc.subject (關鍵詞) PTTen_US
dc.subject (關鍵詞) Topic Modelingen_US
dc.subject (關鍵詞) Social Network Analysisen_US
dc.title (題名) 以LDA機率模型進行PTT論壇文章主題分類並分析文章留言與文章主題之關聯zh_TW
dc.title (題名) Using Latent Dirichlet Allocation Model for Topic Modeling with Articles of PTT Forum and Analyzing Relevance of Article Commentsen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) [1] Hong, L., & Davison, B. D. (2010, July). Empirical study of topic modeling in twitter. In Proceedings of the first workshop on social media analytics (pp. 80-88). ACM.
[2] Everett, B. (2013). An introduction to latent variable models. Springer Science & Business Media.
[3] Landauer, T. K., McNamara, D. S., Dennis, S., & Kintsch, W. (Eds.). (2013). Handbook of latent semantic analysis. Psychology Press.
[4] Manning, C., Raghavan, P., & Schütze, H. (2010). Introduction to information retrieval. Natural Language Engineering, 16(1), 100-103.
[5] Hofmann, T. (2000). Learning the similarity of documents: An information-geometric approach to document retrieval and categorization. In Advances in neural information processing systems (pp. 914-920).
[6] David M. Blei, Andrew Y. Ng, Michael I. Jordan. 2003. Latent Dirichlet Allocation. University of California, United States.
[7] T. Mikolov, I. Sutskever, K. Chen, G. Corrado & J.Dean. 2013. Distributed
Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems (pp.3111–3119)
[8] PTT. (1995.9.14). Retrieved December 23, 2019, from https://www.ptt.cc/bbs/index.html
[9] Jurafsky, D. (2000). Speech & language processing. Pearson Education India.
[10] Nenkova, A., & McKeown, K. (2012). A survey of text summarization techniques. In Mining text data (pp. 43-76). Springer, Boston, MA.
[11] Chaffar, S., & Inkpen, D. (2011, May). Using a heterogeneous dataset for emotion analysis in text. In Canadian conference on artificial intelligence (pp. 62-67). Springer, Berlin, Heidelberg.
[12] 廖經庭. 2007. BBS 站的客家族群認同建構: 以 PTT 「Hakka Dream」版為例. 碩士論文. 國立中央大學, 桃園市, 台灣.
[13] 蔣佳峰. 2017. PTT災害事件擷取系統. 碩士論文. 國立中央大學, 桃園市, 台灣.
[14] 陳弘君. 2017. 社群媒體中鄉民對於政治議題之迴聲室效應:以PTT八卦版為例. 碩士論文. 私立元智大學, 桃園市, 台灣.
[15] J. K. Pritchard, M. Stephens and P. Donnelly. 2000. Inference of Population Structure Using Multilocus Genotype Data. Genetics, 155(2), (pp.945-959). University of Oxford, Oxford OX1 3TG, United Kingdom.
[16] Katherine A. Heller, Zoubin Ghahramani. 2001. Bayesian Hierarchical Clustering. University College London 17 Queen Square, London, WC1N 3AR, UK
[17] Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 1-22.
[18] Jensen, J. L. W. V. (1906). Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta mathematica, 30, 175-193.
[19] Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The annals of mathematical statistics, 22(1), 79-86.
[20] 沈裕傑. 2008. 以語句為主之LDA模型於文件摘要之應用Sentence-Based Latent Dirichlet Allocation for Text Summarization. 碩士論文. 國立成功大學, 台南市,台灣.
[21] Rajaraman, A., & Ullman, J. D. (2011). Mining of massive datasets. Cambridge University Press.
[22] Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. (2010, June). Automatic evaluation of topic coherence. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics (pp. 100-108).
[23] Moody, C. E. (2016). Mixing dirichlet topic models and word embeddings to make lda2vec. arXiv preprint arXiv:1605.02019.
[24] Wang, X., Wei, F., Liu, X., Zhou, M., & Zhang, M. (2011, October). Topic sentiment analysis in twitter: a graph-based hashtag sentiment classification approach. In Proceedings of the 20th ACM international conference on Information and knowledge management (pp. 1031-1040). ACM.
[25] Quercia, D., Askham, H., & Crowcroft, J. (2012, June). TweetLDA: supervised topic classification and link prediction in Twitter. In Proceedings of the 4th Annual ACM Web Science Conference (pp. 247-250). ACM.
[26] Ramage, D., Hall, D., Nallapati, R., & Manning, C. D. (2009, August). Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1 (pp. 248-256). Association for Computational Linguistics.
[27] Pavitt, C., & Johnson, K. K. (1999). An examination of the coherence of group discussions. Communication Research, 26(3), 303-321.
[28] Li, W., Xu, J., He, Y., Yan, S., & Wu, Y. (2019). Coherent comment generation for chinese articles with a graph-to-sequence model. arXiv preprint arXiv:1906.01231.
[29] Gensim. (n.d.). Retrieved December 23, 2019, from https://radimrehurek.com/gensim/models/word2vec.html
[30] Crummy. (1996). Retrieved December 24, 2019, from https://www.crummy.com/software/BeautifulSoup/
[31] MongoDB. (2009). Retrieved December 31, 2019, from https://www.mongodb.com/
[32] Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of research and development, 2(2), 159-165.
[33] Jieba. (n.d.). Retrieved December 31, 2019, from
https://github.com/fxsjy/jieba
[34] Wikipedia. (2001). Retrieved May 22, 2020, from
https://dumps.wikimedia.org/zhwiki/20200501/
[35] Singhal, A. (2001). Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4), 35-43.
[36] Gibbs, N. E., Poole Jr, W. G., & Stockmeyer, P. K. (1975). A Comparison of Several Bandwidth and Profile Reduction Algorithms (No. TR-6). COLLEGE OF WILLIAM AND MARY WILLIAMSBURG VA.
[37] Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J. L., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems (pp. 288-296).
[38] Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. (2010, June). Automatic evaluation of topic coherence. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics (pp. 100-108).
[39] Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011, July). Optimizing semantic coherence in topic models. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (pp. 262-272).
[40] Sievert, C., & Shirley, K. (2014, June). LDAvis: A method for visualizing and interpreting topics. In Proceedings of the workshop on interactive language learning, visualization, and interfaces (pp. 63-70).
zh_TW
dc.identifier.doi (DOI) 10.6814/NCCU202001523en_US