以LDA機率模型進行PTT論壇文章主題分類並分析文章留言與文章主題之關聯

Publications-Theses

Article View/Open

pdf(30)

Publication Export

Google Scholar^TM

NCCU Library

Discovery System

Citation Infomation

Related Publications in TAIR

Simple Record
Full Record

題名	以LDA機率模型進行PTT論壇文章主題分類並分析文章留言與文章主題之關聯 Using Latent Dirichlet Allocation Model for Topic Modeling with Articles of PTT Forum and Analyzing Relevance of Article Comments
作者	郭泓志 Kuo, Haung-Chi
貢獻者	江玥慧<br>劉昭麟郭泓志 Kuo, Haung-Chi
關鍵詞	文件主題模型社群網路分析 PTT Topic Modeling Latent Dirichlet Allocation Latent Dirichlet Allocation PTT Topic Modeling Social Network Analysis
日期	2020
上傳時間	2-Sep-2020 12:16:01 (UTC+8)
摘要	隨著科技日新月異，人們在網路上的社群平台與論壇發言越來越普遍，各個國家不同領域的人集合在同一個區域討論分享意見越來越頻繁，但是如何能自動化的分類出每個發言族群討論的內容為一件難事，基於許多分類方法，本研究使用台灣知名的論壇PTT為資料來源，以LDA（Latent Dirichlet Allocation）模型將文章分類出主題群，使用Word2Vec模型分類出回應給同一篇文章的留言之討論主題，觀察其留言與文章主題的關聯性，可作為進一步了解論壇內交流狀況之基礎。 With the rapid development of technology, people`s interaction on social networking platforms becomes more and more common. People from different fields in various countries gather in the same area to discuss and share opinions more and more frequently, but how can classify topics of discussion automatically is a difficult thing. This study uses Taiwan’s well-known online forum PTT as a data source, and adopts the LDA (Latent Dirichlet Allocation) model to classify articles into topic groups. Results of the model are used to further investigate if the comments of an article are related to the article in terms of topic groups. Analyzing the association between the comments and the articles can be used as a basis for further understanding of the communication in the PTT forum.
參考文獻	[1] Hong, L., & Davison, B. D. (2010, July). Empirical study of topic modeling in twitter. In Proceedings of the first workshop on social media analytics (pp. 80-88). ACM. [2] Everett, B. (2013). An introduction to latent variable models. Springer Science & Business Media. [3] Landauer, T. K., McNamara, D. S., Dennis, S., & Kintsch, W. (Eds.). (2013). Handbook of latent semantic analysis. Psychology Press. [4] Manning, C., Raghavan, P., & Schütze, H. (2010). Introduction to information retrieval. Natural Language Engineering, 16(1), 100-103. [5] Hofmann, T. (2000). Learning the similarity of documents: An information-geometric approach to document retrieval and categorization. In Advances in neural information processing systems (pp. 914-920). [6] David M. Blei, Andrew Y. Ng, Michael I. Jordan. 2003. Latent Dirichlet Allocation. University of California, United States. [7] T. Mikolov, I. Sutskever, K. Chen, G. Corrado & J.Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems (pp.3111–3119) [8] PTT. (1995.9.14). Retrieved December 23, 2019, from https://www.ptt.cc/bbs/index.html [9] Jurafsky, D. (2000). Speech & language processing. Pearson Education India. [10] Nenkova, A., & McKeown, K. (2012). A survey of text summarization techniques. In Mining text data (pp. 43-76). Springer, Boston, MA. [11] Chaffar, S., & Inkpen, D. (2011, May). Using a heterogeneous dataset for emotion analysis in text. In Canadian conference on artificial intelligence (pp. 62-67). Springer, Berlin, Heidelberg. [12] 廖經庭. 2007. BBS 站的客家族群認同建構：以 PTT 「Hakka Dream」版為例. 碩士論文. 國立中央大學, 桃園市, 台灣. [13] 蔣佳峰. 2017. PTT災害事件擷取系統. 碩士論文. 國立中央大學, 桃園市, 台灣. [14] 陳弘君. 2017. 社群媒體中鄉民對於政治議題之迴聲室效應：以PTT八卦版為例. 碩士論文. 私立元智大學, 桃園市, 台灣. [15] J. K. Pritchard, M. Stephens and P. Donnelly. 2000. Inference of Population Structure Using Multilocus Genotype Data. Genetics, 155(2), (pp.945-959). University of Oxford, Oxford OX1 3TG, United Kingdom. [16] Katherine A. Heller, Zoubin Ghahramani. 2001. Bayesian Hierarchical Clustering. University College London 17 Queen Square, London, WC1N 3AR, UK [17] Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 1-22. [18] Jensen, J. L. W. V. (1906). Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta mathematica, 30, 175-193. [19] Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The annals of mathematical statistics, 22(1), 79-86. [20] 沈裕傑. 2008. 以語句為主之LDA模型於文件摘要之應用Sentence-Based Latent Dirichlet Allocation for Text Summarization. 碩士論文. 國立成功大學, 台南市,台灣. [21] Rajaraman, A., & Ullman, J. D. (2011). Mining of massive datasets. Cambridge University Press. [22] Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. (2010, June). Automatic evaluation of topic coherence. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics (pp. 100-108). [23] Moody, C. E. (2016). Mixing dirichlet topic models and word embeddings to make lda2vec. arXiv preprint arXiv:1605.02019. [24] Wang, X., Wei, F., Liu, X., Zhou, M., & Zhang, M. (2011, October). Topic sentiment analysis in twitter: a graph-based hashtag sentiment classification approach. In Proceedings of the 20th ACM international conference on Information and knowledge management (pp. 1031-1040). ACM. [25] Quercia, D., Askham, H., & Crowcroft, J. (2012, June). TweetLDA: supervised topic classification and link prediction in Twitter. In Proceedings of the 4th Annual ACM Web Science Conference (pp. 247-250). ACM. [26] Ramage, D., Hall, D., Nallapati, R., & Manning, C. D. (2009, August). Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1 (pp. 248-256). Association for Computational Linguistics. [27] Pavitt, C., & Johnson, K. K. (1999). An examination of the coherence of group discussions. Communication Research, 26(3), 303-321. [28] Li, W., Xu, J., He, Y., Yan, S., & Wu, Y. (2019). Coherent comment generation for chinese articles with a graph-to-sequence model. arXiv preprint arXiv:1906.01231. [29] Gensim. (n.d.). Retrieved December 23, 2019, from https://radimrehurek.com/gensim/models/word2vec.html [30] Crummy. (1996). Retrieved December 24, 2019, from https://www.crummy.com/software/BeautifulSoup/ [31] MongoDB. (2009). Retrieved December 31, 2019, from https://www.mongodb.com/ [32] Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of research and development, 2(2), 159-165. [33] Jieba. (n.d.). Retrieved December 31, 2019, from https://github.com/fxsjy/jieba [34] Wikipedia. (2001). Retrieved May 22, 2020, from https://dumps.wikimedia.org/zhwiki/20200501/ [35] Singhal, A. (2001). Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4), 35-43. [36] Gibbs, N. E., Poole Jr, W. G., & Stockmeyer, P. K. (1975). A Comparison of Several Bandwidth and Profile Reduction Algorithms (No. TR-6). COLLEGE OF WILLIAM AND MARY WILLIAMSBURG VA. [37] Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J. L., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems (pp. 288-296). [38] Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. (2010, June). Automatic evaluation of topic coherence. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics (pp. 100-108). [39] Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011, July). Optimizing semantic coherence in topic models. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (pp. 262-272). [40] Sievert, C., & Shirley, K. (2014, June). LDAvis: A method for visualizing and interpreting topics. In Proceedings of the workshop on interactive language learning, visualization, and interfaces (pp. 63-70).
描述	碩士國立政治大學資訊科學系 107753013
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0107753013
資料類型	thesis

dc.contributor.advisor	江玥慧<br>劉昭麟	zh_TW
dc.contributor.author (Authors)	郭泓志	zh_TW
dc.contributor.author (Authors)	Kuo, Haung-Chi	en_US
dc.creator (作者)	郭泓志	zh_TW
dc.creator (作者)	Kuo, Haung-Chi	en_US
dc.date (日期)	2020	en_US
dc.date.accessioned	2-Sep-2020 12:16:01 (UTC+8)	-
dc.date.available	2-Sep-2020 12:16:01 (UTC+8)	-
dc.date.issued (上傳時間)	2-Sep-2020 12:16:01 (UTC+8)	-
dc.identifier (Other Identifiers)	G0107753013	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/131634	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	資訊科學系	zh_TW
dc.description (描述)	107753013	zh_TW
dc.description.abstract (摘要)	隨著科技日新月異，人們在網路上的社群平台與論壇發言越來越普遍，各個國家不同領域的人集合在同一個區域討論分享意見越來越頻繁，但是如何能自動化的分類出每個發言族群討論的內容為一件難事，基於許多分類方法，本研究使用台灣知名的論壇PTT為資料來源，以LDA（Latent Dirichlet Allocation）模型將文章分類出主題群，使用Word2Vec模型分類出回應給同一篇文章的留言之討論主題，觀察其留言與文章主題的關聯性，可作為進一步了解論壇內交流狀況之基礎。	zh_TW
dc.description.abstract (摘要)	With the rapid development of technology, people`s interaction on social networking platforms becomes more and more common. People from different fields in various countries gather in the same area to discuss and share opinions more and more frequently, but how can classify topics of discussion automatically is a difficult thing. This study uses Taiwan’s well-known online forum PTT as a data source, and adopts the LDA (Latent Dirichlet Allocation) model to classify articles into topic groups. Results of the model are used to further investigate if the comments of an article are related to the article in terms of topic groups. Analyzing the association between the comments and the articles can be used as a basis for further understanding of the communication in the PTT forum.	en_US
dc.description.tableofcontents	第1章緒論------1 1.1研究背景------1 1.2研究目的與動機 4 1.3本論文結構------5 第2章文獻探討------6 2.1 ＰＴＴ的相關研究探討------6 2.2 LDA(Latent Dirichlet Allocation)分類研究：------7 2.3 國外論壇研究------10 2.4 文章留言連貫性研究------11 第3章研究方法------13 3.1 資料來源------16 3.2 前處理------19 3.3 訓練模型------22 3.4 決定最佳主題數------26 3.5 LDA視覺化------27 3.6 結果呈現------31 第4章實驗------33 4.1 實驗目的------33 4.2 實驗語料------33 4.3 不同斷詞工具實驗------33 4.3.1 實驗設計------33 4.3.2 實驗結果------33 4.4 斷詞詞性不包含副詞實驗------34 4.4.1 實驗設計------34 4.4.2 實驗結果------35 4.5 NER標籤實驗------36 4.5.1 實驗設計------36 4.5.2 實驗結果------39 4.6 檢驗模型最佳主題數實驗------41 4.6.1 實驗設計------41 4.6.2 實驗結果------41 第5章研究結果------44 5.1 LDA模型主題------44 5.2 以Word2Vec找出最相似詞------45 5.3 文章內主題分佈------46 5.4 文章主題判斷------50 5.5 留言主題判斷------51 5.6 文章與留言關係之文氏圖------52 5.7 文章與留言關係交集比率------57 第6章結論與未來展望------60 參考文獻------61 附錄A------67	zh_TW
dc.format.extent	4266011 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0107753013	en_US
dc.subject (關鍵詞)	文件主題模型	zh_TW
dc.subject (關鍵詞)	社群網路分析	zh_TW
dc.subject (關鍵詞)	PTT	zh_TW
dc.subject (關鍵詞)	Topic Modeling	zh_TW
dc.subject (關鍵詞)	Latent Dirichlet Allocation	zh_TW
dc.subject (關鍵詞)	Latent Dirichlet Allocation	en_US
dc.subject (關鍵詞)	PTT	en_US
dc.subject (關鍵詞)	Topic Modeling	en_US
dc.subject (關鍵詞)	Social Network Analysis	en_US
dc.title (題名)	以LDA機率模型進行PTT論壇文章主題分類並分析文章留言與文章主題之關聯	zh_TW
dc.title (題名)	Using Latent Dirichlet Allocation Model for Topic Modeling with Articles of PTT Forum and Analyzing Relevance of Article Comments	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	[1] Hong, L., & Davison, B. D. (2010, July). Empirical study of topic modeling in twitter. In Proceedings of the first workshop on social media analytics (pp. 80-88). ACM. [2] Everett, B. (2013). An introduction to latent variable models. Springer Science & Business Media. [3] Landauer, T. K., McNamara, D. S., Dennis, S., & Kintsch, W. (Eds.). (2013). Handbook of latent semantic analysis. Psychology Press. [4] Manning, C., Raghavan, P., & Schütze, H. (2010). Introduction to information retrieval. Natural Language Engineering, 16(1), 100-103. [5] Hofmann, T. (2000). Learning the similarity of documents: An information-geometric approach to document retrieval and categorization. In Advances in neural information processing systems (pp. 914-920). [6] David M. Blei, Andrew Y. Ng, Michael I. Jordan. 2003. Latent Dirichlet Allocation. University of California, United States. [7] T. Mikolov, I. Sutskever, K. Chen, G. Corrado & J.Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems (pp.3111–3119) [8] PTT. (1995.9.14). Retrieved December 23, 2019, from https://www.ptt.cc/bbs/index.html [9] Jurafsky, D. (2000). Speech & language processing. Pearson Education India. [10] Nenkova, A., & McKeown, K. (2012). A survey of text summarization techniques. In Mining text data (pp. 43-76). Springer, Boston, MA. [11] Chaffar, S., & Inkpen, D. (2011, May). Using a heterogeneous dataset for emotion analysis in text. In Canadian conference on artificial intelligence (pp. 62-67). Springer, Berlin, Heidelberg. [12] 廖經庭. 2007. BBS 站的客家族群認同建構：以 PTT 「Hakka Dream」版為例. 碩士論文. 國立中央大學, 桃園市, 台灣. [13] 蔣佳峰. 2017. PTT災害事件擷取系統. 碩士論文. 國立中央大學, 桃園市, 台灣. [14] 陳弘君. 2017. 社群媒體中鄉民對於政治議題之迴聲室效應：以PTT八卦版為例. 碩士論文. 私立元智大學, 桃園市, 台灣. [15] J. K. Pritchard, M. Stephens and P. Donnelly. 2000. Inference of Population Structure Using Multilocus Genotype Data. Genetics, 155(2), (pp.945-959). University of Oxford, Oxford OX1 3TG, United Kingdom. [16] Katherine A. Heller, Zoubin Ghahramani. 2001. Bayesian Hierarchical Clustering. University College London 17 Queen Square, London, WC1N 3AR, UK [17] Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 1-22. [18] Jensen, J. L. W. V. (1906). Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta mathematica, 30, 175-193. [19] Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The annals of mathematical statistics, 22(1), 79-86. [20] 沈裕傑. 2008. 以語句為主之LDA模型於文件摘要之應用Sentence-Based Latent Dirichlet Allocation for Text Summarization. 碩士論文. 國立成功大學, 台南市,台灣. [21] Rajaraman, A., & Ullman, J. D. (2011). Mining of massive datasets. Cambridge University Press. [22] Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. (2010, June). Automatic evaluation of topic coherence. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics (pp. 100-108). [23] Moody, C. E. (2016). Mixing dirichlet topic models and word embeddings to make lda2vec. arXiv preprint arXiv:1605.02019. [24] Wang, X., Wei, F., Liu, X., Zhou, M., & Zhang, M. (2011, October). Topic sentiment analysis in twitter: a graph-based hashtag sentiment classification approach. In Proceedings of the 20th ACM international conference on Information and knowledge management (pp. 1031-1040). ACM. [25] Quercia, D., Askham, H., & Crowcroft, J. (2012, June). TweetLDA: supervised topic classification and link prediction in Twitter. In Proceedings of the 4th Annual ACM Web Science Conference (pp. 247-250). ACM. [26] Ramage, D., Hall, D., Nallapati, R., & Manning, C. D. (2009, August). Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1 (pp. 248-256). Association for Computational Linguistics. [27] Pavitt, C., & Johnson, K. K. (1999). An examination of the coherence of group discussions. Communication Research, 26(3), 303-321. [28] Li, W., Xu, J., He, Y., Yan, S., & Wu, Y. (2019). Coherent comment generation for chinese articles with a graph-to-sequence model. arXiv preprint arXiv:1906.01231. [29] Gensim. (n.d.). Retrieved December 23, 2019, from https://radimrehurek.com/gensim/models/word2vec.html [30] Crummy. (1996). Retrieved December 24, 2019, from https://www.crummy.com/software/BeautifulSoup/ [31] MongoDB. (2009). Retrieved December 31, 2019, from https://www.mongodb.com/ [32] Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of research and development, 2(2), 159-165. [33] Jieba. (n.d.). Retrieved December 31, 2019, from https://github.com/fxsjy/jieba [34] Wikipedia. (2001). Retrieved May 22, 2020, from https://dumps.wikimedia.org/zhwiki/20200501/ [35] Singhal, A. (2001). Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4), 35-43. [36] Gibbs, N. E., Poole Jr, W. G., & Stockmeyer, P. K. (1975). A Comparison of Several Bandwidth and Profile Reduction Algorithms (No. TR-6). COLLEGE OF WILLIAM AND MARY WILLIAMSBURG VA. [37] Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J. L., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems (pp. 288-296). [38] Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. (2010, June). Automatic evaluation of topic coherence. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics (pp. 100-108). [39] Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011, July). Optimizing semantic coherence in topic models. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (pp. 262-272). [40] Sievert, C., & Shirley, K. (2014, June). LDAvis: A method for visualizing and interpreting topics. In Proceedings of the workshop on interactive language learning, visualization, and interfaces (pp. 63-70).	zh_TW
dc.identifier.doi (DOI)	10.6814/NCCU202001523	en_US

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM