應用主題探勘與標籤聚合於標籤推薦之研究

Publications-Theses

Article View/Open

pdf(57)

Publication Export

Google Scholar^TM

題名	應用主題探勘與標籤聚合於標籤推薦之研究 Application of topic mining and tag clustering for tag recommendation
作者	高挺桂 Kao, Ting Kuei
貢獻者	楊建民 Yang, Jiann Min 高挺桂 Kao, Ting Kuei
關鍵詞	標籤推薦主題模型階層式分群 Tag recommendation Topic model Hierarchical clustering
日期	2017
上傳時間	31-Jul-2017 10:58:56 (UTC+8)
摘要	標記社群標籤是Web2.0以來流行的一種透過使用者詮釋和分享資訊的方式，作為傳統分類方法的替代，其方便、靈活的特色使得使用者能夠輕易地因應內容標註標籤。不過其也有缺點，除了有相當多無標籤標註的內容，也存在大量模糊、不精確的標籤，降低了系統本身組織分類標籤的能力。為了解決上述兩項問題，本研究提出了一種結合主題探勘與標籤聚合的自動化標籤推薦方法，期望能夠建立一個去人工過程的自動化標籤推薦規則，來推薦合適的標籤給使用者。本研究蒐集了痞客邦部落格中，點閱次數大於5000次的熱門中文文章共2500篇，經過前處理，並以其中1939篇訓練模型及400篇作為測試語料來驗證方法。在主題探勘部分，本研究利用LDA主題模型計算不同文章的主題語意，來與既有標籤作出關聯，而能夠針對新進文章預測主題並推薦主題相關標籤給它。其中，本研究利用了能評斷模型表現情形的混淆度(Perplexity)來協助選取LDA的主題數，改善了LDA需要人主觀決定主題數的問題；在標籤聚合部分，本研究以階層式分群法，將有共同出現過的標籤群聚起來，以便找出有相似語意概念的標籤。其中，本研究將分群停止條件設定為共現次數最少為1次，改善了分群方法需要設定分群數量才能有結果的問題，也使本方法能夠自動化的找出合適的分群數目。實驗結果顯示，依照文章主題語意來推薦標籤有一定程度的可行性，且以混淆度所協助選取的主題數取得一致性較好的結果。而依照階層式分群所分出的標籤群中，同一群中的標籤確實擁有相似、類似的概念語意。最後，在結合主題探勘與標籤聚合的方法上，其Top-1至Top-5的準確率平均提升了14.1%，且Top-1準確率也達到72.25%。代表本研究針對文章寫作及標記標籤的習性切入的做法，確實能幫助提升標籤推薦的準確率，也代表本研究確實建立了一個自動化的標籤推薦規則，能推薦出合適的標籤來幫助使用者在撰寫文章後，能夠更方便、精確的標上標籤。 Tags are a popular way of interpreting and sharing information through use, and as a substitute for traditional classification methods, the convenience and flexibility of the community makes it easy for users to use. But it also has disadvantages, in addition to a considerable number of non-tagged content, there are also many fuzzy and inaccurate tags. To solve these two problems, this study proposes a tag recommendation method that combines the Topic Mining and Tag Clustering. In this study, we collected a total of 2500 articles by Pixnet as a corpus. In the Topic Mining section, this study uses the LDA Model to calculate the subject semantics of different articles to associate with existing tags, and we can predict topics for new articles to recommend topics related tags to them. Among them, the topics number of the LDA Model uses the Perplexity to help the selection. In the Tag Clustering section, this study uses the Hierarchical Clustering to collect the tags that have appeared together to find similar semantic concepts. The stop condition is set to a minimum of 1 co-occurrence times, which solves the problem that the clustering method needs to set the number of groups to have the result. First, the Topic Mining results show that it is feasible to recommend tags according to the semantics of the article, and the experiment proves that the number of topics chosen according to the Perplexity is superior to the other topics. Second, the Tag Clustering results show that the same group of tags does have similar conceptual semantics. Last, experiments show that the accuracy rate of Top-1 to Top-5 in combination with two methods increased average of 14.1%, and its Top-1 accuracy rate is 72.25%,and it tells that our tag recommendation method can recommend the appropriate tag for users to use.
參考文獻	Bandyopadhyay, A., Ghosh, K., Majumder, P., & Mitra, M. (2012). Query expansion for microblog retrieval. International Journal of Web Science, 1(4), 368-380. Begelman, G., Keller, P., & Smadja, F. (2006, May). Automated tag clustering: Improving search and exploration in the tag space. In Collaborative Web Tagging Workshop at WWW2006, Edinburgh, Scotland (pp. 15-33). Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022. Ding, Z., Qiu, X., Zhang, Q., & Huang, X. (2013, August). Learning Topical Translation Model for Microblog Hashtag Suggestion. In IJCAI. Golder, S. A., & Huberman, B. A. (2006). The structure of collaborative tagging system. Journal of information science, 32(2), 198-208. Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National academy of Sciences, 101(suppl 1), 5228-5235. Heymann, P., Ramage, D., & Garcia-Molina, H. (2008, July). Social tag prediction. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval (pp. 531-538). ACM. Huang, Z. D. Q. Z. X. (2012, December). Automatic hashtag recommendation for microblogs using topic-specific translation model. In 24th International Conference on Computational Linguistics (p. 265). Krestel, R., Fankhauser, P., & Nejdl, W. (2009, October). Latent dirichlet allocation for tag recommendation. In Proceedings of the third ACM conference on Recommender systems (pp. 61-68). ACM. Mazzia, A., & Juett, J. (2009). Suggesting hashtags on twitter. EECS 545m, Machine Learning, Computer Science and Engineering, University of Michigan. Mishne, G. (2006, May). Autotag: a collaborative approach to automated tag assignment for weblog posts. In Proceedings of the 15th international conference on World Wide Web (pp. 953-954). ACM. Nakamoto, R., Nakajima, S., Miyazaki, J., & Uemura, S. (2007, November). Tag-based contextual collaborative filtering. In Proceedings of the 18th IEICE Data Engineering Workshop (pp. 377-386). Ohkura, T., Kiyota, Y., & Nakagawa, H. (2006, May). Browsing system for weblog articles based on automated folksonomy. In Proceedings of the WWW 2006 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, at WWW (Vol. 2006). Papadimitriou, C. H., Tamaki, H., Raghavan, P., & Vempala, S. (1998, May). Latent semantic indexing: A probabilistic analysis. In Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems (pp. 159-168). ACM. Song, Y., Qiu, B., & Farooq, U. (2011, October). Hierarchical tag visualization and application for tag recommendations. In Proceedings of the 20th ACM international conference on Information and knowledge management (pp. 1331-1340). ACM. Tomar, A., Godin, F., Vandersmissen, B., De Neve, W., & Van de Walle, R. (2014, September). Towards Twitter hashtag recommendation using distributed word representations and a deep feed forward neural network. In Advances in Computing, Communications and Informatics (ICACCI, 2014 International Conference on (pp. 362-368). IEEE. Yin, D., Xue, Z., Hong, L., & Davison, B. D. (2010, July). A probabilistic model for personalized tag prediction. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 959-968). ACM. 邵健, & 章成志. (2015). 文本表示方法对微博 Hashtag 推荐影响研究*--以 Twitter 上 H7N9 微博为例. 圖書與情報, 2015(3), 17-25. 曹高辉, 焦玉英, & 成全. (2008). 基于凝聚式层次聚类算法的标签聚类研究. 现代图书情报技术, 24(4), 23-28. 张静, 宋俊德, & 鄂海红. (2012). 中文分词中间件的设计与实现. 中国科技论文在线.
描述	碩士國立政治大學資訊管理學系 104356004
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0104356004
資料類型	thesis

dc.contributor.advisor	楊建民	zh_TW
dc.contributor.advisor	Yang, Jiann Min	en_US
dc.contributor.author (Authors)	高挺桂	zh_TW
dc.contributor.author (Authors)	Kao, Ting Kuei	en_US
dc.creator (作者)	高挺桂	zh_TW
dc.creator (作者)	Kao, Ting Kuei	en_US
dc.date (日期)	2017	en_US
dc.date.accessioned	31-Jul-2017 10:58:56 (UTC+8)	-
dc.date.available	31-Jul-2017 10:58:56 (UTC+8)	-
dc.date.issued (上傳時間)	31-Jul-2017 10:58:56 (UTC+8)	-
dc.identifier (Other Identifiers)	G0104356004	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/111454	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	資訊管理學系	zh_TW
dc.description (描述)	104356004	zh_TW
dc.description.abstract (摘要)	標記社群標籤是Web2.0以來流行的一種透過使用者詮釋和分享資訊的方式，作為傳統分類方法的替代，其方便、靈活的特色使得使用者能夠輕易地因應內容標註標籤。不過其也有缺點，除了有相當多無標籤標註的內容，也存在大量模糊、不精確的標籤，降低了系統本身組織分類標籤的能力。為了解決上述兩項問題，本研究提出了一種結合主題探勘與標籤聚合的自動化標籤推薦方法，期望能夠建立一個去人工過程的自動化標籤推薦規則，來推薦合適的標籤給使用者。本研究蒐集了痞客邦部落格中，點閱次數大於5000次的熱門中文文章共2500篇，經過前處理，並以其中1939篇訓練模型及400篇作為測試語料來驗證方法。在主題探勘部分，本研究利用LDA主題模型計算不同文章的主題語意，來與既有標籤作出關聯，而能夠針對新進文章預測主題並推薦主題相關標籤給它。其中，本研究利用了能評斷模型表現情形的混淆度(Perplexity)來協助選取LDA的主題數，改善了LDA需要人主觀決定主題數的問題；在標籤聚合部分，本研究以階層式分群法，將有共同出現過的標籤群聚起來，以便找出有相似語意概念的標籤。其中，本研究將分群停止條件設定為共現次數最少為1次，改善了分群方法需要設定分群數量才能有結果的問題，也使本方法能夠自動化的找出合適的分群數目。實驗結果顯示，依照文章主題語意來推薦標籤有一定程度的可行性，且以混淆度所協助選取的主題數取得一致性較好的結果。而依照階層式分群所分出的標籤群中，同一群中的標籤確實擁有相似、類似的概念語意。最後，在結合主題探勘與標籤聚合的方法上，其Top-1至Top-5的準確率平均提升了14.1%，且Top-1準確率也達到72.25%。代表本研究針對文章寫作及標記標籤的習性切入的做法，確實能幫助提升標籤推薦的準確率，也代表本研究確實建立了一個自動化的標籤推薦規則，能推薦出合適的標籤來幫助使用者在撰寫文章後，能夠更方便、精確的標上標籤。	zh_TW
dc.description.abstract (摘要)	Tags are a popular way of interpreting and sharing information through use, and as a substitute for traditional classification methods, the convenience and flexibility of the community makes it easy for users to use. But it also has disadvantages, in addition to a considerable number of non-tagged content, there are also many fuzzy and inaccurate tags. To solve these two problems, this study proposes a tag recommendation method that combines the Topic Mining and Tag Clustering. In this study, we collected a total of 2500 articles by Pixnet as a corpus. In the Topic Mining section, this study uses the LDA Model to calculate the subject semantics of different articles to associate with existing tags, and we can predict topics for new articles to recommend topics related tags to them. Among them, the topics number of the LDA Model uses the Perplexity to help the selection. In the Tag Clustering section, this study uses the Hierarchical Clustering to collect the tags that have appeared together to find similar semantic concepts. The stop condition is set to a minimum of 1 co-occurrence times, which solves the problem that the clustering method needs to set the number of groups to have the result. First, the Topic Mining results show that it is feasible to recommend tags according to the semantics of the article, and the experiment proves that the number of topics chosen according to the Perplexity is superior to the other topics. Second, the Tag Clustering results show that the same group of tags does have similar conceptual semantics. Last, experiments show that the accuracy rate of Top-1 to Top-5 in combination with two methods increased average of 14.1%, and its Top-1 accuracy rate is 72.25%,and it tells that our tag recommendation method can recommend the appropriate tag for users to use.	en_US
dc.description.tableofcontents	第一章緒論 1 第一節研究背景 1 第二節研究動機 1 第三節研究目的 2 第二章文獻探討 3 第一節標籤推薦 3 第二節標籤聚合 5 第三節小結 6 第三章研究方法與設計 7 第一節資料蒐集與處理 8 第二節 LDA模型 10 第三節主題翻譯模型 14 第四節標籤概念集 18 第五節標籤推薦驗證 20 第四章研究結果 22 第一節實驗設計 22 第二節 LDA主題訓練結果 22 第三節主題翻譯模型結果 26 第四節標籤概念集結果 28 第五節標籤推薦驗證 30 第五章結論與未來發展 32 第一節結論 32 第二節未來研究方向 33 參考文獻 35	zh_TW
dc.format.extent	1701939 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0104356004	en_US
dc.subject (關鍵詞)	標籤推薦	zh_TW
dc.subject (關鍵詞)	主題模型	zh_TW
dc.subject (關鍵詞)	階層式分群	zh_TW
dc.subject (關鍵詞)	Tag recommendation	en_US
dc.subject (關鍵詞)	Topic model	en_US
dc.subject (關鍵詞)	Hierarchical clustering	en_US
dc.title (題名)	應用主題探勘與標籤聚合於標籤推薦之研究	zh_TW
dc.title (題名)	Application of topic mining and tag clustering for tag recommendation	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	Bandyopadhyay, A., Ghosh, K., Majumder, P., & Mitra, M. (2012). Query expansion for microblog retrieval. International Journal of Web Science, 1(4), 368-380. Begelman, G., Keller, P., & Smadja, F. (2006, May). Automated tag clustering: Improving search and exploration in the tag space. In Collaborative Web Tagging Workshop at WWW2006, Edinburgh, Scotland (pp. 15-33). Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022. Ding, Z., Qiu, X., Zhang, Q., & Huang, X. (2013, August). Learning Topical Translation Model for Microblog Hashtag Suggestion. In IJCAI. Golder, S. A., & Huberman, B. A. (2006). The structure of collaborative tagging system. Journal of information science, 32(2), 198-208. Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National academy of Sciences, 101(suppl 1), 5228-5235. Heymann, P., Ramage, D., & Garcia-Molina, H. (2008, July). Social tag prediction. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval (pp. 531-538). ACM. Huang, Z. D. Q. Z. X. (2012, December). Automatic hashtag recommendation for microblogs using topic-specific translation model. In 24th International Conference on Computational Linguistics (p. 265). Krestel, R., Fankhauser, P., & Nejdl, W. (2009, October). Latent dirichlet allocation for tag recommendation. In Proceedings of the third ACM conference on Recommender systems (pp. 61-68). ACM. Mazzia, A., & Juett, J. (2009). Suggesting hashtags on twitter. EECS 545m, Machine Learning, Computer Science and Engineering, University of Michigan. Mishne, G. (2006, May). Autotag: a collaborative approach to automated tag assignment for weblog posts. In Proceedings of the 15th international conference on World Wide Web (pp. 953-954). ACM. Nakamoto, R., Nakajima, S., Miyazaki, J., & Uemura, S. (2007, November). Tag-based contextual collaborative filtering. In Proceedings of the 18th IEICE Data Engineering Workshop (pp. 377-386). Ohkura, T., Kiyota, Y., & Nakagawa, H. (2006, May). Browsing system for weblog articles based on automated folksonomy. In Proceedings of the WWW 2006 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, at WWW (Vol. 2006). Papadimitriou, C. H., Tamaki, H., Raghavan, P., & Vempala, S. (1998, May). Latent semantic indexing: A probabilistic analysis. In Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems (pp. 159-168). ACM. Song, Y., Qiu, B., & Farooq, U. (2011, October). Hierarchical tag visualization and application for tag recommendations. In Proceedings of the 20th ACM international conference on Information and knowledge management (pp. 1331-1340). ACM. Tomar, A., Godin, F., Vandersmissen, B., De Neve, W., & Van de Walle, R. (2014, September). Towards Twitter hashtag recommendation using distributed word representations and a deep feed forward neural network. In Advances in Computing, Communications and Informatics (ICACCI, 2014 International Conference on (pp. 362-368). IEEE. Yin, D., Xue, Z., Hong, L., & Davison, B. D. (2010, July). A probabilistic model for personalized tag prediction. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 959-968). ACM. 邵健, & 章成志. (2015). 文本表示方法对微博 Hashtag 推荐影响研究*--以 Twitter 上 H7N9 微博为例. 圖書與情報, 2015(3), 17-25. 曹高辉, 焦玉英, & 成全. (2008). 基于凝聚式层次聚类算法的标签聚类研究. 现代图书情报技术, 24(4), 23-28. 张静, 宋俊德, & 鄂海红. (2012). 中文分词中间件的设计与实现. 中国科技论文在线.	zh_TW

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM