學術產出-Theses

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

題名 應用文字探勘於實用推薦文辨別之研究 -以愛評網美食評論為例
A study of identifying useful review comments for food recommendation with text mining approach
作者 黃怡蓁
Huang, Yi-Chen
貢獻者 楊建民<br>洪為璽
Yang, Chien-Min<br>Hung, Wei-Hsi
黃怡蓁
Huang, Yi-Chen
關鍵詞 中文斷詞
資料探勘
網路爬蟲
群集分析
用戶生成內容
Chinese word segmentation
Data mining
Web crawling
Cluster analysis
User-generated content
日期 2019
上傳時間 7-Aug-2019 16:07:15 (UTC+8)
摘要 網路是世界上最有用的資訊查詢工具,隨著電子商務網站大幅度興起,消費者常於進行購買前閱讀網路相關產品與店家推薦文,並於消費後上網進行經驗回饋分享,在這樣的相互作用之下,網路上的相關產品用戶生成推薦文越垂手可得,資訊與雜訊的分辨逐漸重要。
相關產品推薦文有效影響個人的購買行為與企業發展產品決策,本研究提出一監督式學習的迭代模型,探討非結構性之推薦文對於潛在消費者是否實用,以達到辨別評論為實用或非實用文的目的。
本研究採用愛評網(ipeen)之美食評論發表時間於2008年1月至2018年12月內,共1,219篇實用文與478篇非實用文作為檢測實驗資料,透過使用者與評論層級之雙層過濾,以主題性分析建立特徵詞庫,再以Support Vector Machine、Naive Bayes classifier、Random Forests進行分類,藉由分析結果建立預測模型,並定期擴增詞庫以自適應地學習新實用文迭代模式,因應時代用詞變化。
研究結果顯示最佳模型之準確度為80.20%,精確度為0.924,召回率為0.6886,F-score則可達 0.7891,後續研究可進一步拓展跨領域評論辨別。
E-commerce is growing at an unprecedented rate all over the globe and the internet is becoming an increasingly important query tool in the world. Consumers often read the related online review to get more comments about the products before purchasing and share their opinion and experiences on the products they`ve purchased.
Under the interaction, the more user-generated content on the internet, the more important it is to distinguish between information and noise.Reviews of the related product effectively influence the purchase decision of individuals and organizations and predict product trends.
In this study, we present an iterative and supervised framework, exploring the differences between the participial construction of unstructured recommended reviews for potential consumers, in order to achieve the purpose of distinguishing comments as useful or non-useful.
The reviews which we used python to do web crawler to collect from iPeen was published from January 2008 to December 2018. There are 1,219 useful reviews and 478 non- useful reviews were used as our dataset, which were filtered by double layer of user and comment level. We utilized topic model to find the implicit features in the dataset and then it were be used by Support Vector Machine, Naive Bayes classifier, Random Forests for classification
At last, we use the analyzing results of the classification to establish a prediction model. The dataset will periodically update to amplify the keyword thesaurus and adaptively learn the new implicit features. The accuracy of the model is 80.20%, the precision is 0.924, and the recall rate is 0.6886. F-score can reach 0.7891.
參考文獻 中文文獻
江義平、溫演福、廖奕翔、陳靖翔、陳佳駿(2012)。網路文字探勘技術運用於 智慧型手機口碑之分析研究,國立台北大學資訊管理研究所。
吳珮菁(2012)。意見探勘分析顧客行為之研究。國立成功大學資訊管理研究所碩士論文,台南市。 取自https://hdl.handle.net/11296/8h9d86
任柏衛(2015)。基於文章分析的美食推薦系統。國立清華大學通訊工程研究所碩士論文,新竹市。 取自https://hdl.handle.net/11296/vj93b7
林名彥(2015)。應用文字探勘技術於客訴資料之研究-以台大PPT論壇為例。龍華科技大學資訊管理系碩士班碩士論文,桃園縣。 取自https://hdl.handle.net/11296/8u7ft9
李啟誠、李羽喬 (2010)。網路口碑對消費者購買決策之影響── 以產品涉入及品牌形象為干擾變項. 中華管理評論學報, 第十三卷一期, 1-23.
林國仲(2017)。運用情緒分析結合產品多面向自動分類於消費者評價之研究。國立臺南大學數位學習科技學系數位學習科技碩士在職專班碩士論文,台南市。 取自https://hdl.handle.net/11296/r4fdnz
陳世榮(2015)。"社會科學研究中的文字探勘應用: 以文意為基礎的文件分類及其問題." 人文及社會科學集刊 27.4 : 683-718.
劉力華(2010)。"應用資料探勘於手機評論文章分類之研究." 電子化企業經營管理理論暨實務研討會 : 294-303.
王力弘(2015)。社群媒體新詞偵測系統 以 PTT 八卦版為例 (Doctoral dissertation, 王力弘).
蕭昱維(2014)。基於多階 LDA 技術尋找 Twitter 文章的隱含主題之研究. 樹德科技大學資訊工程系碩士班學位論文, 1-47.
英文文獻
Bilmes, J. A. (1998). A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. International Computer Science Institute, 4(510), 126.
Bickart, B., & Schindler, R. M. (2001). Internet forums as influential sources of consumer information. Journal of interactive marketing, 15(3), 31-40.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297.
Chavoshi, N., Hamooni, H., & Mueen, A. (2016, December). DeBot: Twitter Bot Detection via Warped Correlation. In ICDM(pp. 817-822).
Chen, Z., Tanash, R. S., Stoll, R., & Subramanian, D. (2017, September). Hunting Malicious Bots on Twitter: An Unsupervised Approach. In International Conference on Social Informatics (pp. 501-510). Springer, Cham.
Davis, C. A., Varol, O., Ferrara, E., Flammini, A., & Menczer, F. (2016, April). Botornot: A system to evaluate social bots. In Proceedings of the 25th International Conference Companion on World Wide Web (pp. 273-274). International World Wide Web Conferences Steering Committee.
Eagly, A. H., Wood, W., & Chaiken, S. (1978). Causal inferences about communicators and their effect on opinion change. Journal of Personality and social Psychology, 36(4), 424.
Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National academy of Sciences, 101(suppl 1), 5228-5235.
Garg, R. (2011, July). Study of text based mining. In Proceedings of the International Conference on Advances in Computing and Artificial Intelligence (pp. 5-8). ACM.
Narayan, R., Rout, J. K., & Jena, S. K. (2018). Review spam detection using semi-supervised technique. In Progress in Intelligent Computing Techniques: Theory, Practice, and Applications (pp. 281-286). Springer, Singapore.
Ott, M., Choi, Y., Cardie, C., & Hancock, J. T. (2011, June). Finding deceptive opinion spam by any stretch of the imagination. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1 (pp. 309-319). Association for Computational Linguistics.
Kohavi, R., & Provost, F. (1998). Glossary of Terms Journal of Machine Learning.
Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and Trends® in Information Retrieval, 2(1–2), 1-135.
Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5), 513-523.
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM computing surveys (CSUR), 34(1), 1-47.
Sedhai, S., & Sun, A. (2017). Semi-supervised spam detection in Twitter stream. IEEE Transactions on Computational Social Systems, 5(1), 169-175.
Tsur, O., Davidov, D., & Rappoport, A. (2010, May). ICWSM—a great catchy name: Semi-supervised recognition of sarcastic sentences in online product reviews. In Fourth International AAAI Conference on Weblogs and Social Media.
描述 碩士
國立政治大學
資訊管理學系
106356027
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0106356027
資料類型 thesis
dc.contributor.advisor 楊建民<br>洪為璽zh_TW
dc.contributor.advisor Yang, Chien-Min<br>Hung, Wei-Hsien_US
dc.contributor.author (Authors) 黃怡蓁zh_TW
dc.contributor.author (Authors) Huang, Yi-Chenen_US
dc.creator (作者) 黃怡蓁zh_TW
dc.creator (作者) Huang, Yi-Chenen_US
dc.date (日期) 2019en_US
dc.date.accessioned 7-Aug-2019 16:07:15 (UTC+8)-
dc.date.available 7-Aug-2019 16:07:15 (UTC+8)-
dc.date.issued (上傳時間) 7-Aug-2019 16:07:15 (UTC+8)-
dc.identifier (Other Identifiers) G0106356027en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/124712-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊管理學系zh_TW
dc.description (描述) 106356027zh_TW
dc.description.abstract (摘要) 網路是世界上最有用的資訊查詢工具,隨著電子商務網站大幅度興起,消費者常於進行購買前閱讀網路相關產品與店家推薦文,並於消費後上網進行經驗回饋分享,在這樣的相互作用之下,網路上的相關產品用戶生成推薦文越垂手可得,資訊與雜訊的分辨逐漸重要。
相關產品推薦文有效影響個人的購買行為與企業發展產品決策,本研究提出一監督式學習的迭代模型,探討非結構性之推薦文對於潛在消費者是否實用,以達到辨別評論為實用或非實用文的目的。
本研究採用愛評網(ipeen)之美食評論發表時間於2008年1月至2018年12月內,共1,219篇實用文與478篇非實用文作為檢測實驗資料,透過使用者與評論層級之雙層過濾,以主題性分析建立特徵詞庫,再以Support Vector Machine、Naive Bayes classifier、Random Forests進行分類,藉由分析結果建立預測模型,並定期擴增詞庫以自適應地學習新實用文迭代模式,因應時代用詞變化。
研究結果顯示最佳模型之準確度為80.20%,精確度為0.924,召回率為0.6886,F-score則可達 0.7891,後續研究可進一步拓展跨領域評論辨別。
zh_TW
dc.description.abstract (摘要) E-commerce is growing at an unprecedented rate all over the globe and the internet is becoming an increasingly important query tool in the world. Consumers often read the related online review to get more comments about the products before purchasing and share their opinion and experiences on the products they`ve purchased.
Under the interaction, the more user-generated content on the internet, the more important it is to distinguish between information and noise.Reviews of the related product effectively influence the purchase decision of individuals and organizations and predict product trends.
In this study, we present an iterative and supervised framework, exploring the differences between the participial construction of unstructured recommended reviews for potential consumers, in order to achieve the purpose of distinguishing comments as useful or non-useful.
The reviews which we used python to do web crawler to collect from iPeen was published from January 2008 to December 2018. There are 1,219 useful reviews and 478 non- useful reviews were used as our dataset, which were filtered by double layer of user and comment level. We utilized topic model to find the implicit features in the dataset and then it were be used by Support Vector Machine, Naive Bayes classifier, Random Forests for classification
At last, we use the analyzing results of the classification to establish a prediction model. The dataset will periodically update to amplify the keyword thesaurus and adaptively learn the new implicit features. The accuracy of the model is 80.20%, the precision is 0.924, and the recall rate is 0.6886. F-score can reach 0.7891.
en_US
dc.description.tableofcontents 摘要 I
Abstract II
目錄 III
表次 V
圖次 VI
第一章 緒論 1
第一節 研究動機與背景 1
第二節 研究目的 3
第三節 論文架構 5
第二章 文獻探討 6
第一節 網路評論之於購買決策之研究 6
第二節 文字探勘於資訊擷取與評論分析之相關研究 7
一、 資訊擷取之技術探討 7
二、 消費者行為之主題性分析 8
三、 網路評論類型分析 9
第三章 研究方法 12
第一節 研究流程 12
第二節 資料前置處理 14
一、 數據集描述 14
二、 正則表達式 15
三、 中文斷詞 15
四、 訓練與測試資料集切分 17
第三節 用戶級別檢測 17
第四節 特徵選取及關鍵字詞庫建立 19
一、 主題模型(Topic Model)分析 19
二、 詞頻-逆向文件頻率(TF-IDF) 25
三、 關鍵字詞閥值設定與特徵選取 28
第五節 批次時間更新模型 29
第六節 本研究分類器與評估方法 30
一、 分類器介紹 30
二、 評估方法 34
第四章 研究成果 36
第一節 資料預處理 36
一、 資料收集結果 36
二、 用戶級別檢測過濾與關鍵字詞之閥值設定 37
第二節 研究結果 41
一、 主題模型分析 41
二、 本研究模型驗證之結果 47
三、 本研究模型預測之結果 52
第五章 結論與未來發展方向 54
一、 結論與建議 54
二、 未來發展方向 55
參考文獻 57
zh_TW
dc.format.extent 1020533 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0106356027en_US
dc.subject (關鍵詞) 中文斷詞zh_TW
dc.subject (關鍵詞) 資料探勘zh_TW
dc.subject (關鍵詞) 網路爬蟲zh_TW
dc.subject (關鍵詞) 群集分析zh_TW
dc.subject (關鍵詞) 用戶生成內容zh_TW
dc.subject (關鍵詞) Chinese word segmentationen_US
dc.subject (關鍵詞) Data miningen_US
dc.subject (關鍵詞) Web crawlingen_US
dc.subject (關鍵詞) Cluster analysisen_US
dc.subject (關鍵詞) User-generated contenten_US
dc.title (題名) 應用文字探勘於實用推薦文辨別之研究 -以愛評網美食評論為例zh_TW
dc.title (題名) A study of identifying useful review comments for food recommendation with text mining approachen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) 中文文獻
江義平、溫演福、廖奕翔、陳靖翔、陳佳駿(2012)。網路文字探勘技術運用於 智慧型手機口碑之分析研究,國立台北大學資訊管理研究所。
吳珮菁(2012)。意見探勘分析顧客行為之研究。國立成功大學資訊管理研究所碩士論文,台南市。 取自https://hdl.handle.net/11296/8h9d86
任柏衛(2015)。基於文章分析的美食推薦系統。國立清華大學通訊工程研究所碩士論文,新竹市。 取自https://hdl.handle.net/11296/vj93b7
林名彥(2015)。應用文字探勘技術於客訴資料之研究-以台大PPT論壇為例。龍華科技大學資訊管理系碩士班碩士論文,桃園縣。 取自https://hdl.handle.net/11296/8u7ft9
李啟誠、李羽喬 (2010)。網路口碑對消費者購買決策之影響── 以產品涉入及品牌形象為干擾變項. 中華管理評論學報, 第十三卷一期, 1-23.
林國仲(2017)。運用情緒分析結合產品多面向自動分類於消費者評價之研究。國立臺南大學數位學習科技學系數位學習科技碩士在職專班碩士論文,台南市。 取自https://hdl.handle.net/11296/r4fdnz
陳世榮(2015)。"社會科學研究中的文字探勘應用: 以文意為基礎的文件分類及其問題." 人文及社會科學集刊 27.4 : 683-718.
劉力華(2010)。"應用資料探勘於手機評論文章分類之研究." 電子化企業經營管理理論暨實務研討會 : 294-303.
王力弘(2015)。社群媒體新詞偵測系統 以 PTT 八卦版為例 (Doctoral dissertation, 王力弘).
蕭昱維(2014)。基於多階 LDA 技術尋找 Twitter 文章的隱含主題之研究. 樹德科技大學資訊工程系碩士班學位論文, 1-47.
英文文獻
Bilmes, J. A. (1998). A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. International Computer Science Institute, 4(510), 126.
Bickart, B., & Schindler, R. M. (2001). Internet forums as influential sources of consumer information. Journal of interactive marketing, 15(3), 31-40.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297.
Chavoshi, N., Hamooni, H., & Mueen, A. (2016, December). DeBot: Twitter Bot Detection via Warped Correlation. In ICDM(pp. 817-822).
Chen, Z., Tanash, R. S., Stoll, R., & Subramanian, D. (2017, September). Hunting Malicious Bots on Twitter: An Unsupervised Approach. In International Conference on Social Informatics (pp. 501-510). Springer, Cham.
Davis, C. A., Varol, O., Ferrara, E., Flammini, A., & Menczer, F. (2016, April). Botornot: A system to evaluate social bots. In Proceedings of the 25th International Conference Companion on World Wide Web (pp. 273-274). International World Wide Web Conferences Steering Committee.
Eagly, A. H., Wood, W., & Chaiken, S. (1978). Causal inferences about communicators and their effect on opinion change. Journal of Personality and social Psychology, 36(4), 424.
Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National academy of Sciences, 101(suppl 1), 5228-5235.
Garg, R. (2011, July). Study of text based mining. In Proceedings of the International Conference on Advances in Computing and Artificial Intelligence (pp. 5-8). ACM.
Narayan, R., Rout, J. K., & Jena, S. K. (2018). Review spam detection using semi-supervised technique. In Progress in Intelligent Computing Techniques: Theory, Practice, and Applications (pp. 281-286). Springer, Singapore.
Ott, M., Choi, Y., Cardie, C., & Hancock, J. T. (2011, June). Finding deceptive opinion spam by any stretch of the imagination. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1 (pp. 309-319). Association for Computational Linguistics.
Kohavi, R., & Provost, F. (1998). Glossary of Terms Journal of Machine Learning.
Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and Trends® in Information Retrieval, 2(1–2), 1-135.
Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5), 513-523.
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM computing surveys (CSUR), 34(1), 1-47.
Sedhai, S., & Sun, A. (2017). Semi-supervised spam detection in Twitter stream. IEEE Transactions on Computational Social Systems, 5(1), 169-175.
Tsur, O., Davidov, D., & Rappoport, A. (2010, May). ICWSM—a great catchy name: Semi-supervised recognition of sarcastic sentences in online product reviews. In Fourth International AAAI Conference on Weblogs and Social Media.
zh_TW
dc.identifier.doi (DOI) 10.6814/NCCU201900451en_US