學術產出-Theses
Article View/Open
Publication Export
-
題名 基於文字探勘技術及模型組合比較結果之旅館推薦應用
Hotel recommendation application based on text mining technology and model combination comparison results作者 陳麒仲
Chen, Chi-Chung貢獻者 周珮婷
陳麒仲
Chen, Chi-Chung關鍵詞 旅遊評論
條件熵
餘弦相似度
TF-IDF
Word2Vec
SVM
Travel reviews
Cosine similarity日期 2022 上傳時間 1-Jul-2022 16:58:15 (UTC+8) 摘要 在這網路發達的時代,人們使用線上訂房網站做預訂旅館已經是稀鬆平常的事,旅館在網站上的評價,也會直接影響旅客在訂房上的選擇。隨著增加自身旅館的評分、減少旅客回應的負面評論,是每家旅館業者所追求的目標,尤其是如何減少負面評論更為重視,所以針對負面評論內提到的問題,去制定改善計畫提升旅館的評價,是個有效的治本方法。對於旅客也希望能夠住到滿意的旅館,不會去影響自身的旅遊體驗,但訂房過程還需要查看每家旅館的資訊,所以經由系統去推薦適合的旅館,不僅能省時也能省力。本研究透過網路爬蟲,蒐集訂房網站 Booking.com 上南北歐各一個熱門旅遊國家的旅館評論,以文字探勘 TF-IDF 的方法,配上資訊度量條件熵的方法,找尋特定國家旅館的負面關鍵字,幫助當地旅館業者能制定降低負面評論的計畫,以及定義真實負面評論旅客的標籤,透過詞向量模型和受歡迎的機器學習的分類演算法做出預測,為了著重在抓出真實負面評論旅客,模型評估指標選擇使用 Recall、F1Score、AUC Score 當標準,結果顯示以 Word2Vec 訓練的詞向量模型,以及擅長處於不平衡資料的 SVM 分類模型,兩者的組合模型成效較佳,尤其是由輸入中間的詞,去預測周圍的詞的 Skip gram 模型更優於 CBOW。最後根據預測出的真實負面評論旅客,針對其留過的負面評論,去計算與每間熱門旅館負面關鍵字的餘弦相似度得分,推薦相似度得分較低的旅館。
In this era of the developed Internet, it is common for people to use online booking websites to make hotel reservations. The evaluation of hotels on the website will also directly affect the choice of travelers in booking. Every hotel operator wants to increase the rating of its hotel and reduce the negative reviews responded to by tourists. In particular, reducing negative reviews is more important. Therefore, we should formulate improvement plans for the problems mentioned in the negative reviews. The goal of this research is to help local hoteliers to develop a plan to reduce negative reviews. The web crawlers technique was used to collect hotel reviews on Booking.com. The method of text mining TF-IDF coupled with measuring conditional entropy of information to find the negative keywords of hotels in a specific country was used. Word vector models and popular machine learning classification algorithms were performed to identify the negative review travelers. The model evaluation indicators used are Recall, F1 Score, and AUC Score. The results show that the word vector model trained with Word2Vec and the SVM classification model perform better in imbalanced data settings. The Skip-gram model for predicting surrounding words by inputting the middle word is better than CBOW. Finally, the cosine similarity score was calculated with the negative keywords for each popular hotel, and a hotel recommendation was provided.參考文獻 [1] Aizawa, A.(2003, January). An information-theoretic perspective of tf–idf measures.Information Processing & Management Volume 39, Issue 1, Pages 45-65.[2] Belgiu, M.(2016,April). Random forest in remote sensing: A review of applicationsand future directions. ISPRS Journal of Photogrammetry and Remote Sensing Volume114, Pages 24-31.[3] Bouaziz, A., & Christel, D. P., & Pereira, C. C., & Precioso, F., & Lloret Patrick.(2014). Short Text Classification Using Semantic Random Forest. Data Warehousingand Knowledge Discovery pp 288–299.[4] Chen, Y., & Wang, X.(2012). Text feature extraction based on joint conditional entropy. Proceedings of 2012 2nd International Conference on Computer Science andNetwork Technology.[5] Cortes, C., & Vapnik, V. (1995). Support-vector networks, Machine Learning volume20, pages273–297.[6] Eberendu, A. C. (2016, August). Unstructured Data: an overview of the data of BigData. International Journal of Computer Trends and Technology–Volume 38 Number1.[7] Fazzolari, M., & Petrocchi, M.(2018,August). A study on online travel reviews throughintelligent data analysis. Information Technology & Tourism volume 20, pages37–58(2018).[8] Gretzel, U., & Kyung, H. Y.(2008,January). Use and Impact of Online Travel Reviews.Information and Communication Technologies in Tourism 2008 pp 35–46.[9] Gretzel, U.(2021). Conceptualizing the smart tourism mindset: Fostering. Utopianthinking in smart tourism development, 1(1), 3–8.[10] Groves, M., & Mundt, K.(2015). Friend or foe? Google Translate in language foracademic purposes.[11] Huang, Y., & Wang, R., & Wei, B., & Zheng, S. L., & Chen, M.(2021,July). Sentiment Classification of Crowdsourcing Participants'ReviewsText Based on LDA TopicModel. IEEE Access Volume 9.[12] Koo, C., & Xiang, Z., & Gretzel, U., & Sigala, M.(2021,September). Artificial intelligence (AI) and robotics in travel, hospitality and leisure. Electronic Markets volume31, pages473–476.[13] Mikolov, T., & Chen, K., & Corrado, G., & Dean, J.(2013, January). Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781,2013.[14] Mikolov, T., & Surskever, I., & Chen, K., & Corrado, G., & Dean, J.(2013, December). Distributed Representations of Words and Phrases and their Compositionality.Proceedings of the 26th International Conference on Neural Information ProcessingSystems - Volume 2 Pages 3111–3119.[15] Mitra, V., & Wang, C. J., & Banerjee, S.(2007,June). Text classification: A least squaresupport vector machine approach. Applied Soft Computing Volume 7, Issue 3, June2007, Pages 908-914.[16] Mostafa, L(2020). Machine Learning-Based Sentiment Analysis for Analyzing theTravelers Reviews on Egyptian Hotels. Proceedings of the International Conferenceon Artificial Intelligence and Computer Vision (AICV2020) pp 405–413[17] Noyum, V. D., & Mofenjou, Y. P., & Feudjio, C., & Göktug, A., & Fokoue, E.(2021,January). Boosting the Predictive Accurary of Singer Identification Using Discrete Wavelet Transform For Feature Extraction. arXiv - CS - Sound Pub Date : 2021-01-31.[18] Patel, A., & Meehan, K(2021). Fake News Detection on Reddit Utilising CountVectorizer and Term Frequency-Inverse Document Frequency with Logistic Regression,MultinominalNB and Support Vector Machine. 2021 32nd Irish Signals and SystemsConference (ISSC).[19] Polikar, R.(2012,January). Esemble Learning. Ensemble Machine Learning pp 1–34.[20] Ramos, J.(2003, January). Using TF-IDF to Determine Word Relevance in DocumentQueries. Department of Computer Science, Rutgers University, 23515 BPO Way, Piscataway, NJ, 08855.[21] Schafer, J. B. & Frankowski, D., & Herlocker, J., & Sen, S.(2007,January). Collaborative Filtering Recommender Systems. The Adaptive Web pp 291–324.[22] Schuckert, M. & Liu, X., & Law, R.(2015,August). Hospitality and Tourism OnlineReviews: Recent Trends and Future Directions. Journal of Travel & Tourism Marketing Volume 32, 2015 - Issue 5.[23] Song, S., & Kawamura, H., & Uchida, J. & Saito, H.(2019,April). Determining touristsatisfaction from travel reviews. Information Technology & Tourism volume 21, pages337–367.[24] Stringam, B. B., & Jr, J. G., & Vanleeuwen, D. M.(2010,June).Assessing the Importance and Relationships of Ratings on User-Generated Traveler Reviews. TravelerReviews, Journal of Quality Assurance in Hospitality & Tourism, 11:2, 73-92.[25] Tang, Y., & Zhang, Y. Q., & Chawla, N. V., & Krasser, S.(2008,December). SVMsModeling for Highly Imbalanced Classification. IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 39, NO. 1.[26] Wisam, A. Q., & Musa, M. A., & Bilal, l. A.(2019, June). An Overview of Bag ofWords;Importance, 2019 International Engineering Conference (IEC).[27] Wu, X., & Kumar, V., & Quinlan, J. R., & Ghosh, J., & Yang, Q., & Motoda, H., &McLachlan, G. J., & Ng, A., & Liu, B., & Yu, P. S., & Zhou, Z. H., & Steinbach, M.,& Hand, D. J., & Steinberg, D.(2007,December). Top 10 algorithms in data mining.Knowledge and Information Systems volume 14, pages1–37.[28] Xia, P., & Zhang, L., & Li, F.(2015,June). Learning similarity with cosine similarity.ensemble. Information Sciences Volume 307, Pages 39-52.[29] Zhao, D., & Du, N., & Chang, Z., & Li, Y.(2017). Keyword extraction for social mediashort text. 2017 14th Web Information Systems and Applications Conference. 描述 碩士
國立政治大學
統計學系
109354022資料來源 http://thesis.lib.nccu.edu.tw/record/#G0109354022 資料類型 thesis dc.contributor.advisor 周珮婷 zh_TW dc.contributor.author (Authors) 陳麒仲 zh_TW dc.contributor.author (Authors) Chen, Chi-Chung en_US dc.creator (作者) 陳麒仲 zh_TW dc.creator (作者) Chen, Chi-Chung en_US dc.date (日期) 2022 en_US dc.date.accessioned 1-Jul-2022 16:58:15 (UTC+8) - dc.date.available 1-Jul-2022 16:58:15 (UTC+8) - dc.date.issued (上傳時間) 1-Jul-2022 16:58:15 (UTC+8) - dc.identifier (Other Identifiers) G0109354022 en_US dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/140754 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 統計學系 zh_TW dc.description (描述) 109354022 zh_TW dc.description.abstract (摘要) 在這網路發達的時代,人們使用線上訂房網站做預訂旅館已經是稀鬆平常的事,旅館在網站上的評價,也會直接影響旅客在訂房上的選擇。隨著增加自身旅館的評分、減少旅客回應的負面評論,是每家旅館業者所追求的目標,尤其是如何減少負面評論更為重視,所以針對負面評論內提到的問題,去制定改善計畫提升旅館的評價,是個有效的治本方法。對於旅客也希望能夠住到滿意的旅館,不會去影響自身的旅遊體驗,但訂房過程還需要查看每家旅館的資訊,所以經由系統去推薦適合的旅館,不僅能省時也能省力。本研究透過網路爬蟲,蒐集訂房網站 Booking.com 上南北歐各一個熱門旅遊國家的旅館評論,以文字探勘 TF-IDF 的方法,配上資訊度量條件熵的方法,找尋特定國家旅館的負面關鍵字,幫助當地旅館業者能制定降低負面評論的計畫,以及定義真實負面評論旅客的標籤,透過詞向量模型和受歡迎的機器學習的分類演算法做出預測,為了著重在抓出真實負面評論旅客,模型評估指標選擇使用 Recall、F1Score、AUC Score 當標準,結果顯示以 Word2Vec 訓練的詞向量模型,以及擅長處於不平衡資料的 SVM 分類模型,兩者的組合模型成效較佳,尤其是由輸入中間的詞,去預測周圍的詞的 Skip gram 模型更優於 CBOW。最後根據預測出的真實負面評論旅客,針對其留過的負面評論,去計算與每間熱門旅館負面關鍵字的餘弦相似度得分,推薦相似度得分較低的旅館。 zh_TW dc.description.abstract (摘要) In this era of the developed Internet, it is common for people to use online booking websites to make hotel reservations. The evaluation of hotels on the website will also directly affect the choice of travelers in booking. Every hotel operator wants to increase the rating of its hotel and reduce the negative reviews responded to by tourists. In particular, reducing negative reviews is more important. Therefore, we should formulate improvement plans for the problems mentioned in the negative reviews. The goal of this research is to help local hoteliers to develop a plan to reduce negative reviews. The web crawlers technique was used to collect hotel reviews on Booking.com. The method of text mining TF-IDF coupled with measuring conditional entropy of information to find the negative keywords of hotels in a specific country was used. Word vector models and popular machine learning classification algorithms were performed to identify the negative review travelers. The model evaluation indicators used are Recall, F1 Score, and AUC Score. The results show that the word vector model trained with Word2Vec and the SVM classification model perform better in imbalanced data settings. The Skip-gram model for predicting surrounding words by inputting the middle word is better than CBOW. Finally, the cosine similarity score was calculated with the negative keywords for each popular hotel, and a hotel recommendation was provided. en_US dc.description.tableofcontents 1 緒論 11.1 研究背景和動機 11.2 研究目的 32 文獻回顧 52.1 旅遊評論 52.2 特徵提取 62.3 模型表現 72.4 推薦方法 83 研究方法 93.1 研究流程 103.2 資料蒐集與預處理 113.2.1 資料蒐集 113.2.2 評分標籤 143.2.3 文字預處理 163.2.4 情感套件 163.2.5 負面評論標籤 173.3 文字探勘 183.3.1 TF-IDF 183.3.2 Conditional Entropy 193.4 詞向量模型 203.4.1 Bag of words 203.4.2 TF-IDF 213.4.3 Word2vec 223.5 分類模型 263.5.1 Random Forest 263.5.2 GBDT 293.5.3 SVM 313.5.4 模型績效評估 344 研究結果 374.1 特徵字 374.2 模型結果 424.2.1 類別權重 424.2.2 模型比較 474.3 推薦旅館 515 結論與研究建議 555.1 總結論 555.2 建議和未來方向 57參考文獻 59 zh_TW dc.format.extent 3201024 bytes - dc.format.mimetype application/pdf - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0109354022 en_US dc.subject (關鍵詞) 旅遊評論 zh_TW dc.subject (關鍵詞) 條件熵 zh_TW dc.subject (關鍵詞) 餘弦相似度 zh_TW dc.subject (關鍵詞) TF-IDF en_US dc.subject (關鍵詞) Word2Vec en_US dc.subject (關鍵詞) SVM en_US dc.subject (關鍵詞) Travel reviews en_US dc.subject (關鍵詞) Cosine similarity en_US dc.title (題名) 基於文字探勘技術及模型組合比較結果之旅館推薦應用 zh_TW dc.title (題名) Hotel recommendation application based on text mining technology and model combination comparison results en_US dc.type (資料類型) thesis en_US dc.relation.reference (參考文獻) [1] Aizawa, A.(2003, January). An information-theoretic perspective of tf–idf measures.Information Processing & Management Volume 39, Issue 1, Pages 45-65.[2] Belgiu, M.(2016,April). Random forest in remote sensing: A review of applicationsand future directions. ISPRS Journal of Photogrammetry and Remote Sensing Volume114, Pages 24-31.[3] Bouaziz, A., & Christel, D. P., & Pereira, C. C., & Precioso, F., & Lloret Patrick.(2014). Short Text Classification Using Semantic Random Forest. Data Warehousingand Knowledge Discovery pp 288–299.[4] Chen, Y., & Wang, X.(2012). Text feature extraction based on joint conditional entropy. Proceedings of 2012 2nd International Conference on Computer Science andNetwork Technology.[5] Cortes, C., & Vapnik, V. (1995). Support-vector networks, Machine Learning volume20, pages273–297.[6] Eberendu, A. C. (2016, August). Unstructured Data: an overview of the data of BigData. International Journal of Computer Trends and Technology–Volume 38 Number1.[7] Fazzolari, M., & Petrocchi, M.(2018,August). A study on online travel reviews throughintelligent data analysis. Information Technology & Tourism volume 20, pages37–58(2018).[8] Gretzel, U., & Kyung, H. Y.(2008,January). Use and Impact of Online Travel Reviews.Information and Communication Technologies in Tourism 2008 pp 35–46.[9] Gretzel, U.(2021). Conceptualizing the smart tourism mindset: Fostering. Utopianthinking in smart tourism development, 1(1), 3–8.[10] Groves, M., & Mundt, K.(2015). Friend or foe? Google Translate in language foracademic purposes.[11] Huang, Y., & Wang, R., & Wei, B., & Zheng, S. L., & Chen, M.(2021,July). Sentiment Classification of Crowdsourcing Participants'ReviewsText Based on LDA TopicModel. IEEE Access Volume 9.[12] Koo, C., & Xiang, Z., & Gretzel, U., & Sigala, M.(2021,September). Artificial intelligence (AI) and robotics in travel, hospitality and leisure. Electronic Markets volume31, pages473–476.[13] Mikolov, T., & Chen, K., & Corrado, G., & Dean, J.(2013, January). Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781,2013.[14] Mikolov, T., & Surskever, I., & Chen, K., & Corrado, G., & Dean, J.(2013, December). Distributed Representations of Words and Phrases and their Compositionality.Proceedings of the 26th International Conference on Neural Information ProcessingSystems - Volume 2 Pages 3111–3119.[15] Mitra, V., & Wang, C. J., & Banerjee, S.(2007,June). Text classification: A least squaresupport vector machine approach. Applied Soft Computing Volume 7, Issue 3, June2007, Pages 908-914.[16] Mostafa, L(2020). Machine Learning-Based Sentiment Analysis for Analyzing theTravelers Reviews on Egyptian Hotels. Proceedings of the International Conferenceon Artificial Intelligence and Computer Vision (AICV2020) pp 405–413[17] Noyum, V. D., & Mofenjou, Y. P., & Feudjio, C., & Göktug, A., & Fokoue, E.(2021,January). Boosting the Predictive Accurary of Singer Identification Using Discrete Wavelet Transform For Feature Extraction. arXiv - CS - Sound Pub Date : 2021-01-31.[18] Patel, A., & Meehan, K(2021). Fake News Detection on Reddit Utilising CountVectorizer and Term Frequency-Inverse Document Frequency with Logistic Regression,MultinominalNB and Support Vector Machine. 2021 32nd Irish Signals and SystemsConference (ISSC).[19] Polikar, R.(2012,January). Esemble Learning. Ensemble Machine Learning pp 1–34.[20] Ramos, J.(2003, January). Using TF-IDF to Determine Word Relevance in DocumentQueries. Department of Computer Science, Rutgers University, 23515 BPO Way, Piscataway, NJ, 08855.[21] Schafer, J. B. & Frankowski, D., & Herlocker, J., & Sen, S.(2007,January). Collaborative Filtering Recommender Systems. The Adaptive Web pp 291–324.[22] Schuckert, M. & Liu, X., & Law, R.(2015,August). Hospitality and Tourism OnlineReviews: Recent Trends and Future Directions. Journal of Travel & Tourism Marketing Volume 32, 2015 - Issue 5.[23] Song, S., & Kawamura, H., & Uchida, J. & Saito, H.(2019,April). Determining touristsatisfaction from travel reviews. Information Technology & Tourism volume 21, pages337–367.[24] Stringam, B. B., & Jr, J. G., & Vanleeuwen, D. M.(2010,June).Assessing the Importance and Relationships of Ratings on User-Generated Traveler Reviews. TravelerReviews, Journal of Quality Assurance in Hospitality & Tourism, 11:2, 73-92.[25] Tang, Y., & Zhang, Y. Q., & Chawla, N. V., & Krasser, S.(2008,December). SVMsModeling for Highly Imbalanced Classification. IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 39, NO. 1.[26] Wisam, A. Q., & Musa, M. A., & Bilal, l. A.(2019, June). An Overview of Bag ofWords;Importance, 2019 International Engineering Conference (IEC).[27] Wu, X., & Kumar, V., & Quinlan, J. R., & Ghosh, J., & Yang, Q., & Motoda, H., &McLachlan, G. J., & Ng, A., & Liu, B., & Yu, P. S., & Zhou, Z. H., & Steinbach, M.,& Hand, D. J., & Steinberg, D.(2007,December). Top 10 algorithms in data mining.Knowledge and Information Systems volume 14, pages1–37.[28] Xia, P., & Zhang, L., & Li, F.(2015,June). Learning similarity with cosine similarity.ensemble. Information Sciences Volume 307, Pages 39-52.[29] Zhao, D., & Du, N., & Chang, Z., & Li, Y.(2017). Keyword extraction for social mediashort text. 2017 14th Web Information Systems and Applications Conference. zh_TW dc.identifier.doi (DOI) 10.6814/NCCU202200539 en_US