基於文字探勘技術及模型組合比較結果之旅館推薦應用

學術產出-Theses

Article View/Open

pdf(0)

Publication Export

Google Scholar^TM

政大圖書館

學術資源探索系統

Citation Infomation

Simple Record
Full Record

題名	基於文字探勘技術及模型組合比較結果之旅館推薦應用 Hotel recommendation application based on text mining technology and model combination comparison results
作者	陳麒仲 Chen, Chi-Chung
貢獻者	周珮婷陳麒仲 Chen, Chi-Chung
關鍵詞	旅遊評論條件熵餘弦相似度 TF-IDF Word2Vec SVM Travel reviews Cosine similarity
日期	2022
上傳時間	1-Jul-2022 16:58:15 (UTC+8)
摘要	在這網路發達的時代，人們使用線上訂房網站做預訂旅館已經是稀鬆平常的事，旅館在網站上的評價，也會直接影響旅客在訂房上的選擇。隨著增加自身旅館的評分、減少旅客回應的負面評論，是每家旅館業者所追求的目標，尤其是如何減少負面評論更為重視，所以針對負面評論內提到的問題，去制定改善計畫提升旅館的評價，是個有效的治本方法。對於旅客也希望能夠住到滿意的旅館，不會去影響自身的旅遊體驗，但訂房過程還需要查看每家旅館的資訊，所以經由系統去推薦適合的旅館，不僅能省時也能省力。本研究透過網路爬蟲，蒐集訂房網站 Booking.com 上南北歐各一個熱門旅遊國家的旅館評論，以文字探勘 TF-IDF 的方法，配上資訊度量條件熵的方法，找尋特定國家旅館的負面關鍵字，幫助當地旅館業者能制定降低負面評論的計畫，以及定義真實負面評論旅客的標籤，透過詞向量模型和受歡迎的機器學習的分類演算法做出預測，為了著重在抓出真實負面評論旅客，模型評估指標選擇使用 Recall、F1Score、AUC Score 當標準，結果顯示以 Word2Vec 訓練的詞向量模型，以及擅長處於不平衡資料的 SVM 分類模型，兩者的組合模型成效較佳，尤其是由輸入中間的詞，去預測周圍的詞的 Skip gram 模型更優於 CBOW。最後根據預測出的真實負面評論旅客，針對其留過的負面評論，去計算與每間熱門旅館負面關鍵字的餘弦相似度得分，推薦相似度得分較低的旅館。 In this era of the developed Internet, it is common for people to use online booking websites to make hotel reservations. The evaluation of hotels on the website will also directly affect the choice of travelers in booking. Every hotel operator wants to increase the rating of its hotel and reduce the negative reviews responded to by tourists. In particular, reducing negative reviews is more important. Therefore, we should formulate improvement plans for the problems mentioned in the negative reviews. The goal of this research is to help local hoteliers to develop a plan to reduce negative reviews. The web crawlers technique was used to collect hotel reviews on Booking.com. The method of text mining TF-IDF coupled with measuring conditional entropy of information to find the negative keywords of hotels in a specific country was used. Word vector models and popular machine learning classification algorithms were performed to identify the negative review travelers. The model evaluation indicators used are Recall, F1 Score, and AUC Score. The results show that the word vector model trained with Word2Vec and the SVM classification model perform better in imbalanced data settings. The Skip-gram model for predicting surrounding words by inputting the middle word is better than CBOW. Finally, the cosine similarity score was calculated with the negative keywords for each popular hotel, and a hotel recommendation was provided.
參考文獻	[1] Aizawa, A.(2003, January). An information-theoretic perspective of tf–idf measures. Information Processing & Management Volume 39, Issue 1, Pages 45-65. [2] Belgiu, M.(2016,April). Random forest in remote sensing: A review of applications and future directions. ISPRS Journal of Photogrammetry and Remote Sensing Volume 114, Pages 24-31. [3] Bouaziz, A., & Christel, D. P., & Pereira, C. C., & Precioso, F., & Lloret Patrick. (2014). Short Text Classification Using Semantic Random Forest. Data Warehousing and Knowledge Discovery pp 288–299. [4] Chen, Y., & Wang, X.(2012). Text feature extraction based on joint conditional entropy. Proceedings of 2012 2nd International Conference on Computer Science and Network Technology. [5] Cortes, C., & Vapnik, V. (1995). Support-vector networks, Machine Learning volume 20, pages273–297. [6] Eberendu, A. C. (2016, August). Unstructured Data: an overview of the data of Big Data. International Journal of Computer Trends and Technology–Volume 38 Number 1. [7] Fazzolari, M., & Petrocchi, M.(2018,August). A study on online travel reviews through intelligent data analysis. Information Technology & Tourism volume 20, pages37–58 (2018). [8] Gretzel, U., & Kyung, H. Y.(2008,January). Use and Impact of Online Travel Reviews. Information and Communication Technologies in Tourism 2008 pp 35–46. [9] Gretzel, U.(2021). Conceptualizing the smart tourism mindset: Fostering. Utopian thinking in smart tourism development, 1(1), 3–8. [10] Groves, M., & Mundt, K.(2015). Friend or foe? Google Translate in language for academic purposes. [11] Huang, Y., & Wang, R., & Wei, B., & Zheng, S. L., & Chen, M.(2021,July). Sentiment Classification of Crowdsourcing Participants＇ReviewsText Based on LDA Topic Model. IEEE Access Volume 9. [12] Koo, C., & Xiang, Z., & Gretzel, U., & Sigala, M.(2021,September). Artificial intelligence (AI) and robotics in travel, hospitality and leisure. Electronic Markets volume 31, pages473–476. [13] Mikolov, T., & Chen, K., & Corrado, G., & Dean, J.(2013, January). Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781, 2013. [14] Mikolov, T., & Surskever, I., & Chen, K., & Corrado, G., & Dean, J.(2013, December). Distributed Representations of Words and Phrases and their Compositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 Pages 3111–3119. [15] Mitra, V., & Wang, C. J., & Banerjee, S.(2007,June). Text classification: A least square support vector machine approach. Applied Soft Computing Volume 7, Issue 3, June 2007, Pages 908-914. [16] Mostafa, L(2020). Machine Learning-Based Sentiment Analysis for Analyzing the Travelers Reviews on Egyptian Hotels. Proceedings of the International Conference on Artificial Intelligence and Computer Vision (AICV2020) pp 405–413 [17] Noyum, V. D., & Mofenjou, Y. P., & Feudjio, C., & Göktug, A., & Fokoue, E. (2021,January). Boosting the Predictive Accurary of Singer Identification Using Discrete Wavelet Transform For Feature Extraction. arXiv - CS - Sound Pub Date : 2021- 01-31. [18] Patel, A., & Meehan, K(2021). Fake News Detection on Reddit Utilising CountVectorizer and Term Frequency-Inverse Document Frequency with Logistic Regression, MultinominalNB and Support Vector Machine. 2021 32nd Irish Signals and Systems Conference (ISSC). [19] Polikar, R.(2012,January). Esemble Learning. Ensemble Machine Learning pp 1–34. [20] Ramos, J.(2003, January). Using TF-IDF to Determine Word Relevance in Document Queries. Department of Computer Science, Rutgers University, 23515 BPO Way, Piscataway, NJ, 08855. [21] Schafer, J. B. & Frankowski, D., & Herlocker, J., & Sen, S.(2007,January). Collaborative Filtering Recommender Systems. The Adaptive Web pp 291–324. [22] Schuckert, M. & Liu, X., & Law, R.(2015,August). Hospitality and Tourism Online Reviews: Recent Trends and Future Directions. Journal of Travel & Tourism Marketing Volume 32, 2015 - Issue 5. [23] Song, S., & Kawamura, H., & Uchida, J. & Saito, H.(2019,April). Determining tourist satisfaction from travel reviews. Information Technology & Tourism volume 21, pages337– 367. [24] Stringam, B. B., & Jr, J. G., & Vanleeuwen, D. M.(2010,June).Assessing the Importance and Relationships of Ratings on User-Generated Traveler Reviews. Traveler Reviews, Journal of Quality Assurance in Hospitality & Tourism, 11:2, 73-92. [25] Tang, Y., & Zhang, Y. Q., & Chawla, N. V., & Krasser, S.(2008,December). SVMs Modeling for Highly Imbalanced Classification. IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 39, NO. 1. [26] Wisam, A. Q., & Musa, M. A., & Bilal, l. A.(2019, June). An Overview of Bag of Words;Importance, 2019 International Engineering Conference (IEC). [27] Wu, X., & Kumar, V., & Quinlan, J. R., & Ghosh, J., & Yang, Q., & Motoda, H., & McLachlan, G. J., & Ng, A., & Liu, B., & Yu, P. S., & Zhou, Z. H., & Steinbach, M., & Hand, D. J., & Steinberg, D.(2007,December). Top 10 algorithms in data mining. Knowledge and Information Systems volume 14, pages1–37. [28] Xia, P., & Zhang, L., & Li, F.(2015,June). Learning similarity with cosine similarity. ensemble. Information Sciences Volume 307, Pages 39-52. [29] Zhao, D., & Du, N., & Chang, Z., & Li, Y.(2017). Keyword extraction for social media short text. 2017 14th Web Information Systems and Applications Conference.
描述	碩士國立政治大學統計學系 109354022
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0109354022
資料類型	thesis

dc.contributor.advisor	周珮婷	zh_TW
dc.contributor.author (Authors)	陳麒仲	zh_TW
dc.contributor.author (Authors)	Chen, Chi-Chung	en_US
dc.creator (作者)	陳麒仲	zh_TW
dc.creator (作者)	Chen, Chi-Chung	en_US
dc.date (日期)	2022	en_US
dc.date.accessioned	1-Jul-2022 16:58:15 (UTC+8)	-
dc.date.available	1-Jul-2022 16:58:15 (UTC+8)	-
dc.date.issued (上傳時間)	1-Jul-2022 16:58:15 (UTC+8)	-
dc.identifier (Other Identifiers)	G0109354022	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/140754	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	統計學系	zh_TW
dc.description (描述)	109354022	zh_TW
dc.description.abstract (摘要)	在這網路發達的時代，人們使用線上訂房網站做預訂旅館已經是稀鬆平常的事，旅館在網站上的評價，也會直接影響旅客在訂房上的選擇。隨著增加自身旅館的評分、減少旅客回應的負面評論，是每家旅館業者所追求的目標，尤其是如何減少負面評論更為重視，所以針對負面評論內提到的問題，去制定改善計畫提升旅館的評價，是個有效的治本方法。對於旅客也希望能夠住到滿意的旅館，不會去影響自身的旅遊體驗，但訂房過程還需要查看每家旅館的資訊，所以經由系統去推薦適合的旅館，不僅能省時也能省力。本研究透過網路爬蟲，蒐集訂房網站 Booking.com 上南北歐各一個熱門旅遊國家的旅館評論，以文字探勘 TF-IDF 的方法，配上資訊度量條件熵的方法，找尋特定國家旅館的負面關鍵字，幫助當地旅館業者能制定降低負面評論的計畫，以及定義真實負面評論旅客的標籤，透過詞向量模型和受歡迎的機器學習的分類演算法做出預測，為了著重在抓出真實負面評論旅客，模型評估指標選擇使用 Recall、F1Score、AUC Score 當標準，結果顯示以 Word2Vec 訓練的詞向量模型，以及擅長處於不平衡資料的 SVM 分類模型，兩者的組合模型成效較佳，尤其是由輸入中間的詞，去預測周圍的詞的 Skip gram 模型更優於 CBOW。最後根據預測出的真實負面評論旅客，針對其留過的負面評論，去計算與每間熱門旅館負面關鍵字的餘弦相似度得分，推薦相似度得分較低的旅館。	zh_TW
dc.description.abstract (摘要)	In this era of the developed Internet, it is common for people to use online booking websites to make hotel reservations. The evaluation of hotels on the website will also directly affect the choice of travelers in booking. Every hotel operator wants to increase the rating of its hotel and reduce the negative reviews responded to by tourists. In particular, reducing negative reviews is more important. Therefore, we should formulate improvement plans for the problems mentioned in the negative reviews. The goal of this research is to help local hoteliers to develop a plan to reduce negative reviews. The web crawlers technique was used to collect hotel reviews on Booking.com. The method of text mining TF-IDF coupled with measuring conditional entropy of information to find the negative keywords of hotels in a specific country was used. Word vector models and popular machine learning classification algorithms were performed to identify the negative review travelers. The model evaluation indicators used are Recall, F1 Score, and AUC Score. The results show that the word vector model trained with Word2Vec and the SVM classification model perform better in imbalanced data settings. The Skip-gram model for predicting surrounding words by inputting the middle word is better than CBOW. Finally, the cosine similarity score was calculated with the negative keywords for each popular hotel, and a hotel recommendation was provided.	en_US
dc.description.tableofcontents	1 緒論 1 1.1 研究背景和動機 1 1.2 研究目的 3 2 文獻回顧 5 2.1 旅遊評論 5 2.2 特徵提取 6 2.3 模型表現 7 2.4 推薦方法 8 3 研究方法 9 3.1 研究流程 10 3.2 資料蒐集與預處理 11 3.2.1 資料蒐集 11 3.2.2 評分標籤 14 3.2.3 文字預處理 16 3.2.4 情感套件 16 3.2.5 負面評論標籤 17 3.3 文字探勘 18 3.3.1 TF-IDF 18 3.3.2 Conditional Entropy 19 3.4 詞向量模型 20 3.4.1 Bag of words 20 3.4.2 TF-IDF 21 3.4.3 Word2vec 22 3.5 分類模型 26 3.5.1 Random Forest 26 3.5.2 GBDT 29 3.5.3 SVM 31 3.5.4 模型績效評估 34 4 研究結果 37 4.1 特徵字 37 4.2 模型結果 42 4.2.1 類別權重 42 4.2.2 模型比較 47 4.3 推薦旅館 51 5 結論與研究建議 55 5.1 總結論 55 5.2 建議和未來方向 57 參考文獻 59	zh_TW
dc.format.extent	3201024 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0109354022	en_US
dc.subject (關鍵詞)	旅遊評論	zh_TW
dc.subject (關鍵詞)	條件熵	zh_TW
dc.subject (關鍵詞)	餘弦相似度	zh_TW
dc.subject (關鍵詞)	TF-IDF	en_US
dc.subject (關鍵詞)	Word2Vec	en_US
dc.subject (關鍵詞)	SVM	en_US
dc.subject (關鍵詞)	Travel reviews	en_US
dc.subject (關鍵詞)	Cosine similarity	en_US
dc.title (題名)	基於文字探勘技術及模型組合比較結果之旅館推薦應用	zh_TW
dc.title (題名)	Hotel recommendation application based on text mining technology and model combination comparison results	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	[1] Aizawa, A.(2003, January). An information-theoretic perspective of tf–idf measures. Information Processing & Management Volume 39, Issue 1, Pages 45-65. [2] Belgiu, M.(2016,April). Random forest in remote sensing: A review of applications and future directions. ISPRS Journal of Photogrammetry and Remote Sensing Volume 114, Pages 24-31. [3] Bouaziz, A., & Christel, D. P., & Pereira, C. C., & Precioso, F., & Lloret Patrick. (2014). Short Text Classification Using Semantic Random Forest. Data Warehousing and Knowledge Discovery pp 288–299. [4] Chen, Y., & Wang, X.(2012). Text feature extraction based on joint conditional entropy. Proceedings of 2012 2nd International Conference on Computer Science and Network Technology. [5] Cortes, C., & Vapnik, V. (1995). Support-vector networks, Machine Learning volume 20, pages273–297. [6] Eberendu, A. C. (2016, August). Unstructured Data: an overview of the data of Big Data. International Journal of Computer Trends and Technology–Volume 38 Number 1. [7] Fazzolari, M., & Petrocchi, M.(2018,August). A study on online travel reviews through intelligent data analysis. Information Technology & Tourism volume 20, pages37–58 (2018). [8] Gretzel, U., & Kyung, H. Y.(2008,January). Use and Impact of Online Travel Reviews. Information and Communication Technologies in Tourism 2008 pp 35–46. [9] Gretzel, U.(2021). Conceptualizing the smart tourism mindset: Fostering. Utopian thinking in smart tourism development, 1(1), 3–8. [10] Groves, M., & Mundt, K.(2015). Friend or foe? Google Translate in language for academic purposes. [11] Huang, Y., & Wang, R., & Wei, B., & Zheng, S. L., & Chen, M.(2021,July). Sentiment Classification of Crowdsourcing Participants＇ReviewsText Based on LDA Topic Model. IEEE Access Volume 9. [12] Koo, C., & Xiang, Z., & Gretzel, U., & Sigala, M.(2021,September). Artificial intelligence (AI) and robotics in travel, hospitality and leisure. Electronic Markets volume 31, pages473–476. [13] Mikolov, T., & Chen, K., & Corrado, G., & Dean, J.(2013, January). Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781, 2013. [14] Mikolov, T., & Surskever, I., & Chen, K., & Corrado, G., & Dean, J.(2013, December). Distributed Representations of Words and Phrases and their Compositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 Pages 3111–3119. [15] Mitra, V., & Wang, C. J., & Banerjee, S.(2007,June). Text classification: A least square support vector machine approach. Applied Soft Computing Volume 7, Issue 3, June 2007, Pages 908-914. [16] Mostafa, L(2020). Machine Learning-Based Sentiment Analysis for Analyzing the Travelers Reviews on Egyptian Hotels. Proceedings of the International Conference on Artificial Intelligence and Computer Vision (AICV2020) pp 405–413 [17] Noyum, V. D., & Mofenjou, Y. P., & Feudjio, C., & Göktug, A., & Fokoue, E. (2021,January). Boosting the Predictive Accurary of Singer Identification Using Discrete Wavelet Transform For Feature Extraction. arXiv - CS - Sound Pub Date : 2021- 01-31. [18] Patel, A., & Meehan, K(2021). Fake News Detection on Reddit Utilising CountVectorizer and Term Frequency-Inverse Document Frequency with Logistic Regression, MultinominalNB and Support Vector Machine. 2021 32nd Irish Signals and Systems Conference (ISSC). [19] Polikar, R.(2012,January). Esemble Learning. Ensemble Machine Learning pp 1–34. [20] Ramos, J.(2003, January). Using TF-IDF to Determine Word Relevance in Document Queries. Department of Computer Science, Rutgers University, 23515 BPO Way, Piscataway, NJ, 08855. [21] Schafer, J. B. & Frankowski, D., & Herlocker, J., & Sen, S.(2007,January). Collaborative Filtering Recommender Systems. The Adaptive Web pp 291–324. [22] Schuckert, M. & Liu, X., & Law, R.(2015,August). Hospitality and Tourism Online Reviews: Recent Trends and Future Directions. Journal of Travel & Tourism Marketing Volume 32, 2015 - Issue 5. [23] Song, S., & Kawamura, H., & Uchida, J. & Saito, H.(2019,April). Determining tourist satisfaction from travel reviews. Information Technology & Tourism volume 21, pages337– 367. [24] Stringam, B. B., & Jr, J. G., & Vanleeuwen, D. M.(2010,June).Assessing the Importance and Relationships of Ratings on User-Generated Traveler Reviews. Traveler Reviews, Journal of Quality Assurance in Hospitality & Tourism, 11:2, 73-92. [25] Tang, Y., & Zhang, Y. Q., & Chawla, N. V., & Krasser, S.(2008,December). SVMs Modeling for Highly Imbalanced Classification. IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 39, NO. 1. [26] Wisam, A. Q., & Musa, M. A., & Bilal, l. A.(2019, June). An Overview of Bag of Words;Importance, 2019 International Engineering Conference (IEC). [27] Wu, X., & Kumar, V., & Quinlan, J. R., & Ghosh, J., & Yang, Q., & Motoda, H., & McLachlan, G. J., & Ng, A., & Liu, B., & Yu, P. S., & Zhou, Z. H., & Steinbach, M., & Hand, D. J., & Steinberg, D.(2007,December). Top 10 algorithms in data mining. Knowledge and Information Systems volume 14, pages1–37. [28] Xia, P., & Zhang, L., & Li, F.(2015,June). Learning similarity with cosine similarity. ensemble. Information Sciences Volume 307, Pages 39-52. [29] Zhao, D., & Du, N., & Chang, Z., & Li, Y.(2017). Keyword extraction for social media short text. 2017 14th Web Information Systems and Applications Conference.	zh_TW
dc.identifier.doi (DOI)	10.6814/NCCU202200539	en_US

學術產出-Theses

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

Google Scholar^TM