學術產出-Theses

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

題名 基於文字探勘技術及模型組合比較結果之旅館推薦應用
Hotel recommendation application based on text mining technology and model combination comparison results
作者 陳麒仲
Chen, Chi-Chung
貢獻者 周珮婷
陳麒仲
Chen, Chi-Chung
關鍵詞 旅遊評論
條件熵
餘弦相似度
TF-IDF
Word2Vec
SVM
Travel reviews
Cosine similarity
日期 2022
上傳時間 1-Jul-2022 16:58:15 (UTC+8)
摘要 在這網路發達的時代,人們使用線上訂房網站做預訂旅館已經是稀鬆平常的事,旅館在網站上的評價,也會直接影響旅客在訂房上的選擇。隨著增加自身旅館的評分、減少旅客回應的負面評論,是每家旅館業者所追求的目標,尤其是如何減少負面評論更為重視,所以針對負面評論內提到的問題,去制定改善計畫提升旅館的評價,是個有效的治本方法。對於旅客也希望能夠住到滿意的旅館,不會去影響自身的旅遊體驗,但訂房過程還需要查看每家旅館的資訊,所以經由系統去推薦適合的旅館,不僅能省時也能省力。

本研究透過網路爬蟲,蒐集訂房網站 Booking.com 上南北歐各一個熱門旅遊國家的旅館評論,以文字探勘 TF-IDF 的方法,配上資訊度量條件熵的方法,找尋特定國家旅館的負面關鍵字,幫助當地旅館業者能制定降低負面評論的計畫,以及定義真實負面評論旅客的標籤,透過詞向量模型和受歡迎的機器學習的分類演算法做出預測,為了著重在抓出真實負面評論旅客,模型評估指標選擇使用 Recall、F1Score、AUC Score 當標準,結果顯示以 Word2Vec 訓練的詞向量模型,以及擅長處於不平衡資料的 SVM 分類模型,兩者的組合模型成效較佳,尤其是由輸入中間的詞,去預測周圍的詞的 Skip gram 模型更優於 CBOW。最後根據預測出的真實負面評論旅客,針對其留過的負面評論,去計算與每間熱門旅館負面關鍵字的餘弦相似
度得分,推薦相似度得分較低的旅館。
In this era of the developed Internet, it is common for people to use online booking websites to make hotel reservations. The evaluation of hotels on the website will also directly affect the choice of travelers in booking. Every hotel operator wants to increase the rating of its hotel and reduce the negative reviews responded to by tourists. In particular, reducing negative reviews is more important. Therefore, we should formulate improvement plans for the problems mentioned in the negative reviews. The goal of this research is to help local hoteliers to develop a plan to reduce negative reviews. The web crawlers technique was used to collect hotel reviews on Booking.com. The method of text mining TF-IDF coupled with measuring conditional entropy of information to find the negative keywords of hotels in a specific country was used. Word vector models and popular machine learning classification algorithms were performed to identify the negative review travelers. The model evaluation indicators used are Recall, F1 Score, and AUC Score. The results show that the word vector model trained with Word2Vec and the SVM classification model perform better in imbalanced data settings. The Skip-gram model for predicting surrounding words by inputting the middle word is better than CBOW. Finally, the cosine similarity score was calculated with the negative keywords for each popular hotel, and a hotel recommendation was provided.
參考文獻 [1] Aizawa, A.(2003, January). An information-theoretic perspective of tf–idf measures.
Information Processing & Management Volume 39, Issue 1, Pages 45-65.
[2] Belgiu, M.(2016,April). Random forest in remote sensing: A review of applications
and future directions. ISPRS Journal of Photogrammetry and Remote Sensing Volume
114, Pages 24-31.
[3] Bouaziz, A., & Christel, D. P., & Pereira, C. C., & Precioso, F., & Lloret Patrick.
(2014). Short Text Classification Using Semantic Random Forest. Data Warehousing
and Knowledge Discovery pp 288–299.
[4] Chen, Y., & Wang, X.(2012). Text feature extraction based on joint conditional entropy. Proceedings of 2012 2nd International Conference on Computer Science and
Network Technology.
[5] Cortes, C., & Vapnik, V. (1995). Support-vector networks, Machine Learning volume
20, pages273–297.
[6] Eberendu, A. C. (2016, August). Unstructured Data: an overview of the data of Big
Data. International Journal of Computer Trends and Technology–Volume 38 Number
1.
[7] Fazzolari, M., & Petrocchi, M.(2018,August). A study on online travel reviews through
intelligent data analysis. Information Technology & Tourism volume 20, pages37–58
(2018).
[8] Gretzel, U., & Kyung, H. Y.(2008,January). Use and Impact of Online Travel Reviews.
Information and Communication Technologies in Tourism 2008 pp 35–46.
[9] Gretzel, U.(2021). Conceptualizing the smart tourism mindset: Fostering. Utopian
thinking in smart tourism development, 1(1), 3–8.
[10] Groves, M., & Mundt, K.(2015). Friend or foe? Google Translate in language for
academic purposes.
[11] Huang, Y., & Wang, R., & Wei, B., & Zheng, S. L., & Chen, M.(2021,July). Sentiment Classification of Crowdsourcing Participants'ReviewsText Based on LDA Topic
Model. IEEE Access Volume 9.
[12] Koo, C., & Xiang, Z., & Gretzel, U., & Sigala, M.(2021,September). Artificial intelligence (AI) and robotics in travel, hospitality and leisure. Electronic Markets volume
31, pages473–476.
[13] Mikolov, T., & Chen, K., & Corrado, G., & Dean, J.(2013, January). Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781,
2013.
[14] Mikolov, T., & Surskever, I., & Chen, K., & Corrado, G., & Dean, J.(2013, December). Distributed Representations of Words and Phrases and their Compositionality.
Proceedings of the 26th International Conference on Neural Information Processing
Systems - Volume 2 Pages 3111–3119.
[15] Mitra, V., & Wang, C. J., & Banerjee, S.(2007,June). Text classification: A least square
support vector machine approach. Applied Soft Computing Volume 7, Issue 3, June
2007, Pages 908-914.
[16] Mostafa, L(2020). Machine Learning-Based Sentiment Analysis for Analyzing the
Travelers Reviews on Egyptian Hotels. Proceedings of the International Conference
on Artificial Intelligence and Computer Vision (AICV2020) pp 405–413
[17] Noyum, V. D., & Mofenjou, Y. P., & Feudjio, C., & Göktug, A., & Fokoue, E.
(2021,January). Boosting the Predictive Accurary of Singer Identification Using Discrete Wavelet Transform For Feature Extraction. arXiv - CS - Sound Pub Date : 2021-
01-31.
[18] Patel, A., & Meehan, K(2021). Fake News Detection on Reddit Utilising CountVectorizer and Term Frequency-Inverse Document Frequency with Logistic Regression,
MultinominalNB and Support Vector Machine. 2021 32nd Irish Signals and Systems
Conference (ISSC).
[19] Polikar, R.(2012,January). Esemble Learning. Ensemble Machine Learning pp 1–34.
[20] Ramos, J.(2003, January). Using TF-IDF to Determine Word Relevance in Document
Queries. Department of Computer Science, Rutgers University, 23515 BPO Way, Piscataway, NJ, 08855.
[21] Schafer, J. B. & Frankowski, D., & Herlocker, J., & Sen, S.(2007,January). Collaborative Filtering Recommender Systems. The Adaptive Web pp 291–324.
[22] Schuckert, M. & Liu, X., & Law, R.(2015,August). Hospitality and Tourism Online
Reviews: Recent Trends and Future Directions. Journal of Travel & Tourism Marketing Volume 32, 2015 - Issue 5.
[23] Song, S., & Kawamura, H., & Uchida, J. & Saito, H.(2019,April). Determining tourist
satisfaction from travel reviews. Information Technology & Tourism volume 21, pages337–
367.
[24] Stringam, B. B., & Jr, J. G., & Vanleeuwen, D. M.(2010,June).Assessing the Importance and Relationships of Ratings on User-Generated Traveler Reviews. Traveler
Reviews, Journal of Quality Assurance in Hospitality & Tourism, 11:2, 73-92.
[25] Tang, Y., & Zhang, Y. Q., & Chawla, N. V., & Krasser, S.(2008,December). SVMs
Modeling for Highly Imbalanced Classification. IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 39, NO. 1.
[26] Wisam, A. Q., & Musa, M. A., & Bilal, l. A.(2019, June). An Overview of Bag of
Words;Importance, 2019 International Engineering Conference (IEC).
[27] Wu, X., & Kumar, V., & Quinlan, J. R., & Ghosh, J., & Yang, Q., & Motoda, H., &
McLachlan, G. J., & Ng, A., & Liu, B., & Yu, P. S., & Zhou, Z. H., & Steinbach, M.,
& Hand, D. J., & Steinberg, D.(2007,December). Top 10 algorithms in data mining.
Knowledge and Information Systems volume 14, pages1–37.
[28] Xia, P., & Zhang, L., & Li, F.(2015,June). Learning similarity with cosine similarity.
ensemble. Information Sciences Volume 307, Pages 39-52.
[29] Zhao, D., & Du, N., & Chang, Z., & Li, Y.(2017). Keyword extraction for social media
short text. 2017 14th Web Information Systems and Applications Conference.
描述 碩士
國立政治大學
統計學系
109354022
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0109354022
資料類型 thesis
dc.contributor.advisor 周珮婷zh_TW
dc.contributor.author (Authors) 陳麒仲zh_TW
dc.contributor.author (Authors) Chen, Chi-Chungen_US
dc.creator (作者) 陳麒仲zh_TW
dc.creator (作者) Chen, Chi-Chungen_US
dc.date (日期) 2022en_US
dc.date.accessioned 1-Jul-2022 16:58:15 (UTC+8)-
dc.date.available 1-Jul-2022 16:58:15 (UTC+8)-
dc.date.issued (上傳時間) 1-Jul-2022 16:58:15 (UTC+8)-
dc.identifier (Other Identifiers) G0109354022en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/140754-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 統計學系zh_TW
dc.description (描述) 109354022zh_TW
dc.description.abstract (摘要) 在這網路發達的時代,人們使用線上訂房網站做預訂旅館已經是稀鬆平常的事,旅館在網站上的評價,也會直接影響旅客在訂房上的選擇。隨著增加自身旅館的評分、減少旅客回應的負面評論,是每家旅館業者所追求的目標,尤其是如何減少負面評論更為重視,所以針對負面評論內提到的問題,去制定改善計畫提升旅館的評價,是個有效的治本方法。對於旅客也希望能夠住到滿意的旅館,不會去影響自身的旅遊體驗,但訂房過程還需要查看每家旅館的資訊,所以經由系統去推薦適合的旅館,不僅能省時也能省力。

本研究透過網路爬蟲,蒐集訂房網站 Booking.com 上南北歐各一個熱門旅遊國家的旅館評論,以文字探勘 TF-IDF 的方法,配上資訊度量條件熵的方法,找尋特定國家旅館的負面關鍵字,幫助當地旅館業者能制定降低負面評論的計畫,以及定義真實負面評論旅客的標籤,透過詞向量模型和受歡迎的機器學習的分類演算法做出預測,為了著重在抓出真實負面評論旅客,模型評估指標選擇使用 Recall、F1Score、AUC Score 當標準,結果顯示以 Word2Vec 訓練的詞向量模型,以及擅長處於不平衡資料的 SVM 分類模型,兩者的組合模型成效較佳,尤其是由輸入中間的詞,去預測周圍的詞的 Skip gram 模型更優於 CBOW。最後根據預測出的真實負面評論旅客,針對其留過的負面評論,去計算與每間熱門旅館負面關鍵字的餘弦相似
度得分,推薦相似度得分較低的旅館。
zh_TW
dc.description.abstract (摘要) In this era of the developed Internet, it is common for people to use online booking websites to make hotel reservations. The evaluation of hotels on the website will also directly affect the choice of travelers in booking. Every hotel operator wants to increase the rating of its hotel and reduce the negative reviews responded to by tourists. In particular, reducing negative reviews is more important. Therefore, we should formulate improvement plans for the problems mentioned in the negative reviews. The goal of this research is to help local hoteliers to develop a plan to reduce negative reviews. The web crawlers technique was used to collect hotel reviews on Booking.com. The method of text mining TF-IDF coupled with measuring conditional entropy of information to find the negative keywords of hotels in a specific country was used. Word vector models and popular machine learning classification algorithms were performed to identify the negative review travelers. The model evaluation indicators used are Recall, F1 Score, and AUC Score. The results show that the word vector model trained with Word2Vec and the SVM classification model perform better in imbalanced data settings. The Skip-gram model for predicting surrounding words by inputting the middle word is better than CBOW. Finally, the cosine similarity score was calculated with the negative keywords for each popular hotel, and a hotel recommendation was provided.en_US
dc.description.tableofcontents 1 緒論 1
1.1 研究背景和動機 1
1.2 研究目的 3
2 文獻回顧 5
2.1 旅遊評論 5
2.2 特徵提取 6
2.3 模型表現 7
2.4 推薦方法 8
3 研究方法 9
3.1 研究流程 10
3.2 資料蒐集與預處理 11
3.2.1 資料蒐集 11
3.2.2 評分標籤 14
3.2.3 文字預處理 16
3.2.4 情感套件 16
3.2.5 負面評論標籤 17
3.3 文字探勘 18
3.3.1 TF-IDF 18
3.3.2 Conditional Entropy 19
3.4 詞向量模型 20
3.4.1 Bag of words 20
3.4.2 TF-IDF 21
3.4.3 Word2vec 22
3.5 分類模型 26
3.5.1 Random Forest 26
3.5.2 GBDT 29
3.5.3 SVM 31
3.5.4 模型績效評估 34
4 研究結果 37
4.1 特徵字 37
4.2 模型結果 42
4.2.1 類別權重 42
4.2.2 模型比較 47
4.3 推薦旅館 51
5 結論與研究建議 55
5.1 總結論 55
5.2 建議和未來方向 57
參考文獻 59
zh_TW
dc.format.extent 3201024 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0109354022en_US
dc.subject (關鍵詞) 旅遊評論zh_TW
dc.subject (關鍵詞) 條件熵zh_TW
dc.subject (關鍵詞) 餘弦相似度zh_TW
dc.subject (關鍵詞) TF-IDFen_US
dc.subject (關鍵詞) Word2Vecen_US
dc.subject (關鍵詞) SVMen_US
dc.subject (關鍵詞) Travel reviewsen_US
dc.subject (關鍵詞) Cosine similarityen_US
dc.title (題名) 基於文字探勘技術及模型組合比較結果之旅館推薦應用zh_TW
dc.title (題名) Hotel recommendation application based on text mining technology and model combination comparison resultsen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) [1] Aizawa, A.(2003, January). An information-theoretic perspective of tf–idf measures.
Information Processing & Management Volume 39, Issue 1, Pages 45-65.
[2] Belgiu, M.(2016,April). Random forest in remote sensing: A review of applications
and future directions. ISPRS Journal of Photogrammetry and Remote Sensing Volume
114, Pages 24-31.
[3] Bouaziz, A., & Christel, D. P., & Pereira, C. C., & Precioso, F., & Lloret Patrick.
(2014). Short Text Classification Using Semantic Random Forest. Data Warehousing
and Knowledge Discovery pp 288–299.
[4] Chen, Y., & Wang, X.(2012). Text feature extraction based on joint conditional entropy. Proceedings of 2012 2nd International Conference on Computer Science and
Network Technology.
[5] Cortes, C., & Vapnik, V. (1995). Support-vector networks, Machine Learning volume
20, pages273–297.
[6] Eberendu, A. C. (2016, August). Unstructured Data: an overview of the data of Big
Data. International Journal of Computer Trends and Technology–Volume 38 Number
1.
[7] Fazzolari, M., & Petrocchi, M.(2018,August). A study on online travel reviews through
intelligent data analysis. Information Technology & Tourism volume 20, pages37–58
(2018).
[8] Gretzel, U., & Kyung, H. Y.(2008,January). Use and Impact of Online Travel Reviews.
Information and Communication Technologies in Tourism 2008 pp 35–46.
[9] Gretzel, U.(2021). Conceptualizing the smart tourism mindset: Fostering. Utopian
thinking in smart tourism development, 1(1), 3–8.
[10] Groves, M., & Mundt, K.(2015). Friend or foe? Google Translate in language for
academic purposes.
[11] Huang, Y., & Wang, R., & Wei, B., & Zheng, S. L., & Chen, M.(2021,July). Sentiment Classification of Crowdsourcing Participants'ReviewsText Based on LDA Topic
Model. IEEE Access Volume 9.
[12] Koo, C., & Xiang, Z., & Gretzel, U., & Sigala, M.(2021,September). Artificial intelligence (AI) and robotics in travel, hospitality and leisure. Electronic Markets volume
31, pages473–476.
[13] Mikolov, T., & Chen, K., & Corrado, G., & Dean, J.(2013, January). Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781,
2013.
[14] Mikolov, T., & Surskever, I., & Chen, K., & Corrado, G., & Dean, J.(2013, December). Distributed Representations of Words and Phrases and their Compositionality.
Proceedings of the 26th International Conference on Neural Information Processing
Systems - Volume 2 Pages 3111–3119.
[15] Mitra, V., & Wang, C. J., & Banerjee, S.(2007,June). Text classification: A least square
support vector machine approach. Applied Soft Computing Volume 7, Issue 3, June
2007, Pages 908-914.
[16] Mostafa, L(2020). Machine Learning-Based Sentiment Analysis for Analyzing the
Travelers Reviews on Egyptian Hotels. Proceedings of the International Conference
on Artificial Intelligence and Computer Vision (AICV2020) pp 405–413
[17] Noyum, V. D., & Mofenjou, Y. P., & Feudjio, C., & Göktug, A., & Fokoue, E.
(2021,January). Boosting the Predictive Accurary of Singer Identification Using Discrete Wavelet Transform For Feature Extraction. arXiv - CS - Sound Pub Date : 2021-
01-31.
[18] Patel, A., & Meehan, K(2021). Fake News Detection on Reddit Utilising CountVectorizer and Term Frequency-Inverse Document Frequency with Logistic Regression,
MultinominalNB and Support Vector Machine. 2021 32nd Irish Signals and Systems
Conference (ISSC).
[19] Polikar, R.(2012,January). Esemble Learning. Ensemble Machine Learning pp 1–34.
[20] Ramos, J.(2003, January). Using TF-IDF to Determine Word Relevance in Document
Queries. Department of Computer Science, Rutgers University, 23515 BPO Way, Piscataway, NJ, 08855.
[21] Schafer, J. B. & Frankowski, D., & Herlocker, J., & Sen, S.(2007,January). Collaborative Filtering Recommender Systems. The Adaptive Web pp 291–324.
[22] Schuckert, M. & Liu, X., & Law, R.(2015,August). Hospitality and Tourism Online
Reviews: Recent Trends and Future Directions. Journal of Travel & Tourism Marketing Volume 32, 2015 - Issue 5.
[23] Song, S., & Kawamura, H., & Uchida, J. & Saito, H.(2019,April). Determining tourist
satisfaction from travel reviews. Information Technology & Tourism volume 21, pages337–
367.
[24] Stringam, B. B., & Jr, J. G., & Vanleeuwen, D. M.(2010,June).Assessing the Importance and Relationships of Ratings on User-Generated Traveler Reviews. Traveler
Reviews, Journal of Quality Assurance in Hospitality & Tourism, 11:2, 73-92.
[25] Tang, Y., & Zhang, Y. Q., & Chawla, N. V., & Krasser, S.(2008,December). SVMs
Modeling for Highly Imbalanced Classification. IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 39, NO. 1.
[26] Wisam, A. Q., & Musa, M. A., & Bilal, l. A.(2019, June). An Overview of Bag of
Words;Importance, 2019 International Engineering Conference (IEC).
[27] Wu, X., & Kumar, V., & Quinlan, J. R., & Ghosh, J., & Yang, Q., & Motoda, H., &
McLachlan, G. J., & Ng, A., & Liu, B., & Yu, P. S., & Zhou, Z. H., & Steinbach, M.,
& Hand, D. J., & Steinberg, D.(2007,December). Top 10 algorithms in data mining.
Knowledge and Information Systems volume 14, pages1–37.
[28] Xia, P., & Zhang, L., & Li, F.(2015,June). Learning similarity with cosine similarity.
ensemble. Information Sciences Volume 307, Pages 39-52.
[29] Zhao, D., & Du, N., & Chang, Z., & Li, Y.(2017). Keyword extraction for social media
short text. 2017 14th Web Information Systems and Applications Conference.
zh_TW
dc.identifier.doi (DOI) 10.6814/NCCU202200539en_US