學術產出-學位論文
文章檢視/開啟
書目匯出
-
題名 名目型與次序型資料之分類模型比較及其在網路文本評論之應用
A comparison of nominal and ordinal classification models with application to online reviews作者 柳瑞俞
Liou, Ruei-Yu貢獻者 翁久幸
Weng, Chiu-Hsing
柳瑞俞
Liou, Ruei-Yu關鍵詞 次序邏輯斯模型
多元邏輯斯模型
Word2Vec
TF-IDF
FastText
Ordered Logit Model
Multinomial Logit Model
Word2Vec
TF-IDF
FastText日期 2021 上傳時間 4-八月-2021 14:42:47 (UTC+8) 摘要 隨著資訊科技的蓬勃發展,機器學習的技術越來越被大眾所使用,然而現今面對次序型的資料型態多半直接使用名目型分類模型而不是使用能夠正確考慮資料本身大小關係的次序型分類模型,McCullagh(1980)提出次序型目標變數的邏輯斯模型之推廣,稱為次序邏輯斯模型(Ordered Logit Model),本研究使用三種次序邏輯斯模型做為次序型分類模型,在名目型分類模型的部分使用樸素貝葉斯(Naïve Bayes)與多元邏輯斯模型,用來預測13組目標變數為次序型的資料集,並以正確率(Accuracy)、Macro-F1與均方誤差(MSE)做為衡量指標,結果發現只有其中六組資料集在次序型分類模型表現較好,進而我們發現這六組資料集中較多變數符合次序邏輯斯模型的「比例賠率假設(Proportional odds assumption)」,接著我們使用統計資料模擬的方法,驗證確實在符合模型假設之下的資料,使用次序型分類模型獲得較名目型分類模型佳的預測結果。最後我們將次序型資料的問題延伸至現今流行的文字分類議題,電影與Google評論等都會有一般民眾的留言與評論等級,通常分為1到5分,我們使用Word2Vec、TF-IDF與Fasttext的詞嵌入(Word Embedding)方式將文字資料轉為模型可以代入的向量型態,結果顯示中文評論使用次序型分類模型成效較佳,英文評論使用名目型分類模型較佳,詞嵌入方法也會影響預測結果,考慮越多周遭字詞的Word2Vec方法成效越好,TF-IDF法表現最差,但Word2Vec訓練方式較久,若有時間上的考量可以使用網路上使用Fasttext訓練好的Wiki Pretrain詞向量也有不差的成效。
With the development of information technology, machine learning techniques are increasingly being used by the public. However, nowadays, when facing ordinal data, most of them use the nominal classification model instead of the ordinal classification that can correctly consider the rank relationship of the data. McCullagh (1980) proposed an extension of the logistic model of ordered target variables, called the ordered logit model. This study uses three ordered logit models as the ordinal classification model. Part of the nominal classification models uses Naïve Bayes and multinomial logit model to predict 13 sets of target variables as ordinal data, and uses Accuracy, Macro-F1 and Mean Square Error (MSE) As a measurement, it turns out that only six datasets perform better in the ordinal classification model. Then we found that more variables in these six datasets conform to the "Proportional odds assumption" of the ordered logistic model. Then we use statistical data simulation methods to verify that the data is indeed in line with the model assumptions, and use the ordinal classification model to obtain better prediction results than the nominal classification model.Finally, we extend the problem of ordinal data to the text classification issues. Movies and Google reviews will have public comments and ratings. They are usually divided into 1 to 5 points. The word embedding method we use Word2Vec, TF-IDF and FastText to convert the text data into a vector type that the model can use. The results show that the ordinal classification model for Chinese reviews is better , and the nominal classification model for English reviews is better. The word embedding method will also affect the prediction. As a result, the Word2Vec method that considers more surrounding words the better, the TF-IDF method performs the worst, but the training time of Word2Vec is longer, if you have time considerations, you can use the Wiki Pretrain word vector trained on the Internet using Fasttext, and it will have not bad results.參考文獻 Alan Agresti(2003). Categorical Data Analysis 3rd Edition, A JOHN WILEY & SONS, INC., PUBLICATION.Jones, K. S.(1972). A statistical interpretation of term specificity and its application in retrieval. Journal of documentation.Liu, B.(2020). Text sentiment analysis based on CBOW model and deep learning in big data environment. Journal of Ambient Intelligence and Humanized Computing, 11(2), 451-458.McCullagh, P.(1980). Regression models for ordinal data.Journal of the Royal Statistical Society: Series B (Methodological), 42(2), 109-127.Cardoso, J., & da Costa, J. P.(2007). Learning to Classify Ordinal Data: The Data Replication Method. Journal of Machine Learning Research, 8, 1393-1429.Chu, W., & Keerthi, S. S.(2005). New approaches to support vector ordinal regression. In Proceedings of the 22nd international conference on Machine learning, 145-152.Frank, E., & Hall, M.(2001). A simple approach to ordinal classification. ECML`01: Proceedings of the 12th European Conference on Machine Learning, 145-156.Jain, A. P., & Dandannavar, P.(2016). Application of machine learning techniques to sentiment analysis. In 2016 2nd International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT), 628-632.Koren, Y., & Sill, J.(2011). Ordrec: an ordinal model for predicting personalized item rating distributions. In Proceedings of the fifth ACM conference on Recommender systems, 117-124.Opitz, J., & Burst, S.(2019). Macro f1 and macro f1. arXiv preprint arXiv:1911.03347.Rennie, J. D., & Srebro, N.(2005). Loss functions for preference levels: Regression with discrete ordered labels. In Proceedings of the IJCAI multidisciplinary workshop on advances in preference handling, 1.Saad, S. E., & Yang, J.(2019). Twitter sentiment analysis based on ordinal regression. IEEE Access, 7, 163677-163685.Jing, L. P., Huang, H. K., & Shi, H. B.(2002). Improved feature selection approach TFIDF in text mining. In Proceedings. International Conference on Machine Learning and Cybernetics, 2, 944-946.Joulin, A., Grave, E., & Dandannavar, P.(2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.Vargas, V. M., Gutiérrez, P. A., & Hervás-Martínez, C.(2020). Cumulative link models for deep ordinal classification. Neurocomputing, 401, 48-58.Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T.(2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 628-632.Liu, C., Li, Y., Ping Li, & Fei, H.(2019). Deep Skip-Gram Networks for Text Classification. In Proceedings of the 2019 SIAM International Conference on Data Mining, 145-153.Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J.(2013). Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546. 描述 碩士
國立政治大學
統計學系
108354021資料來源 http://thesis.lib.nccu.edu.tw/record/#G0108354021 資料類型 thesis dc.contributor.advisor 翁久幸 zh_TW dc.contributor.advisor Weng, Chiu-Hsing en_US dc.contributor.author (作者) 柳瑞俞 zh_TW dc.contributor.author (作者) Liou, Ruei-Yu en_US dc.creator (作者) 柳瑞俞 zh_TW dc.creator (作者) Liou, Ruei-Yu en_US dc.date (日期) 2021 en_US dc.date.accessioned 4-八月-2021 14:42:47 (UTC+8) - dc.date.available 4-八月-2021 14:42:47 (UTC+8) - dc.date.issued (上傳時間) 4-八月-2021 14:42:47 (UTC+8) - dc.identifier (其他 識別碼) G0108354021 en_US dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/136322 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 統計學系 zh_TW dc.description (描述) 108354021 zh_TW dc.description.abstract (摘要) 隨著資訊科技的蓬勃發展,機器學習的技術越來越被大眾所使用,然而現今面對次序型的資料型態多半直接使用名目型分類模型而不是使用能夠正確考慮資料本身大小關係的次序型分類模型,McCullagh(1980)提出次序型目標變數的邏輯斯模型之推廣,稱為次序邏輯斯模型(Ordered Logit Model),本研究使用三種次序邏輯斯模型做為次序型分類模型,在名目型分類模型的部分使用樸素貝葉斯(Naïve Bayes)與多元邏輯斯模型,用來預測13組目標變數為次序型的資料集,並以正確率(Accuracy)、Macro-F1與均方誤差(MSE)做為衡量指標,結果發現只有其中六組資料集在次序型分類模型表現較好,進而我們發現這六組資料集中較多變數符合次序邏輯斯模型的「比例賠率假設(Proportional odds assumption)」,接著我們使用統計資料模擬的方法,驗證確實在符合模型假設之下的資料,使用次序型分類模型獲得較名目型分類模型佳的預測結果。最後我們將次序型資料的問題延伸至現今流行的文字分類議題,電影與Google評論等都會有一般民眾的留言與評論等級,通常分為1到5分,我們使用Word2Vec、TF-IDF與Fasttext的詞嵌入(Word Embedding)方式將文字資料轉為模型可以代入的向量型態,結果顯示中文評論使用次序型分類模型成效較佳,英文評論使用名目型分類模型較佳,詞嵌入方法也會影響預測結果,考慮越多周遭字詞的Word2Vec方法成效越好,TF-IDF法表現最差,但Word2Vec訓練方式較久,若有時間上的考量可以使用網路上使用Fasttext訓練好的Wiki Pretrain詞向量也有不差的成效。 zh_TW dc.description.abstract (摘要) With the development of information technology, machine learning techniques are increasingly being used by the public. However, nowadays, when facing ordinal data, most of them use the nominal classification model instead of the ordinal classification that can correctly consider the rank relationship of the data. McCullagh (1980) proposed an extension of the logistic model of ordered target variables, called the ordered logit model. This study uses three ordered logit models as the ordinal classification model. Part of the nominal classification models uses Naïve Bayes and multinomial logit model to predict 13 sets of target variables as ordinal data, and uses Accuracy, Macro-F1 and Mean Square Error (MSE) As a measurement, it turns out that only six datasets perform better in the ordinal classification model. Then we found that more variables in these six datasets conform to the "Proportional odds assumption" of the ordered logistic model. Then we use statistical data simulation methods to verify that the data is indeed in line with the model assumptions, and use the ordinal classification model to obtain better prediction results than the nominal classification model.Finally, we extend the problem of ordinal data to the text classification issues. Movies and Google reviews will have public comments and ratings. They are usually divided into 1 to 5 points. The word embedding method we use Word2Vec, TF-IDF and FastText to convert the text data into a vector type that the model can use. The results show that the ordinal classification model for Chinese reviews is better , and the nominal classification model for English reviews is better. The word embedding method will also affect the prediction. As a result, the Word2Vec method that considers more surrounding words the better, the TF-IDF method performs the worst, but the training time of Word2Vec is longer, if you have time considerations, you can use the Wiki Pretrain word vector trained on the Internet using Fasttext, and it will have not bad results. en_US dc.description.tableofcontents 1 緒論 11.1 研究動機 11.2 研究目的 22 文獻回顧 33 研究方法 53.1 衡量指標 53.2 分類模型 73.2.1 Cumulative Logit Model 73.2.2 Continuation-Ratio Logit Model 83.2.3 Adjacent-Category Logit Model 93.2.4 樸素貝葉斯 Naïve Bayes 103.2.5 多元邏輯斯模型 Multinomial Logistic Model 103.3 假設檢定與模擬方法 113.3.1 比例賠率假設及檢定 113.3.2 模擬方法 133.4 詞嵌入 Word Embedding 133.4.1 詞袋模型與 TF-IDF 133.4.2 CBOW(Continuous Bag Of Words) 163.4.3 Skip-gram 183.4.4 Wiki Pretrain Word Embedding 194 資料集介紹 214.1 數值與類別型態資料 214.1.1 全資料集總表 214.1.2 各資料集簡介 234.2 文字型態資料 274.2.1 中文文字資料 - Yahoo 電影評論 274.2.2 英文文字資料 - Trip Advisor Hotel Reviews 285 實例分析與模擬研究 315.1 實例分析 315.2 統計模擬 375.3 文字資料應用 415.3.1 中文文字-Yahoo 電影評論 425.3.2 英文文字-Trip Advisor Hotel Reviews 496 研究結論與建議 576.1 研究結論 576.2 研究建議 59參考文獻 61 zh_TW dc.format.extent 18360199 bytes - dc.format.mimetype application/pdf - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0108354021 en_US dc.subject (關鍵詞) 次序邏輯斯模型 zh_TW dc.subject (關鍵詞) 多元邏輯斯模型 zh_TW dc.subject (關鍵詞) Word2Vec zh_TW dc.subject (關鍵詞) TF-IDF zh_TW dc.subject (關鍵詞) FastText zh_TW dc.subject (關鍵詞) Ordered Logit Model en_US dc.subject (關鍵詞) Multinomial Logit Model en_US dc.subject (關鍵詞) Word2Vec en_US dc.subject (關鍵詞) TF-IDF en_US dc.subject (關鍵詞) FastText en_US dc.title (題名) 名目型與次序型資料之分類模型比較及其在網路文本評論之應用 zh_TW dc.title (題名) A comparison of nominal and ordinal classification models with application to online reviews en_US dc.type (資料類型) thesis en_US dc.relation.reference (參考文獻) Alan Agresti(2003). Categorical Data Analysis 3rd Edition, A JOHN WILEY & SONS, INC., PUBLICATION.Jones, K. S.(1972). A statistical interpretation of term specificity and its application in retrieval. Journal of documentation.Liu, B.(2020). Text sentiment analysis based on CBOW model and deep learning in big data environment. Journal of Ambient Intelligence and Humanized Computing, 11(2), 451-458.McCullagh, P.(1980). Regression models for ordinal data.Journal of the Royal Statistical Society: Series B (Methodological), 42(2), 109-127.Cardoso, J., & da Costa, J. P.(2007). Learning to Classify Ordinal Data: The Data Replication Method. Journal of Machine Learning Research, 8, 1393-1429.Chu, W., & Keerthi, S. S.(2005). New approaches to support vector ordinal regression. In Proceedings of the 22nd international conference on Machine learning, 145-152.Frank, E., & Hall, M.(2001). A simple approach to ordinal classification. ECML`01: Proceedings of the 12th European Conference on Machine Learning, 145-156.Jain, A. P., & Dandannavar, P.(2016). Application of machine learning techniques to sentiment analysis. In 2016 2nd International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT), 628-632.Koren, Y., & Sill, J.(2011). Ordrec: an ordinal model for predicting personalized item rating distributions. In Proceedings of the fifth ACM conference on Recommender systems, 117-124.Opitz, J., & Burst, S.(2019). Macro f1 and macro f1. arXiv preprint arXiv:1911.03347.Rennie, J. D., & Srebro, N.(2005). Loss functions for preference levels: Regression with discrete ordered labels. In Proceedings of the IJCAI multidisciplinary workshop on advances in preference handling, 1.Saad, S. E., & Yang, J.(2019). Twitter sentiment analysis based on ordinal regression. IEEE Access, 7, 163677-163685.Jing, L. P., Huang, H. K., & Shi, H. B.(2002). Improved feature selection approach TFIDF in text mining. In Proceedings. International Conference on Machine Learning and Cybernetics, 2, 944-946.Joulin, A., Grave, E., & Dandannavar, P.(2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.Vargas, V. M., Gutiérrez, P. A., & Hervás-Martínez, C.(2020). Cumulative link models for deep ordinal classification. Neurocomputing, 401, 48-58.Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T.(2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 628-632.Liu, C., Li, Y., Ping Li, & Fei, H.(2019). Deep Skip-Gram Networks for Text Classification. In Proceedings of the 2019 SIAM International Conference on Data Mining, 145-153.Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J.(2013). Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546. zh_TW dc.identifier.doi (DOI) 10.6814/NCCU202100932 en_US