學術產出-Theses

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

題名 名目型與次序型資料之分類模型比較及其在網路文本評論之應用
A comparison of nominal and ordinal classification models with application to online reviews
作者 柳瑞俞
Liou, Ruei-Yu
貢獻者 翁久幸
Weng, Chiu-Hsing
柳瑞俞
Liou, Ruei-Yu
關鍵詞 次序邏輯斯模型
多元邏輯斯模型
Word2Vec
TF-IDF
FastText
Ordered Logit Model
Multinomial Logit Model
Word2Vec
TF-IDF
FastText
日期 2021
上傳時間 4-Aug-2021 14:42:47 (UTC+8)
摘要 隨著資訊科技的蓬勃發展,機器學習的技術越來越被大眾所使用,然而現今面對次序型的資料型態多半直接使用名目型分類模型而不是使用能夠正確考慮資料本身大小關係的次序型分類模型,McCullagh(1980)提出次序型目標變數的邏輯斯模型之推廣,稱為次序邏輯斯模型(Ordered Logit Model),本研究使用三種次序邏輯斯模型做為次序型分類模型,在名目型分類模型的部分使用樸素貝葉斯(Naïve Bayes)與多元邏輯斯模型,用來預測13組目標變數為次序型的資料集,並以正確率(Accuracy)、Macro-F1與均方誤差(MSE)做為衡量指標,結果發現只有其中六組資料集在次序型分類模型表現較好,進而我們發現這六組資料集中較多變數符合次序邏輯斯模型的「比例賠率假設(Proportional odds assumption)」,接著我們使用統計資料模擬的方法,驗證確實在符合模型假設之下的資料,使用次序型分類模型獲得較名目型分類模型佳的預測結果。
最後我們將次序型資料的問題延伸至現今流行的文字分類議題,電影與Google評論等都會有一般民眾的留言與評論等級,通常分為1到5分,我們使用Word2Vec、TF-IDF與Fasttext的詞嵌入(Word Embedding)方式將文字資料轉為模型可以代入的向量型態,結果顯示中文評論使用次序型分類模型成效較佳,英文評論使用名目型分類模型較佳,詞嵌入方法也會影響預測結果,考慮越多周遭字詞的Word2Vec方法成效越好,TF-IDF法表現最差,但Word2Vec訓練方式較久,若有時間上的考量可以使用網路上使用Fasttext訓練好的Wiki Pretrain詞向量也有不差的成效。
With the development of information technology, machine learning techniques are increasingly being used by the public. However, nowadays, when facing ordinal data, most of them use the nominal classification model instead of the ordinal classification that can correctly consider the rank relationship of the data. McCullagh (1980) proposed an extension of the logistic model of ordered target variables, called the ordered logit model. This study uses three ordered logit models as the ordinal classification model. Part of the nominal classification models uses Naïve Bayes and multinomial logit model to predict 13 sets of target variables as ordinal data, and uses Accuracy, Macro-F1 and Mean Square Error (MSE) As a measurement, it turns out that only six datasets perform better in the ordinal classification model. Then we found that more variables in these six datasets conform to the "Proportional odds assumption" of the ordered logistic model. Then we use statistical data simulation methods to verify that the data is indeed in line with the model assumptions, and use the ordinal classification model to obtain better prediction results than the nominal classification model.
Finally, we extend the problem of ordinal data to the text classification issues. Movies and Google reviews will have public comments and ratings. They are usually divided into 1 to 5 points. The word embedding method we use Word2Vec, TF-IDF and FastText to convert the text data into a vector type that the model can use. The results show that the ordinal classification model for Chinese reviews is better , and the nominal classification model for English reviews is better. The word embedding method will also affect the prediction. As a result, the Word2Vec method that considers more surrounding words the better, the TF-IDF method performs the worst, but the training time of Word2Vec is longer, if you have time considerations, you can use the Wiki Pretrain word vector trained on the Internet using Fasttext, and it will have not bad results.
參考文獻 Alan Agresti(2003). Categorical Data Analysis 3rd Edition, A JOHN WILEY & SONS, INC., PUBLICATION.
Jones, K. S.(1972). A statistical interpretation of term specificity and its application in retrieval. Journal of documentation.
Liu, B.(2020). Text sentiment analysis based on CBOW model and deep learning in big data environment. Journal of Ambient Intelligence and Humanized Computing, 11(2), 451-458.
McCullagh, P.(1980). Regression models for ordinal data.Journal of the Royal Statistical Society: Series B (Methodological), 42(2), 109-127.
Cardoso, J., & da Costa, J. P.(2007). Learning to Classify Ordinal Data: The Data Replication Method. Journal of Machine Learning Research, 8, 1393-1429.
Chu, W., & Keerthi, S. S.(2005). New approaches to support vector ordinal regression. In Proceedings of the 22nd international conference on Machine learning, 145-152.
Frank, E., & Hall, M.(2001). A simple approach to ordinal classification. ECML`01: Proceedings of the 12th European Conference on Machine Learning, 145-156.
Jain, A. P., & Dandannavar, P.(2016). Application of machine learning techniques to sentiment analysis. In 2016 2nd International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT), 628-632.
Koren, Y., & Sill, J.(2011). Ordrec: an ordinal model for predicting personalized item rating distributions. In Proceedings of the fifth ACM conference on Recommender systems, 117-124.
Opitz, J., & Burst, S.(2019). Macro f1 and macro f1. arXiv preprint arXiv:1911.03347.
Rennie, J. D., & Srebro, N.(2005). Loss functions for preference levels: Regression with discrete ordered labels. In Proceedings of the IJCAI multidisciplinary workshop on advances in preference handling, 1.
Saad, S. E., & Yang, J.(2019). Twitter sentiment analysis based on ordinal regression. IEEE Access, 7, 163677-163685.
Jing, L. P., Huang, H. K., & Shi, H. B.(2002). Improved feature selection approach TFIDF in text mining. In Proceedings. International Conference on Machine Learning and Cybernetics, 2, 944-946.
Joulin, A., Grave, E., & Dandannavar, P.(2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.
Vargas, V. M., Gutiérrez, P. A., & Hervás-Martínez, C.(2020). Cumulative link models for deep ordinal classification. Neurocomputing, 401, 48-58.
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T.(2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 628-632.
Liu, C., Li, Y., Ping Li, & Fei, H.(2019). Deep Skip-Gram Networks for Text Classification. In Proceedings of the 2019 SIAM International Conference on Data Mining, 145-153.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J.(2013). Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546.
描述 碩士
國立政治大學
統計學系
108354021
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0108354021
資料類型 thesis
dc.contributor.advisor 翁久幸zh_TW
dc.contributor.advisor Weng, Chiu-Hsingen_US
dc.contributor.author (Authors) 柳瑞俞zh_TW
dc.contributor.author (Authors) Liou, Ruei-Yuen_US
dc.creator (作者) 柳瑞俞zh_TW
dc.creator (作者) Liou, Ruei-Yuen_US
dc.date (日期) 2021en_US
dc.date.accessioned 4-Aug-2021 14:42:47 (UTC+8)-
dc.date.available 4-Aug-2021 14:42:47 (UTC+8)-
dc.date.issued (上傳時間) 4-Aug-2021 14:42:47 (UTC+8)-
dc.identifier (Other Identifiers) G0108354021en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/136322-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 統計學系zh_TW
dc.description (描述) 108354021zh_TW
dc.description.abstract (摘要) 隨著資訊科技的蓬勃發展,機器學習的技術越來越被大眾所使用,然而現今面對次序型的資料型態多半直接使用名目型分類模型而不是使用能夠正確考慮資料本身大小關係的次序型分類模型,McCullagh(1980)提出次序型目標變數的邏輯斯模型之推廣,稱為次序邏輯斯模型(Ordered Logit Model),本研究使用三種次序邏輯斯模型做為次序型分類模型,在名目型分類模型的部分使用樸素貝葉斯(Naïve Bayes)與多元邏輯斯模型,用來預測13組目標變數為次序型的資料集,並以正確率(Accuracy)、Macro-F1與均方誤差(MSE)做為衡量指標,結果發現只有其中六組資料集在次序型分類模型表現較好,進而我們發現這六組資料集中較多變數符合次序邏輯斯模型的「比例賠率假設(Proportional odds assumption)」,接著我們使用統計資料模擬的方法,驗證確實在符合模型假設之下的資料,使用次序型分類模型獲得較名目型分類模型佳的預測結果。
最後我們將次序型資料的問題延伸至現今流行的文字分類議題,電影與Google評論等都會有一般民眾的留言與評論等級,通常分為1到5分,我們使用Word2Vec、TF-IDF與Fasttext的詞嵌入(Word Embedding)方式將文字資料轉為模型可以代入的向量型態,結果顯示中文評論使用次序型分類模型成效較佳,英文評論使用名目型分類模型較佳,詞嵌入方法也會影響預測結果,考慮越多周遭字詞的Word2Vec方法成效越好,TF-IDF法表現最差,但Word2Vec訓練方式較久,若有時間上的考量可以使用網路上使用Fasttext訓練好的Wiki Pretrain詞向量也有不差的成效。
zh_TW
dc.description.abstract (摘要) With the development of information technology, machine learning techniques are increasingly being used by the public. However, nowadays, when facing ordinal data, most of them use the nominal classification model instead of the ordinal classification that can correctly consider the rank relationship of the data. McCullagh (1980) proposed an extension of the logistic model of ordered target variables, called the ordered logit model. This study uses three ordered logit models as the ordinal classification model. Part of the nominal classification models uses Naïve Bayes and multinomial logit model to predict 13 sets of target variables as ordinal data, and uses Accuracy, Macro-F1 and Mean Square Error (MSE) As a measurement, it turns out that only six datasets perform better in the ordinal classification model. Then we found that more variables in these six datasets conform to the "Proportional odds assumption" of the ordered logistic model. Then we use statistical data simulation methods to verify that the data is indeed in line with the model assumptions, and use the ordinal classification model to obtain better prediction results than the nominal classification model.
Finally, we extend the problem of ordinal data to the text classification issues. Movies and Google reviews will have public comments and ratings. They are usually divided into 1 to 5 points. The word embedding method we use Word2Vec, TF-IDF and FastText to convert the text data into a vector type that the model can use. The results show that the ordinal classification model for Chinese reviews is better , and the nominal classification model for English reviews is better. The word embedding method will also affect the prediction. As a result, the Word2Vec method that considers more surrounding words the better, the TF-IDF method performs the worst, but the training time of Word2Vec is longer, if you have time considerations, you can use the Wiki Pretrain word vector trained on the Internet using Fasttext, and it will have not bad results.
en_US
dc.description.tableofcontents 1 緒論 1
1.1 研究動機 1
1.2 研究目的 2
2 文獻回顧 3
3 研究方法 5
3.1 衡量指標 5
3.2 分類模型 7
3.2.1 Cumulative Logit Model 7
3.2.2 Continuation-Ratio Logit Model 8
3.2.3 Adjacent-Category Logit Model 9
3.2.4 樸素貝葉斯 Naïve Bayes 10
3.2.5 多元邏輯斯模型 Multinomial Logistic Model 10
3.3 假設檢定與模擬方法 11
3.3.1 比例賠率假設及檢定 11
3.3.2 模擬方法 13
3.4 詞嵌入 Word Embedding 13
3.4.1 詞袋模型與 TF-IDF 13
3.4.2 CBOW(Continuous Bag Of Words) 16
3.4.3 Skip-gram 18
3.4.4 Wiki Pretrain Word Embedding 19
4 資料集介紹 21
4.1 數值與類別型態資料 21
4.1.1 全資料集總表 21
4.1.2 各資料集簡介 23
4.2 文字型態資料 27
4.2.1 中文文字資料 - Yahoo 電影評論 27
4.2.2 英文文字資料 - Trip Advisor Hotel Reviews 28
5 實例分析與模擬研究 31
5.1 實例分析 31
5.2 統計模擬 37
5.3 文字資料應用 41
5.3.1 中文文字-Yahoo 電影評論 42
5.3.2 英文文字-Trip Advisor Hotel Reviews 49
6 研究結論與建議 57
6.1 研究結論 57
6.2 研究建議 59
參考文獻 61
zh_TW
dc.format.extent 18360199 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0108354021en_US
dc.subject (關鍵詞) 次序邏輯斯模型zh_TW
dc.subject (關鍵詞) 多元邏輯斯模型zh_TW
dc.subject (關鍵詞) Word2Veczh_TW
dc.subject (關鍵詞) TF-IDFzh_TW
dc.subject (關鍵詞) FastTextzh_TW
dc.subject (關鍵詞) Ordered Logit Modelen_US
dc.subject (關鍵詞) Multinomial Logit Modelen_US
dc.subject (關鍵詞) Word2Vecen_US
dc.subject (關鍵詞) TF-IDFen_US
dc.subject (關鍵詞) FastTexten_US
dc.title (題名) 名目型與次序型資料之分類模型比較及其在網路文本評論之應用zh_TW
dc.title (題名) A comparison of nominal and ordinal classification models with application to online reviewsen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) Alan Agresti(2003). Categorical Data Analysis 3rd Edition, A JOHN WILEY & SONS, INC., PUBLICATION.
Jones, K. S.(1972). A statistical interpretation of term specificity and its application in retrieval. Journal of documentation.
Liu, B.(2020). Text sentiment analysis based on CBOW model and deep learning in big data environment. Journal of Ambient Intelligence and Humanized Computing, 11(2), 451-458.
McCullagh, P.(1980). Regression models for ordinal data.Journal of the Royal Statistical Society: Series B (Methodological), 42(2), 109-127.
Cardoso, J., & da Costa, J. P.(2007). Learning to Classify Ordinal Data: The Data Replication Method. Journal of Machine Learning Research, 8, 1393-1429.
Chu, W., & Keerthi, S. S.(2005). New approaches to support vector ordinal regression. In Proceedings of the 22nd international conference on Machine learning, 145-152.
Frank, E., & Hall, M.(2001). A simple approach to ordinal classification. ECML`01: Proceedings of the 12th European Conference on Machine Learning, 145-156.
Jain, A. P., & Dandannavar, P.(2016). Application of machine learning techniques to sentiment analysis. In 2016 2nd International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT), 628-632.
Koren, Y., & Sill, J.(2011). Ordrec: an ordinal model for predicting personalized item rating distributions. In Proceedings of the fifth ACM conference on Recommender systems, 117-124.
Opitz, J., & Burst, S.(2019). Macro f1 and macro f1. arXiv preprint arXiv:1911.03347.
Rennie, J. D., & Srebro, N.(2005). Loss functions for preference levels: Regression with discrete ordered labels. In Proceedings of the IJCAI multidisciplinary workshop on advances in preference handling, 1.
Saad, S. E., & Yang, J.(2019). Twitter sentiment analysis based on ordinal regression. IEEE Access, 7, 163677-163685.
Jing, L. P., Huang, H. K., & Shi, H. B.(2002). Improved feature selection approach TFIDF in text mining. In Proceedings. International Conference on Machine Learning and Cybernetics, 2, 944-946.
Joulin, A., Grave, E., & Dandannavar, P.(2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.
Vargas, V. M., Gutiérrez, P. A., & Hervás-Martínez, C.(2020). Cumulative link models for deep ordinal classification. Neurocomputing, 401, 48-58.
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T.(2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 628-632.
Liu, C., Li, Y., Ping Li, & Fei, H.(2019). Deep Skip-Gram Networks for Text Classification. In Proceedings of the 2019 SIAM International Conference on Data Mining, 145-153.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J.(2013). Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546.
zh_TW
dc.identifier.doi (DOI) 10.6814/NCCU202100932en_US