Please use this identifier to cite or link to this item: https://ah.nccu.edu.tw/handle/140.119/132064


Title: 跨語言遷移學習在惡意留言偵測上的應用
Cross-lingual Transfer Learning for Toxic Comment Detection
Authors: 陳冠宇
Chen, Kuan-Yu
Contributors: 蔡炎龍
陳冠宇
Chen, Kuan-Yu
Keywords: Transformer
XLM-R
跨語言預測
惡意留言
不平衡數據
深度學習
對話安全
Transformer
XLM-R
cross-lingual prediction
toxic comment
imbalanced data
deep learning
security conversations
Date: 2020
Issue Date: 2020-10-05 15:16:14 (UTC+8)
Abstract: Transformer這個模型,它開啟了自然語言處理領域的一道大門,使得這個領域往前邁進了一大步,它讓模型更了解了文字中的關係。並且它的模型架構延伸了許多語言模型,例如跨語言模型的XLM,XLM-R,而這些延伸出來的模型在各個任務中都獲得了很好的成績。在本篇論文中,我們證實了可以透過其他高資源的語言來彌補低資源的語言的資料量,我們以預測留言是否是惡意留言來做為例子,我們分別使用Jigsaw Multilingual Toxic Comment Classification 競賽所釋出的英文資料和PTT黑特版上的留言當做輸入的訓練集,並要模型預測中文的惡意留言,而且英文的資料量比中文的資料量多出很多,我們將其預測結果分為三個種類分別是單純以英文資料訓練模型,單純以中文資料訓練模型,最後是將兩者的資料結合並訓練模型,發現在以英文資料的訓練因為其資料量較大使得其預測結果為最好有75.9% 的水準,而以總體預測水準來說為混合型的資料分數較高有88.3%。總體來說,我們可以透過跨語言模型來補足低資源語言的不足,並且有了另一種解決低語言資料的方法。
The Transformer model, which opens a door in the field of natural language processing, makes this field has another significant further step. It allows the model to better understand the relationship in the word. And the model architecture extends many language models, such as cross-lingual model XLM, XLM-R, and these models have achieved good results in various tasks. In this paper, we proved that other high-resource languages can be used to make up for the data in low-resource languages. We take the prediction of whether the comment is a toxic message as an example. We use the English data released by the Jigsaw Multilingual Toxic Comment Classification competition and the comment on the PTT Hate board as the input training set. We want the model to predict toxic comment in Chinese, and the data in English is much larger than that in Chinese. We divide the prediction results into three categories: only use English data to fine-tune the model, fine-tune the model with Chinese data, and the last is combine the two data and fine-tune the model. We found that the training with English data has the best accuracy score of 75.9% because of the large amount of data, while the overall accuracy scores that mixed data has a higher score of 88.3%. In general, we can make up for the lack of low-resource languages through cross-lingual models and have another way to solve low-resource languages problem.
Reference: [1] Mateusz Buda, Atsuto Maki, and Maciej A Mazurowski. A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks, 106:249–259,
2018.
[2] KR Chowdhary. Natural language processing. In Fundamentals of Artificial Intelligence, pages 603–649. Springer, 2020.
[3] David A Cieslak, Nitesh V Chawla, and Aaron Striegel. Combating imbalance in network intrusion datasets. In GrC, pages 732–737, 2006.
[4] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin
Stoyanov. Unsupervised crosslingual representation learning at scale, 2019.
[5] Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding, 2018.
[6] Chris Drummond, Robert C Holte, et al. C4. 5, class imbalance, and cost sensitivity: why undersampling
beats oversampling. In Workshop on learning from imbalanced datasets II, volume 11, pages 1–8. Citeseer, 2003.
[7] Guo Haixiang, Li Yijing, Jennifer Shang, Gu Mingyun, Huang Yuanyue, and Gong Bing. Learning from classimbalanced
data: Review of methods and applications. Expert
Systems with Applications, 73:220–239, 2017.
[8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
[9] Hossein Hosseini, Sreeram Kannan, Baosen Zhang, and Radha Poovendran. Deceiving google’s perspective api built for detecting toxic comments, 2017.
[10] Anil K Jain, Jianchang Mao, and KM Mohiuddin. Artificial neural networks: A tutorial. Computer, (3):31–44, 1996.
[11] Miroslav Kubat, Robert C Holte, and Stan Matwin. Machine learning for the detection of oil spills in satellite radar images. Machine learning, 30(23):
195–215, 1998.
[12] Guillaume Lample and Alexis Conneau. Crosslingual
language model pretraining. CoRR, abs/1901.07291, 2019.
[13] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553): 436–444, 2015.
[14] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert
pretraining approach, 2019.
[15] R Bharat Rao, Sriram Krishnan, and Radu Stefan Niculescu. Data mining for improved cardiac care. ACM SIGKDD Explorations Newsletter, 8(1):3–10, 2006.
[16] Daniel Svozil, Vladimir Kvasnicka, and Jiri Pospichal. Introduction to multilayer feedforward neural networks. Chemometrics and intelligent laboratory systems, 39(1):43–62, 1997.
[17] Wilson L. Taylor. “cloze procedure": A new tool for measuring readability. Journalism Quarterly, 30(4):415–433, 1953.
[18] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon,
U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran
Associates, Inc., 2017.
[19] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff
Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s neural machine translation system: Bridging the gap
between human and machine translation, 2016.
[20] Show-Jane Yen and Yue-Shi Lee. Under-sampling
approaches for improving prediction of the minority class in an imbalanced dataset. In Intelligent Control and Automation, pages 731–740. Springer, 2006.
[21] Dong Yu and Li Deng. Automatic Speech Recognition: A Deep Learning Approach. Springer Publishing Company, Incorporated, 2014.
Description: 碩士
國立政治大學
應用數學系
107751010
Source URI: http://thesis.lib.nccu.edu.tw/record/#G0107751010
Data Type: thesis
Appears in Collections:[應用數學系] 學位論文

Files in This Item:

File Description SizeFormat
101001.pdf1346KbAdobe PDF0View/Open


All items in 學術集成 are protected by copyright, with all rights reserved.


社群 sharing