深度學習於中文句子之表示法學習

Publications-Theses

Article View/Open

pdf(463)

Publication Export

Google Scholar^TM

題名	深度學習於中文句子之表示法學習 Deep learning techniques for Chinese sentence representation learning
作者	管芸辰 Kuan, Yun Chen
貢獻者	蔡銘峰 Tsai, Ming Feng 管芸辰 Kuan, Yun Chen
關鍵詞	深度學習分散式表示情緒分類 Deep learning Distributed representation Sentiment analysis
日期	2018
上傳時間	2-Mar-2018 12:05:00 (UTC+8)
摘要	本篇論文主要在探討如何利用近期發展之深度學習技術在於中文句子分散式表示法學習。近期深度學習受到極大的注目，相關技術也隨之蓬勃發展。然而相關的分散式表示方式，大多以英文為主的其他印歐語系作為主要的衡量對象，也據其特性發展。除了印歐語系外，另外漢藏語系及阿爾泰語系等也有眾多使用人口。還有獨立語系的像日語、韓語等語系存在，各自也有其不同的特性。中文本身屬於漢藏語系，本身具有相當不同的特性，像是孤立語、聲調、量詞等。近來也有許多論文使用多語系的資料集作為評量標準，但鮮少去討論各語言間表現的差異。本論文利用句子情緒分類之實驗，來比較近期所發展之深度學習之技術與傳統詞向量表示法的差異，我們將以TF-IDF為基準比較其他三個PVDM、Siamese-CBOW及Fasttext的表現差異，也深入探討此些模型對於中文句子情緒分類之表現。 The paper demonstrates how the deep learning methods published in recent years applied in Chinese sentence representation learning. Recently, the deep learning techniques have attracted the great attention. Related areas also grow enormously. However, the most techniques use Indo-European languages mainly as evaluation objective and developed corresponding to their properties. Besides Indo-European languages, there are Sino-Tibetan language and Altaic language, which also spoken widely. There are only some independent languages like Japanese or Korean, which have their own properties. Chinese itself is belonged to Sino-Tibetan language family and has some characters like isolating language, tone, count word...etc.Recently, many publications also use the multilingual dataset to evaluate their performance, but few of them discuss the differences among different languages. This thesis demonstrates that we perform the sentiment analysis on Chinese Weibo dataset to quantize the effectiveness of different deep learning techniques. We compared the traditional TF-IDF model with PVDM, Siamese-CBOW, and FastText, and evaluate the model they created.
參考文獻	[1] G. Arevian. Recurrent neural networks for robust real-world text classification. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, pages 326–329. IEEE Computer Society, 2007. [2] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606, 2016. [3] L. Chen, C. Zhang, and C. Wilson. Tweeting under pressure: Analyzing trending topics and evolving word choice on sina weibo. In Proceedings of the First ACM Conference on Online Social Networks, COSN ’13, pages 89–100, New York, NY,USA, 2013. ACM. [4] K. Dashtipour, S. Poria, A. Hussain, E. Cambria, A. Y. A. Hawalah, A. Gelbukh, and Q. Zhou. Multilingual sentiment analysis: State of the art and independent comparison of techniques. Cognitive Computation, 8(4):757–771, Aug 2016. [5] K.-w. Fu and M. Chau. Reality check for the chinese microblog space: a random sampling approach. PloS one, 8(3):e58356, 2013. [6] T. Ge, K. He, Q. Ke, and J. Sun. Optimized product quantization for approximate nearest neighbor search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2946–2953, 2013. [7] H. J´egou, R. Tavenard, M. Douze, and L. Amsaleg. Searching in one billion vectors: re-rank with source coding. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 861–864. IEEE, 2011. [8] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. J´egou, and T. Mikolov. Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651,2016. [9] T. Kenter, A. Borisov, and M. de Rijke. Siamese cbow: Optimizing word embeddings for sentence representations. 2016.27 [10] Y. Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014. [11] R. Kiros, Y. Zhu, R. Salakhutdinov, R. S. Zemel, A. Torralba, R. Urtasun, and S. Fidler. Skip-thought vectors. arXiv preprint arXiv:1506.06726, 2015. [12] Q. V. Le and T. Mikolov. Distributed representations of sentences and documents icml. 2014. [13] T. Mikolov, Q. V. Le, and I. Sutskever. Exploiting similarities among languages for machine translation. CoRR, abs/1309.4168, 2013. [14] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. pages 3111–3119, 2013. [15] B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up?: sentiment classification using machine learning techniques. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10, pages 79–86. Association for Computational Linguistics, 2002. [16] D. Tang, F. Wei, N. Yang, M. Zhou, T. Liu, and B. Qin. Learning sentiment-specific word embedding for twitter sentiment classification. In ACL (1), pages 1555–1565,2014. [17] D. Vilares, M. Alonso Pardo, and C. G´omez-Rodr´ıguez. Supervised sentiment analysis in multilingual environments. 53, 05 2017. [18] J. Zhao, L. Dong, J. Wu, and K. Xu. Moodlens: an emoticon-based sentiment analysis system for chinese tweets. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1528–1531. ACM, 2012.
描述	碩士國立政治大學資訊科學系碩士在職專班 103971010
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0103971010
資料類型	thesis

dc.contributor.advisor	蔡銘峰	zh_TW
dc.contributor.advisor	Tsai, Ming Feng	en_US
dc.contributor.author (Authors)	管芸辰	zh_TW
dc.contributor.author (Authors)	Kuan, Yun Chen	en_US
dc.creator (作者)	管芸辰	zh_TW
dc.creator (作者)	Kuan, Yun Chen	en_US
dc.date (日期)	2018	en_US
dc.date.accessioned	2-Mar-2018 12:05:00 (UTC+8)	-
dc.date.available	2-Mar-2018 12:05:00 (UTC+8)	-
dc.date.issued (上傳時間)	2-Mar-2018 12:05:00 (UTC+8)	-
dc.identifier (Other Identifiers)	G0103971010	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/116164	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	資訊科學系碩士在職專班	zh_TW
dc.description (描述)	103971010	zh_TW
dc.description.abstract (摘要)	本篇論文主要在探討如何利用近期發展之深度學習技術在於中文句子分散式表示法學習。近期深度學習受到極大的注目，相關技術也隨之蓬勃發展。然而相關的分散式表示方式，大多以英文為主的其他印歐語系作為主要的衡量對象，也據其特性發展。除了印歐語系外，另外漢藏語系及阿爾泰語系等也有眾多使用人口。還有獨立語系的像日語、韓語等語系存在，各自也有其不同的特性。中文本身屬於漢藏語系，本身具有相當不同的特性，像是孤立語、聲調、量詞等。近來也有許多論文使用多語系的資料集作為評量標準，但鮮少去討論各語言間表現的差異。本論文利用句子情緒分類之實驗，來比較近期所發展之深度學習之技術與傳統詞向量表示法的差異，我們將以TF-IDF為基準比較其他三個PVDM、Siamese-CBOW及Fasttext的表現差異，也深入探討此些模型對於中文句子情緒分類之表現。	zh_TW
dc.description.abstract (摘要)	The paper demonstrates how the deep learning methods published in recent years applied in Chinese sentence representation learning. Recently, the deep learning techniques have attracted the great attention. Related areas also grow enormously. However, the most techniques use Indo-European languages mainly as evaluation objective and developed corresponding to their properties. Besides Indo-European languages, there are Sino-Tibetan language and Altaic language, which also spoken widely. There are only some independent languages like Japanese or Korean, which have their own properties. Chinese itself is belonged to Sino-Tibetan language family and has some characters like isolating language, tone, count word...etc.Recently, many publications also use the multilingual dataset to evaluate their performance, but few of them discuss the differences among different languages. This thesis demonstrates that we perform the sentiment analysis on Chinese Weibo dataset to quantize the effectiveness of different deep learning techniques. We compared the traditional TF-IDF model with PVDM, Siamese-CBOW, and FastText, and evaluate the model they created.	en_US
dc.description.tableofcontents	1 Introduction 1 1.1 Background 1 1.2 Purpose 2 2 Related Work 3 2.1 Traditional Approach 3 2.2 Chinese Related Sentiment Analysis 4 2.3 Advanced Approach 4 3 Methodology 6 3.1 TF-IDF + SVM 6 3.2 Fasttext 6 3.3 Paragraph Vector 8 3.4 Siamese-CBOW 9 4 Experiments 10 4.1 Experimental Settings 10 4.2 Preprocess 11 4.3 PVDM 12 4.4 FastText 12 4.5 Siamese-CBOW 13 4.6 Experimental Results 13 5 Discussions 18 5.1 Discussion 18 5.2 Baseline 19 5.3 Siamese-CBOW 19 5.4 PVDM 20 5.5 FastText 21 5.5.1 N-grams Evaluation 24 5.5.2 Subword Information 24 6 Conclusion 26	zh_TW
dc.format.extent	1442897 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0103971010	en_US
dc.subject (關鍵詞)	深度學習	zh_TW
dc.subject (關鍵詞)	分散式表示	zh_TW
dc.subject (關鍵詞)	情緒分類	zh_TW
dc.subject (關鍵詞)	Deep learning	en_US
dc.subject (關鍵詞)	Distributed representation	en_US
dc.subject (關鍵詞)	Sentiment analysis	en_US
dc.title (題名)	深度學習於中文句子之表示法學習	zh_TW
dc.title (題名)	Deep learning techniques for Chinese sentence representation learning	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	[1] G. Arevian. Recurrent neural networks for robust real-world text classification. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, pages 326–329. IEEE Computer Society, 2007. [2] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606, 2016. [3] L. Chen, C. Zhang, and C. Wilson. Tweeting under pressure: Analyzing trending topics and evolving word choice on sina weibo. In Proceedings of the First ACM Conference on Online Social Networks, COSN ’13, pages 89–100, New York, NY,USA, 2013. ACM. [4] K. Dashtipour, S. Poria, A. Hussain, E. Cambria, A. Y. A. Hawalah, A. Gelbukh, and Q. Zhou. Multilingual sentiment analysis: State of the art and independent comparison of techniques. Cognitive Computation, 8(4):757–771, Aug 2016. [5] K.-w. Fu and M. Chau. Reality check for the chinese microblog space: a random sampling approach. PloS one, 8(3):e58356, 2013. [6] T. Ge, K. He, Q. Ke, and J. Sun. Optimized product quantization for approximate nearest neighbor search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2946–2953, 2013. [7] H. J´egou, R. Tavenard, M. Douze, and L. Amsaleg. Searching in one billion vectors: re-rank with source coding. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 861–864. IEEE, 2011. [8] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. J´egou, and T. Mikolov. Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651,2016. [9] T. Kenter, A. Borisov, and M. de Rijke. Siamese cbow: Optimizing word embeddings for sentence representations. 2016.27 [10] Y. Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014. [11] R. Kiros, Y. Zhu, R. Salakhutdinov, R. S. Zemel, A. Torralba, R. Urtasun, and S. Fidler. Skip-thought vectors. arXiv preprint arXiv:1506.06726, 2015. [12] Q. V. Le and T. Mikolov. Distributed representations of sentences and documents icml. 2014. [13] T. Mikolov, Q. V. Le, and I. Sutskever. Exploiting similarities among languages for machine translation. CoRR, abs/1309.4168, 2013. [14] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. pages 3111–3119, 2013. [15] B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up?: sentiment classification using machine learning techniques. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10, pages 79–86. Association for Computational Linguistics, 2002. [16] D. Tang, F. Wei, N. Yang, M. Zhou, T. Liu, and B. Qin. Learning sentiment-specific word embedding for twitter sentiment classification. In ACL (1), pages 1555–1565,2014. [17] D. Vilares, M. Alonso Pardo, and C. G´omez-Rodr´ıguez. Supervised sentiment analysis in multilingual environments. 53, 05 2017. [18] J. Zhao, L. Dong, J. Wu, and K. Xu. Moodlens: an emoticon-based sentiment analysis system for chinese tweets. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1528–1531. ACM, 2012.	zh_TW

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM