Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

題名 以詞性組合為基礎之中文語言特徵研究
A Study of Part-of-Speech Pair-based Language Features in Chinese Texts
作者 江易倫
Jiang, Yi Lun
貢獻者 劉吉軒
Liu, Jyi Shane
江易倫
Jiang, Yi Lun
關鍵詞 作者歸屬
語言特徵
隨機森林
Authorship attribution
Language features
Random forest
日期 2017
上傳時間 28-Aug-2017 11:41:25 (UTC+8)
摘要 在作者歸屬的研究中,語言特徵的選擇一直是很重要的一環,因為會反映到整個預測結果表現。大多數常用的語言特徵雖然在分類上表現優異,像是高頻詞彙、n-grams、及標點符號等,但這些語言特徵內的詞組卻無法解釋分類間的因果關係及相互差異。為了解決這問題,本論文提出詞性組合、否定程度組合及情態詞組合共3種具有語言學意義的語言特徵作為輔助驗證,並以雷震這位作者的文本為基準,探討在「同主題不同作者」及「同作者不同主題」兩個研究方向上是否適用。本論文將會使用隨機森林演算法建立分類模型,使用OOB錯誤率評估分類模型分類表現,並透過重要特徵數值找出各詞組作為決策點的權重。最後希望能從分類規則中,找出不同作者以及不同類型間語言特徵的獨特性詞組並做解釋。
In the study of authorship attribution, the choice of language features have always been a very important part because it reflects the performance of the whole prediction. Most of the commonly used language features are excellent in classification, such as word frequencies, n-grams, and punctuation, but the phrases within these language features can not explain the causal relationship between categories and the differences between them. In order to solve this problem, this paper proposes 3 kinds of linguistic meaning as a auxiliary verification, and based on the Lei-Chen `s text, discussed "different authors with same topics" and "different genres with same author" is applied on the two research directions. In this paper, we will use the random forest algorithm to establish the classification model, use the OOB error rate assessment classification model classification performance, and through the important feature values to find the weight of each phrase as a decision point. Finally, we hope to find out unique phrases of different authors and different genres of language features from the classification rules and explain them.
參考文獻 杜協昌,〈利用文本採礦探討《紅樓夢》的後四十回作者爭議〉,2012數位典藏與數位人文國際研討會,頁135-162,國立台灣大學,2012。
A. Abbasi, and H. Chen, “Writeprints: A Stylometric Approach to Identity-Level Identification and Similarity Detection in Cyberspace,” ACM Transactions on Information Systems, vol. 26, no. 2, pp. 1-29, Mar. 2008.
J. Wang, “A critical discourse analysis of Barack Obama’s speeches,” Journal of Language Teaching and Research, vol. 1, no. 3, pp.254-261,May 2010.
薛化元,《自由中國與民主憲政:1950年代台灣思想史的一個考察》,臺北縣板橋市:稻鄉出版社,頁1-11,1996。
M. Koppel, J. Schler, and S. Argamon, “Authorship Attribution: What`s Easy and What`s Hard?” Journal of Law & Policy, vol. 21, no. 2, pp. 317-331, Jun. 2013.
M. Koppel, J. Schler, and S. Argamon, “Authorship attribution in the wild,” Language Resources and Evaluation, vol. 45, no. 1, pp. 83-94, Mar. 2011.
N. Zechner, “The past, present and future of text classification,” in 2013 European Intelligence and Security Informatics Conference. EISIC’13, Aug. 2013, pp. 230-230.
郉義田,〈居延漢簡資料庫的建立與展望〉,2015數位典藏與數位人文國際研討會,頁1-7,國立台灣大學,2015。
胡適,《中國章回小說考證》,天津市:南開大學出版社,頁187-328,2014。
E. Stamatatos, “A survey of modern authorship attribution methods,” Journal of the American Society for information Science and Technology, vol. 60, no. 3, pp. 538-556, Mar. 2009.
M. Koppel and Y. Winter, “Determining if two documents are written by the same author,” Journal of the Association for Information Science and Technology, vol. 65, no. 1, pp. 178-187, Jan. 2014.
V. G. Ashok, S. Feng, and Y. Choi, “Success with style: Using writing style to predict the success of novels,” in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Oct. 2013, pp. 1753–1764.
S. Bird and E. Loper, “NLTK: the natural language toolkit,” Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics-Volume 1, Association for Computational Linguistics, pp. 63-70, 2002.
B. Yu, “Function words for Chinese authorship attribution,” Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature, Association for Computational Linguistics, pp. 45-53, 2012.
A. Rocha, et al., “Authorship Attribution for Social Media Forensics,” IEEE Transactions on Information Forensics and Security, vol. 12, no. 1, pp. 5-33, Jan. 2017.
L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5-32, Oct. 2001.
T. M. Oshiro, P. S. Perez, and J. A. Baranauskas, “How many trees in a random forest? ” in Machine Learning and Data Mining in Pattern Recognition, Jul. 2012, pp. 154-168.
L. Breiman, “Bagging predictors,” Machine learning, vol. 24, no. 2, pp. 123-140, Aug. 1996.
A. Caliskan-Islam, Stylometric Fingerprints and Privacy Behavior in Textual Data. Drexel University, pp. 81-85, 2015.
M. L. Pacheco, K. Fernandes, and A. Porco, “Random forest with increased generalization: A universal background approach for authorship verification,” in Conference and Labs of the Evaluation Forum, 2015.
M. Popescu and C. Grozea, “Kernel methods and string kernels for authorship analysis,” in Conference and Labs of the Evaluation Forum, 2012.
L. Marujo, et al., “Textual Event Detection using Fuzzy Fingerprints,” in Intelligent Systems’2014, Springer International Publishing, pp.825-836, 2015.
T. R. Reddy, B. V. Vardhan, and P. V. Reddy, “A Survey on Authorship Profiling Techniques,” International Journal of Applied Engineering Research, vol. 11, no. 5, pp. 3092-3102, 2016.
M. Kuta, B. Puto, and J. Kitowski, “Authorship Attribution of Polish Newspaper Articles,” in Artificial Intelligence and Soft Computing, Springer International Publishing, 29 May 2016, pp. 474-483.
A. Palomino-Garibay, et al., “A Random Forest Approach for Authorship Profiling,” in Conference and Labs of the Evaluation Forum, 2015.
P. Galán-García, et al., “Supervised Machine Learning for the Detection of Troll Proles in Twitter Social Network: Application to a Real Case of Cyberbullying,” Logic Journal of the IGPL, vol. 24, no. 1, pp. 42–53, Feb. 2016.
孙雪、韩蕾、李昆仑,〈基于类别特征选择与反馈学习随机森林算法的邮件过滤系统研究〉,计算机应用与软件,第32卷,第4期,頁67-71,2015。
P. Maitra, S. Ghosh, and D. Das, “Authorship Verification – An Approach based on Random Forest,” in Conference and Labs of the Evaluation Forum, 2015.
任函、冯文贺、刘茂福等,〈基于语言现象的文本蕴涵识别〉,中文信息学报,第31卷,第1期,頁184-191,2017。
孟雪井、孟祥兰、胡杨洋,〈基于文本挖掘和百度指数的投资者情绪指数研究〉,宏观经济研究,第1期,頁144-153,2016。
周强、俞士汶,〈汉语短语标注标记集的确定〉,中文信息学报,第10卷,第4期,頁1-11,1996。
丁声树,《现代汉语语法讲话》,北京:商务印书馆,頁180,1961。
呂叔湘、朱德熙,《語法研究和探索》,北京:北京大學出版社,頁85,1983。
劉月華、故韡、潘文娛,《實用現代漢語語法》,臺北市:師大書苑出版,頁124,1996。
李泉,《汉语语法考察与分析》,北京市:北京語言文化大學,頁71,2001。
张谊生,《现代汉语副词分析》,上海市:上海三聯書店,頁6,2010。
謝佳玲,〈漢語情態詞的語意界定:語料庫為本的研究〉,中國語文研究,第1期,頁45-63,2006。
张华伟、王明文、甘丽新,〈基于随机森林的文本分类模型研究〉,山东大学学报 (理学版),第41卷,第3期,頁139-143,2006。
描述 碩士
國立政治大學
資訊科學學系
104753018
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0104753018
資料類型 thesis
dc.contributor.advisor 劉吉軒zh_TW
dc.contributor.advisor Liu, Jyi Shaneen_US
dc.contributor.author (Authors) 江易倫zh_TW
dc.contributor.author (Authors) Jiang, Yi Lunen_US
dc.creator (作者) 江易倫zh_TW
dc.creator (作者) Jiang, Yi Lunen_US
dc.date (日期) 2017en_US
dc.date.accessioned 28-Aug-2017 11:41:25 (UTC+8)-
dc.date.available 28-Aug-2017 11:41:25 (UTC+8)-
dc.date.issued (上傳時間) 28-Aug-2017 11:41:25 (UTC+8)-
dc.identifier (Other Identifiers) G0104753018en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/112205-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊科學學系zh_TW
dc.description (描述) 104753018zh_TW
dc.description.abstract (摘要) 在作者歸屬的研究中,語言特徵的選擇一直是很重要的一環,因為會反映到整個預測結果表現。大多數常用的語言特徵雖然在分類上表現優異,像是高頻詞彙、n-grams、及標點符號等,但這些語言特徵內的詞組卻無法解釋分類間的因果關係及相互差異。為了解決這問題,本論文提出詞性組合、否定程度組合及情態詞組合共3種具有語言學意義的語言特徵作為輔助驗證,並以雷震這位作者的文本為基準,探討在「同主題不同作者」及「同作者不同主題」兩個研究方向上是否適用。本論文將會使用隨機森林演算法建立分類模型,使用OOB錯誤率評估分類模型分類表現,並透過重要特徵數值找出各詞組作為決策點的權重。最後希望能從分類規則中,找出不同作者以及不同類型間語言特徵的獨特性詞組並做解釋。zh_TW
dc.description.abstract (摘要) In the study of authorship attribution, the choice of language features have always been a very important part because it reflects the performance of the whole prediction. Most of the commonly used language features are excellent in classification, such as word frequencies, n-grams, and punctuation, but the phrases within these language features can not explain the causal relationship between categories and the differences between them. In order to solve this problem, this paper proposes 3 kinds of linguistic meaning as a auxiliary verification, and based on the Lei-Chen `s text, discussed "different authors with same topics" and "different genres with same author" is applied on the two research directions. In this paper, we will use the random forest algorithm to establish the classification model, use the OOB error rate assessment classification model classification performance, and through the important feature values to find the weight of each phrase as a decision point. Finally, we hope to find out unique phrases of different authors and different genres of language features from the classification rules and explain them.en_US
dc.description.tableofcontents 第 1 章 緒論 1
1.1 研究背景 1
1.2 研究目的與動機 2
1.3 研究資料 3
1.4 論文架構 4
第 2 章 文獻探討 5
2.1 作者歸屬研究 5
2.2 中文斷詞介紹 6
2.3 語言特徵 7
2.4 向量空間模型 9
2.5 隨機森林分類演算法 12
2.5.1 機器學習介紹 12
2.5.2 決策樹及隨機森林介紹 13
2.5.3 隨機森林的相關研究 15
第 3 章 語言特徵研究方法 16
3.1 資料前處理 17
3.1.1 研究文本選取、斷詞及標註詞性 18
3.1.2 語言特徵的選擇與文本向量建立 20
3.2 分類模型選擇及建立 27
3.3 結果如何評估 31
第 4 章 研究成果及分析 33
4.1 分類模型評估 33
4.1.1 多類別模型 34
4.1.2 雙類別模型 36
4.1.3 綜合評估分析 41
4.2 類別獨特詞組尋找 44
4.2.1 各類別重要特徵詞組尋找 44
4.2.2 各類別獨特性詞組尋找 71
4.2.3 結果分析 84
4.3 語言特徵整合預測 86
4.4 小結 89
第 5 章 結論與未來展望 90
References 92
附錄 96
zh_TW
dc.format.extent 3108077 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0104753018en_US
dc.subject (關鍵詞) 作者歸屬zh_TW
dc.subject (關鍵詞) 語言特徵zh_TW
dc.subject (關鍵詞) 隨機森林zh_TW
dc.subject (關鍵詞) Authorship attributionen_US
dc.subject (關鍵詞) Language featuresen_US
dc.subject (關鍵詞) Random foresten_US
dc.title (題名) 以詞性組合為基礎之中文語言特徵研究zh_TW
dc.title (題名) A Study of Part-of-Speech Pair-based Language Features in Chinese Textsen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) 杜協昌,〈利用文本採礦探討《紅樓夢》的後四十回作者爭議〉,2012數位典藏與數位人文國際研討會,頁135-162,國立台灣大學,2012。
A. Abbasi, and H. Chen, “Writeprints: A Stylometric Approach to Identity-Level Identification and Similarity Detection in Cyberspace,” ACM Transactions on Information Systems, vol. 26, no. 2, pp. 1-29, Mar. 2008.
J. Wang, “A critical discourse analysis of Barack Obama’s speeches,” Journal of Language Teaching and Research, vol. 1, no. 3, pp.254-261,May 2010.
薛化元,《自由中國與民主憲政:1950年代台灣思想史的一個考察》,臺北縣板橋市:稻鄉出版社,頁1-11,1996。
M. Koppel, J. Schler, and S. Argamon, “Authorship Attribution: What`s Easy and What`s Hard?” Journal of Law & Policy, vol. 21, no. 2, pp. 317-331, Jun. 2013.
M. Koppel, J. Schler, and S. Argamon, “Authorship attribution in the wild,” Language Resources and Evaluation, vol. 45, no. 1, pp. 83-94, Mar. 2011.
N. Zechner, “The past, present and future of text classification,” in 2013 European Intelligence and Security Informatics Conference. EISIC’13, Aug. 2013, pp. 230-230.
郉義田,〈居延漢簡資料庫的建立與展望〉,2015數位典藏與數位人文國際研討會,頁1-7,國立台灣大學,2015。
胡適,《中國章回小說考證》,天津市:南開大學出版社,頁187-328,2014。
E. Stamatatos, “A survey of modern authorship attribution methods,” Journal of the American Society for information Science and Technology, vol. 60, no. 3, pp. 538-556, Mar. 2009.
M. Koppel and Y. Winter, “Determining if two documents are written by the same author,” Journal of the Association for Information Science and Technology, vol. 65, no. 1, pp. 178-187, Jan. 2014.
V. G. Ashok, S. Feng, and Y. Choi, “Success with style: Using writing style to predict the success of novels,” in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Oct. 2013, pp. 1753–1764.
S. Bird and E. Loper, “NLTK: the natural language toolkit,” Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics-Volume 1, Association for Computational Linguistics, pp. 63-70, 2002.
B. Yu, “Function words for Chinese authorship attribution,” Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature, Association for Computational Linguistics, pp. 45-53, 2012.
A. Rocha, et al., “Authorship Attribution for Social Media Forensics,” IEEE Transactions on Information Forensics and Security, vol. 12, no. 1, pp. 5-33, Jan. 2017.
L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5-32, Oct. 2001.
T. M. Oshiro, P. S. Perez, and J. A. Baranauskas, “How many trees in a random forest? ” in Machine Learning and Data Mining in Pattern Recognition, Jul. 2012, pp. 154-168.
L. Breiman, “Bagging predictors,” Machine learning, vol. 24, no. 2, pp. 123-140, Aug. 1996.
A. Caliskan-Islam, Stylometric Fingerprints and Privacy Behavior in Textual Data. Drexel University, pp. 81-85, 2015.
M. L. Pacheco, K. Fernandes, and A. Porco, “Random forest with increased generalization: A universal background approach for authorship verification,” in Conference and Labs of the Evaluation Forum, 2015.
M. Popescu and C. Grozea, “Kernel methods and string kernels for authorship analysis,” in Conference and Labs of the Evaluation Forum, 2012.
L. Marujo, et al., “Textual Event Detection using Fuzzy Fingerprints,” in Intelligent Systems’2014, Springer International Publishing, pp.825-836, 2015.
T. R. Reddy, B. V. Vardhan, and P. V. Reddy, “A Survey on Authorship Profiling Techniques,” International Journal of Applied Engineering Research, vol. 11, no. 5, pp. 3092-3102, 2016.
M. Kuta, B. Puto, and J. Kitowski, “Authorship Attribution of Polish Newspaper Articles,” in Artificial Intelligence and Soft Computing, Springer International Publishing, 29 May 2016, pp. 474-483.
A. Palomino-Garibay, et al., “A Random Forest Approach for Authorship Profiling,” in Conference and Labs of the Evaluation Forum, 2015.
P. Galán-García, et al., “Supervised Machine Learning for the Detection of Troll Proles in Twitter Social Network: Application to a Real Case of Cyberbullying,” Logic Journal of the IGPL, vol. 24, no. 1, pp. 42–53, Feb. 2016.
孙雪、韩蕾、李昆仑,〈基于类别特征选择与反馈学习随机森林算法的邮件过滤系统研究〉,计算机应用与软件,第32卷,第4期,頁67-71,2015。
P. Maitra, S. Ghosh, and D. Das, “Authorship Verification – An Approach based on Random Forest,” in Conference and Labs of the Evaluation Forum, 2015.
任函、冯文贺、刘茂福等,〈基于语言现象的文本蕴涵识别〉,中文信息学报,第31卷,第1期,頁184-191,2017。
孟雪井、孟祥兰、胡杨洋,〈基于文本挖掘和百度指数的投资者情绪指数研究〉,宏观经济研究,第1期,頁144-153,2016。
周强、俞士汶,〈汉语短语标注标记集的确定〉,中文信息学报,第10卷,第4期,頁1-11,1996。
丁声树,《现代汉语语法讲话》,北京:商务印书馆,頁180,1961。
呂叔湘、朱德熙,《語法研究和探索》,北京:北京大學出版社,頁85,1983。
劉月華、故韡、潘文娛,《實用現代漢語語法》,臺北市:師大書苑出版,頁124,1996。
李泉,《汉语语法考察与分析》,北京市:北京語言文化大學,頁71,2001。
张谊生,《现代汉语副词分析》,上海市:上海三聯書店,頁6,2010。
謝佳玲,〈漢語情態詞的語意界定:語料庫為本的研究〉,中國語文研究,第1期,頁45-63,2006。
张华伟、王明文、甘丽新,〈基于随机森林的文本分类模型研究〉,山东大学学报 (理学版),第41卷,第3期,頁139-143,2006。
zh_TW