Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

題名 維度縮減於文本風格之應用研究
A Study of Data Reduction on Text Mining
作者 林志軒
Lin, Chih-Hsuan
貢獻者 余清祥<br>鄭文惠
Yue, Ching-Syang<br>Cheng, Wen-Huei
林志軒
Lin, Chih-Hsuan
關鍵詞 文字探勘
寫作風格
資料縮減
卡方檢定
交叉驗證
Text Mining
Writing Style
Data Reduction
Chi-Square Test
Cross-Validation
日期 2020
上傳時間 2-Sep-2020 11:43:26 (UTC+8)
摘要 寫作風格是文字分析的常見議題,無論個人寫作、學術期刊、報章雜誌等,各文本多半都有自己的獨特風格,往往由用詞遣字及編排就能看出差異。寫作風格的量化分析經常透過分類模型,判定文章來自於哪位作者,由於分析時通常會因模型代入過多變數,使得運算時間過長,有些研究提議套用主成份分析之類的資料縮減方法,但如此多半無法具體詮釋文本差異。本文以分類寫作風格為研究目標,藉由卡方檢定等方法篩選相關變數,並與線性、非線性資料縮減方法比較,希冀可兼顧分類準確率及實質詮釋。
本文使用的文本都屬於白話文,包括臺灣及中國的報刊:2012~2019年《蘋果日報》、《自由時報》、《中國時報》頭條新聞,1971~1975年、1989~1993年《人民日報》頭版新聞,以及1919年、1926年《新青年》第七卷及第十一卷。各文本先經過結巴(jieba)斷詞處理,以倍數指標、卡方檢定等方法挑選變數,再與線性及非線性維度縮減選取變數比較,代入統計學習、機器學習模型,藉由交叉驗證比較分類準確率。分析發現本文提出的卡方檢定篩選方法較為穩定,分類準確率也較高,模型以XGBoost之類集成方法較佳。另外,根據本文挑選出的字詞判斷文本風格,《蘋果日報》、《自由時報》、《中國時報》用詞分別偏向於社會議題、政黨政治及兩岸關係議題,《人民日報》在1970年代、1990年代用詞偏向革命議題、經濟改革等議題,《新青年》第七卷、第十一卷用詞分別偏向於思想改革、資本主義等議題。
Writing style is a popular research topic in text mining and experts often can judge the authors of articles by checking the use of certain words. In addition to choosing proper words, statistical and machine learning models also are important in the study of writing style. In practice, usually many variables (e.g., words or phrases) are plugged into the models, costing a lot of computation time, and thus data reduction methods are recommended to speeding the analysis. However, it is difficult to give a reasonable interpretation to the variables after data reduction. In this study, we propose two methods for selecting variables, which take into account the accuracy and interpretation of classification models.
The texts used in this study all belong to modern Chinese writing, including the headlines of Apple Daily, Liberty Times, and China Times (2012-2019), articles of People’s Daily (1971-1975 and 1989-1993), and Volumes 7 and 11 of New Youth Magazine (1919 and 1926). We first apply jieba to all articles for word segmentation, following by performing the variable selection methods (e.g., the proposed methods and linear/nonlinear dimension reduction methods), and finally plug the chosen variables into statistical and machine learning models. The model comparison is based on the F1 measures via cross-validation. We found that the proposed variable selection methods and the ensemble methods generally have the best performance in classification. As for the interpretation of selected variables, Apple Daily, Liberty Times and China Times each focused on issues related social affair, politics and cross-strait relationship, respectively. People’s Daily emphasized on topics related to revolution and economic reform in 1970’s and 1990’s, respectively. New Youth Magazine focused issues related to ideological reform and capitalism in Volumes 7 and 11, respectively.
參考文獻 一、中文文獻
1.李竹君(2016)。「再思考新聞價值—以蘋果日報與中時集團的即時新聞為例」,台灣大學新聞研究所碩士論文。
2.宋長熾(2004)。「兩岸報紙對「2003年美伊戰爭」議題報導之研究-以《中國時報》、《聯合報》、《自由時報》、《人民日報》為例」,政治作戰學校新聞研究所碩士論文。
3.余清祥、葉昱廷(2020)。「以文字探勘技術分析臺灣四大報文字風格」,《數位典藏與數位人文》,第6卷。
4.陳美瑜(2013)。「中文文本作者辨識研究: 以社群網站--臉書為例」,臺灣師範大學英語學系碩士論文。
5.黃于珊(2017)。「文字探勘在總體經濟上之應用-以美國聯準會會議紀錄為例」。政治大學金融學系碩士論文。
6.黃培軒(2017)。「關鍵詞與階層式詞彙文本分群之應用」,政治大學統計學系碩士論文。
7.鄭開元(2018)。「基於詞頻、位置及類別關係的特徵選擇方法」,銘傳大學資訊管理學系碩士論文。

二、英文文獻
1.Bishop, C. (2006). Machine Learning and Pattern Recognition, Cambridge University Press.
2.Boyce, G., Curran, J. and Wingate, P. (Eds.) (1978). Newspaper History from the 17th Century to the Present Day, Acton Society, Press group.
3.Chuan, H., Zhe, D., Ruifan, L. and Yixin Z. (2008). Dimensionality Reduction for Text Using LLE, Beijing, China.
4.Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge, Cambridge University Press.
5.Archer, J. and Jockers, M.L. (2016). The Bestseller Code, New York: St. Martin’s Press.
6.Jolliffe, I.T. (2002) Principal Component Analysis, 2 edition, Springer, New York.
7.Li, Y., Xu, L., Tian, F., Jiang, L., Zhong, X. and Chen, E. (2015). “Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective,” Proceedings of the 24th International Conference on Artificial Intelligence, Buenos Aires, Argentina, AAAI Press: 3650-3656.
描述 碩士
國立政治大學
統計學系
107354025
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0107354025
資料類型 thesis
dc.contributor.advisor 余清祥<br>鄭文惠zh_TW
dc.contributor.advisor Yue, Ching-Syang<br>Cheng, Wen-Hueien_US
dc.contributor.author (Authors) 林志軒zh_TW
dc.contributor.author (Authors) Lin, Chih-Hsuanen_US
dc.creator (作者) 林志軒zh_TW
dc.creator (作者) Lin, Chih-Hsuanen_US
dc.date (日期) 2020en_US
dc.date.accessioned 2-Sep-2020 11:43:26 (UTC+8)-
dc.date.available 2-Sep-2020 11:43:26 (UTC+8)-
dc.date.issued (上傳時間) 2-Sep-2020 11:43:26 (UTC+8)-
dc.identifier (Other Identifiers) G0107354025en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/131479-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 統計學系zh_TW
dc.description (描述) 107354025zh_TW
dc.description.abstract (摘要) 寫作風格是文字分析的常見議題,無論個人寫作、學術期刊、報章雜誌等,各文本多半都有自己的獨特風格,往往由用詞遣字及編排就能看出差異。寫作風格的量化分析經常透過分類模型,判定文章來自於哪位作者,由於分析時通常會因模型代入過多變數,使得運算時間過長,有些研究提議套用主成份分析之類的資料縮減方法,但如此多半無法具體詮釋文本差異。本文以分類寫作風格為研究目標,藉由卡方檢定等方法篩選相關變數,並與線性、非線性資料縮減方法比較,希冀可兼顧分類準確率及實質詮釋。
本文使用的文本都屬於白話文,包括臺灣及中國的報刊:2012~2019年《蘋果日報》、《自由時報》、《中國時報》頭條新聞,1971~1975年、1989~1993年《人民日報》頭版新聞,以及1919年、1926年《新青年》第七卷及第十一卷。各文本先經過結巴(jieba)斷詞處理,以倍數指標、卡方檢定等方法挑選變數,再與線性及非線性維度縮減選取變數比較,代入統計學習、機器學習模型,藉由交叉驗證比較分類準確率。分析發現本文提出的卡方檢定篩選方法較為穩定,分類準確率也較高,模型以XGBoost之類集成方法較佳。另外,根據本文挑選出的字詞判斷文本風格,《蘋果日報》、《自由時報》、《中國時報》用詞分別偏向於社會議題、政黨政治及兩岸關係議題,《人民日報》在1970年代、1990年代用詞偏向革命議題、經濟改革等議題,《新青年》第七卷、第十一卷用詞分別偏向於思想改革、資本主義等議題。
zh_TW
dc.description.abstract (摘要) Writing style is a popular research topic in text mining and experts often can judge the authors of articles by checking the use of certain words. In addition to choosing proper words, statistical and machine learning models also are important in the study of writing style. In practice, usually many variables (e.g., words or phrases) are plugged into the models, costing a lot of computation time, and thus data reduction methods are recommended to speeding the analysis. However, it is difficult to give a reasonable interpretation to the variables after data reduction. In this study, we propose two methods for selecting variables, which take into account the accuracy and interpretation of classification models.
The texts used in this study all belong to modern Chinese writing, including the headlines of Apple Daily, Liberty Times, and China Times (2012-2019), articles of People’s Daily (1971-1975 and 1989-1993), and Volumes 7 and 11 of New Youth Magazine (1919 and 1926). We first apply jieba to all articles for word segmentation, following by performing the variable selection methods (e.g., the proposed methods and linear/nonlinear dimension reduction methods), and finally plug the chosen variables into statistical and machine learning models. The model comparison is based on the F1 measures via cross-validation. We found that the proposed variable selection methods and the ensemble methods generally have the best performance in classification. As for the interpretation of selected variables, Apple Daily, Liberty Times and China Times each focused on issues related social affair, politics and cross-strait relationship, respectively. People’s Daily emphasized on topics related to revolution and economic reform in 1970’s and 1990’s, respectively. New Youth Magazine focused issues related to ideological reform and capitalism in Volumes 7 and 11, respectively.
en_US
dc.description.tableofcontents 第一章、緒論 1
第一節、研究動機 1
第二節、研究目的 3
第二章、文獻探討 5
第一節、文獻回顧 5
第二節、資料介紹 7
第三章、研究方法 9
第一節、斷詞系統 9
第二節、降維方法 10
第三節、探索性資料分析 14
第四節、詞向量與文本向量 14
第五節、T-SNE 18
第六節、分類模型 21
第四章、內文分析 24
第一節、二分類 25
第二節、三分類 34
第三節、臺灣報紙分群 38
第四節、相似文本 43
第五章、結論與建議 48
第一節、結論 48
第二節、未來研究建議 49
參考文獻 50
zh_TW
dc.format.extent 6129917 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0107354025en_US
dc.subject (關鍵詞) 文字探勘zh_TW
dc.subject (關鍵詞) 寫作風格zh_TW
dc.subject (關鍵詞) 資料縮減zh_TW
dc.subject (關鍵詞) 卡方檢定zh_TW
dc.subject (關鍵詞) 交叉驗證zh_TW
dc.subject (關鍵詞) Text Miningen_US
dc.subject (關鍵詞) Writing Styleen_US
dc.subject (關鍵詞) Data Reductionen_US
dc.subject (關鍵詞) Chi-Square Testen_US
dc.subject (關鍵詞) Cross-Validationen_US
dc.title (題名) 維度縮減於文本風格之應用研究zh_TW
dc.title (題名) A Study of Data Reduction on Text Miningen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) 一、中文文獻
1.李竹君(2016)。「再思考新聞價值—以蘋果日報與中時集團的即時新聞為例」,台灣大學新聞研究所碩士論文。
2.宋長熾(2004)。「兩岸報紙對「2003年美伊戰爭」議題報導之研究-以《中國時報》、《聯合報》、《自由時報》、《人民日報》為例」,政治作戰學校新聞研究所碩士論文。
3.余清祥、葉昱廷(2020)。「以文字探勘技術分析臺灣四大報文字風格」,《數位典藏與數位人文》,第6卷。
4.陳美瑜(2013)。「中文文本作者辨識研究: 以社群網站--臉書為例」,臺灣師範大學英語學系碩士論文。
5.黃于珊(2017)。「文字探勘在總體經濟上之應用-以美國聯準會會議紀錄為例」。政治大學金融學系碩士論文。
6.黃培軒(2017)。「關鍵詞與階層式詞彙文本分群之應用」,政治大學統計學系碩士論文。
7.鄭開元(2018)。「基於詞頻、位置及類別關係的特徵選擇方法」,銘傳大學資訊管理學系碩士論文。

二、英文文獻
1.Bishop, C. (2006). Machine Learning and Pattern Recognition, Cambridge University Press.
2.Boyce, G., Curran, J. and Wingate, P. (Eds.) (1978). Newspaper History from the 17th Century to the Present Day, Acton Society, Press group.
3.Chuan, H., Zhe, D., Ruifan, L. and Yixin Z. (2008). Dimensionality Reduction for Text Using LLE, Beijing, China.
4.Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge, Cambridge University Press.
5.Archer, J. and Jockers, M.L. (2016). The Bestseller Code, New York: St. Martin’s Press.
6.Jolliffe, I.T. (2002) Principal Component Analysis, 2 edition, Springer, New York.
7.Li, Y., Xu, L., Tian, F., Jiang, L., Zhong, X. and Chen, E. (2015). “Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective,” Proceedings of the 24th International Conference on Artificial Intelligence, Buenos Aires, Argentina, AAAI Press: 3650-3656.
zh_TW
dc.identifier.doi (DOI) 10.6814/NCCU202001336en_US