維度縮減於文本風格之應用研究

學術產出-學位論文

文章檢視/開啟

pdf(0)

書目匯出

Google Scholar^TM

政大圖書館

學術資源探索系統

引文資訊

TAIR相關學術產出

Simple Record
Full Record

題名	維度縮減於文本風格之應用研究 A Study of Data Reduction on Text Mining
作者	林志軒 Lin, Chih-Hsuan
貢獻者	余清祥<br>鄭文惠 Yue, Ching-Syang<br>Cheng, Wen-Huei 林志軒 Lin, Chih-Hsuan
關鍵詞	文字探勘寫作風格資料縮減卡方檢定交叉驗證 Text Mining Writing Style Data Reduction Chi-Square Test Cross-Validation
日期	2020
上傳時間	2-九月-2020 11:43:26 (UTC+8)
摘要	寫作風格是文字分析的常見議題，無論個人寫作、學術期刊、報章雜誌等，各文本多半都有自己的獨特風格，往往由用詞遣字及編排就能看出差異。寫作風格的量化分析經常透過分類模型，判定文章來自於哪位作者，由於分析時通常會因模型代入過多變數，使得運算時間過長，有些研究提議套用主成份分析之類的資料縮減方法，但如此多半無法具體詮釋文本差異。本文以分類寫作風格為研究目標，藉由卡方檢定等方法篩選相關變數，並與線性、非線性資料縮減方法比較，希冀可兼顧分類準確率及實質詮釋。本文使用的文本都屬於白話文，包括臺灣及中國的報刊：2012～2019年《蘋果日報》、《自由時報》、《中國時報》頭條新聞，1971～1975年、1989～1993年《人民日報》頭版新聞，以及1919年、1926年《新青年》第七卷及第十一卷。各文本先經過結巴（jieba）斷詞處理，以倍數指標、卡方檢定等方法挑選變數，再與線性及非線性維度縮減選取變數比較，代入統計學習、機器學習模型，藉由交叉驗證比較分類準確率。分析發現本文提出的卡方檢定篩選方法較為穩定，分類準確率也較高，模型以XGBoost之類集成方法較佳。另外，根據本文挑選出的字詞判斷文本風格，《蘋果日報》、《自由時報》、《中國時報》用詞分別偏向於社會議題、政黨政治及兩岸關係議題，《人民日報》在1970年代、1990年代用詞偏向革命議題、經濟改革等議題，《新青年》第七卷、第十一卷用詞分別偏向於思想改革、資本主義等議題。 Writing style is a popular research topic in text mining and experts often can judge the authors of articles by checking the use of certain words. In addition to choosing proper words, statistical and machine learning models also are important in the study of writing style. In practice, usually many variables (e.g., words or phrases) are plugged into the models, costing a lot of computation time, and thus data reduction methods are recommended to speeding the analysis. However, it is difficult to give a reasonable interpretation to the variables after data reduction. In this study, we propose two methods for selecting variables, which take into account the accuracy and interpretation of classification models. The texts used in this study all belong to modern Chinese writing, including the headlines of Apple Daily, Liberty Times, and China Times (2012-2019), articles of People’s Daily (1971-1975 and 1989-1993), and Volumes 7 and 11 of New Youth Magazine (1919 and 1926). We first apply jieba to all articles for word segmentation, following by performing the variable selection methods (e.g., the proposed methods and linear/nonlinear dimension reduction methods), and finally plug the chosen variables into statistical and machine learning models. The model comparison is based on the F1 measures via cross-validation. We found that the proposed variable selection methods and the ensemble methods generally have the best performance in classification. As for the interpretation of selected variables, Apple Daily, Liberty Times and China Times each focused on issues related social affair, politics and cross-strait relationship, respectively. People’s Daily emphasized on topics related to revolution and economic reform in 1970’s and 1990’s, respectively. New Youth Magazine focused issues related to ideological reform and capitalism in Volumes 7 and 11, respectively.
參考文獻	一、中文文獻 1.李竹君（2016）。「再思考新聞價值—以蘋果日報與中時集團的即時新聞為例」，台灣大學新聞研究所碩士論文。 2.宋長熾（2004）。「兩岸報紙對「2003年美伊戰爭」議題報導之研究-以《中國時報》、《聯合報》、《自由時報》、《人民日報》為例」，政治作戰學校新聞研究所碩士論文。 3.余清祥、葉昱廷（2020）。「以文字探勘技術分析臺灣四大報文字風格」，《數位典藏與數位人文》，第6卷。 4.陳美瑜（2013）。「中文文本作者辨識研究: 以社群網站--臉書為例」，臺灣師範大學英語學系碩士論文。 5.黃于珊（2017）。「文字探勘在總體經濟上之應用－以美國聯準會會議紀錄為例」。政治大學金融學系碩士論文。 6.黃培軒（2017）。「關鍵詞與階層式詞彙文本分群之應用」，政治大學統計學系碩士論文。 7.鄭開元（2018）。「基於詞頻、位置及類別關係的特徵選擇方法」，銘傳大學資訊管理學系碩士論文。二、英文文獻 1.Bishop, C. (2006). Machine Learning and Pattern Recognition, Cambridge University Press. 2.Boyce, G., Curran, J. and Wingate, P. (Eds.) (1978). Newspaper History from the 17th Century to the Present Day, Acton Society, Press group. 3.Chuan, H., Zhe, D., Ruifan, L. and Yixin Z. (2008). Dimensionality Reduction for Text Using LLE, Beijing, China. 4.Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge, Cambridge University Press. 5.Archer, J. and Jockers, M.L. (2016). The Bestseller Code, New York: St. Martin’s Press. 6.Jolliffe, I.T. (2002) Principal Component Analysis, 2 edition, Springer, New York. 7.Li, Y., Xu, L., Tian, F., Jiang, L., Zhong, X. and Chen, E. (2015). “Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective,” Proceedings of the 24th International Conference on Artificial Intelligence, Buenos Aires, Argentina, AAAI Press: 3650-3656.
描述	碩士國立政治大學統計學系 107354025
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0107354025
資料類型	thesis

dc.contributor.advisor	余清祥<br>鄭文惠	zh_TW
dc.contributor.advisor	Yue, Ching-Syang<br>Cheng, Wen-Huei	en_US
dc.contributor.author (作者)	林志軒	zh_TW
dc.contributor.author (作者)	Lin, Chih-Hsuan	en_US
dc.creator (作者)	林志軒	zh_TW
dc.creator (作者)	Lin, Chih-Hsuan	en_US
dc.date (日期)	2020	en_US
dc.date.accessioned	2-九月-2020 11:43:26 (UTC+8)	-
dc.date.available	2-九月-2020 11:43:26 (UTC+8)	-
dc.date.issued (上傳時間)	2-九月-2020 11:43:26 (UTC+8)	-
dc.identifier (其他識別碼)	G0107354025	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/131479	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	統計學系	zh_TW
dc.description (描述)	107354025	zh_TW
dc.description.abstract (摘要)	寫作風格是文字分析的常見議題，無論個人寫作、學術期刊、報章雜誌等，各文本多半都有自己的獨特風格，往往由用詞遣字及編排就能看出差異。寫作風格的量化分析經常透過分類模型，判定文章來自於哪位作者，由於分析時通常會因模型代入過多變數，使得運算時間過長，有些研究提議套用主成份分析之類的資料縮減方法，但如此多半無法具體詮釋文本差異。本文以分類寫作風格為研究目標，藉由卡方檢定等方法篩選相關變數，並與線性、非線性資料縮減方法比較，希冀可兼顧分類準確率及實質詮釋。本文使用的文本都屬於白話文，包括臺灣及中國的報刊：2012～2019年《蘋果日報》、《自由時報》、《中國時報》頭條新聞，1971～1975年、1989～1993年《人民日報》頭版新聞，以及1919年、1926年《新青年》第七卷及第十一卷。各文本先經過結巴（jieba）斷詞處理，以倍數指標、卡方檢定等方法挑選變數，再與線性及非線性維度縮減選取變數比較，代入統計學習、機器學習模型，藉由交叉驗證比較分類準確率。分析發現本文提出的卡方檢定篩選方法較為穩定，分類準確率也較高，模型以XGBoost之類集成方法較佳。另外，根據本文挑選出的字詞判斷文本風格，《蘋果日報》、《自由時報》、《中國時報》用詞分別偏向於社會議題、政黨政治及兩岸關係議題，《人民日報》在1970年代、1990年代用詞偏向革命議題、經濟改革等議題，《新青年》第七卷、第十一卷用詞分別偏向於思想改革、資本主義等議題。	zh_TW
dc.description.abstract (摘要)	Writing style is a popular research topic in text mining and experts often can judge the authors of articles by checking the use of certain words. In addition to choosing proper words, statistical and machine learning models also are important in the study of writing style. In practice, usually many variables (e.g., words or phrases) are plugged into the models, costing a lot of computation time, and thus data reduction methods are recommended to speeding the analysis. However, it is difficult to give a reasonable interpretation to the variables after data reduction. In this study, we propose two methods for selecting variables, which take into account the accuracy and interpretation of classification models. The texts used in this study all belong to modern Chinese writing, including the headlines of Apple Daily, Liberty Times, and China Times (2012-2019), articles of People’s Daily (1971-1975 and 1989-1993), and Volumes 7 and 11 of New Youth Magazine (1919 and 1926). We first apply jieba to all articles for word segmentation, following by performing the variable selection methods (e.g., the proposed methods and linear/nonlinear dimension reduction methods), and finally plug the chosen variables into statistical and machine learning models. The model comparison is based on the F1 measures via cross-validation. We found that the proposed variable selection methods and the ensemble methods generally have the best performance in classification. As for the interpretation of selected variables, Apple Daily, Liberty Times and China Times each focused on issues related social affair, politics and cross-strait relationship, respectively. People’s Daily emphasized on topics related to revolution and economic reform in 1970’s and 1990’s, respectively. New Youth Magazine focused issues related to ideological reform and capitalism in Volumes 7 and 11, respectively.	en_US
dc.description.tableofcontents	第一章、緒論 1 第一節、研究動機 1 第二節、研究目的 3 第二章、文獻探討 5 第一節、文獻回顧 5 第二節、資料介紹 7 第三章、研究方法 9 第一節、斷詞系統 9 第二節、降維方法 10 第三節、探索性資料分析 14 第四節、詞向量與文本向量 14 第五節、T-SNE 18 第六節、分類模型 21 第四章、內文分析 24 第一節、二分類 25 第二節、三分類 34 第三節、臺灣報紙分群 38 第四節、相似文本 43 第五章、結論與建議 48 第一節、結論 48 第二節、未來研究建議 49 參考文獻 50	zh_TW
dc.format.extent	6129917 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0107354025	en_US
dc.subject (關鍵詞)	文字探勘	zh_TW
dc.subject (關鍵詞)	寫作風格	zh_TW
dc.subject (關鍵詞)	資料縮減	zh_TW
dc.subject (關鍵詞)	卡方檢定	zh_TW
dc.subject (關鍵詞)	交叉驗證	zh_TW
dc.subject (關鍵詞)	Text Mining	en_US
dc.subject (關鍵詞)	Writing Style	en_US
dc.subject (關鍵詞)	Data Reduction	en_US
dc.subject (關鍵詞)	Chi-Square Test	en_US
dc.subject (關鍵詞)	Cross-Validation	en_US
dc.title (題名)	維度縮減於文本風格之應用研究	zh_TW
dc.title (題名)	A Study of Data Reduction on Text Mining	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	一、中文文獻 1.李竹君（2016）。「再思考新聞價值—以蘋果日報與中時集團的即時新聞為例」，台灣大學新聞研究所碩士論文。 2.宋長熾（2004）。「兩岸報紙對「2003年美伊戰爭」議題報導之研究-以《中國時報》、《聯合報》、《自由時報》、《人民日報》為例」，政治作戰學校新聞研究所碩士論文。 3.余清祥、葉昱廷（2020）。「以文字探勘技術分析臺灣四大報文字風格」，《數位典藏與數位人文》，第6卷。 4.陳美瑜（2013）。「中文文本作者辨識研究: 以社群網站--臉書為例」，臺灣師範大學英語學系碩士論文。 5.黃于珊（2017）。「文字探勘在總體經濟上之應用－以美國聯準會會議紀錄為例」。政治大學金融學系碩士論文。 6.黃培軒（2017）。「關鍵詞與階層式詞彙文本分群之應用」，政治大學統計學系碩士論文。 7.鄭開元（2018）。「基於詞頻、位置及類別關係的特徵選擇方法」，銘傳大學資訊管理學系碩士論文。二、英文文獻 1.Bishop, C. (2006). Machine Learning and Pattern Recognition, Cambridge University Press. 2.Boyce, G., Curran, J. and Wingate, P. (Eds.) (1978). Newspaper History from the 17th Century to the Present Day, Acton Society, Press group. 3.Chuan, H., Zhe, D., Ruifan, L. and Yixin Z. (2008). Dimensionality Reduction for Text Using LLE, Beijing, China. 4.Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge, Cambridge University Press. 5.Archer, J. and Jockers, M.L. (2016). The Bestseller Code, New York: St. Martin’s Press. 6.Jolliffe, I.T. (2002) Principal Component Analysis, 2 edition, Springer, New York. 7.Li, Y., Xu, L., Tian, F., Jiang, L., Zhong, X. and Chen, E. (2015). “Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective,” Proceedings of the 24th International Conference on Artificial Intelligence, Buenos Aires, Argentina, AAAI Press: 3650-3656.	zh_TW
dc.identifier.doi (DOI)	10.6814/NCCU202001336	en_US

學術產出-學位論文

文章檢視/開啟

書目匯出

Google ScholarTM

政大圖書館

引文資訊

TAIR相關學術產出

Google Scholar^TM