以文字探勘技術分析台灣四大報文字風格

葉昱廷; Ye, Yu-Ting

Please use this identifier to cite or link to this item: https://ah.lib.nccu.edu.tw/handle/140.119/125516

DC Field	Value	Language
dc.contributor.advisor	余清祥<br>鄭文惠	zh_TW
dc.contributor.advisor	Yue, Ching-Syang<br>Cheng, Wen-Huei	en_US
dc.contributor.author	葉昱廷	zh_TW
dc.contributor.author	Ye, Yu-Ting	en_US
dc.creator	葉昱廷	zh_TW
dc.creator	Ye, Yu-Ting	en_US
dc.date	2019	en_US
dc.date.accessioned	2019-09-05T07:41:55Z	-
dc.date.available	2019-09-05T07:41:55Z	-
dc.date.issued	2019-09-05T07:41:55Z	-
dc.identifier	G0106354021	en_US
dc.identifier.uri	http://nccur.lib.nccu.edu.tw/handle/140.119/125516	-
dc.description	碩士	zh_TW
dc.description	國立政治大學	zh_TW
dc.description	統計學系	zh_TW
dc.description	106354021	zh_TW
dc.description.abstract	如同作者的寫作風格，即使主題相同，因為切入角度、用詞鋪陳等因素，各報紙的新聞報導經常有明顯差異，從報導文章中往往可判斷來自於哪一個媒體。本文也以研究報紙報導為目標，透過相似指標、多變量分析等文字探勘統計方法，在不考量文字意義、只著重用字頻率的前提下，比較台灣四大報紙的《蘋果日報》、《自由時報》、《聯合報》、《中國時報》的文字風格，資料期間為2012年至2018年。為避免報導題材造成的干擾，資料分析時根據各大報每天的頭版報導，其中受限於資料下載的限制，頭版標題為四大報，但內文比較僅有《蘋果日報》、《自由時報》兩家報紙。\n透過探索性資料分析及Jaccard、Yue指標衡量相似程度，評估四大報頭版頭條間的用字風格，分析顯示四大報在標題用詞上確實存在差異。以頭版標題而言，先計算四大報的用字的相似指標數值，再藉由t-SNE與廣義相關圖(GAP)分群視覺化，發現Jaccard和Yue指標提供不同角度的分群結果，前者傾向於將同時期的各報放在同一群，後者則是將四大報分成三群。頭版內文分析以詞向量為基礎，《自由時報》及《蘋果日報》的用字可對應到5到6個題材領域：《自由時報》題材傾向政治議題，《蘋果日報》傾向社會新聞議題。\n將《自由時報》及《蘋果日報》2012年到2017年用詞次數高於50次，且差異2倍以上的高頻詞作為分類變數，用於預測2018年的頭條內文屬於《自由時報》或《蘋果日報》，機器學習模型（如：SVM）的預測準確率達95.35%。另外，分析發現《自由時報》偏向政治議題的詞彙，《蘋果日報》則傾向社會新聞的用詞，統計分析確實能夠區分兩大報紙頭條內文上的文字風格。	zh_TW
dc.description.abstract	Like an author’s writing style, every newspaper has its own opinion and narrative methods, and it can be easily distinguished just by reading its articles. In this study, our goal is to explore the news reporting styles of Taiwan’s four major newspapers (Apple Daily, Liberty Times, United Daily News and China Times) and compare their differences. We choose the headline news for analysis in order to prevent the influence of nuisance factors, such as differences in political positions and target audience. The newspaper headlines considered are between 2012 and 2017. The titles of headlines can be downloaded for all four newspapers but the content of headlines is available only for Apple Daily and Liberty Times.\nWe first applied the methods of Exploratory Data Analysis (EDA), such as Jaccard and Yue index, for the word frequencies and word types to evaluate the similarities between four newspapers. In addition, we also considered multivariate tools, including t-SNE (t-distributed Stochastic Neighbor Embedding), GAP (Generalized Association Plots), Cluster Analysis, and Neural Network. We plugged the similarity indices into these multivariate tools to visualize the differences of newspapers and to classify observations into different groups.\nFor the analysis of headline titles and contents, the results show that there are significant differences in word usage between four newspapers. However, the grouping results of titles and contents based on similarity indices are quite different. For the headline titles, the Jaccard indices grouped titles by time and the Yue indices grouped titles by the media (i.e., 3 groups). For the headline contents, the words used in Apple Daily and Liberty Times, can be classified into five or six classes of topics, with Liberty Times emphasizing political terms and Apple Daily focusing social affairs and crime problems. We also applied machine learning methods to distinguish headline articles of Apple Daily and Liberty Times via cross-validation, treating the data of 2012-2017 as training set and those of 2018 as testing set. Support Vector Machine (SVM) achieved 95.35% accuracy in prediction with 3,316 variables.	en_US
dc.description.tableofcontents	第一章緒論 1\n第一節研究動機 1\n第二節研究目的 3\n第二章文獻探討 5\n第一節文獻回顧 5\n第二節資料介紹 6\n第三章研究方法 8\n第一節相似度分析 8\n第二節廣義相關圖(GAP) 10\n第三節 T-SNE 12\n第四節詞向量與文本向量 13\n第五節社會網路分析 16\n第六節支援向量機 16\n第七節決策樹 18\n第八節隨機森林 19\n第四章頭版標題分析 20\n第一節標題的研究方法 21\n第二節探索性資料分析 22\n第三節時間數列模型 27\n第四節頭版標題的分群 29\n第五章頭條報導內文分析 31\n第一節內文的研究方法 32\n第二節探索性資料分析 34\n第三節內文的量化 43\n第四節內文的分群 46\n第五節內文的詞彙網路結構 50\n第六節內文的分類結果 54\n第六章結論與建議 57\n第一節結論 57\n第二節建議 58\n參考文獻 60	zh_TW
dc.format.extent	10675770 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri	http://thesis.lib.nccu.edu.tw/record/#G0106354021	en_US
dc.subject	寫作風格	zh_TW
dc.subject	相似指標	zh_TW
dc.subject	台灣四大報	zh_TW
dc.subject	探索性資料分析	zh_TW
dc.subject	社會網路分析	zh_TW
dc.subject	Writing Style	en_US
dc.subject	Similarity Index	en_US
dc.subject	Taiwan’s Newspaper	en_US
dc.subject	Exploratory Data Analysis	en_US
dc.subject	Social Network	en_US
dc.title	以文字探勘技術分析台灣四大報文字風格	zh_TW
dc.title	A Case Study of Text Mining on Taiwan’s Newspapers	en_US
dc.type	thesis	en_US
dc.relation.reference	一、中文文獻\n1. 張筱涵(2009)「2008年北京奧運期間兩岸報紙呈現中國國家形象之研究—以自由時報、《人民日報》為例」。輔仁大學大眾傳播學研究所。\n2. 楊佳寧(2011)「解讀報紙中的「大陸遊客」—以《自由時報》、《聯合報》為例」。政治大學新聞研究所。\n3. 楊堯為(2014)「平面媒體對太陽花事件報導之內容分析－以《聯合報》、《中國時報》、《自由時報》、《蘋果日報》為例」。政治大學國家發展研究所。\n4. 鄧孟涵(2004)「中共領導人之媒體形象研究(2001-2004)：以中國時報與《人民日報》為例」。淡江大學中國大陸研究所。\n5. 蔡貴如(2008)「語言與政治立場：臺灣電視新聞之分析」。臺灣師範大學英語學系。\n6. 蔡佳青(2006)「八面玲瓏：台灣蘋果日報政治立場之初探」。臺北大學社會學系。\n\n二、英文文獻\n1. Boyce, G., Curran, J. and Wingate, P. (Eds.) (1978). Newspaper History from the 17th Century to the Present Day, Acton Society, Press group.\n2. Cryer, J.D. and Chan, K. (2008). Time Series Analysis with Applications in R, Springer-Verlag New York.\n3. Chen, C.H. (2002). “Generalized Association Plots for Information Visualization: The Applications of the Convergence of Iteratively formed Correlation Matrices,” Statistica Sinica 12: 1-23.\n4. Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge, Cambridge University Press.\n5. Faust, K. (1994). Social Network Analysis in the Social and Behavioral Sciences, in Social Network Analysis: Methods and Applications, Cambridge University Press.\n6. Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Biometrics, Second Edition. Springer-Verlag New York.\n7. Huang, T.-M., Kecman, V. and Kopriva, I. (2006). Kernel Based Algorithms for Mining Huge Data Sets: Supervised, Semi-supervised, and Unsupervised Learning (Studies in Computational Intelligence), Springer-Verlag.\n8. Ho, T.K. (1995) “Random Decision Forest.” Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, 14-16 August 1995, 278-282.\n9. Lebret, R. and Collobert, R. (2013). “Word Emdeddings through Hellinger PCA.” The Association for Computer Linguistics, EACL, page 482-490.\n10. Levy, O. and Goldberg, Y. (2014). “Neural Word Embedding as Implicit Matrix Factorization. Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. Montreal, Canada, MIT Press: 2177-2185.\n11. Li, Y., Xu, L., Tian, F., Jiang, L., Zhong, X. and Chen, E. (2015). “Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective.” Proceedings of the 24th International Conference on Artificial Intelligence. Buenos Aires, Argentina, AAAI Press: 3650-3656.\n12. Liaw, A. and Wiener, M. (2001). “Classification and Regression by RandomForest.” R NEWS 2 (3): 18-22.\n13. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. and Dean, J. (2013). “Distributed representations of words and phrases and their compositionality.” Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2. Lake Tahoe, Nevada, Curran Associates Inc.: 3111-3119.\n14. Real, R. and Vargas, J. M. (1996). “The Probabilistic Basis of Jaccard`s Index of Similarity,” Systematic Biology, 45(3): 380-385.\n15. Rokach, L. and Maimon, O. (2008). Data Mining with Decision Trees: Theroy and Applications, World Scientific Publishing Co., Inc.\n16. Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge, Cambridge University Press.\n17. Simpson, E.H. (1949). “Measurement of diversity,” Nature 163: 688.\n18. Singhal, A. (2001). “Modern Information Retrieval: A Brief Overview,” Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 24(4): 35-43.\n19. Tin Kam, H. (1998). “The random subspace method for constructing decision forests.” IEEE Transactions on Pattern Analysis and Machine Intelligence 20(8): 832-844.\n20. Wu, H.M. and Chen, C.H. (2005). “GAP: Generalized Association Plots for Dimension Free Data Visualization,” the Workshops of 5th Asian Conference on Statistical Computing (IASC-ARS 2005), Hong Kong, Dec. 15-17.\n21. Wu, H.M. and Chen, C.H. (2004). “Matrix Visualization with Nonlinear Association,” 中國統計學社93年社員大會暨統計研討會, November 2004. Chiayi, Taiwan.\n22. Wasserman, S. and Faust, K. (1994). Social Network Analysis in the Social and Behavioral Sciences. In Social Network Analysis: Methods and Applications (Structural Analysis in the Social Sciences, pp. 3-27). Cambridge: Cambridge University Press.\n23. Yue, J.C. and Clayton, M.K. (2005). “A Similarity Measure based on Species Proportions,” Communications in Statistics-Theory and Methods 34(11): 2123- 2131.	zh_TW
dc.identifier.doi	10.6814/NCCU201900992	en_US
item.fulltext	With Fulltext	-
item.openairetype	thesis	-
item.cerifentitytype	Publications	-
item.openairecristype	http://purl.org/coar/resource_type/c_46ec	-
item.grantfulltext	open	-
Appears in Collections:	學位論文

Files in This Item:

File	Size	Format
402101.pdf	10.43 MB	Adobe PDF2	View/Open

Show simple item record

Google Scholar^TM

Check

Files in This Item:

Google Scholar^TM

Altmetric

Altmetric

Files in This Item:

Google ScholarTM

Altmetric

Altmetric

Google Scholar^TM