以文字探勘技術分析台灣四大報文字風格

葉昱廷; Ye, Yu-Ting

Please use this identifier to cite or link to this item: https://ah.lib.nccu.edu.tw/handle/140.119/125516

題名:	以文字探勘技術分析台灣四大報文字風格 A Case Study of Text Mining on Taiwan’s Newspapers
作者:	葉昱廷 Ye, Yu-Ting
貢獻者:	余清祥<br>鄭文惠 Yue, Ching-Syang<br>Cheng, Wen-Huei 葉昱廷 Ye, Yu-Ting
關鍵詞:	寫作風格相似指標台灣四大報探索性資料分析社會網路分析 Writing Style Similarity Index Taiwan’s Newspaper Exploratory Data Analysis Social Network
日期:	2019
上傳時間:	5-Sep-2019
摘要:	如同作者的寫作風格，即使主題相同，因為切入角度、用詞鋪陳等因素，各報紙的新聞報導經常有明顯差異，從報導文章中往往可判斷來自於哪一個媒體。本文也以研究報紙報導為目標，透過相似指標、多變量分析等文字探勘統計方法，在不考量文字意義、只著重用字頻率的前提下，比較台灣四大報紙的《蘋果日報》、《自由時報》、《聯合報》、《中國時報》的文字風格，資料期間為2012年至2018年。為避免報導題材造成的干擾，資料分析時根據各大報每天的頭版報導，其中受限於資料下載的限制，頭版標題為四大報，但內文比較僅有《蘋果日報》、《自由時報》兩家報紙。\n透過探索性資料分析及Jaccard、Yue指標衡量相似程度，評估四大報頭版頭條間的用字風格，分析顯示四大報在標題用詞上確實存在差異。以頭版標題而言，先計算四大報的用字的相似指標數值，再藉由t-SNE與廣義相關圖(GAP)分群視覺化，發現Jaccard和Yue指標提供不同角度的分群結果，前者傾向於將同時期的各報放在同一群，後者則是將四大報分成三群。頭版內文分析以詞向量為基礎，《自由時報》及《蘋果日報》的用字可對應到5到6個題材領域：《自由時報》題材傾向政治議題，《蘋果日報》傾向社會新聞議題。\n將《自由時報》及《蘋果日報》2012年到2017年用詞次數高於50次，且差異2倍以上的高頻詞作為分類變數，用於預測2018年的頭條內文屬於《自由時報》或《蘋果日報》，機器學習模型（如：SVM）的預測準確率達95.35%。另外，分析發現《自由時報》偏向政治議題的詞彙，《蘋果日報》則傾向社會新聞的用詞，統計分析確實能夠區分兩大報紙頭條內文上的文字風格。 Like an author’s writing style, every newspaper has its own opinion and narrative methods, and it can be easily distinguished just by reading its articles. In this study, our goal is to explore the news reporting styles of Taiwan’s four major newspapers (Apple Daily, Liberty Times, United Daily News and China Times) and compare their differences. We choose the headline news for analysis in order to prevent the influence of nuisance factors, such as differences in political positions and target audience. The newspaper headlines considered are between 2012 and 2017. The titles of headlines can be downloaded for all four newspapers but the content of headlines is available only for Apple Daily and Liberty Times.\nWe first applied the methods of Exploratory Data Analysis (EDA), such as Jaccard and Yue index, for the word frequencies and word types to evaluate the similarities between four newspapers. In addition, we also considered multivariate tools, including t-SNE (t-distributed Stochastic Neighbor Embedding), GAP (Generalized Association Plots), Cluster Analysis, and Neural Network. We plugged the similarity indices into these multivariate tools to visualize the differences of newspapers and to classify observations into different groups.\nFor the analysis of headline titles and contents, the results show that there are significant differences in word usage between four newspapers. However, the grouping results of titles and contents based on similarity indices are quite different. For the headline titles, the Jaccard indices grouped titles by time and the Yue indices grouped titles by the media (i.e., 3 groups). For the headline contents, the words used in Apple Daily and Liberty Times, can be classified into five or six classes of topics, with Liberty Times emphasizing political terms and Apple Daily focusing social affairs and crime problems. We also applied machine learning methods to distinguish headline articles of Apple Daily and Liberty Times via cross-validation, treating the data of 2012-2017 as training set and those of 2018 as testing set. Support Vector Machine (SVM) achieved 95.35% accuracy in prediction with 3,316 variables.
參考文獻:	一、中文文獻\n1. 張筱涵(2009)「2008年北京奧運期間兩岸報紙呈現中國國家形象之研究—以自由時報、《人民日報》為例」。輔仁大學大眾傳播學研究所。\n2. 楊佳寧(2011)「解讀報紙中的「大陸遊客」—以《自由時報》、《聯合報》為例」。政治大學新聞研究所。\n3. 楊堯為(2014)「平面媒體對太陽花事件報導之內容分析－以《聯合報》、《中國時報》、《自由時報》、《蘋果日報》為例」。政治大學國家發展研究所。\n4. 鄧孟涵(2004)「中共領導人之媒體形象研究(2001-2004)：以中國時報與《人民日報》為例」。淡江大學中國大陸研究所。\n5. 蔡貴如(2008)「語言與政治立場：臺灣電視新聞之分析」。臺灣師範大學英語學系。\n6. 蔡佳青(2006)「八面玲瓏：台灣蘋果日報政治立場之初探」。臺北大學社會學系。\n\n二、英文文獻\n1. Boyce, G., Curran, J. and Wingate, P. (Eds.) (1978). Newspaper History from the 17th Century to the Present Day, Acton Society, Press group.\n2. Cryer, J.D. and Chan, K. (2008). Time Series Analysis with Applications in R, Springer-Verlag New York.\n3. Chen, C.H. (2002). “Generalized Association Plots for Information Visualization: The Applications of the Convergence of Iteratively formed Correlation Matrices,” Statistica Sinica 12: 1-23.\n4. Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge, Cambridge University Press.\n5. Faust, K. (1994). Social Network Analysis in the Social and Behavioral Sciences, in Social Network Analysis: Methods and Applications, Cambridge University Press.\n6. Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Biometrics, Second Edition. Springer-Verlag New York.\n7. Huang, T.-M., Kecman, V. and Kopriva, I. (2006). Kernel Based Algorithms for Mining Huge Data Sets: Supervised, Semi-supervised, and Unsupervised Learning (Studies in Computational Intelligence), Springer-Verlag.\n8. Ho, T.K. (1995) “Random Decision Forest.” Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, 14-16 August 1995, 278-282.\n9. Lebret, R. and Collobert, R. (2013). “Word Emdeddings through Hellinger PCA.” The Association for Computer Linguistics, EACL, page 482-490.\n10. Levy, O. and Goldberg, Y. (2014). “Neural Word Embedding as Implicit Matrix Factorization. Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. Montreal, Canada, MIT Press: 2177-2185.\n11. Li, Y., Xu, L., Tian, F., Jiang, L., Zhong, X. and Chen, E. (2015). “Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective.” Proceedings of the 24th International Conference on Artificial Intelligence. Buenos Aires, Argentina, AAAI Press: 3650-3656.\n12. Liaw, A. and Wiener, M. (2001). “Classification and Regression by RandomForest.” R NEWS 2 (3): 18-22.\n13. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. and Dean, J. (2013). “Distributed representations of words and phrases and their compositionality.” Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2. Lake Tahoe, Nevada, Curran Associates Inc.: 3111-3119.\n14. Real, R. and Vargas, J. M. (1996). “The Probabilistic Basis of Jaccard`s Index of Similarity,” Systematic Biology, 45(3): 380-385.\n15. Rokach, L. and Maimon, O. (2008). Data Mining with Decision Trees: Theroy and Applications, World Scientific Publishing Co., Inc.\n16. Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge, Cambridge University Press.\n17. Simpson, E.H. (1949). “Measurement of diversity,” Nature 163: 688.\n18. Singhal, A. (2001). “Modern Information Retrieval: A Brief Overview,” Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 24(4): 35-43.\n19. Tin Kam, H. (1998). “The random subspace method for constructing decision forests.” IEEE Transactions on Pattern Analysis and Machine Intelligence 20(8): 832-844.\n20. Wu, H.M. and Chen, C.H. (2005). “GAP: Generalized Association Plots for Dimension Free Data Visualization,” the Workshops of 5th Asian Conference on Statistical Computing (IASC-ARS 2005), Hong Kong, Dec. 15-17.\n21. Wu, H.M. and Chen, C.H. (2004). “Matrix Visualization with Nonlinear Association,” 中國統計學社93年社員大會暨統計研討會, November 2004. Chiayi, Taiwan.\n22. Wasserman, S. and Faust, K. (1994). Social Network Analysis in the Social and Behavioral Sciences. In Social Network Analysis: Methods and Applications (Structural Analysis in the Social Sciences, pp. 3-27). Cambridge: Cambridge University Press.\n23. Yue, J.C. and Clayton, M.K. (2005). “A Similarity Measure based on Species Proportions,” Communications in Statistics-Theory and Methods 34(11): 2123- 2131.
描述:	碩士國立政治大學統計學系 106354021
資料來源:	http://thesis.lib.nccu.edu.tw/record/#G0106354021
資料類型:	thesis
Appears in Collections:	學位論文

Files in This Item:

File	Size	Format
402101.pdf	10.43 MB	Adobe PDF2	View/Open

Show full item record

Google Scholar^TM

Check

Files in This Item:

Google Scholar^TM

Altmetric

Altmetric

Files in This Item:

Google ScholarTM

Altmetric

Altmetric

Google Scholar^TM