Please use this identifier to cite or link to this item: https://ah.lib.nccu.edu.tw/handle/140.119/125516
DC FieldValueLanguage
dc.contributor.advisor余清祥<br>鄭文惠zh_TW
dc.contributor.advisorYue, Ching-Syang<br>Cheng, Wen-Hueien_US
dc.contributor.author葉昱廷zh_TW
dc.contributor.authorYe, Yu-Tingen_US
dc.creator葉昱廷zh_TW
dc.creatorYe, Yu-Tingen_US
dc.date2019en_US
dc.date.accessioned2019-09-05T07:41:55Z-
dc.date.available2019-09-05T07:41:55Z-
dc.date.issued2019-09-05T07:41:55Z-
dc.identifierG0106354021en_US
dc.identifier.urihttp://nccur.lib.nccu.edu.tw/handle/140.119/125516-
dc.description碩士zh_TW
dc.description國立政治大學zh_TW
dc.description統計學系zh_TW
dc.description106354021zh_TW
dc.description.abstract如同作者的寫作風格,即使主題相同,因為切入角度、用詞鋪陳等因素,各報紙的新聞報導經常有明顯差異,從報導文章中往往可判斷來自於哪一個媒體。本文也以研究報紙報導為目標,透過相似指標、多變量分析等文字探勘統計方法,在不考量文字意義、只著重用字頻率的前提下,比較台灣四大報紙的《蘋果日報》、《自由時報》、《聯合報》、《中國時報》的文字風格,資料期間為2012年至2018年。為避免報導題材造成的干擾,資料分析時根據各大報每天的頭版報導,其中受限於資料下載的限制,頭版標題為四大報,但內文比較僅有《蘋果日報》、《自由時報》兩家報紙。\n透過探索性資料分析及Jaccard、Yue指標衡量相似程度,評估四大報頭版頭條間的用字風格,分析顯示四大報在標題用詞上確實存在差異。以頭版標題而言,先計算四大報的用字的相似指標數值,再藉由t-SNE與廣義相關圖(GAP)分群視覺化,發現Jaccard和Yue指標提供不同角度的分群結果,前者傾向於將同時期的各報放在同一群,後者則是將四大報分成三群。頭版內文分析以詞向量為基礎,《自由時報》及《蘋果日報》的用字可對應到5到6個題材領域:《自由時報》題材傾向政治議題,《蘋果日報》傾向社會新聞議題。\n將《自由時報》及《蘋果日報》2012年到2017年用詞次數高於50次,且差異2倍以上的高頻詞作為分類變數,用於預測2018年的頭條內文屬於《自由時報》或《蘋果日報》,機器學習模型(如:SVM)的預測準確率達95.35%。另外,分析發現《自由時報》偏向政治議題的詞彙,《蘋果日報》則傾向社會新聞的用詞,統計分析確實能夠區分兩大報紙頭條內文上的文字風格。zh_TW
dc.description.abstractLike an author’s writing style, every newspaper has its own opinion and narrative methods, and it can be easily distinguished just by reading its articles. In this study, our goal is to explore the news reporting styles of Taiwan’s four major newspapers (Apple Daily, Liberty Times, United Daily News and China Times) and compare their differences. We choose the headline news for analysis in order to prevent the influence of nuisance factors, such as differences in political positions and target audience. The newspaper headlines considered are between 2012 and 2017. The titles of headlines can be downloaded for all four newspapers but the content of headlines is available only for Apple Daily and Liberty Times.\nWe first applied the methods of Exploratory Data Analysis (EDA), such as Jaccard and Yue index, for the word frequencies and word types to evaluate the similarities between four newspapers. In addition, we also considered multivariate tools, including t-SNE (t-distributed Stochastic Neighbor Embedding), GAP (Generalized Association Plots), Cluster Analysis, and Neural Network. We plugged the similarity indices into these multivariate tools to visualize the differences of newspapers and to classify observations into different groups.\nFor the analysis of headline titles and contents, the results show that there are significant differences in word usage between four newspapers. However, the grouping results of titles and contents based on similarity indices are quite different. For the headline titles, the Jaccard indices grouped titles by time and the Yue indices grouped titles by the media (i.e., 3 groups). For the headline contents, the words used in Apple Daily and Liberty Times, can be classified into five or six classes of topics, with Liberty Times emphasizing political terms and Apple Daily focusing social affairs and crime problems. We also applied machine learning methods to distinguish headline articles of Apple Daily and Liberty Times via cross-validation, treating the data of 2012-2017 as training set and those of 2018 as testing set. Support Vector Machine (SVM) achieved 95.35% accuracy in prediction with 3,316 variables.en_US
dc.description.tableofcontents第一章 緒論 1\n第一節 研究動機 1\n第二節 研究目的 3\n第二章 文獻探討 5\n第一節 文獻回顧 5\n第二節 資料介紹 6\n第三章 研究方法 8\n第一節 相似度分析 8\n第二節 廣義相關圖(GAP) 10\n第三節 T-SNE 12\n第四節 詞向量與文本向量 13\n第五節 社會網路分析 16\n第六節 支援向量機 16\n第七節 決策樹 18\n第八節 隨機森林 19\n第四章 頭版標題分析 20\n第一節 標題的研究方法 21\n第二節 探索性資料分析 22\n第三節 時間數列模型 27\n第四節 頭版標題的分群 29\n第五章 頭條報導內文分析 31\n第一節 內文的研究方法 32\n第二節 探索性資料分析 34\n第三節 內文的量化 43\n第四節 內文的分群 46\n第五節 內文的詞彙網路結構 50\n第六節 內文的分類結果 54\n第六章 結論與建議 57\n第一節 結論 57\n第二節 建議 58\n參考文獻 60zh_TW
dc.format.extent10675770 bytes-
dc.format.mimetypeapplication/pdf-
dc.source.urihttp://thesis.lib.nccu.edu.tw/record/#G0106354021en_US
dc.subject寫作風格zh_TW
dc.subject相似指標zh_TW
dc.subject台灣四大報zh_TW
dc.subject探索性資料分析zh_TW
dc.subject社會網路分析zh_TW
dc.subjectWriting Styleen_US
dc.subjectSimilarity Indexen_US
dc.subjectTaiwan’s Newspaperen_US
dc.subjectExploratory Data Analysisen_US
dc.subjectSocial Networken_US
dc.title以文字探勘技術分析台灣四大報文字風格zh_TW
dc.titleA Case Study of Text Mining on Taiwan’s Newspapersen_US
dc.typethesisen_US
dc.relation.reference一、 中文文獻\n1. 張筱涵(2009)「2008年北京奧運期間兩岸報紙呈現中國國家形象之研究—以自由時報、《人民日報》為例」。輔仁大學大眾傳播學研究所。\n2. 楊佳寧(2011)「解讀報紙中的「大陸遊客」—以《自由時報》、《聯合報》為例」。政治大學新聞研究所。\n3. 楊堯為(2014)「平面媒體對太陽花事件報導之內容分析-以《聯合報》、《中國時報》、《自由時報》、《蘋果日報》為例」。政治大學國家發展研究所。\n4. 鄧孟涵(2004)「中共領導人之媒體形象研究(2001-2004):以中國時報與《人民日報》為例」。淡江大學中國大陸研究所。\n5. 蔡貴如(2008)「語言與政治立場:臺灣電視新聞之分析」。臺灣師範大學英語學系。\n6. 蔡佳青(2006)「八面玲瓏:台灣蘋果日報政治立場之初探」。臺北大學社會學系。\n\n二、 英文文獻\n1. Boyce, G., Curran, J. and Wingate, P. (Eds.) (1978). Newspaper History from the 17th Century to the Present Day, Acton Society, Press group.\n2. Cryer, J.D. and Chan, K. (2008). Time Series Analysis with Applications in R, Springer-Verlag New York.\n3. Chen, C.H. (2002). “Generalized Association Plots for Information Visualization: The Applications of the Convergence of Iteratively formed Correlation Matrices,” Statistica Sinica 12: 1-23.\n4. Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge, Cambridge University Press.\n5. Faust, K. (1994). Social Network Analysis in the Social and Behavioral Sciences, in Social Network Analysis: Methods and Applications, Cambridge University Press.\n6. Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Biometrics, Second Edition. Springer-Verlag New York.\n7. Huang, T.-M., Kecman, V. and Kopriva, I. (2006). Kernel Based Algorithms for Mining Huge Data Sets: Supervised, Semi-supervised, and Unsupervised Learning (Studies in Computational Intelligence), Springer-Verlag.\n8. Ho, T.K. (1995) “Random Decision Forest.” Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, 14-16 August 1995, 278-282.\n9. Lebret, R. and Collobert, R. (2013). “Word Emdeddings through Hellinger PCA.” The Association for Computer Linguistics, EACL, page 482-490.\n10. Levy, O. and Goldberg, Y. (2014). “Neural Word Embedding as Implicit Matrix Factorization. Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. Montreal, Canada, MIT Press: 2177-2185.\n11. Li, Y., Xu, L., Tian, F., Jiang, L., Zhong, X. and Chen, E. (2015). “Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective.” Proceedings of the 24th International Conference on Artificial Intelligence. Buenos Aires, Argentina, AAAI Press: 3650-3656.\n12. Liaw, A. and Wiener, M. (2001). “Classification and Regression by RandomForest.” R NEWS 2 (3): 18-22.\n13. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. and Dean, J. (2013). “Distributed representations of words and phrases and their compositionality.” Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2. Lake Tahoe, Nevada, Curran Associates Inc.: 3111-3119.\n14. Real, R. and Vargas, J. M. (1996). “The Probabilistic Basis of Jaccard`s Index of Similarity,” Systematic Biology, 45(3): 380-385.\n15. Rokach, L. and Maimon, O. (2008). Data Mining with Decision Trees: Theroy and Applications, World Scientific Publishing Co., Inc.\n16. Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge, Cambridge University Press.\n17. Simpson, E.H. (1949). “Measurement of diversity,” Nature 163: 688.\n18. Singhal, A. (2001). “Modern Information Retrieval: A Brief Overview,” Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 24(4): 35-43.\n19. Tin Kam, H. (1998). “The random subspace method for constructing decision forests.” IEEE Transactions on Pattern Analysis and Machine Intelligence 20(8): 832-844.\n20. Wu, H.M. and Chen, C.H. (2005). “GAP: Generalized Association Plots for Dimension Free Data Visualization,” the Workshops of 5th Asian Conference on Statistical Computing (IASC-ARS 2005), Hong Kong, Dec. 15-17.\n21. Wu, H.M. and Chen, C.H. (2004). “Matrix Visualization with Nonlinear Association,” 中國統計學社93年社員大會暨統計研討會, November 2004. Chiayi, Taiwan.\n22. Wasserman, S. and Faust, K. (1994). Social Network Analysis in the Social and Behavioral Sciences. In Social Network Analysis: Methods and Applications (Structural Analysis in the Social Sciences, pp. 3-27). Cambridge: Cambridge University Press.\n23. Yue, J.C. and Clayton, M.K. (2005). “A Similarity Measure based on Species Proportions,” Communications in Statistics-Theory and Methods 34(11): 2123- 2131.zh_TW
dc.identifier.doi10.6814/NCCU201900992en_US
item.fulltextWith Fulltext-
item.openairetypethesis-
item.cerifentitytypePublications-
item.openairecristypehttp://purl.org/coar/resource_type/c_46ec-
item.grantfulltextopen-
Appears in Collections:學位論文
Files in This Item:
File SizeFormat
402101.pdf10.43 MBAdobe PDF2View/Open
Show simple item record

Google ScholarTM

Check

Altmetric

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.