Please use this identifier to cite or link to this item: https://ah.nccu.edu.tw/handle/140.119/136830


Title: 運用文字探勘分析人民日報的風格變遷
A Study of Writing Style of The People’s Daily
Authors: 陳庭偉
Chen, Ting-Wei
Contributors: 陳麗霞
余清祥

陳庭偉
Chen, Ting-Wei
Keywords: 寫作風格
風格變遷
群集分析
關鍵詞
挑選變數
Writing Style
Style Change
Cluster Analysis
Keyword
Variable Selection
Date: 2021
Issue Date: 2021-09-02 15:38:09 (UTC+8)
Abstract: 大數據發展促使各類型資料的數位化,文字探勘更是當中典範,在不同領域都可看到相關應用,寫作風格是常見議題之一。然而,文章風格容易受到議題的影響,即便是同一作者或文本,文字使用可能因為時空背景等因素而產生差異。以中國共產黨機關報刊《人民日報》為例,內容及題材不僅呈現當代特色,也會顧及官方立場與目的,該報的特色變化可反映中共建國至今的政治及社會變遷。因此本文以《人民日報》的風格變化為研究目標,藉由比較各年度的遣詞用字差異,透過統計方法及分群劃分不同時期;另外,本文也運用多種關鍵詞偵測指標,篩選各時期的代表詞作為分類的解釋變數,希望能夠兼顧準確率、運算速度、解釋性。
本文以《人民日報》1949~2019年頭版報導為研究素材,因為頭版內容大多涉及全國性及國際等重大事務,避免某些地方性事務造成用詞的異質性。本文先考量探索性資料分析,包括字、詞以及字詞的Jaccard、Yue相似指標,挖掘《人民日報》的文字基本特性;接著套用群集分析近年中國分成數個時期,再與專家的分期結果比較。研究發現:透過雙字詞更能看出各時期的差異,如果以雙字詞或相似指標進行分群,《人民日報》可分為四個時期(或可命名為「建國」、「文化革命」、「改革開放」、「現代化」),不同分群方法的分析結果相當一致,而各時期的用詞風格有明顯差異。另外,分類解釋變數的挑選以本文提出的代表詞偵測指標最佳,無論是準確率、運算速度、解釋性三者的結果,都優於卡方指標或維度縮減等方法。
Big data enhances the quantitative analysis in all kinds of data and text mining is one of them. Identifying authors’ writing style is one popular topic of text mining. However, the writing style can be affected by, for example, the theme and language of articles. Take the People’s Daily, official newspaper of the Central Committee of the Chinese Communist Party, as an example. The Chinese Communist Party attaches great importance to the People's Daily, and has given strong guidance to the work of the People's Daily in all periods of revolution, construction and reform. In order words, through the text analysis of the People’s Daily, we may find the changes of political/social environment of Chinese Communist Party, and we want to know if it is possible to differentiate different periods of China (1949~2019) via text analysis of the articles in the People’s Daily.
We first conduct exploratory data analysis, including characters, words, Jaccard and Yue’s Index. Then we use cluster analysis to divide modern China into several periods, and then compare with the results of experts' research. The research found that the differences between the periods can be more clearly seen through the two-character words. If the two-character words or similar indicators are used to cluster, the People's Daily can be divided into four periods. Besides, we use multiple keyword indicators to select representative words in each period, and we select these representative words as explanatory variables to classify. Whether in terms of accuracy, calculation speed, or explanatory performance, it is better than chi-square indicators or dimensionality reduction methods.
Reference: 一、中文文獻
1. 王宇(2012)。「框架視野下的食品安全報導——以《人民日報》近10年的報導為例」,《現代傳播: 中國傳媒大學學報》,34(2),頁43-47。
2. 曲青山(2021)。《中國共產黨百年輝煌》。北京市: 人民出版社。
3. 余清祥、葉昱廷(2020)。「以文字探勘技術分析臺灣四大報文字風格」,《數位典藏與數位人文》,6,頁69-96。
4. 於韜、王洪岩(2018)。「基於 TF-IDF 算法的文本信息提取」,《科技視界》,16,頁117-118。
5. 林志軒(2020)。「維度縮減於文本風格之應用研究」,政治大學統計學系學位論文,頁1-51。
6. 林晏辰(2020)。「中文關鍵詞偵測的探討」,政治大學統計學系學位論文,頁1-62。
7. 胡適(2016)。《紅樓夢考證》。北京市: 北京出版社。
8. 姚興山(2009)。「基於詞頻的中文文本分類研究」,《現代情報》,29(2),頁179-181。
9. 孫曉明、馬少平(2001)。「基於寫作風格的作者識別」,《中國中文信息學會第五屆全國會員代表大會暨成立二十週年學術會議論文集》,北京:清華大學出版社。
10. 陳鳳芝(2003)。「中西方思維差異與寫作風格對比分析」,《三峽大學學報: 人文社會科學版》,25(3),頁95-97。
11. 夏天(2013)。「詞語位置加權TextRank的關鍵詞抽取研究」,《現代圖書情報技術》,9,頁30-34。
12. 徐超(2017)。「《人民日報》社論詞彙統計與分析」,《采寫編》,(3),頁144-145。
13. 張運良、朱禮軍、喬曉東、張全(2009)。「基於句類特徵的作者寫作風格分類研究」,《計算機工程與應用》,45(22),頁129-131。
14. 黃秋林、吳本虎(2009)。「政治隱喻的歷時分析——基於《人民日報》(1978—2007) 兩會社論的研究」,《語言教學與研究》,(5),頁91-96。

二、英文文獻
1. Archer, J. and Jockers, M.L. (2016). The Bestseller Code, New York: St. Martin’s Press.
2. Beliga, S. (2014). “Keyword extraction: a review of methods and approaches.” University of Rijeka, Department of Informatics, Rijeka, 1-9.
3. Ikonomakis, M., Kotsiantis, S., and Tampakas, V. (2005). “Text classification using machine learning techniques.” WSEAS transactions on computers, 4(8), 966-974.
4. James, G., Witten, D., Hastie, T. and Tibshirani, R. (2017). An Introduction to Statistical Learning: With Applications in R, Berlin: Springer
5. Liu, F., Pennell, D., Liu, F., and Liu, Y. (2009). “Unsupervised approaches for automatic keyword extraction using meeting transcripts.” In Proceedings of human language technologies: The 2009 annual conference of the North American chapter of the association for computational linguistics (pp. 620-628).
6. Matsuo, Y., and Ishizuka, M. (2004). “Keyword extraction from a single document using word co-occurrence statistical information.” International Journal on Artificial Intelligence Tools, 13(01), 157-169.
7. Puglisi, R. (2006). “Being The New York Times: Thepolitical Behaviour Of A Newspaper (No. 20).” Suntory and Toyota International Centres for Economics and Related Disciplines, LSE.
8. Pervaiz, F., Pervaiz, M., Rehman, N. A., & Saif, U. (2012). “FluBreaks: early epidemic detection from Google flu trends.” Journal of medical Internet research, 14(5), e125.
9. Rose, S., Engel, D., Cramer, N., and Cowley, W. (2010). “Automatic keyword extraction from individual documents.” Text mining: applications and theory, 1, 1-20.
10. King, T., “80 Percent of Your Data Will Be Unstructured in Five Years.”, Retrieved June 15, 2021, from: https://solutionsreview.com/data-management/80-percent-of-your-data-will-be-unstructured-in-five-years/
11. Zhai, Y., Song, W., Liu, X., Liu, L., and Zhao, X. (2018). “A chi-square statistics based feature selection method in text classification.” In 2018 IEEE 9th International Conference on Software Engineering and Service Science (ICSESS) (pp. 160-163). IEEE.
Description: 碩士
國立政治大學
統計學系
108354007
Source URI: http://thesis.lib.nccu.edu.tw/record/#G0108354007
Data Type: thesis
Appears in Collections:[統計學系] 學位論文

Files in This Item:

File SizeFormat
400701.pdf4793KbAdobe PDF0View/Open


All items in 學術集成 are protected by copyright, with all rights reserved.


社群 sharing