學術產出-Theses

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

題名 從文字探勘比較臺灣與中國之寫作風格—以《聯合報》與《人民日報》為例
Comparing the Writing Style of Newspaper between Taiwan and China
作者 吳蒨芸
Wu, Qian Yun
貢獻者 陳麗霞<br>余清祥
吳蒨芸
Wu, Qian Yun
關鍵詞 文字分析
風格變遷
探索性資料分析
關聯指標
兩岸差異
Text mining
Style change
Exploratory data analysis
Related index
Areal differentiation
日期 2022
上傳時間 2-Sep-2022 14:45:06 (UTC+8)
摘要 臺灣與中國同為華人社會,中文也是兩岸居民的共同語言,但書面寫作與口語對話使用方式卻有不小出入,這些差異似乎隨著時間逐漸增加,近年國際局勢變化更能讓人感受兩岸分治的不同。本文以臺灣《聯合報》和中國《人民日報》為研究素材,探討從第一代領導者蔣介石、毛澤東之後,兩岸文字風格的變遷及其差異,透過文字分析檢視文字使用變化及其意涵。除了藉由探索性資料分析(Exploratory Data Analysis,EDA)比較字詞的多樣性及不均度,本文也提出衡量雙字詞及多字詞間的方法,用以測量字詞間的關聯性,從中粹取近年台灣與中國思想的代表觀念,比較兩者有何重大變化與差異。本文使用《人民日報》1946年~2020年頭版報導、《聯合報》1960年~2020年社論,這些文章大多偏向於國際關係、國家層級等大事,較少著眼於地方性事務及社會新聞。
分析發現《人民日報》、《聯合報》在字詞多樣性有明顯差異,《人民日報》的字詞多樣性在1960年文革時期最低,之後逐年遞增至1980年代後期,繼而隨時間遞減;《聯合報》的字詞多樣性在1990年之前遞降,之後大幅提高。以出現頻率最高的前500個雙字詞作為解釋變數,代入分群模型,可將臺灣、中國分為四個年代。將本文提出的雙字詞關聯分析,發現「同一句內後一個雙字詞」找到的詞組效果較好,而且與歷史發展頗為契合。像是《人民日報》「中國」作為先行詞,四個年代可找出關聯最強的詞組為「人民、大使、特色、特色」,這與中共建國初期強調「中國→人民」民族主義的概念,之後為了進入國際舞台(如:聯合國)而出現「中國→大使」,經濟開放後則為強調「中國→特色」,向全世界推銷「中國特色社會主義」。這樣的詞與詞之間的關係,可用於描述一個概念或議題,未來可與人文學者合作,藉由詞叢關聯找出文章特色與內容大義。
Taiwan and China are both Chinese societies, and Chinese is a common language for residents on both sides of the Taiwan Strait. However, there are considerable differences in the way written and spoken language are used, and these differences seem to have increased over time. This paper uses the Taiwanese newspaper United Daily News and the Chinese newspaper People`s Daily News as research materials to explore the changes in writing styles and the differences between the two sides of the Taiwan Strait since the first leaders Kai-shek Chiang and Zedong Mao, and to examine the changes in the use of words and their meanings through textual analysis. In addition to comparing the diversity and unevenness of words through Exploratory Data Analysis (EDA), this paper also proposes a method to measure the association between two-character words and multi- character words, from which representative concepts of Taiwanese and Chinese thought in recent years are extracted to compare the significant changes and differences between them. This paper uses headlines in People`s Daily from 1946 to 2020 and editorials in United Daily News from 1960 to 2020, which mostly focus on international relations and national-level events, but less on local affairs and social news.
The word diversity of People`s Daily and United Daily News is significantly different. The word diversity of People`s Daily was the lowest in 1960 during the Cultural Revolution, and then increased year by year to the late 1980s, and then decreased over time; the word diversity of United Daily News decreased until 1990, and then increased significantly. The top 500 most frequently occurring two-character words are used as explanatory variables, and by substituting them into the clustering model, Taiwan and China can be divided into four eras. In the analysis of the two-character word association proposed in this paper, it was found that the phrase "the last two-character word in the same sentence" was found to be more effective and fit well with the historical development. For example, in the People`s Daily, "China" is used as the first word, and the strongest word group can be found in the four eras: "people, ambassador, characteristic, characteristic", which is similar to the concept of "China→people" nationalism in the early years of the Chinese Communist Party, and "China→ambassador" in order to enter the international arena (e.g., the United Nations). After the opening of the economy, "socialism with Chinese characteristics" was marketed to the world to emphasize "Chinese characteristics. Such a relationship between words can be used to describe a concept or an issue, and in the future, we can work with humanities scholars to identify the characteristics and content of articles through word clusters.
參考文獻 一、中文文獻
1.何立行、余清祥、鄭文惠(2014),「從文言到白話:《新青年》雜誌語言變化統計研究」,《東亞觀念史集刊》,7,頁427-454。
2.余清祥(1998),「統計在紅樓夢的應用」,《政大學報》,76,頁303-327。
3.余清祥、葉昱廷(2020),「以文字探勘技術分析臺灣四大報文字風格」,《數位典藏與數位人文》,第六卷。
4.林志軒(2020)。「維度縮減於文本風格之應用研究」,政治大學統計學系學位論文。
5.林晏辰(2020)。「中文關鍵詞偵測的探討」,政治大學統計學系學位論文。
6.金觀濤(2011),「數位人文研究的理論基礎」,收錄於《數位人文研究的新視野:基礎與想像》,項潔主編,頁45-61,臺灣大學。
7.洪嘉馡、黃居仁、馬偉雲、中央研究院語言學研究所、中央研究院資訊科學研究所 (2008) ,「語料庫為本的兩岸對應詞彙發掘」,《語言暨語言學》,9(2), 頁221-238。
8.夏天(2013)。「詞語位置加權TextRank的關鍵詞抽取研究」,《現代圖書情報技術》,9,頁30-34。
9.徐超(2017)。「《人民日報》社論詞彙統計與分析」,《采寫編》,(3),頁144-145。
10.梁家安(2017)。「從國共內戰到改革開放: 人民日報風格變遷之量化研究」,政治大學統計學系學位論文。
11.馮建三(2020)。「分析台灣主要報紙的兩岸新聞與言論聚焦在《聯合報》(1951-2019)」,《台灣社會研究季刊》,115,頁151-235。
12.黃秋林、吳本虎(2009)。「政治隱喻的歷時分析——基於《人民日報》(1978—2007) 兩會社論的研究」,《語言教學與研究》,(5),頁91-96。

二、英文文獻
1.Archer, J. and Jockers, M.L. (2016). The Bestseller Code, New York: St. Martin’s Press.
2.Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2019). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1.
3.Faust, K. (1994). Social Network Analysis in the Social and Behavioral Sciences, in Social Network Analysis: Methods and Applications, Cambridge University Press.
4.Kumar, A., Dabas, V., and Hooda, P. (2020). “Text classification algorithms for mining unstructured data: a SWOT analysis”, International Journal of Information Technology, Vol. 12, 1159–1169.
5.Manschreck, T. C., Maher, B. A. and Ader, D. N. (1981). “Formal thought disorder, the type-token ratio and disturbed voluntary motor movement in schizophrenia”, British Journal of Psychiatry, Vol. 139, 7–15.
6.Mihalcea R., Tarau, P. (2004). “TextRank: Bringing order into texts.”, In: Proceedings of Conference on Empirical Methods in Natural Language Processing, Vol. 4(4), 404-411.
7.Namugenyi, C., Nimmagadda, S.L., and Reiners, T. (2019). “Design of a SWOT Analysis Model and its Evaluation in Diverse Digital Business Ecosystem Contexts”, Procedia Computer Science, Vol. 159, 11451154.
8.Real, R., & Vargas, J. M. (1996). “The probabilistic basis of Jaccard`s index of similarity”, Systematic Biology, Vol. 45(3), 380-385.
9.Siddiqi, S., & Sharan, A. (2015). “Keyword and keyphrase extraction techniques: a literature review”, International Journal of Computer Applications, Vol. 109(2), 18-23.
10.Yue, C.J., Ho, L., Pan, Y., and Cheng, W.(2016). “A Quantitative Study of Chinese Writing Style based on the New Youth Magazine”, Concepts & Context in East Asia, Vol. 5, 87-102.
11.Yue, J.C. and Clayton, M.K. (2005). “A similarity measure based on species proportions”, Communications in Statistics-Theory and Methods, Vol. 34(11), 2123-2131.
描述 碩士
國立政治大學
統計學系
109354001
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0109354001
資料類型 thesis
dc.contributor.advisor 陳麗霞<br>余清祥zh_TW
dc.contributor.author (Authors) 吳蒨芸zh_TW
dc.contributor.author (Authors) Wu, Qian Yunen_US
dc.creator (作者) 吳蒨芸zh_TW
dc.creator (作者) Wu, Qian Yunen_US
dc.date (日期) 2022en_US
dc.date.accessioned 2-Sep-2022 14:45:06 (UTC+8)-
dc.date.available 2-Sep-2022 14:45:06 (UTC+8)-
dc.date.issued (上傳時間) 2-Sep-2022 14:45:06 (UTC+8)-
dc.identifier (Other Identifiers) G0109354001en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/141544-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 統計學系zh_TW
dc.description (描述) 109354001zh_TW
dc.description.abstract (摘要) 臺灣與中國同為華人社會,中文也是兩岸居民的共同語言,但書面寫作與口語對話使用方式卻有不小出入,這些差異似乎隨著時間逐漸增加,近年國際局勢變化更能讓人感受兩岸分治的不同。本文以臺灣《聯合報》和中國《人民日報》為研究素材,探討從第一代領導者蔣介石、毛澤東之後,兩岸文字風格的變遷及其差異,透過文字分析檢視文字使用變化及其意涵。除了藉由探索性資料分析(Exploratory Data Analysis,EDA)比較字詞的多樣性及不均度,本文也提出衡量雙字詞及多字詞間的方法,用以測量字詞間的關聯性,從中粹取近年台灣與中國思想的代表觀念,比較兩者有何重大變化與差異。本文使用《人民日報》1946年~2020年頭版報導、《聯合報》1960年~2020年社論,這些文章大多偏向於國際關係、國家層級等大事,較少著眼於地方性事務及社會新聞。
分析發現《人民日報》、《聯合報》在字詞多樣性有明顯差異,《人民日報》的字詞多樣性在1960年文革時期最低,之後逐年遞增至1980年代後期,繼而隨時間遞減;《聯合報》的字詞多樣性在1990年之前遞降,之後大幅提高。以出現頻率最高的前500個雙字詞作為解釋變數,代入分群模型,可將臺灣、中國分為四個年代。將本文提出的雙字詞關聯分析,發現「同一句內後一個雙字詞」找到的詞組效果較好,而且與歷史發展頗為契合。像是《人民日報》「中國」作為先行詞,四個年代可找出關聯最強的詞組為「人民、大使、特色、特色」,這與中共建國初期強調「中國→人民」民族主義的概念,之後為了進入國際舞台(如:聯合國)而出現「中國→大使」,經濟開放後則為強調「中國→特色」,向全世界推銷「中國特色社會主義」。這樣的詞與詞之間的關係,可用於描述一個概念或議題,未來可與人文學者合作,藉由詞叢關聯找出文章特色與內容大義。
zh_TW
dc.description.abstract (摘要) Taiwan and China are both Chinese societies, and Chinese is a common language for residents on both sides of the Taiwan Strait. However, there are considerable differences in the way written and spoken language are used, and these differences seem to have increased over time. This paper uses the Taiwanese newspaper United Daily News and the Chinese newspaper People`s Daily News as research materials to explore the changes in writing styles and the differences between the two sides of the Taiwan Strait since the first leaders Kai-shek Chiang and Zedong Mao, and to examine the changes in the use of words and their meanings through textual analysis. In addition to comparing the diversity and unevenness of words through Exploratory Data Analysis (EDA), this paper also proposes a method to measure the association between two-character words and multi- character words, from which representative concepts of Taiwanese and Chinese thought in recent years are extracted to compare the significant changes and differences between them. This paper uses headlines in People`s Daily from 1946 to 2020 and editorials in United Daily News from 1960 to 2020, which mostly focus on international relations and national-level events, but less on local affairs and social news.
The word diversity of People`s Daily and United Daily News is significantly different. The word diversity of People`s Daily was the lowest in 1960 during the Cultural Revolution, and then increased year by year to the late 1980s, and then decreased over time; the word diversity of United Daily News decreased until 1990, and then increased significantly. The top 500 most frequently occurring two-character words are used as explanatory variables, and by substituting them into the clustering model, Taiwan and China can be divided into four eras. In the analysis of the two-character word association proposed in this paper, it was found that the phrase "the last two-character word in the same sentence" was found to be more effective and fit well with the historical development. For example, in the People`s Daily, "China" is used as the first word, and the strongest word group can be found in the four eras: "people, ambassador, characteristic, characteristic", which is similar to the concept of "China→people" nationalism in the early years of the Chinese Communist Party, and "China→ambassador" in order to enter the international arena (e.g., the United Nations). After the opening of the economy, "socialism with Chinese characteristics" was marketed to the world to emphasize "Chinese characteristics. Such a relationship between words can be used to describe a concept or an issue, and in the future, we can work with humanities scholars to identify the characteristics and content of articles through word clusters.
en_US
dc.description.tableofcontents 第一章 緒論 1
第一節 研究動機 1
第二節 研究目的 2
第二章 文獻探討及研究方法 4
第一節 文獻回顧 4
第二節 資料介紹 5
第三節 研究方法 6
第三章 探索性資料分析 11
第一節 豐富度 11
第二節 不均度 18
第三節 相似度與年代分群 22
第四節 結構差異(句長) 26
第四章 中國與台灣用詞的變化與比較 28
第一節 兩種卡方關聯指標找到的詞組 28
第二節 《聯合報》不同年代間的用詞轉變 35
第三節 《人民日報》不同年代間的用詞轉變 39
第四節 《聯合報》、《人民日報》用詞的差異 43
第五章 結論與建議 45
第一節 結論 45
第二節 建議 46
參考文獻 48
zh_TW
dc.format.extent 4099109 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0109354001en_US
dc.subject (關鍵詞) 文字分析zh_TW
dc.subject (關鍵詞) 風格變遷zh_TW
dc.subject (關鍵詞) 探索性資料分析zh_TW
dc.subject (關鍵詞) 關聯指標zh_TW
dc.subject (關鍵詞) 兩岸差異zh_TW
dc.subject (關鍵詞) Text miningen_US
dc.subject (關鍵詞) Style changeen_US
dc.subject (關鍵詞) Exploratory data analysisen_US
dc.subject (關鍵詞) Related indexen_US
dc.subject (關鍵詞) Areal differentiationen_US
dc.title (題名) 從文字探勘比較臺灣與中國之寫作風格—以《聯合報》與《人民日報》為例zh_TW
dc.title (題名) Comparing the Writing Style of Newspaper between Taiwan and Chinaen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) 一、中文文獻
1.何立行、余清祥、鄭文惠(2014),「從文言到白話:《新青年》雜誌語言變化統計研究」,《東亞觀念史集刊》,7,頁427-454。
2.余清祥(1998),「統計在紅樓夢的應用」,《政大學報》,76,頁303-327。
3.余清祥、葉昱廷(2020),「以文字探勘技術分析臺灣四大報文字風格」,《數位典藏與數位人文》,第六卷。
4.林志軒(2020)。「維度縮減於文本風格之應用研究」,政治大學統計學系學位論文。
5.林晏辰(2020)。「中文關鍵詞偵測的探討」,政治大學統計學系學位論文。
6.金觀濤(2011),「數位人文研究的理論基礎」,收錄於《數位人文研究的新視野:基礎與想像》,項潔主編,頁45-61,臺灣大學。
7.洪嘉馡、黃居仁、馬偉雲、中央研究院語言學研究所、中央研究院資訊科學研究所 (2008) ,「語料庫為本的兩岸對應詞彙發掘」,《語言暨語言學》,9(2), 頁221-238。
8.夏天(2013)。「詞語位置加權TextRank的關鍵詞抽取研究」,《現代圖書情報技術》,9,頁30-34。
9.徐超(2017)。「《人民日報》社論詞彙統計與分析」,《采寫編》,(3),頁144-145。
10.梁家安(2017)。「從國共內戰到改革開放: 人民日報風格變遷之量化研究」,政治大學統計學系學位論文。
11.馮建三(2020)。「分析台灣主要報紙的兩岸新聞與言論聚焦在《聯合報》(1951-2019)」,《台灣社會研究季刊》,115,頁151-235。
12.黃秋林、吳本虎(2009)。「政治隱喻的歷時分析——基於《人民日報》(1978—2007) 兩會社論的研究」,《語言教學與研究》,(5),頁91-96。

二、英文文獻
1.Archer, J. and Jockers, M.L. (2016). The Bestseller Code, New York: St. Martin’s Press.
2.Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2019). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1.
3.Faust, K. (1994). Social Network Analysis in the Social and Behavioral Sciences, in Social Network Analysis: Methods and Applications, Cambridge University Press.
4.Kumar, A., Dabas, V., and Hooda, P. (2020). “Text classification algorithms for mining unstructured data: a SWOT analysis”, International Journal of Information Technology, Vol. 12, 1159–1169.
5.Manschreck, T. C., Maher, B. A. and Ader, D. N. (1981). “Formal thought disorder, the type-token ratio and disturbed voluntary motor movement in schizophrenia”, British Journal of Psychiatry, Vol. 139, 7–15.
6.Mihalcea R., Tarau, P. (2004). “TextRank: Bringing order into texts.”, In: Proceedings of Conference on Empirical Methods in Natural Language Processing, Vol. 4(4), 404-411.
7.Namugenyi, C., Nimmagadda, S.L., and Reiners, T. (2019). “Design of a SWOT Analysis Model and its Evaluation in Diverse Digital Business Ecosystem Contexts”, Procedia Computer Science, Vol. 159, 11451154.
8.Real, R., & Vargas, J. M. (1996). “The probabilistic basis of Jaccard`s index of similarity”, Systematic Biology, Vol. 45(3), 380-385.
9.Siddiqi, S., & Sharan, A. (2015). “Keyword and keyphrase extraction techniques: a literature review”, International Journal of Computer Applications, Vol. 109(2), 18-23.
10.Yue, C.J., Ho, L., Pan, Y., and Cheng, W.(2016). “A Quantitative Study of Chinese Writing Style based on the New Youth Magazine”, Concepts & Context in East Asia, Vol. 5, 87-102.
11.Yue, J.C. and Clayton, M.K. (2005). “A similarity measure based on species proportions”, Communications in Statistics-Theory and Methods, Vol. 34(11), 2123-2131.
zh_TW
dc.identifier.doi (DOI) 10.6814/NCCU202201492en_US