學術產出-Theses

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

  • No doi shows Citation Infomation
題名 從國共內戰到改革開放:人民日報風格變遷之量化研究
A Quantitative Study of Concept Change in People’s Daily
作者 梁家安
貢獻者 余清祥
梁家安
關鍵詞 數位人文
生物多樣性
文字採礦
人民日報
生態變遷
Digital humanities
Biodiversity
Text mining
People’s Daily
Ecological change
日期 2017
上傳時間 13-Sep-2017 14:11:28 (UTC+8)
摘要 隨數位人文研究興起,現今比過往更易取得數位化的資料,文字資料處理的技術也日新月異。但現在文章寫作的量化研究,大多是先根據研究主題、確定分析項目、再挑選變數(及相關特徵),如果不事先指定挑選方向及主題,可以由資料驅動(Data Drive)發掘出結果嗎?由於文字屬於非結構資料,大部份是以文檔中的字詞為分析單位,藉由特定字、詞的出現次數作為數量分析的變數,評估能否區隔特定主題的特色。然而這些分析鮮少聚焦於字詞的關係,例如兩個字詞共伴出現可能代表某種程度的關聯、甚至顯示出獨特的觀念和特性,透過計算字詞間的距離,當可獲得字詞相關性、甚至寫作風格等資訊。有鑑於此,本文採取和現今文字分析不同的觀點,將研究目標設定為探勘關鍵字詞的特性及字詞間的關係,透過探索性資料分析(Exploratory Data Analysis)的想法,挖掘出字詞的特徵,作為量化文字及其分析的依據。
本文以1946年至2003年《人民日報》共約58年、17萬篇頭版報導為研究對象,透過辨識字詞及字詞間的關係,探討《人民日報》文字風格的變化。除了文字採礦中常見的詞頻排序及各年度常見字數作為解釋變數外,本文也引進生物多樣性中生態變遷的想法,整理各年度常見的雙字詞,並以其出現次數仿造物種變遷,區分為常用、新生、滅絕雙字詞,作為輔助判斷風格變化的依據。研究結果顯示58年的《人民日報》,大致可分為四個時期,每個時期的常見雙字詞有非常明顯的不同,而且時期間的風格轉換非常快速。另外,透過計算字詞間距離可以找出字詞的關聯性,我們發現某些字詞間存在共生、或是互斥關係。例如:早期《人民日報》的報導提到「美國」時,通常不會看到「經濟」、「社會」等雙字詞,顯示字詞距離隱含重要資訊,若能進一步挖掘其中的關係與脈絡,可作為判斷文章風格變化及意義詮釋的利器。
Digital humanity has receiving a lot of attention in recent years, since it is easier to acquire texts in digital form and the computer technology for processing text data improves significantly over the last decades. However, most of the quantitative studies are not truly data-driven and highly dependent on the researchers. We first determine the study goal and the related variables (or features), and then apply quantitative methods and models, such as words frequency, to these variables. In other words, the text analysis is often to figure out the difference/connection between files based on pre-selected variables, and barely concentrate on the relationship between variables/words. This study use the distances between words to evaluate their relationship and to explore the connection between files based on the relationship.
Our study is based on the front page articles of People’s Daily from 1946 to 2003, with 169,739 articles totally. Through the identification of relationship between words, we explore the changes of literary style of People’s Daily. In addition to the information commonly used in text mining, such as term frequency and overlapping words, we also consider the terms new and extinct species in species diversity. The results show that there are the writing style of People’s Daily can be divided into four different periods and the changes between periods are rapid. Furthermore, the new and extinct words in different periods suggest the changes of writing style of People’s Daily are highly correlated to the China’s modernization, especially in economics.
參考文獻 一、中文部份
1. 王汎森(2014),“數位人文學之可能性及限制— 一個歷史學者的觀察”,收錄於《數位人文研究與技藝》,項潔主編,臺灣大學。
2. 何立行、余清祥、鄭文惠(2014),“從文言到白話:《新青年》雜誌語言變化統計研究”,東亞觀念史集刊,第七期,頁427-454。
3. 余清祥(1998), “統計在紅樓夢的應用”,政大學報,76,頁303-327。
4. 金觀濤(2011),“數位人文研究的理論基礎”,收錄於《數位人文研究的新視野:基礎與想像》,項潔主編,頁45-61,臺灣大學。
5. 金觀濤、梁穎誼、姚育松、劉昭麟(2014):〈統計偏離值分析於人文研究上的應用──以《新青年》為例〉,《東亞觀念史集刊》第6期,頁327-366。
6. 金觀濤、邱偉雲、梁穎誼、陳柏聿、沈錳坤、劉青峰(2016),“觀念群變化的數位人文研究— 以《新青年》為例”,收錄於《數位人文:在過去、現在和未來之間》,項潔主編,頁427-463,臺灣大學。ISBN: 978-986-350-198-5。
7. 項潔、涂豐恩(2011),“導論—什麼是數位人文”, 收錄於《從保存到創造: 開啟數位人文研究》,項潔主編,頁9-28,臺灣大學。
8. 黃居仁、陳克健、張莉萍、許蕙麗(1995), 中央研究院平衡語料庫簡介, Proceeding of ROCLLING,第7期, 81-99。
9. 鄭文惠(2013),“中國近代知識轉型與概念變遷——觀念史/概念史與方法與視域”, 《東亞觀念史集刊》,第4期,頁223-302。
10. 鄭文惠(2014),“從人文到數位人文:知識微縮革命與人文研究範式的轉向”,《人文與社會科學簡訊》,第15卷第4期,頁169-175。

二、英文部份
1. Blei, D.M., Ng, A.Y., and Jordan, M.I. (2003). “Latent Dirichlet allocation”, Journal of Machine Learning Research, 3, 993-1022.
2. Brown, P. F., Desouza, P. V., Mercer, R. L., Pietra, V. J. D., and Lai, J. C. (1992). Class-based n-gram models of natural language. Computational Linguistics, 18(4), 467-479.
3. Efron, B. and Thisted, R. (1976). “Estimating the number of unseen species: How many words did Shakespeare know?”, Biometrika, 63(3), 435-447.
4. Hastie, T., Tibshirani, R., and Friedman, J. (2002). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd edition, Springer Series in Statistics.
5. Mosteller, F. and Wallace, D. L. (1963). Inference in an authorship problem: A comparative study of discrimination methods applied to the authorship of the disputed Federalist Papers. Journal of the American Statistical Association, 58(302), 275-309.
6. Parzen, E. (1962). On estimation of a probability density function and mode. The Annals of Mathematical Statistics, 33(3), 1065-1076.
7. Ramos, J. (2003). “Using tf-idf to determine word relevance in document queries”, in Proceedings of the First Instructional Conference on Machine Learning, 242, 133-142.
8. Real, R., and Vargas, J. M. (1996). The probabilistic basis of Jaccard’s index of similarity. Systematic Biology, 45(3), 380-385.
9. Schuessler, J. (2011). Too Much Information About ‘Information’?. The New York Times.
10. Silverman, B.W. (1984). Spline smoothing: The equivalent variable kernel method, Annals of Statistics, 12, 898-916.
11. Thisted, R. and Efron, B. (1987). Did Shakespeare write a newly-discovered poem?, Biometrika, 74(3), 445-455.
12. Tukey, J.W. (1977). Exploratory Data Analysis, Princeton University.
13. Yue, J.C. and Clayton, M.K. (2005). A similarity measure based on species proportions, Communications in Statistics-Theory and Methods, 34(11), 2123-2131.
14. Yue, C.J., Ho, L., Pan, Y., and Cheng, W.(2016), A quantitative study of Chinese writing style based on the New Youth Magazine, Concepts and Context in East Asia, Vol. 5.
15. Zipf, G. K. (2016). Human behavior and the principle of least effort: An introduction to human ecology. Ravenio Books.
描述 碩士
國立政治大學
統計學系
104354031
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0104354031
資料類型 thesis
dc.contributor.advisor 余清祥zh_TW
dc.contributor.author (Authors) 梁家安zh_TW
dc.creator (作者) 梁家安zh_TW
dc.date (日期) 2017en_US
dc.date.accessioned 13-Sep-2017 14:11:28 (UTC+8)-
dc.date.available 13-Sep-2017 14:11:28 (UTC+8)-
dc.date.issued (上傳時間) 13-Sep-2017 14:11:28 (UTC+8)-
dc.identifier (Other Identifiers) G0104354031en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/112615-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 統計學系zh_TW
dc.description (描述) 104354031zh_TW
dc.description.abstract (摘要) 隨數位人文研究興起,現今比過往更易取得數位化的資料,文字資料處理的技術也日新月異。但現在文章寫作的量化研究,大多是先根據研究主題、確定分析項目、再挑選變數(及相關特徵),如果不事先指定挑選方向及主題,可以由資料驅動(Data Drive)發掘出結果嗎?由於文字屬於非結構資料,大部份是以文檔中的字詞為分析單位,藉由特定字、詞的出現次數作為數量分析的變數,評估能否區隔特定主題的特色。然而這些分析鮮少聚焦於字詞的關係,例如兩個字詞共伴出現可能代表某種程度的關聯、甚至顯示出獨特的觀念和特性,透過計算字詞間的距離,當可獲得字詞相關性、甚至寫作風格等資訊。有鑑於此,本文採取和現今文字分析不同的觀點,將研究目標設定為探勘關鍵字詞的特性及字詞間的關係,透過探索性資料分析(Exploratory Data Analysis)的想法,挖掘出字詞的特徵,作為量化文字及其分析的依據。
本文以1946年至2003年《人民日報》共約58年、17萬篇頭版報導為研究對象,透過辨識字詞及字詞間的關係,探討《人民日報》文字風格的變化。除了文字採礦中常見的詞頻排序及各年度常見字數作為解釋變數外,本文也引進生物多樣性中生態變遷的想法,整理各年度常見的雙字詞,並以其出現次數仿造物種變遷,區分為常用、新生、滅絕雙字詞,作為輔助判斷風格變化的依據。研究結果顯示58年的《人民日報》,大致可分為四個時期,每個時期的常見雙字詞有非常明顯的不同,而且時期間的風格轉換非常快速。另外,透過計算字詞間距離可以找出字詞的關聯性,我們發現某些字詞間存在共生、或是互斥關係。例如:早期《人民日報》的報導提到「美國」時,通常不會看到「經濟」、「社會」等雙字詞,顯示字詞距離隱含重要資訊,若能進一步挖掘其中的關係與脈絡,可作為判斷文章風格變化及意義詮釋的利器。
zh_TW
dc.description.abstract (摘要) Digital humanity has receiving a lot of attention in recent years, since it is easier to acquire texts in digital form and the computer technology for processing text data improves significantly over the last decades. However, most of the quantitative studies are not truly data-driven and highly dependent on the researchers. We first determine the study goal and the related variables (or features), and then apply quantitative methods and models, such as words frequency, to these variables. In other words, the text analysis is often to figure out the difference/connection between files based on pre-selected variables, and barely concentrate on the relationship between variables/words. This study use the distances between words to evaluate their relationship and to explore the connection between files based on the relationship.
Our study is based on the front page articles of People’s Daily from 1946 to 2003, with 169,739 articles totally. Through the identification of relationship between words, we explore the changes of literary style of People’s Daily. In addition to the information commonly used in text mining, such as term frequency and overlapping words, we also consider the terms new and extinct species in species diversity. The results show that there are the writing style of People’s Daily can be divided into four different periods and the changes between periods are rapid. Furthermore, the new and extinct words in different periods suggest the changes of writing style of People’s Daily are highly correlated to the China’s modernization, especially in economics.
en_US
dc.description.tableofcontents 第一章 緒論 1
第一節 研究動機 1
第二節 研究目的 2
第二章 文獻回顧 4
第三章 研究方法與資料 6
第一節 資料來源 6
第二節 研究方法 7
第四章 年度用字變化 10
第一節 單字詞變化 10
第二節 雙字詞變化 12
第五章 字詞關聯性 19
第一節 衡量字詞關聯方法 19
第二節 以順序關係衡量字詞關聯性 20
第六章 結論與建議 26
第一節 結論 26
第二節 建議 26
參考文獻 28
附錄 30
zh_TW
dc.format.extent 1015742 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0104354031en_US
dc.subject (關鍵詞) 數位人文zh_TW
dc.subject (關鍵詞) 生物多樣性zh_TW
dc.subject (關鍵詞) 文字採礦zh_TW
dc.subject (關鍵詞) 人民日報zh_TW
dc.subject (關鍵詞) 生態變遷zh_TW
dc.subject (關鍵詞) Digital humanitiesen_US
dc.subject (關鍵詞) Biodiversityen_US
dc.subject (關鍵詞) Text miningen_US
dc.subject (關鍵詞) People’s Dailyen_US
dc.subject (關鍵詞) Ecological changeen_US
dc.title (題名) 從國共內戰到改革開放:人民日報風格變遷之量化研究zh_TW
dc.title (題名) A Quantitative Study of Concept Change in People’s Dailyen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) 一、中文部份
1. 王汎森(2014),“數位人文學之可能性及限制— 一個歷史學者的觀察”,收錄於《數位人文研究與技藝》,項潔主編,臺灣大學。
2. 何立行、余清祥、鄭文惠(2014),“從文言到白話:《新青年》雜誌語言變化統計研究”,東亞觀念史集刊,第七期,頁427-454。
3. 余清祥(1998), “統計在紅樓夢的應用”,政大學報,76,頁303-327。
4. 金觀濤(2011),“數位人文研究的理論基礎”,收錄於《數位人文研究的新視野:基礎與想像》,項潔主編,頁45-61,臺灣大學。
5. 金觀濤、梁穎誼、姚育松、劉昭麟(2014):〈統計偏離值分析於人文研究上的應用──以《新青年》為例〉,《東亞觀念史集刊》第6期,頁327-366。
6. 金觀濤、邱偉雲、梁穎誼、陳柏聿、沈錳坤、劉青峰(2016),“觀念群變化的數位人文研究— 以《新青年》為例”,收錄於《數位人文:在過去、現在和未來之間》,項潔主編,頁427-463,臺灣大學。ISBN: 978-986-350-198-5。
7. 項潔、涂豐恩(2011),“導論—什麼是數位人文”, 收錄於《從保存到創造: 開啟數位人文研究》,項潔主編,頁9-28,臺灣大學。
8. 黃居仁、陳克健、張莉萍、許蕙麗(1995), 中央研究院平衡語料庫簡介, Proceeding of ROCLLING,第7期, 81-99。
9. 鄭文惠(2013),“中國近代知識轉型與概念變遷——觀念史/概念史與方法與視域”, 《東亞觀念史集刊》,第4期,頁223-302。
10. 鄭文惠(2014),“從人文到數位人文:知識微縮革命與人文研究範式的轉向”,《人文與社會科學簡訊》,第15卷第4期,頁169-175。

二、英文部份
1. Blei, D.M., Ng, A.Y., and Jordan, M.I. (2003). “Latent Dirichlet allocation”, Journal of Machine Learning Research, 3, 993-1022.
2. Brown, P. F., Desouza, P. V., Mercer, R. L., Pietra, V. J. D., and Lai, J. C. (1992). Class-based n-gram models of natural language. Computational Linguistics, 18(4), 467-479.
3. Efron, B. and Thisted, R. (1976). “Estimating the number of unseen species: How many words did Shakespeare know?”, Biometrika, 63(3), 435-447.
4. Hastie, T., Tibshirani, R., and Friedman, J. (2002). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd edition, Springer Series in Statistics.
5. Mosteller, F. and Wallace, D. L. (1963). Inference in an authorship problem: A comparative study of discrimination methods applied to the authorship of the disputed Federalist Papers. Journal of the American Statistical Association, 58(302), 275-309.
6. Parzen, E. (1962). On estimation of a probability density function and mode. The Annals of Mathematical Statistics, 33(3), 1065-1076.
7. Ramos, J. (2003). “Using tf-idf to determine word relevance in document queries”, in Proceedings of the First Instructional Conference on Machine Learning, 242, 133-142.
8. Real, R., and Vargas, J. M. (1996). The probabilistic basis of Jaccard’s index of similarity. Systematic Biology, 45(3), 380-385.
9. Schuessler, J. (2011). Too Much Information About ‘Information’?. The New York Times.
10. Silverman, B.W. (1984). Spline smoothing: The equivalent variable kernel method, Annals of Statistics, 12, 898-916.
11. Thisted, R. and Efron, B. (1987). Did Shakespeare write a newly-discovered poem?, Biometrika, 74(3), 445-455.
12. Tukey, J.W. (1977). Exploratory Data Analysis, Princeton University.
13. Yue, J.C. and Clayton, M.K. (2005). A similarity measure based on species proportions, Communications in Statistics-Theory and Methods, 34(11), 2123-2131.
14. Yue, C.J., Ho, L., Pan, Y., and Cheng, W.(2016), A quantitative study of Chinese writing style based on the New Youth Magazine, Concepts and Context in East Asia, Vol. 5.
15. Zipf, G. K. (2016). Human behavior and the principle of least effort: An introduction to human ecology. Ravenio Books.
zh_TW