探索性資料分析方法在文本資料中的應用─以「新青年」雜誌為例 | Publication

Publications-Theses

Article View/Open

pdf(254)

Publication Export

Google Scholar^TM

題名	探索性資料分析方法在文本資料中的應用─以「新青年」雜誌為例 A Study of Exploratory Data Analysis on Text Data ── A Case study based on New Youth Magazine
作者	潘艷艷 Pan, Yan Yan
貢獻者	余清祥 Yue, Jack 潘艷艷 Pan, Yan Yan
關鍵詞	非結構化數據文本分析探索性資料分析主成分分析羅吉斯迴歸 Unstructured Data Text Analysis Exploratory data Analysis Principal Component Analysis Logistic Regression
日期	2015
上傳時間	3-Feb-2016 11:16:27 (UTC+8)
摘要	隨著經濟繁榮和網絡發展的日新月異，線上線下每時每刻都產生龐大數據，其中約有80%的文字、影像等非結構化數據，如何量化和採取適合的分析方法，成為有效提取有價值信息及對其加以利用的關鍵。針對文字類型的資料，本文提出探索性資料分析方法，並以《新青年》雜誌的語言變化為例，呈現如何選取文本特徵并对其量化及分析的過程。首先，本文以卷為分析單位，多角度量化《新青年》雜誌各卷的文本結構，包括文本用字、用句、文言和白虛字使用以及常用字詞共用等方面，通過多種圖表相結合的呈現方式，窺探《新青年》雜誌語言變化歷程以及轉變特點。這其中既包括了對文言文到白話文轉變機制的探索，也包括白話語言演化的探索。其次，根據各卷初探的結果，尋找可區隔文言文和白話文兩種語言形式的文本特徵變數，再以《新青年》第一卷和第七卷為訓練樣本，結合主成分和羅吉斯迴歸，對文、白兩種語言形式的文章進行分類訓練，再利用第四卷進行測試。結果證實，所提取的文本變數能夠有效實現對文、白兩種語言形式的文章的區分。此外，本文亦根據前述初探結果以及人文學者經驗，探索《新青年》雜誌後期語言形式的變化，即從五四運動時期的白話文至以「紅色中文」為特徵的白話文（二戰之後中國使用的白話文）的變化。以第七卷和第十一卷為樣本進行訓練，結果證實這兩卷語言形式存在明顯區別；並加入台灣《聯合報》和中國大陸的《人民日報》進行分類預測，發現兩類報刊的語言偏向有明顯差異，值得後續深入研究。 Tremendous data are produced every day, due to the rapid development of computer technology and economics. Unstructured data, such as text, pictures, videos, etc., account for nearly 80 percent of all data created. Choosing appropriate methods for quantifying and analyzing this kind of data would determine whether or not we can extract useful information. For that, we propose a standard operating process of exploratory data analysis (EDA) and use a case study of language changes in New Youth Magazine as a demonstration. First, we quantify the texts of New Youth magazine from different perspectives, including the uses of words, sentences, function words, and share of common vocabulary. We aim to detect the evolution of modern language itself as well as changes from traditional Chinese to modern Chinese. Then, according to the results of exploratory data analysis, we treat the first and seventh volumes of New Youth magazine for training data to develop classification model and apply the model to fourth volume (i.e., testing data). The results show that the traditional Chinese and modern Chinese can be successfully classified. Next, we intend to verify the changes from modern Chinese of the May 4th Movement to those by advocating Socialism. We treat the seventh volume and eleventh volume of New Youth magazine as training data and again develop a classification model. Then we apply this model to the United Daily News from Taiwan and People’s Daily from Mainland China. We found these two newspapers are very different and the style of United Daily News is closer to that of seventh volume, while the style of People’s Daily is more like that of eleventh volume. This indicates that the People’s Daily is likely to be influenced by the Soviet Union.
參考文獻	一、中文部分 1.丁守和、殷敘彝(1963)，從五四啓蒙運動到馬克思主義的傳播，生活·讀書·新知三聯書店。 2.王治敏(2010)，基於時間跨度的漢語教學常用詞表統計研究，華文教學與研究，4，49-55。 3.何立行、余清祥、鄭文惠(2014)，從文言到白話：《新青年》雜誌語言變化統計研究，東亞觀念史集刊，7，427-454。 4.朱華宇、孫正興、張福炎(2001)，一個基於向量空間模型的中文文本自動分類系統，計算機工程，vol. 27(2)，70-73。 5.余清祥(1998)，統計在紅樓夢的應用，政大學報，76，303-327。 6.李新麗(2007)，《新青年》研究綜述，新聞大學，vol. 4，18-22。 7.李榮陸、王建會、陳曉雲、陶曉鵬、胡運發(2005)，使用最大嫡模型進行中文文本分類，計算機研究與發展，vol. 42(1)，94-101。 8.李美霞(2002)，語言變遷研究綜述，北京師範大學學報，vol. 4，128-133。 9.辛剛(1991)，語言變異和語言系統，現代外語。 10.莊森(2006)，飛揚跋扈為誰雄——作為文學社團的新青年社研究，東方出版中心。 11.張寶明、王中江(1998)，回眸《新青年》，河南文藝出版社。 12.陳平原(2002)，思想史視野中的文學—《新青年》研究（上），中國現代文學研究叢刊，vol. 3，1-31。 13.陳斯華(2003)，《新青年》雜誌登載文學作品數量分析表，東岳論叢，vol. 24(3)，39-41。 14.郭曙綸、馬玄思、李開拓（2014），基於《中國語言生活狀況報告》的字與詞的對比研究，北華大學學報，vol.15(3)，10-13。 15.趙岡、陳鍾毅(1980)，紅樓夢研究新編，聯經出版社。 16.鄭秋生、翟琳琳(2013)，基於改進Rocchio算法的短文本自動分類研究，中原工學院學報，vol. 24(1)，70-73。 17.謝佳斌、金勇進(2009)，探索性數據分析中的統計圖形應用，統計與信息論壇， vol. 24(7)，13-17。二、英文部分 1. Agresti, A.(1990), Categorical Data Analysis, New York: Wiley. 2. Karlgren, B. (1952), “New Excursions in Chinese Grammar”, in Bulletin of the museum of Far Eastern Antiquities (Stockholm), 24:51-80. 3.Mosteller, F. and Wallace, D. (1964), Inference and Disputed Authorship: the Federalist. Addison-Wesley. 4.Richard, A.J. and Dean W.W. (2007), Applied Multivariate Statistical Analysis,6th edition, Pearson. 5.Shannon, C.E. and Weaver W. (1948), A mathematical theory of communication, The Bell System Technical Journal, 27, 379–423 and 623–656. 6.Simpson, E. H. (1949),"Measurement of diversity", Nature, 63: 688. 7.Thisted, R. and Efron, B. (1986), “Did Shakespeare Write a Newly-discovered Poem?”, Biometrika, 74(3): 445-455. 8.Tukey, J.W. (1977), Exploratory data analysis, Addison-Wesley. 9.T.K.Das, P. Mohan Kumar(2013), Big Data Analytics: A Framework for Unstructured Data Analysis, International Journal of Engineering and Technology (IJET), Vol.5(1).
描述	碩士國立政治大學統計學系 102354031
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0102354031
資料類型	thesis

dc.contributor.advisor	余清祥	zh_TW
dc.contributor.advisor	Yue, Jack	en_US
dc.contributor.author (Authors)	潘艷艷	zh_TW
dc.contributor.author (Authors)	Pan, Yan Yan	en_US
dc.creator (作者)	潘艷艷	zh_TW
dc.creator (作者)	Pan, Yan Yan	en_US
dc.date (日期)	2015	en_US
dc.date.accessioned	3-Feb-2016 11:16:27 (UTC+8)	-
dc.date.available	3-Feb-2016 11:16:27 (UTC+8)	-
dc.date.issued (上傳時間)	3-Feb-2016 11:16:27 (UTC+8)	-
dc.identifier (Other Identifiers)	G0102354031	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/81107	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	統計學系	zh_TW
dc.description (描述)	102354031	zh_TW
dc.description.abstract (摘要)	隨著經濟繁榮和網絡發展的日新月異，線上線下每時每刻都產生龐大數據，其中約有80%的文字、影像等非結構化數據，如何量化和採取適合的分析方法，成為有效提取有價值信息及對其加以利用的關鍵。針對文字類型的資料，本文提出探索性資料分析方法，並以《新青年》雜誌的語言變化為例，呈現如何選取文本特徵并对其量化及分析的過程。首先，本文以卷為分析單位，多角度量化《新青年》雜誌各卷的文本結構，包括文本用字、用句、文言和白虛字使用以及常用字詞共用等方面，通過多種圖表相結合的呈現方式，窺探《新青年》雜誌語言變化歷程以及轉變特點。這其中既包括了對文言文到白話文轉變機制的探索，也包括白話語言演化的探索。其次，根據各卷初探的結果，尋找可區隔文言文和白話文兩種語言形式的文本特徵變數，再以《新青年》第一卷和第七卷為訓練樣本，結合主成分和羅吉斯迴歸，對文、白兩種語言形式的文章進行分類訓練，再利用第四卷進行測試。結果證實，所提取的文本變數能夠有效實現對文、白兩種語言形式的文章的區分。此外，本文亦根據前述初探結果以及人文學者經驗，探索《新青年》雜誌後期語言形式的變化，即從五四運動時期的白話文至以「紅色中文」為特徵的白話文（二戰之後中國使用的白話文）的變化。以第七卷和第十一卷為樣本進行訓練，結果證實這兩卷語言形式存在明顯區別；並加入台灣《聯合報》和中國大陸的《人民日報》進行分類預測，發現兩類報刊的語言偏向有明顯差異，值得後續深入研究。	zh_TW
dc.description.abstract (摘要)	Tremendous data are produced every day, due to the rapid development of computer technology and economics. Unstructured data, such as text, pictures, videos, etc., account for nearly 80 percent of all data created. Choosing appropriate methods for quantifying and analyzing this kind of data would determine whether or not we can extract useful information. For that, we propose a standard operating process of exploratory data analysis (EDA) and use a case study of language changes in New Youth Magazine as a demonstration. First, we quantify the texts of New Youth magazine from different perspectives, including the uses of words, sentences, function words, and share of common vocabulary. We aim to detect the evolution of modern language itself as well as changes from traditional Chinese to modern Chinese. Then, according to the results of exploratory data analysis, we treat the first and seventh volumes of New Youth magazine for training data to develop classification model and apply the model to fourth volume (i.e., testing data). The results show that the traditional Chinese and modern Chinese can be successfully classified. Next, we intend to verify the changes from modern Chinese of the May 4th Movement to those by advocating Socialism. We treat the seventh volume and eleventh volume of New Youth magazine as training data and again develop a classification model. Then we apply this model to the United Daily News from Taiwan and People’s Daily from Mainland China. We found these two newspapers are very different and the style of United Daily News is closer to that of seventh volume, while the style of People’s Daily is more like that of eleventh volume. This indicates that the People’s Daily is likely to be influenced by the Soviet Union.	en_US
dc.description.tableofcontents	摘要 I 目錄 IV 表目錄 VI 圖目錄 VII 第一章緒論 1 第一節研究背景與動機 1 第二節論文編排 3 第二章文獻回顧和研究方法 5 第一節文獻回顧 5 第二節研究方法 7 第三章《新青年》雜誌文本初探 10 第一節《新青年》用字變化分析 10 第二節《新青年》用句變化分析 14 第三節《新青年》虛字使用變化分析 16 第四節《新青年》各卷常用字、詞的共用情況分析 18 第五節本章小結 20 第四章文言文和白話文的分類分析 22 第一節文本變數選取與主成分提取 22 第二節《新青年》雜誌第一、七卷文本分類訓練 24 第三節《新青年》雜誌第四卷分類預測 27 第五章五四白話與「紅色中文」的分類分析 30 第一節文本變數與文本分類 31 第二節《聯合報》和《人民日報》文體偏向研究 33 第六章結論及建議 36 第一節研究結論 36 第二節後續研究建議 37 參考文獻 40 附表 42	zh_TW
dc.format.extent	1706325 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0102354031	en_US
dc.subject (關鍵詞)	非結構化數據	zh_TW
dc.subject (關鍵詞)	文本分析	zh_TW
dc.subject (關鍵詞)	探索性資料分析	zh_TW
dc.subject (關鍵詞)	主成分分析	zh_TW
dc.subject (關鍵詞)	羅吉斯迴歸	zh_TW
dc.subject (關鍵詞)	Unstructured Data	en_US
dc.subject (關鍵詞)	Text Analysis	en_US
dc.subject (關鍵詞)	Exploratory data Analysis	en_US
dc.subject (關鍵詞)	Principal Component Analysis	en_US
dc.subject (關鍵詞)	Logistic Regression	en_US
dc.title (題名)	探索性資料分析方法在文本資料中的應用─以「新青年」雜誌為例	zh_TW
dc.title (題名)	A Study of Exploratory Data Analysis on Text Data ── A Case study based on New Youth Magazine	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	一、中文部分 1.丁守和、殷敘彝(1963)，從五四啓蒙運動到馬克思主義的傳播，生活·讀書·新知三聯書店。 2.王治敏(2010)，基於時間跨度的漢語教學常用詞表統計研究，華文教學與研究，4，49-55。 3.何立行、余清祥、鄭文惠(2014)，從文言到白話：《新青年》雜誌語言變化統計研究，東亞觀念史集刊，7，427-454。 4.朱華宇、孫正興、張福炎(2001)，一個基於向量空間模型的中文文本自動分類系統，計算機工程，vol. 27(2)，70-73。 5.余清祥(1998)，統計在紅樓夢的應用，政大學報，76，303-327。 6.李新麗(2007)，《新青年》研究綜述，新聞大學，vol. 4，18-22。 7.李榮陸、王建會、陳曉雲、陶曉鵬、胡運發(2005)，使用最大嫡模型進行中文文本分類，計算機研究與發展，vol. 42(1)，94-101。 8.李美霞(2002)，語言變遷研究綜述，北京師範大學學報，vol. 4，128-133。 9.辛剛(1991)，語言變異和語言系統，現代外語。 10.莊森(2006)，飛揚跋扈為誰雄——作為文學社團的新青年社研究，東方出版中心。 11.張寶明、王中江(1998)，回眸《新青年》，河南文藝出版社。 12.陳平原(2002)，思想史視野中的文學—《新青年》研究（上），中國現代文學研究叢刊，vol. 3，1-31。 13.陳斯華(2003)，《新青年》雜誌登載文學作品數量分析表，東岳論叢，vol. 24(3)，39-41。 14.郭曙綸、馬玄思、李開拓（2014），基於《中國語言生活狀況報告》的字與詞的對比研究，北華大學學報，vol.15(3)，10-13。 15.趙岡、陳鍾毅(1980)，紅樓夢研究新編，聯經出版社。 16.鄭秋生、翟琳琳(2013)，基於改進Rocchio算法的短文本自動分類研究，中原工學院學報，vol. 24(1)，70-73。 17.謝佳斌、金勇進(2009)，探索性數據分析中的統計圖形應用，統計與信息論壇， vol. 24(7)，13-17。二、英文部分 1. Agresti, A.(1990), Categorical Data Analysis, New York: Wiley. 2. Karlgren, B. (1952), “New Excursions in Chinese Grammar”, in Bulletin of the museum of Far Eastern Antiquities (Stockholm), 24:51-80. 3.Mosteller, F. and Wallace, D. (1964), Inference and Disputed Authorship: the Federalist. Addison-Wesley. 4.Richard, A.J. and Dean W.W. (2007), Applied Multivariate Statistical Analysis,6th edition, Pearson. 5.Shannon, C.E. and Weaver W. (1948), A mathematical theory of communication, The Bell System Technical Journal, 27, 379–423 and 623–656. 6.Simpson, E. H. (1949),"Measurement of diversity", Nature, 63: 688. 7.Thisted, R. and Efron, B. (1986), “Did Shakespeare Write a Newly-discovered Poem?”, Biometrika, 74(3): 445-455. 8.Tukey, J.W. (1977), Exploratory data analysis, Addison-Wesley. 9.T.K.Das, P. Mohan Kumar(2013), Big Data Analytics: A Framework for Unstructured Data Analysis, International Journal of Engineering and Technology (IJET), Vol.5(1).	zh_TW

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM