Publications-Theses
Article View/Open
Publication Export
-
Google ScholarTM
NCCU Library
Citation Infomation
Related Publications in TAIR
題名 運用社會網絡技術由文集中探勘觀念:以新青年為例
Concept Discovery from Essays based on Social Network Mining: Using New Youth as an Example作者 陳柏聿
Chen, Po Yu貢獻者 沈錳坤
Shan, Man Kwan
陳柏聿
Chen, Po Yu關鍵詞 社會網絡分析
觀念探勘
文字探勘
Social Network Analysis
Concept Mining
Text Mining日期 2014 上傳時間 2-Mar-2015 10:13:20 (UTC+8) 摘要 以往人文歷史領域的學者們,以土法煉鋼的人工方式進行資料的研究與分析,這樣的方法在資料量不大的時候還可行,但隨著數位典藏的進行以及巨量資料的興起,傳統的書本、古籍和文獻大量的數位化,若繼續使用傳統逐條分析的方式便會花費很多的時間與人力,但也因為資料數位化的關係,資訊領域的人便能利用資訊技術從旁進行協助。 而其中在觀念史研究領域裡,關鍵詞叢的研究是其中的重點之一,因為觀念可以用關鍵詞或含關鍵詞的句子來表達,所以研究關鍵詞就能幫助人文學者,了解史料文獻背後的意義與掌握當時的脈絡。因此本篇論文研究之目的在於針對收錄多篇文章的文集,探討詞彙與詞彙之間出現在文章中的情形,並利用五種的共現關係,將社群網絡的概念引入到文本分析之中,將每個詞彙當作節點,詞彙之間的關聯性當作邊建立詞彙網絡,從中找出詞彙所形成的觀念,最後實作一個由文集中探勘觀念的系統,此系統主要提供三種分析功能,分別是多詞彙觀念查詢、單詞彙觀念查詢與潛在觀念探勘。 本研究主要以《新青年》雜誌作為主要的觀察文集與實驗案例分析,《新青年》中觀念由自由主義轉向馬克思列寧主義,而我們利用本系統的確能夠找出變化的軌跡,以及探勘兩個觀念下的關鍵詞彙。
With development of the digital archives, essays have been digitized. While it takes much time to analyze the contents of essays by human, it is beneficial to analyze by computer. This thesis aims to investigate the approach to discover concepts of essays based on social network mining techniques. While a concept can be represented as a set of keywords, the proposed approach measure the co-occurrence relationships between two keywords and represent the relationships among keywords by networks of keywords. Social network mining techniques are employed to discover the concepts of essays. We also develop the concept discovery system which provides discovery by multiple keywords, discovery by single keyword, and latent concept mining. The New Youth is taken as an example to demonstrate the capability of the developed system.參考文獻 [1] R. Agrawal, T. Imielinski, and A. Swami, “Mining Association Rules between Sets of Items in Large Databases,” Proceedings of the ACM SIGMOD International Conference on Management of Data, Washington, D.C., May 1993. [2] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proceedings of the 20th International Conference on Very Large Data Bases, 1994. [3] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, "Fast Unfolding of Communities in Large Network," Journal of Statistical Mechanics: Theory and Experiment, P10008, 2008. [4] P. Bonacich, “Factoring and Weighting Approaches to Status Scores and Clique Identification,” Journal of Mathematical Sociology, Vol. 2, No.1 , pp. 113-120, 1972. [5] R. L. Breiger, “The Analysis of Social Network,” Handbook of data analysis, London: Sage Publication, pp. 505-526, 2004. [6] H. Cramer, “Mathematical Methods of Statistics,” Princeton University Press, Princeton, p282, 1946. [7] L. C. Freeman, “Centrality in Social Network: Conceptual Clarification,” Social Networks, Vol. 1, No.3, pp. 215-239, 1979. [8] M. Girvan and M. E. J. Newman, “Community Structure in Social and Biological Networks,” In Proceeding of National Academic of Sciences(PNAS’02), 7821-7826, 2002. [9] J.-W. Huang, B.-R. Dai, and M.-S. Chen, “Twain: Two-End Association Miner with Precise Frequent Exhibition Periods,” ACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 2, 2007. [10] K. S. Jones, “A Statistical Interpretation of Term Specificity and Its Application in Retrieval,” Journal of Documentation, Vol. 28, pp. 11-24, 1972. [11] C. D. Manning, P. Raghavan, and H. Schutze, “Introduction to Information Retrieval,” Cambridge University Press, Cambridge, 2008. [12] M. E. J. Newman, “Fast Algorithm for Detecting Community Structure in Networks,” Physical Review E, Vol. 69, No. 6, 066133, 2004. [13] M. E. J. Newman, “The Structure and Function of Complex Networks,” SIAM Review, Vol. 45, No. 2, 2003. [14] K. Pearson, “Note on Regression and Inheritance in the Case of Two Parents,” Proceedings of the Royal Society of London, Vol. 58, pp. 240-242, 1895. [15] K. Pearson, “On the Criterion that a given System of Deviations from the Probable in the Case of a Correlated System of variables is such that it can be reasonably supposed to have arisen from Random Sampling,” Philosophical Magazine, Series 5, Vol. 50, No.302, pp. 157–175, 1900. [16] C. Spearman, “The Proof and Measurement of Association Between Two Things,” American Journal of Psychology, 15, pp. 72-101, 1904. [17] G. Salton , A. Wong , C. S. Yang, “A Vector Space Model for Automatic Indexing,” Communications of the ACM, Vol. 18, No. 11, pp. 613-620, 1975. [18] S. Wasserman and K. Faust, “Social Network Analysis: Methods and Applications,” Cambridge University Press, Cambridge, 1994. [19] 項潔、涂豐恩,〈導論——什麼是數位人文〉,《從保存到創造:開啟數位人文研究》,頁9-28,臺北:國立臺灣大學出版中心,2011年。 [20] 金觀濤和劉青峰。〈觀念史研究:中國現代重要政治術語的形成〉,中文大學出版社,2008。 [21] 金觀濤、梁穎誼、姚育松和劉昭麟,〈統計偏離值分析於人文研究上的應用:以《新青年》為例〉,第四屆數位典藏與數位人文國際研討會,2012。 [22] 金觀濤、邱偉雲和劉昭麟,〈「共現」詞頻分析及其運用-以「華人」觀念起源為例〉,第四屆數位典藏與數位人文國際研討會,2012。 [23] 余清祥,〈統計在紅樓夢的應用〉,《政人學報》76期,頁303-327,1998年。 [24] 中國近現代思想及文學史專業數據(1830-1930), http://digibase.ssic.nccu.edu.tw/?m=2302&wsn=0101 [25] 《新青年》文獻簡介,http://digibase.ssic.nccu.edu.tw/?m=2302&wsn=0304 [26] 結巴中文斷詞( Jieba)套件,https://github.com/fxsjy/jieba [27] community套件,https://bitbucket.org/taynaud/python-louvain [28] Networkx套件,https://networkx.github.io/ 描述 碩士
國立政治大學
資訊科學學系
100753013
103資料來源 http://thesis.lib.nccu.edu.tw/record/#G0100753013 資料類型 thesis dc.contributor.advisor 沈錳坤 zh_TW dc.contributor.advisor Shan, Man Kwan en_US dc.contributor.author (Authors) 陳柏聿 zh_TW dc.contributor.author (Authors) Chen, Po Yu en_US dc.creator (作者) 陳柏聿 zh_TW dc.creator (作者) Chen, Po Yu en_US dc.date (日期) 2014 en_US dc.date.accessioned 2-Mar-2015 10:13:20 (UTC+8) - dc.date.available 2-Mar-2015 10:13:20 (UTC+8) - dc.date.issued (上傳時間) 2-Mar-2015 10:13:20 (UTC+8) - dc.identifier (Other Identifiers) G0100753013 en_US dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/73570 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 資訊科學學系 zh_TW dc.description (描述) 100753013 zh_TW dc.description (描述) 103 zh_TW dc.description.abstract (摘要) 以往人文歷史領域的學者們,以土法煉鋼的人工方式進行資料的研究與分析,這樣的方法在資料量不大的時候還可行,但隨著數位典藏的進行以及巨量資料的興起,傳統的書本、古籍和文獻大量的數位化,若繼續使用傳統逐條分析的方式便會花費很多的時間與人力,但也因為資料數位化的關係,資訊領域的人便能利用資訊技術從旁進行協助。 而其中在觀念史研究領域裡,關鍵詞叢的研究是其中的重點之一,因為觀念可以用關鍵詞或含關鍵詞的句子來表達,所以研究關鍵詞就能幫助人文學者,了解史料文獻背後的意義與掌握當時的脈絡。因此本篇論文研究之目的在於針對收錄多篇文章的文集,探討詞彙與詞彙之間出現在文章中的情形,並利用五種的共現關係,將社群網絡的概念引入到文本分析之中,將每個詞彙當作節點,詞彙之間的關聯性當作邊建立詞彙網絡,從中找出詞彙所形成的觀念,最後實作一個由文集中探勘觀念的系統,此系統主要提供三種分析功能,分別是多詞彙觀念查詢、單詞彙觀念查詢與潛在觀念探勘。 本研究主要以《新青年》雜誌作為主要的觀察文集與實驗案例分析,《新青年》中觀念由自由主義轉向馬克思列寧主義,而我們利用本系統的確能夠找出變化的軌跡,以及探勘兩個觀念下的關鍵詞彙。 zh_TW dc.description.abstract (摘要) With development of the digital archives, essays have been digitized. While it takes much time to analyze the contents of essays by human, it is beneficial to analyze by computer. This thesis aims to investigate the approach to discover concepts of essays based on social network mining techniques. While a concept can be represented as a set of keywords, the proposed approach measure the co-occurrence relationships between two keywords and represent the relationships among keywords by networks of keywords. Social network mining techniques are employed to discover the concepts of essays. We also develop the concept discovery system which provides discovery by multiple keywords, discovery by single keyword, and latent concept mining. The New Youth is taken as an example to demonstrate the capability of the developed system. en_US dc.description.tableofcontents 摘要 i Abstract ii 誌謝 iii 目錄 v 圖目錄 vii 表目錄 viii 第一章 前言 1 1.1 研究背景 1 1.2 研究動機與目的 1 1.3 採用的研究文集 2 1.4 論文架構 3 第二章 數位人文背景與文獻探討 4 2.1 數位人文 4 2.2 關鍵詞捕捉觀念 5 第三章 由文本中探勘潛在觀念 7 3.1 系統架構 7 3.2 前處理 9 3.2.1 文章斷詞 9 3.2.2 詞性分析與停用詞處理 10 3.2.3 反向索引檔建立 11 3.3 由文本中萃取頻繁關鍵詞組 12 3.3.1 單詞彙關鍵詞叢萃取 13 3.3.1.1 詞彙關係性計算 13 3.3.1.2 詞彙重要性計算 21 3.3.1.3 詞彙關係性結合詞彙重要性 22 3.3.2 潛在觀念關鍵詞叢萃取 22 3.4 由關鍵詞叢探勘觀念 28 3.4.1 建置詞彙網絡 28 3.4.2 詞彙網絡中心性分析 30 3.4.2.1 程度中心性 31 3.4.2.2 緊密中心性 31 3.4.2.3 中介中心性 32 3.4.2.4 特徵值中心性 33 3.4.3 詞彙網絡社群偵測 33 第四章 系統實作與案例分析 36 4.1 文集資料 36 4.2 系統實作 39 4.2.1 系統功能介紹 39 4.2.2 多詞彙觀念查詢 40 4.2.3 單詞彙觀念查詢 45 4.2.4 潛在觀念探勘 47 4.2.5 介面平台與應用套件 50 4.3 案例分析 51 第五章 結論與未來工作 58 參考文獻 60 附錄 63 附錄一:ICTCLAS詞性表 63 附錄二:卡方分布臨界值表 65 zh_TW dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0100753013 en_US dc.subject (關鍵詞) 社會網絡分析 zh_TW dc.subject (關鍵詞) 觀念探勘 zh_TW dc.subject (關鍵詞) 文字探勘 zh_TW dc.subject (關鍵詞) Social Network Analysis en_US dc.subject (關鍵詞) Concept Mining en_US dc.subject (關鍵詞) Text Mining en_US dc.title (題名) 運用社會網絡技術由文集中探勘觀念:以新青年為例 zh_TW dc.title (題名) Concept Discovery from Essays based on Social Network Mining: Using New Youth as an Example en_US dc.type (資料類型) thesis en dc.relation.reference (參考文獻) [1] R. Agrawal, T. Imielinski, and A. Swami, “Mining Association Rules between Sets of Items in Large Databases,” Proceedings of the ACM SIGMOD International Conference on Management of Data, Washington, D.C., May 1993. [2] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proceedings of the 20th International Conference on Very Large Data Bases, 1994. [3] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, "Fast Unfolding of Communities in Large Network," Journal of Statistical Mechanics: Theory and Experiment, P10008, 2008. [4] P. Bonacich, “Factoring and Weighting Approaches to Status Scores and Clique Identification,” Journal of Mathematical Sociology, Vol. 2, No.1 , pp. 113-120, 1972. [5] R. L. Breiger, “The Analysis of Social Network,” Handbook of data analysis, London: Sage Publication, pp. 505-526, 2004. [6] H. Cramer, “Mathematical Methods of Statistics,” Princeton University Press, Princeton, p282, 1946. [7] L. C. Freeman, “Centrality in Social Network: Conceptual Clarification,” Social Networks, Vol. 1, No.3, pp. 215-239, 1979. [8] M. Girvan and M. E. J. Newman, “Community Structure in Social and Biological Networks,” In Proceeding of National Academic of Sciences(PNAS’02), 7821-7826, 2002. [9] J.-W. Huang, B.-R. Dai, and M.-S. Chen, “Twain: Two-End Association Miner with Precise Frequent Exhibition Periods,” ACM Transactions on Knowledge Discovery from Data, Vol. 1, No. 2, 2007. [10] K. S. Jones, “A Statistical Interpretation of Term Specificity and Its Application in Retrieval,” Journal of Documentation, Vol. 28, pp. 11-24, 1972. [11] C. D. Manning, P. Raghavan, and H. Schutze, “Introduction to Information Retrieval,” Cambridge University Press, Cambridge, 2008. [12] M. E. J. Newman, “Fast Algorithm for Detecting Community Structure in Networks,” Physical Review E, Vol. 69, No. 6, 066133, 2004. [13] M. E. J. Newman, “The Structure and Function of Complex Networks,” SIAM Review, Vol. 45, No. 2, 2003. [14] K. Pearson, “Note on Regression and Inheritance in the Case of Two Parents,” Proceedings of the Royal Society of London, Vol. 58, pp. 240-242, 1895. [15] K. Pearson, “On the Criterion that a given System of Deviations from the Probable in the Case of a Correlated System of variables is such that it can be reasonably supposed to have arisen from Random Sampling,” Philosophical Magazine, Series 5, Vol. 50, No.302, pp. 157–175, 1900. [16] C. Spearman, “The Proof and Measurement of Association Between Two Things,” American Journal of Psychology, 15, pp. 72-101, 1904. [17] G. Salton , A. Wong , C. S. Yang, “A Vector Space Model for Automatic Indexing,” Communications of the ACM, Vol. 18, No. 11, pp. 613-620, 1975. [18] S. Wasserman and K. Faust, “Social Network Analysis: Methods and Applications,” Cambridge University Press, Cambridge, 1994. [19] 項潔、涂豐恩,〈導論——什麼是數位人文〉,《從保存到創造:開啟數位人文研究》,頁9-28,臺北:國立臺灣大學出版中心,2011年。 [20] 金觀濤和劉青峰。〈觀念史研究:中國現代重要政治術語的形成〉,中文大學出版社,2008。 [21] 金觀濤、梁穎誼、姚育松和劉昭麟,〈統計偏離值分析於人文研究上的應用:以《新青年》為例〉,第四屆數位典藏與數位人文國際研討會,2012。 [22] 金觀濤、邱偉雲和劉昭麟,〈「共現」詞頻分析及其運用-以「華人」觀念起源為例〉,第四屆數位典藏與數位人文國際研討會,2012。 [23] 余清祥,〈統計在紅樓夢的應用〉,《政人學報》76期,頁303-327,1998年。 [24] 中國近現代思想及文學史專業數據(1830-1930), http://digibase.ssic.nccu.edu.tw/?m=2302&wsn=0101 [25] 《新青年》文獻簡介,http://digibase.ssic.nccu.edu.tw/?m=2302&wsn=0304 [26] 結巴中文斷詞( Jieba)套件,https://github.com/fxsjy/jieba [27] community套件,https://bitbucket.org/taynaud/python-louvain [28] Networkx套件,https://networkx.github.io/ zh_TW