Publications-Theses
Article View/Open
Publication Export
-
Google ScholarTM
NCCU Library
Citation Infomation
Related Publications in TAIR
題名 臺灣碩博士論文之文字分析—以商業及管理學門摘要為例
Text Analysis of Master’s and Doctoral Theses in Taiwan: A Study on Abstracts in the Field of Business and Administration作者 劉貞莉
Liu, Chen-Li貢獻者 陳怡如<br>余清祥
Chen, Yi-Ju<br>Yue, Ching-Syang
劉貞莉
Liu, Chen-Li關鍵詞 文字分析
中文斷詞
探索性資料分析
文本分類
關聯性分析
Text Analysis
Word Segmentation
Exploratory Data Analysis
Classification
Association Analysis日期 2024 上傳時間 4-Sep-2024 14:55:57 (UTC+8) 摘要 自從人類發明文字,文字一直是人類傳遞知識、故事和情感的重要工具,藉由文字分析可以探索各時期的文化及科技等發展、社會特色及變遷軌跡,並能鉅細靡遺地發掘其中的關鍵因素。摘要則是文章、書籍的縮影,通常可在摘要的文字及其內容一窺全文的關鍵,以學術論文為例,讀者應能從摘要知道文章的研究目的、結論、重要啟發等要素。本研究以107至109學年度臺灣商業及管理(簡稱商管)學門的碩博士論文摘要為研究對象,除了整理論文的用字等寫作風格外,同時也嘗試使用群集分析等工具,剖析摘要三個單元的文字風格,比較商管各學類論文的特色,協助讀者撰寫及研讀商管學門的論文。 由於現代中文主要以白話文為主,通常以兩個字及以上組成的詞彙為基本單位,分析白話文時會先經過斷詞處理,取得更接近文意的重要詞彙。本研究將先探討兩種斷詞套件:Jieba和CKIP,從執行時間、詞彙數量、詞彙比例、詞彙種類與斷詞精確度等面向進行比較,提供使用者分析中文的參考。而摘要的文字分析主要從探索性資料分析著手,以人工標示將摘要分成「動機目的」、「方法素材」與「結論建議」三個單元,並根據斷詞結果的常見詞彙、字詞多樣性與共現詞叢等角度,探索商管論文的十個學類之寫作風格。資料分析顯示,CKIP斷詞結果能捕捉到臺灣碩博士商管學門論文摘要的慣用詞語,整體結果較符合本研究的期望。摘要三個單元之間的特徵與格式相當明顯,商管學門的十大學類可分為三大集群:醫管、會計、以及其他學類。另外,以各集群與各單元的常見詞彙與共現詞叢作為解釋變數,代入分類模型能有效地區隔商管學門的三個集群、摘要三個單元。
Writing has been a crucial tool for humans to exchange knowledge and express emotions. Through text analysis, we can explore the cultural and technological developments in various eras and understand social characteristics and changes. An abstract serves as the epitome of an article or book, often providing key insights of the full text. For example, readers are usually able to discern the research objectives, conclusions, and significant insights from the abstract of an academic paper. This paper studies the abstracts of master’s and doctoral theses in the field of business and administration (BA) in Taiwan between 2018 and 2020, using cluster analysis to dissect the textual styles of the three sections of the abstracts. The goal is to compare the characteristics of theses across various BA disciplines and to assist readers in writing and understanding BA academic papers. Modern Chinese writing typically consists of phrases (two or more words) as a basic unit, and thus word segmentation is the first step in analyzing Chinese text. We evaluate two word segmentation tools: Jieba and CKIP, and compare them in terms of execution time and segmentation accuracy to provide references for users analyzing Chinese text. For the study of textual style, we apply tools in exploratory data analysis and examine common terms, word diversity, and co-occurrence terms in abstracts, based on the word segmentation results. Note that the abstracts can be divided into three sections: Motivations & Purposes, Methods & Materials, and Conclusions & Suggestions. The analysis results show that the CKIP tool can capture the commonly used terms in master’s and doctoral thesis abstracts in Taiwan, aligning better with the expectations of this study. Additionally, by using the common terms and co-occurrence terms as explanatory variables in classification models, we can effectively distinguish between the three clusters of BA disciplines and the three sections of the abstracts.參考文獻 一、中文文獻 [1] eyck (2018)。 [XD] 中文很煩。批踢踢實業坊。https://www.ptt.cc/bbs/joke/M.1528192353.A.0A8.html [2] National Digital Archives Program (2003)。中文斷詞系統。https://ckipsvr.iis.sinica.edu.tw [3] 何立行、余清祥、鄭文惠 (2014)。「從文言到白話:《新青年》雜誌語言變化統計研究」。《東亞觀念史集刊》,7,427-454。 [4] 余清祥 (1998)。「統計在紅樓夢的應用」。《政大學報》,76,303-327。 [5] 余清祥、葉昱廷 (2020)。「以文字探勘技術分析臺灣四大報文字風格」。《數位典藏與數位人文》,6,67-94。 [6] 婚嫁 (2018)。「想過過過兒過過的生活是什麼梗 逼死外國人系列啊」。壹讀。https://read01.com/0e3ynKE.html [7] 宋子軒、冷燮、陳瑤瑤 (2012)。「概率抽樣條件下樣本代表性事後評估方法探討」。《統計研究》,29(7),96-100。 [8] 李宏毅 [Hung-yi Lee] (2019)。ELMO, BERT, GPT [Video]. YouTube. https://youtu.be/UYPa347-DdE?si=WFueLnLv8XDKuUF6 二、英文文獻 [1] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., & Amodei, D. (2020). Language Models are Few-shot Learners. Advances in Neural Information Processing Systems, 33, 1877-1901. [2] Chang, P. C., Galley, M., & Manning, C. D. (2008). Optimizing Chinese Word Segmentation for Machine Translation Performance. In Proceedings of the Third Workshop on Statistical Machine Translation, 224-232. [3] Chen, X., Qiu, X., Zhu, C., & Liu, P. (2015). Long Short-term Memory Neural Networks for Chinese Word Segmentation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 1385-1390. [4] Church, K. W. (1988). A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text. In Proceedings of the Second Conference on Applied Natural Language Processing, 136-143. [5] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, 4171-4186. [6] Efron, B., & Thisted, R. (1976). Estimating the Number of Unseen Species: How Many Words Did Shakespeare Know? Biometrika, 63(3), 435-447. [7] Hochreiter, S., & Schmidhuber, J. (1997). Long Short-term Memory. Neural Computation, 9(8), 1735-1780. [8] LaPlaca, P., Lindgreen, A., & Vanhammed, J. (2018). How to Write Really Good Articles for Premier Academic Journals. Industrial Marketing Management, 68, 202-209. [9] Li, P. H., Fu, T. J., & Ma, W. Y. (2020). Why Attention? Analyze BiLSTM Deficiency and Its Remedies in the Case of NER. In Proceedings of the AAAI Conference on Artificial Intelligence, 34(5), 8236-8244. [10] Lin, Q. X., Chang, C. H., & Chen, C. L. (2010). A Simple and Effective Closed Test for Chinese Word Segmentation Based on Sequence Labeling. Computational Linguistics and Chinese Language Processing, 15(3-4), 161-180. [11] Low, J. K., Ng, H. T., & Guo, W. (2005). A Maximum Entropy Approach to Chinese Word Segmentation. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, 161-164. [12] Ma, J., & Hinrichs, E. (2015). Accurate Linear-time Chinese Word Segmentation via Embedding Matching. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, 1735-1745. [13] Ma, J., Ganchev, K., & Weiss, D. (2018). State-of-the-art Chinese Word Segmentation with Bi-LSTMs. arXiv preprint arXiv:1808.06511. [14] Ma, W. Y., & Chen, K. J. (2003). A Bottom-up Merging Algorithm for Chinese Unknown Word Extraction. In Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, 31-38. [15] Mosteller, F., & Wallace, D. L. (1984). Applied Bayesian and Classical Inference. Springer Series in Statistics. [16] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners. OpenAI Blog, 1(8), 9. [17] Ríos-Toledo, G., Posadas-Durán, J. P. F., Sidorov, G., & Castro-Sánchez, N. A. (2022). Detection of Changes in Literary Writing Style Using N-grams as Style Markers and Supervised Machine Learning. Plos One, 17(7), e0267590. [18] Salton, G., & Buckley, C. (1988). Term-weighting Approaches in Automatic Text Retrieval. Information Processing & Management, 24(5), 513-523. [19] Shannon, C. E. (1948). A Mathematical Theory of Communication. The Bell System Technical Journal, 27(3), 379-423. [20] Simpson, E. H. (1949). Measurement of Diversity. Nature, 163(4148), 688-688. [21] Thisted, R., & Efron, B. (1987). Did Shakespeare Write a Newly-Discovered Poem? Biometrika, 74(3), 445-455. [22] Turing, A. M. (2009). Computing Machinery and Intelligence. Springer Netherlands, 23-65. [23] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is All You Need. In Advances in Neural Information Processing Systems (NeurIPS), 30. [24] Yeh, W. C., Hsieh, Y. L., Chang, Y. C., & Hsu, W. L. (2022). Multifaceted Assessments of Traditional Chinese Word Segmentation Tool on Large Corpora. In Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022), 193-199. [25] Yue, C. J., & Clayton, M. (2005). An Similarity Measure Based on Species Proportions. Communications in Statistics: Theory and Methods, 34, 2123-2131. [26] Yue, C. J., Clayton, M., & Lin, F. (2001). A Nonparametric Estimator of Species Overlap. Biometrics, 57(3), 743-749. 描述 碩士
國立政治大學
統計學系
111354014資料來源 http://thesis.lib.nccu.edu.tw/record/#G0111354014 資料類型 thesis dc.contributor.advisor 陳怡如<br>余清祥 zh_TW dc.contributor.advisor Chen, Yi-Ju<br>Yue, Ching-Syang en_US dc.contributor.author (Authors) 劉貞莉 zh_TW dc.contributor.author (Authors) Liu, Chen-Li en_US dc.creator (作者) 劉貞莉 zh_TW dc.creator (作者) Liu, Chen-Li en_US dc.date (日期) 2024 en_US dc.date.accessioned 4-Sep-2024 14:55:57 (UTC+8) - dc.date.available 4-Sep-2024 14:55:57 (UTC+8) - dc.date.issued (上傳時間) 4-Sep-2024 14:55:57 (UTC+8) - dc.identifier (Other Identifiers) G0111354014 en_US dc.identifier.uri (URI) https://nccur.lib.nccu.edu.tw/handle/140.119/153362 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 統計學系 zh_TW dc.description (描述) 111354014 zh_TW dc.description.abstract (摘要) 自從人類發明文字,文字一直是人類傳遞知識、故事和情感的重要工具,藉由文字分析可以探索各時期的文化及科技等發展、社會特色及變遷軌跡,並能鉅細靡遺地發掘其中的關鍵因素。摘要則是文章、書籍的縮影,通常可在摘要的文字及其內容一窺全文的關鍵,以學術論文為例,讀者應能從摘要知道文章的研究目的、結論、重要啟發等要素。本研究以107至109學年度臺灣商業及管理(簡稱商管)學門的碩博士論文摘要為研究對象,除了整理論文的用字等寫作風格外,同時也嘗試使用群集分析等工具,剖析摘要三個單元的文字風格,比較商管各學類論文的特色,協助讀者撰寫及研讀商管學門的論文。 由於現代中文主要以白話文為主,通常以兩個字及以上組成的詞彙為基本單位,分析白話文時會先經過斷詞處理,取得更接近文意的重要詞彙。本研究將先探討兩種斷詞套件:Jieba和CKIP,從執行時間、詞彙數量、詞彙比例、詞彙種類與斷詞精確度等面向進行比較,提供使用者分析中文的參考。而摘要的文字分析主要從探索性資料分析著手,以人工標示將摘要分成「動機目的」、「方法素材」與「結論建議」三個單元,並根據斷詞結果的常見詞彙、字詞多樣性與共現詞叢等角度,探索商管論文的十個學類之寫作風格。資料分析顯示,CKIP斷詞結果能捕捉到臺灣碩博士商管學門論文摘要的慣用詞語,整體結果較符合本研究的期望。摘要三個單元之間的特徵與格式相當明顯,商管學門的十大學類可分為三大集群:醫管、會計、以及其他學類。另外,以各集群與各單元的常見詞彙與共現詞叢作為解釋變數,代入分類模型能有效地區隔商管學門的三個集群、摘要三個單元。 zh_TW dc.description.abstract (摘要) Writing has been a crucial tool for humans to exchange knowledge and express emotions. Through text analysis, we can explore the cultural and technological developments in various eras and understand social characteristics and changes. An abstract serves as the epitome of an article or book, often providing key insights of the full text. For example, readers are usually able to discern the research objectives, conclusions, and significant insights from the abstract of an academic paper. This paper studies the abstracts of master’s and doctoral theses in the field of business and administration (BA) in Taiwan between 2018 and 2020, using cluster analysis to dissect the textual styles of the three sections of the abstracts. The goal is to compare the characteristics of theses across various BA disciplines and to assist readers in writing and understanding BA academic papers. Modern Chinese writing typically consists of phrases (two or more words) as a basic unit, and thus word segmentation is the first step in analyzing Chinese text. We evaluate two word segmentation tools: Jieba and CKIP, and compare them in terms of execution time and segmentation accuracy to provide references for users analyzing Chinese text. For the study of textual style, we apply tools in exploratory data analysis and examine common terms, word diversity, and co-occurrence terms in abstracts, based on the word segmentation results. Note that the abstracts can be divided into three sections: Motivations & Purposes, Methods & Materials, and Conclusions & Suggestions. The analysis results show that the CKIP tool can capture the commonly used terms in master’s and doctoral thesis abstracts in Taiwan, aligning better with the expectations of this study. Additionally, by using the common terms and co-occurrence terms as explanatory variables in classification models, we can effectively distinguish between the three clusters of BA disciplines and the three sections of the abstracts. en_US dc.description.tableofcontents 第一章 緒論 1 第一節 動機與目的 1 第二節 研究流程 3 第二章 文獻探討與研究方法 5 第一節 文獻回顧 5 第二節 研究素材 8 第三節 人工標記 15 第四節 中文斷詞套件 17 第五節 詞彙探索性資料分析 18 第六節 分類模型 22 第三章 中文斷詞套件比較 26 第一節 執行時間 26 第二節 詞彙種類、數量與分佈 28 第三節 文字雲分析 31 第四節 斷詞成效比較 40 第五節 斷詞比較的彙整 42 第四章 寫作風格分析 45 第一節 結構分析 45 第二節 文字特性分析 50 第三節 學類分群及驗證 54 第五章 摘要單元識別 65 第一節 資料前處理 65 第二節 變數說明與探索 66 第三節 模型成果及驗證 67 第六章 結論與建議 73 第一節 結論 73 第二節 研究限制與建議 74 參考文獻 76 附錄一、樣本代表性補充圖表 79 附錄二、中文斷詞套件比較補充圖表 81 附錄三、關聯詞叢補充圖表 83 附錄四、摘要單元識別特徵重要性補充圖表 88 zh_TW dc.format.extent 16218020 bytes - dc.format.mimetype application/pdf - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0111354014 en_US dc.subject (關鍵詞) 文字分析 zh_TW dc.subject (關鍵詞) 中文斷詞 zh_TW dc.subject (關鍵詞) 探索性資料分析 zh_TW dc.subject (關鍵詞) 文本分類 zh_TW dc.subject (關鍵詞) 關聯性分析 zh_TW dc.subject (關鍵詞) Text Analysis en_US dc.subject (關鍵詞) Word Segmentation en_US dc.subject (關鍵詞) Exploratory Data Analysis en_US dc.subject (關鍵詞) Classification en_US dc.subject (關鍵詞) Association Analysis en_US dc.title (題名) 臺灣碩博士論文之文字分析—以商業及管理學門摘要為例 zh_TW dc.title (題名) Text Analysis of Master’s and Doctoral Theses in Taiwan: A Study on Abstracts in the Field of Business and Administration en_US dc.type (資料類型) thesis en_US dc.relation.reference (參考文獻) 一、中文文獻 [1] eyck (2018)。 [XD] 中文很煩。批踢踢實業坊。https://www.ptt.cc/bbs/joke/M.1528192353.A.0A8.html [2] National Digital Archives Program (2003)。中文斷詞系統。https://ckipsvr.iis.sinica.edu.tw [3] 何立行、余清祥、鄭文惠 (2014)。「從文言到白話:《新青年》雜誌語言變化統計研究」。《東亞觀念史集刊》,7,427-454。 [4] 余清祥 (1998)。「統計在紅樓夢的應用」。《政大學報》,76,303-327。 [5] 余清祥、葉昱廷 (2020)。「以文字探勘技術分析臺灣四大報文字風格」。《數位典藏與數位人文》,6,67-94。 [6] 婚嫁 (2018)。「想過過過兒過過的生活是什麼梗 逼死外國人系列啊」。壹讀。https://read01.com/0e3ynKE.html [7] 宋子軒、冷燮、陳瑤瑤 (2012)。「概率抽樣條件下樣本代表性事後評估方法探討」。《統計研究》,29(7),96-100。 [8] 李宏毅 [Hung-yi Lee] (2019)。ELMO, BERT, GPT [Video]. YouTube. https://youtu.be/UYPa347-DdE?si=WFueLnLv8XDKuUF6 二、英文文獻 [1] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., & Amodei, D. (2020). Language Models are Few-shot Learners. Advances in Neural Information Processing Systems, 33, 1877-1901. [2] Chang, P. C., Galley, M., & Manning, C. D. (2008). Optimizing Chinese Word Segmentation for Machine Translation Performance. In Proceedings of the Third Workshop on Statistical Machine Translation, 224-232. [3] Chen, X., Qiu, X., Zhu, C., & Liu, P. (2015). Long Short-term Memory Neural Networks for Chinese Word Segmentation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 1385-1390. [4] Church, K. W. (1988). A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text. In Proceedings of the Second Conference on Applied Natural Language Processing, 136-143. [5] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, 4171-4186. [6] Efron, B., & Thisted, R. (1976). Estimating the Number of Unseen Species: How Many Words Did Shakespeare Know? Biometrika, 63(3), 435-447. [7] Hochreiter, S., & Schmidhuber, J. (1997). Long Short-term Memory. Neural Computation, 9(8), 1735-1780. [8] LaPlaca, P., Lindgreen, A., & Vanhammed, J. (2018). How to Write Really Good Articles for Premier Academic Journals. Industrial Marketing Management, 68, 202-209. [9] Li, P. H., Fu, T. J., & Ma, W. Y. (2020). Why Attention? Analyze BiLSTM Deficiency and Its Remedies in the Case of NER. In Proceedings of the AAAI Conference on Artificial Intelligence, 34(5), 8236-8244. [10] Lin, Q. X., Chang, C. H., & Chen, C. L. (2010). A Simple and Effective Closed Test for Chinese Word Segmentation Based on Sequence Labeling. Computational Linguistics and Chinese Language Processing, 15(3-4), 161-180. [11] Low, J. K., Ng, H. T., & Guo, W. (2005). A Maximum Entropy Approach to Chinese Word Segmentation. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, 161-164. [12] Ma, J., & Hinrichs, E. (2015). Accurate Linear-time Chinese Word Segmentation via Embedding Matching. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, 1735-1745. [13] Ma, J., Ganchev, K., & Weiss, D. (2018). State-of-the-art Chinese Word Segmentation with Bi-LSTMs. arXiv preprint arXiv:1808.06511. [14] Ma, W. Y., & Chen, K. J. (2003). A Bottom-up Merging Algorithm for Chinese Unknown Word Extraction. In Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, 31-38. [15] Mosteller, F., & Wallace, D. L. (1984). Applied Bayesian and Classical Inference. Springer Series in Statistics. [16] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners. OpenAI Blog, 1(8), 9. [17] Ríos-Toledo, G., Posadas-Durán, J. P. F., Sidorov, G., & Castro-Sánchez, N. A. (2022). Detection of Changes in Literary Writing Style Using N-grams as Style Markers and Supervised Machine Learning. Plos One, 17(7), e0267590. [18] Salton, G., & Buckley, C. (1988). Term-weighting Approaches in Automatic Text Retrieval. Information Processing & Management, 24(5), 513-523. [19] Shannon, C. E. (1948). A Mathematical Theory of Communication. The Bell System Technical Journal, 27(3), 379-423. [20] Simpson, E. H. (1949). Measurement of Diversity. Nature, 163(4148), 688-688. [21] Thisted, R., & Efron, B. (1987). Did Shakespeare Write a Newly-Discovered Poem? Biometrika, 74(3), 445-455. [22] Turing, A. M. (2009). Computing Machinery and Intelligence. Springer Netherlands, 23-65. [23] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is All You Need. In Advances in Neural Information Processing Systems (NeurIPS), 30. [24] Yeh, W. C., Hsieh, Y. L., Chang, Y. C., & Hsu, W. L. (2022). Multifaceted Assessments of Traditional Chinese Word Segmentation Tool on Large Corpora. In Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022), 193-199. [25] Yue, C. J., & Clayton, M. (2005). An Similarity Measure Based on Species Proportions. Communications in Statistics: Theory and Methods, 34, 2123-2131. [26] Yue, C. J., Clayton, M., & Lin, F. (2001). A Nonparametric Estimator of Species Overlap. Biometrics, 57(3), 743-749. zh_TW