基於大型語言模型的論文主題分析與跨領域應用探索 | Publication

Publications-Theses

Article View/Open

pdf(3)

Publication Export

Google Scholar^TM

題名	基於大型語言模型的論文主題分析與跨領域應用探索 Exploring the interdisciplinary potential of research through LLM-driven subject area analysis
作者	羅延康 Lo, In-Hong
貢獻者	廖文宏 Liao ,Wen-Hung 羅延康 Lo, In-Hong
關鍵詞	大型語言模型跨領域研究知識圖譜關鍵字共現 Large Language Models Cross-domain Collaboration Knowledge Graph Keyword Co-occurrence
日期	2025
上傳時間	4-Aug-2025 15:10:04 (UTC+8)
摘要	本研究結合關鍵字共現之知識圖譜構建方法與大型語言模型（LLM）應用，聚焦於評估學術論文的跨領域特性與研究者的跨領域動態。首先，透過 Python、NLTK、py2neo、Sentence-Transformers 等工具，自 Scopus 資料庫擷取國立政治大學（NCCU）之論文標題與摘要，經過 BERT 語意探勘、同義詞整合與分群分析後，建立了關鍵字共現矩陣與知識圖譜。並以 Neo4j 進行可視化與中心性分析，協助辨識跨領域中具關鍵影響力的詞彙與「橋樑節點」，為更具體展現方法應用，本研究進行了多組「子圖」分析：例如「國家與地緣政治」子圖揭示了中國、台灣、美國與香港等關鍵詞共現下的國際關係與經濟脈絡，以追蹤研究趨勢與發掘新興議題。在跨領域研究評估方面，本研究除了利用 Scopus 提供的期刊領域標籤外，也應用GPT4o 大型語言模型（LLM），分別在「僅使用標題」與「標題＋摘要」兩種輸入模式下，為文獻分配多元領域標籤。其中，「At Least One Accuracy」指標用來評估模型是否能為每篇文獻至少正確預測一個對應的學術領域。結果顯示，無論是哪種輸入模式，模型在此指標下皆有穩定表現，顯示其具備良好的語意理解與判斷能力，能有效涵蓋文獻的關鍵主題，適合應用於初步分類與主題探索。此外，在多標籤情境下提高分類門檻雖可提升Precision，卻會造成 Recall 明顯下降，呈現分類權衡關係。而為更全面掌握研究者與機構的跨領域特性，研究亦引入熵值（Entropy）作為多樣性量化指標，分析不同學院與研究者的跨領域程度與變化趨勢，進而辨識潛在合作群體、橋樑作者，並透過時序觀察揭示研究重心的演進與未來的合作機會。基於上述跨領域分析架構，本研究進一步設計一套以大型語言模型（LLM）結合期刊分類與對稱式 KL 散度（SymKL）指標之合作推薦系統。以國科會計畫中的潛在合作者篩選為實例，利用 LLM 對計畫摘要進行語意分類，判定其關鍵應用領域，並對應 Scopus標準領域分數進行量化。接著透過 SymKL 評估研究者與計畫主題的分佈相似性，為高等教育與科研單位在篩選出計畫主題高度契合的候選人，本方法能快速、客觀識別涵蓋跨領域之研究者，建立一支具互補性與應用導向的團隊，並大幅提升推薦流程效率與準確性，為計畫主持人提供即時決策支援。 This study integrates keyword co-occurrence-based knowledge graph construction with the application of Large Language Models (LLMs), focusing on assessing the interdisciplinary nature of academic papers and the dynamic cross-domain behavior of researchers. Using tools such as Python, NLTK, py2neo, and Sentence-Transformers, we collected paper titles and abstracts affiliated with National Chengchi University (NCCU) from the Scopus database. Through BERT-based semantic analysis, synonym integration, and clustering, a keyword co-occurrence matrix and a knowledge graph were constructed. Neo4j was employed for visualization and centrality analysis to identify influential keywords and ”bridge nodes” in interdisciplinary contexts. To demonstrate practical applications, several subgraph analyses were conducted. For example, the “Nation and Geopolitics”subgraph revealed co-occurrence among keywords like China, Taiwan, the United States, and Hong Kong, reflecting international relations and economic themes, thereby facilitating trend analysis and identification of emerging topics. In evaluating interdisciplinary research, the study adopted not only Scopus’s subject area labels but also leveraged the GPT-4o LLM. It assigned multi-label domain tags to each paper using two input scenarios: (1) titles only and (2) titles with abstracts. The ”At Least One Accuracy” metric assessed whether the model could correctly predict at least one relevant subject area per paper. Results showed consistent performance across both scenarios, confirming the model’s strong semantic understanding and suitability for preliminary classification and topic exploration. However, raising the classification threshold improved precision but significantly reduced recall, indicating a precision-recall tradeoff in multi-label classification. To more comprehensively capture the interdisciplinary nature of researchers and institutions, entropy was introduced as a diversity metric. This enabled analysis of interdisciplinary breadth and evolution among different colleges and individual researchers, revealing potential collaborative groups, identifying bridging authors, and highlighting shifts in research focus over time. Building on this interdisciplinary analysis framework, the study further designed a collaborator recommendation system combining LLM-based journal classification and symmetric KL divergence (SymKL). Taking the National Science and Technology Council’s project collaborator screening as an example, LLMs were used to classify project abstracts semantically and match them to Scopus’s standardized subject area scores. SymKL was then used to measure distributional similarity between researchers and project topics. This approach enables rapid and objective identification of interdisciplinary researchers, forming complementary, application-driven teams while greatly improving recommendation efficiency and accuracy, thus providing real-time decision support for project leaders.
參考文獻	[BCB03] K. Börner, C. Chen, and K. W. Boyack, “Visualizing knowledge domains”, Annual Review of Information Science and Technology, vol. 37, pp. 179–255, 2003 (引用於第 13). [CCT+83] M. Callon, J. P. Courtial, W. A. Turner, and S. Bauin, “From translations to problematic networks: An introduction to co-word analysis”, Social Science Information, vol. 22, no. 2, pp. 191–235, 1983 (引用於第 13). [Che06] C. Chen, “Citespace ii: Detecting and visualizing emerging trends and transient patterns in scientific literature”, Journal of the American Society for Information Science and Technology, vol. 57, no. 3, pp. 359–377, 2006 (引用於第 13). [DCL+19] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding”, in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186 (引用於第 2). [Gro20] M. Grootendorst, Keybert: Minimal keyword extraction with bert. Version v0.3.0, 2020 (引用於第 31). [Joh67] S. C. Johnson, “Hierarchical clustering schemes”, Psychometrika, vol. 32, no. 3, pp. 241–254, 1967 (引用於第 34). [KL51] S. Kullback and R. A. Leibler, “On information and sufficiency”, The Annals of Mathematical Statistics, vol. 22, no. 1, pp. 79–86, 1951 (引用於第 8). [LIX+23] Y. Liu, D. Iter, Y. Xu, et al., “G-eval: NLG evaluation using gpt-4 with better human alignment”, in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds., Singapore: Association for Computational Linguistics, Dec. 2023, pp. 2511–2522 (引用於第 20). [New10] M. E. J. Newman, Networks: An Introduction. Oxford University Press, 2010 (引用於第 13). [NJW02] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm”, in Advances in Neural Information Processing Systems, vol. 14, 2002, pp. 849–856 (引用於第 9). [PC85] A. L. Porter and D. E. Chubin, “An indicator of cross-disciplinary research”, Scientometrics, vol. 8, no. 3, pp. 161–176, 1985 (引用於第 5). [PR09] A. L. Porter and I. Rafols, “Is science becoming more interdisciplinary? measuring and mapping six research fields over time”, Scientometrics, vol. 81, no. 3, pp. 719–745, 2009 (引用於第 7). [Raf20] I. Rafols. “On “measuring” interdisciplinarity: From indicators to indicating”. Leiden Madtrics，Science & Society 部落格文章. (Nov. 2020), [Online]. Available: https://www.leidenmadtrics.nl/articles/on-measuringinterdisciplinarity- from- indicators- to- indicating (visited on Jun. 16, 2025) (引用於第 5). [RG19] N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks”, in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Nov. 2019 (引用於第 33). [RPL10] I. Rafols, A. L. Porter, and L. Leydesdorff, “Science overlay maps: A new tool for research policy and library management”, Journal of the American Society for Information Science and Technology, vol. 61, no. 9, pp. 1568–1582, 2010 (引用於第 7). [Sha48] C. E. Shannon, “A mathematical theory of communication”, Bell System Technical Journal, vol. 27, pp. 379–423, 1948 (引用於第 7). [SM00] J. Shi and J. Malik, “Normalized cuts and image segmentation”, in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, IEEE, 2000, pp. 888–905 (引用於第 9). [Von07] U. Von Luxburg, “A tutorial on spectral clustering”, Statistics and Computing, vol. 17, no. 4, pp. 395–416, 2007 (引用於第 9). [War63] J. H. J. Ward, “Hierarchical grouping to optimize an objective function”, Journal of the American Statistical Association, vol. 58, no. 301, pp. 236–244, 1963 (引用於第 34).
描述	碩士國立政治大學資訊科學系碩士在職專班 111971011
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0111971011
資料類型	thesis

dc.contributor.advisor	廖文宏	zh_TW
dc.contributor.advisor	Liao ,Wen-Hung	en_US
dc.contributor.author (Authors)	羅延康	zh_TW
dc.contributor.author (Authors)	Lo, In-Hong	en_US
dc.creator (作者)	羅延康	zh_TW
dc.creator (作者)	Lo, In-Hong	en_US
dc.date (日期)	2025	en_US
dc.date.accessioned	4-Aug-2025 15:10:04 (UTC+8)	-
dc.date.available	4-Aug-2025 15:10:04 (UTC+8)	-
dc.date.issued (上傳時間)	4-Aug-2025 15:10:04 (UTC+8)	-
dc.identifier (Other Identifiers)	G0111971011	en_US
dc.identifier.uri (URI)	https://nccur.lib.nccu.edu.tw/handle/140.119/158708	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	資訊科學系碩士在職專班	zh_TW
dc.description (描述)	111971011	zh_TW
dc.description.abstract (摘要)	本研究結合關鍵字共現之知識圖譜構建方法與大型語言模型（LLM）應用，聚焦於評估學術論文的跨領域特性與研究者的跨領域動態。首先，透過 Python、NLTK、py2neo、Sentence-Transformers 等工具，自 Scopus 資料庫擷取國立政治大學（NCCU）之論文標題與摘要，經過 BERT 語意探勘、同義詞整合與分群分析後，建立了關鍵字共現矩陣與知識圖譜。並以 Neo4j 進行可視化與中心性分析，協助辨識跨領域中具關鍵影響力的詞彙與「橋樑節點」，為更具體展現方法應用，本研究進行了多組「子圖」分析：例如「國家與地緣政治」子圖揭示了中國、台灣、美國與香港等關鍵詞共現下的國際關係與經濟脈絡，以追蹤研究趨勢與發掘新興議題。在跨領域研究評估方面，本研究除了利用 Scopus 提供的期刊領域標籤外，也應用GPT4o 大型語言模型（LLM），分別在「僅使用標題」與「標題＋摘要」兩種輸入模式下，為文獻分配多元領域標籤。其中，「At Least One Accuracy」指標用來評估模型是否能為每篇文獻至少正確預測一個對應的學術領域。結果顯示，無論是哪種輸入模式，模型在此指標下皆有穩定表現，顯示其具備良好的語意理解與判斷能力，能有效涵蓋文獻的關鍵主題，適合應用於初步分類與主題探索。此外，在多標籤情境下提高分類門檻雖可提升Precision，卻會造成 Recall 明顯下降，呈現分類權衡關係。而為更全面掌握研究者與機構的跨領域特性，研究亦引入熵值（Entropy）作為多樣性量化指標，分析不同學院與研究者的跨領域程度與變化趨勢，進而辨識潛在合作群體、橋樑作者，並透過時序觀察揭示研究重心的演進與未來的合作機會。基於上述跨領域分析架構，本研究進一步設計一套以大型語言模型（LLM）結合期刊分類與對稱式 KL 散度（SymKL）指標之合作推薦系統。以國科會計畫中的潛在合作者篩選為實例，利用 LLM 對計畫摘要進行語意分類，判定其關鍵應用領域，並對應 Scopus標準領域分數進行量化。接著透過 SymKL 評估研究者與計畫主題的分佈相似性，為高等教育與科研單位在篩選出計畫主題高度契合的候選人，本方法能快速、客觀識別涵蓋跨領域之研究者，建立一支具互補性與應用導向的團隊，並大幅提升推薦流程效率與準確性，為計畫主持人提供即時決策支援。	zh_TW
dc.description.abstract (摘要)	This study integrates keyword co-occurrence-based knowledge graph construction with the application of Large Language Models (LLMs), focusing on assessing the interdisciplinary nature of academic papers and the dynamic cross-domain behavior of researchers. Using tools such as Python, NLTK, py2neo, and Sentence-Transformers, we collected paper titles and abstracts affiliated with National Chengchi University (NCCU) from the Scopus database. Through BERT-based semantic analysis, synonym integration, and clustering, a keyword co-occurrence matrix and a knowledge graph were constructed. Neo4j was employed for visualization and centrality analysis to identify influential keywords and ”bridge nodes” in interdisciplinary contexts. To demonstrate practical applications, several subgraph analyses were conducted. For example, the “Nation and Geopolitics”subgraph revealed co-occurrence among keywords like China, Taiwan, the United States, and Hong Kong, reflecting international relations and economic themes, thereby facilitating trend analysis and identification of emerging topics. In evaluating interdisciplinary research, the study adopted not only Scopus’s subject area labels but also leveraged the GPT-4o LLM. It assigned multi-label domain tags to each paper using two input scenarios: (1) titles only and (2) titles with abstracts. The ”At Least One Accuracy” metric assessed whether the model could correctly predict at least one relevant subject area per paper. Results showed consistent performance across both scenarios, confirming the model’s strong semantic understanding and suitability for preliminary classification and topic exploration. However, raising the classification threshold improved precision but significantly reduced recall, indicating a precision-recall tradeoff in multi-label classification. To more comprehensively capture the interdisciplinary nature of researchers and institutions, entropy was introduced as a diversity metric. This enabled analysis of interdisciplinary breadth and evolution among different colleges and individual researchers, revealing potential collaborative groups, identifying bridging authors, and highlighting shifts in research focus over time. Building on this interdisciplinary analysis framework, the study further designed a collaborator recommendation system combining LLM-based journal classification and symmetric KL divergence (SymKL). Taking the National Science and Technology Council’s project collaborator screening as an example, LLMs were used to classify project abstracts semantically and match them to Scopus’s standardized subject area scores. SymKL was then used to measure distributional similarity between researchers and project topics. This approach enables rapid and objective identification of interdisciplinary researchers, forming complementary, application-driven teams while greatly improving recommendation efficiency and accuracy, thus providing real-time decision support for project leaders.	en_US
dc.description.tableofcontents	致謝 i 摘要 iii Abstract v 目錄 vii 圖目錄 ix 表目錄 xiii 章節1 緒論 1 1.1 研究背景與動機 1 1.2 研究目的 2 1.3 研究貢獻 3 1.4 論文架構 4 章節2 文獻探討 5 2.1 早期跨領域量化指標 5 2.2 大型語言模型 6 2.3 熵 (Entropy) 分析 7 2.4 Kullback–Leibler Divergence (KL 散度) 8 2.5 Spectral Clustering 8 2.6 知識圖譜 10 2.7 社會網路分析 12 章節3 研究方法 17 3.1 資料集 17 3.2 項目一: 熵分析 19 3.3 項目二: 對稱式 KL 散度之跨學科研究者學術相似性與合作潛力評估 27 3.4 項目三: 國科會計畫合作者推薦流程 28 3.5 項目四: 知識圖譜的建構 30 章節4 案例研究 55 4.1 項目一: 熵分析 55 4.2 項目二: 對稱式 KL 散度合作潛力應用 73 4.3 項目三: 國科會計畫合作者推薦案例：結合 LLM 論文分類與 SymKl 分析結果 76 4.4 項目四: 利用知識圖譜分析研究者跨領域程度 83 章節5 結論與未來展望 91 5.1 結論 91 5.2 未來展望 92 參考文獻 95 A 附錄 99 A.1 建構關鍵字共現知識圖譜-僅使用文獻標題完整範例 99 A.2 建構關鍵字共現知識圖譜-結合文獻標題與摘要完整範例 101 A.3 LLM 進行期刊分類-僅使用文獻標題進行分類完整範例 104 A.4 LLM 進行期刊分類-結合文獻標題與摘要進行分類完整範例 106 A.5 完整連通份量評分範例 111 A.6 國科會計畫案例分類完整內容 114	zh_TW
dc.format.extent	6311076 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0111971011	en_US
dc.subject (關鍵詞)	大型語言模型	zh_TW
dc.subject (關鍵詞)	跨領域研究	zh_TW
dc.subject (關鍵詞)	知識圖譜	zh_TW
dc.subject (關鍵詞)	關鍵字共現	zh_TW
dc.subject (關鍵詞)	Large Language Models	en_US
dc.subject (關鍵詞)	Cross-domain Collaboration	en_US
dc.subject (關鍵詞)	Knowledge Graph	en_US
dc.subject (關鍵詞)	Keyword Co-occurrence	en_US
dc.title (題名)	基於大型語言模型的論文主題分析與跨領域應用探索	zh_TW
dc.title (題名)	Exploring the interdisciplinary potential of research through LLM-driven subject area analysis	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	[BCB03] K. Börner, C. Chen, and K. W. Boyack, “Visualizing knowledge domains”, Annual Review of Information Science and Technology, vol. 37, pp. 179–255, 2003 (引用於第 13). [CCT+83] M. Callon, J. P. Courtial, W. A. Turner, and S. Bauin, “From translations to problematic networks: An introduction to co-word analysis”, Social Science Information, vol. 22, no. 2, pp. 191–235, 1983 (引用於第 13). [Che06] C. Chen, “Citespace ii: Detecting and visualizing emerging trends and transient patterns in scientific literature”, Journal of the American Society for Information Science and Technology, vol. 57, no. 3, pp. 359–377, 2006 (引用於第 13). [DCL+19] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding”, in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186 (引用於第 2). [Gro20] M. Grootendorst, Keybert: Minimal keyword extraction with bert. Version v0.3.0, 2020 (引用於第 31). [Joh67] S. C. Johnson, “Hierarchical clustering schemes”, Psychometrika, vol. 32, no. 3, pp. 241–254, 1967 (引用於第 34). [KL51] S. Kullback and R. A. Leibler, “On information and sufficiency”, The Annals of Mathematical Statistics, vol. 22, no. 1, pp. 79–86, 1951 (引用於第 8). [LIX+23] Y. Liu, D. Iter, Y. Xu, et al., “G-eval: NLG evaluation using gpt-4 with better human alignment”, in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds., Singapore: Association for Computational Linguistics, Dec. 2023, pp. 2511–2522 (引用於第 20). [New10] M. E. J. Newman, Networks: An Introduction. Oxford University Press, 2010 (引用於第 13). [NJW02] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm”, in Advances in Neural Information Processing Systems, vol. 14, 2002, pp. 849–856 (引用於第 9). [PC85] A. L. Porter and D. E. Chubin, “An indicator of cross-disciplinary research”, Scientometrics, vol. 8, no. 3, pp. 161–176, 1985 (引用於第 5). [PR09] A. L. Porter and I. Rafols, “Is science becoming more interdisciplinary? measuring and mapping six research fields over time”, Scientometrics, vol. 81, no. 3, pp. 719–745, 2009 (引用於第 7). [Raf20] I. Rafols. “On “measuring” interdisciplinarity: From indicators to indicating”. Leiden Madtrics，Science & Society 部落格文章. (Nov. 2020), [Online]. Available: https://www.leidenmadtrics.nl/articles/on-measuringinterdisciplinarity- from- indicators- to- indicating (visited on Jun. 16, 2025) (引用於第 5). [RG19] N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks”, in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Nov. 2019 (引用於第 33). [RPL10] I. Rafols, A. L. Porter, and L. Leydesdorff, “Science overlay maps: A new tool for research policy and library management”, Journal of the American Society for Information Science and Technology, vol. 61, no. 9, pp. 1568–1582, 2010 (引用於第 7). [Sha48] C. E. Shannon, “A mathematical theory of communication”, Bell System Technical Journal, vol. 27, pp. 379–423, 1948 (引用於第 7). [SM00] J. Shi and J. Malik, “Normalized cuts and image segmentation”, in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, IEEE, 2000, pp. 888–905 (引用於第 9). [Von07] U. Von Luxburg, “A tutorial on spectral clustering”, Statistics and Computing, vol. 17, no. 4, pp. 395–416, 2007 (引用於第 9). [War63] J. H. J. Ward, “Hierarchical grouping to optimize an objective function”, Journal of the American Statistical Association, vol. 58, no. 301, pp. 236–244, 1963 (引用於第 34).	zh_TW

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM