學術產出-國科會研究計畫

文章檢視/開啟

書目匯出

Google ScholarTM

政大圖書館

引文資訊

TAIR相關學術產出

題名 以階層式狄利克雷歷程混合模型進行文本分析
Application of Hierarchical Dirichlet Process Mixture Model in Text Analysis
作者 楊立行
貢獻者 心理系
關鍵詞 文本分析; 主題模型; 階層式狄利克雷歷程混合模型
Text Analysis; LDA; Hierarchical Dirichlet Process Mixture Model
日期 2022-03
上傳時間 28-五月-2025 14:06:55 (UTC+8)
摘要 拜機率模型與演算法精進之賜,現代文本分析(text analysis)大幅提升了研究者可以處理文字資料的能力。除了從文本中抽取結構特徵,例如關鍵詞,主題模型(topic model)更可以抽取文本的內容特徵,即主題,根據詞彙之間的共現性(co-occurrence)提供研究者對本文在意義層次上的了解。著名的主題模型LDA(Latent Dirichlet Allocation)奠基於狄利克雷歷程混合模型(Dirichlet Process Mixture Model),將主題視為機率分配模型,並透過貝氏推論(Bayesian inference)估計最適合的主題個數以及每個主題模型的參數,以最大可能地預測詞彙在文本中出現的機率分配。然而,殊為可惜的是LDA並無法針對多群文本內容進行主題異同的比較。相對地,階層式狄利克雷歷程混合模型(Hierarchical Dirichlet Process Mixture Model)可以同時比較多群資料。因此,本計畫建議以階層式狄利克雷歷程混合模型做為比較多群文本主題的主題模型。為檢視此一構想,本計畫認為階層式狄利克雷歷程混合模型應能達成三個目標。第一、應能呈現多群文本之間的主題差異。第二、應能呈現多群文本之間的主題一致性。第三、應能捕捉主題依時改變的軌跡。先導研究結果顯示模型確實可以偵測兩群文本主題之間的差異。延續先導研究,本計畫擬以三種不同資料庫分別檢視上述三個目標達成的可能性,並希望藉此作為改進文字資料分析工具的建議。
Text analysis has become a powerful analysis tool by virtue of the progress of probability models and computational algorithms. Not only can text analysis extract the structural features of texts (e.g., key words), but also it can extract the features of contents (i.e., topics). The model used to extract the topics of texts is call topic model. The most famous topic model is LDA (Latent Dirichlet Allocation), which based on the Dirichlet process treats a text as a collection of topics and topics as a collection of words. LDA assumes that each topic is a distribution over the words in it and different topics correspond to different distributions. The parameters of each topic distribution is estimated by Bayesian inference. However, LDA cannot deal with the comparison between the topics of different sets of texts. In order to extend the utility of topic model in social sciences, it is suggested in this proposal to substitute the hierarchical Dirichlet process mixture model (HDPMM) for LDA, as its hierarchical structure allows different data sets to share the same base models. In order to examine to what extent HDPMM when being used as a topic model can deal with the comparison on topics between different text sets, three main goals are proposed here. First, HDPMM should be able to reveal the difference on topics between text sets. Second, HDPMM should be able to reveal the consistence on topics between text sets. Third, HDPMM should be able to reveal the change along time on topics within the same text set. A pilot study provides a positive support to this research idea that HDPMM can distinguish between the topics of two text sets. Following the pilot study, HDPMM will be applied to three different databases for examining the plausibility to use HDPMM as a topic model.
關聯 科技部, MOST109-2410-H004-078, 109.08-110.07
資料類型 report
dc.contributor 心理系
dc.creator (作者) 楊立行
dc.date (日期) 2022-03
dc.date.accessioned 28-五月-2025 14:06:55 (UTC+8)-
dc.date.available 28-五月-2025 14:06:55 (UTC+8)-
dc.date.issued (上傳時間) 28-五月-2025 14:06:55 (UTC+8)-
dc.identifier.uri (URI) https://nccur.lib.nccu.edu.tw/handle/140.119/157128-
dc.description.abstract (摘要) 拜機率模型與演算法精進之賜,現代文本分析(text analysis)大幅提升了研究者可以處理文字資料的能力。除了從文本中抽取結構特徵,例如關鍵詞,主題模型(topic model)更可以抽取文本的內容特徵,即主題,根據詞彙之間的共現性(co-occurrence)提供研究者對本文在意義層次上的了解。著名的主題模型LDA(Latent Dirichlet Allocation)奠基於狄利克雷歷程混合模型(Dirichlet Process Mixture Model),將主題視為機率分配模型,並透過貝氏推論(Bayesian inference)估計最適合的主題個數以及每個主題模型的參數,以最大可能地預測詞彙在文本中出現的機率分配。然而,殊為可惜的是LDA並無法針對多群文本內容進行主題異同的比較。相對地,階層式狄利克雷歷程混合模型(Hierarchical Dirichlet Process Mixture Model)可以同時比較多群資料。因此,本計畫建議以階層式狄利克雷歷程混合模型做為比較多群文本主題的主題模型。為檢視此一構想,本計畫認為階層式狄利克雷歷程混合模型應能達成三個目標。第一、應能呈現多群文本之間的主題差異。第二、應能呈現多群文本之間的主題一致性。第三、應能捕捉主題依時改變的軌跡。先導研究結果顯示模型確實可以偵測兩群文本主題之間的差異。延續先導研究,本計畫擬以三種不同資料庫分別檢視上述三個目標達成的可能性,並希望藉此作為改進文字資料分析工具的建議。
dc.description.abstract (摘要) Text analysis has become a powerful analysis tool by virtue of the progress of probability models and computational algorithms. Not only can text analysis extract the structural features of texts (e.g., key words), but also it can extract the features of contents (i.e., topics). The model used to extract the topics of texts is call topic model. The most famous topic model is LDA (Latent Dirichlet Allocation), which based on the Dirichlet process treats a text as a collection of topics and topics as a collection of words. LDA assumes that each topic is a distribution over the words in it and different topics correspond to different distributions. The parameters of each topic distribution is estimated by Bayesian inference. However, LDA cannot deal with the comparison between the topics of different sets of texts. In order to extend the utility of topic model in social sciences, it is suggested in this proposal to substitute the hierarchical Dirichlet process mixture model (HDPMM) for LDA, as its hierarchical structure allows different data sets to share the same base models. In order to examine to what extent HDPMM when being used as a topic model can deal with the comparison on topics between different text sets, three main goals are proposed here. First, HDPMM should be able to reveal the difference on topics between text sets. Second, HDPMM should be able to reveal the consistence on topics between text sets. Third, HDPMM should be able to reveal the change along time on topics within the same text set. A pilot study provides a positive support to this research idea that HDPMM can distinguish between the topics of two text sets. Following the pilot study, HDPMM will be applied to three different databases for examining the plausibility to use HDPMM as a topic model.
dc.format.extent 116 bytes-
dc.format.mimetype text/html-
dc.relation (關聯) 科技部, MOST109-2410-H004-078, 109.08-110.07
dc.subject (關鍵詞) 文本分析; 主題模型; 階層式狄利克雷歷程混合模型
dc.subject (關鍵詞) Text Analysis; LDA; Hierarchical Dirichlet Process Mixture Model
dc.title (題名) 以階層式狄利克雷歷程混合模型進行文本分析
dc.title (題名) Application of Hierarchical Dirichlet Process Mixture Model in Text Analysis
dc.type (資料類型) report