綜合分群技術與 BERT 模型於文件推薦的探索 | 學術產出

學術產出-學位論文

文章檢視/開啟

pdf(8)

書目匯出

Google Scholar^TM

題名	綜合分群技術與 BERT 模型於文件推薦的探索 An Exploration of Integrating Clustering and BERT Models for Document Recommendation
作者	陳筠 Chen, Yun
貢獻者	劉昭麟 Liu, Chao-Lin 陳筠 Chen, Yun
關鍵詞	深度學習 BERT 文本向量化半監督式分群 Deep learning BERT document embeddings semi-supervised clustering
日期	2024
上傳時間	1-三月-2024 13:41:20 (UTC+8)
摘要	當試從大量資料中挑選出有興趣的類別內容時，往往需花費人力資源進行瀏覽，或標記資料以分類。相較之下，分群得將類似的文本分在同群，是個更快速且節省成本的方式。故為更有效地找到類似資料以進行文件推薦，本研究透過微調的 BERT 對文本進行向量化再以 K-means 分群，並實驗指定起始點的「種子分群」方式，以期達資料無標記、只需少量線索即可有效分群之效。實驗結果顯示，文本透過微調 BERT 向量化後的分群結果，遠勝於未微調 BERT 及以 TF-IDF 向量化的分群效果。然同時也發現，BERT 投入 K-means 分群的穩定性極高，導致每次分群結果幾無差別，也影響到種子分群之結果，使得本研究中的種子分群方法對分群的改善甚微。是故未來相關研究可在以微調 BERT 進行文本向量化的基礎之上，嘗試其他分群和種子分群的方式。 When users try to find similar contents or documents they’re interested in from an abundance of data, remarkable resources are usually spent on human reviewing or labeling for the classification. In contrast, clustering, which can assign similar documents in the same clusters, is faster and more cost-saving. Therefore, to find similar contents more efficiently, in this research, documents are vectorized through fine-tuned BERT models and clustered by K-means, and by “seed clustering”, which is clustering with appointed initial centroids. The study shows that the clustering with the fine-tuned BERT embeddings outperforms those of BERT without fine-tuning and those of TF-IDF. However, it is found that K-means clustering of BERT embeddings has high stability, causing the results throughout multiple times of clustering to remain nearly identical, which also affects the performance of the seed clustering. The methods of seed clustering thus are shown to have little effect on improving the clustering. Therefore, it is suggested that research in the future be based on fine-tuned BERT embeddings but in different ways of clustering or seed clustering.
參考文獻	[1] C. D. Manning, P. Raghavan and H. Schütze, “Flat Clustering”, in Introduction to Information Retrieval, online ed. Cambridge, England: Cambridge UP, 2009, ch16, pp. 349-350, 354, 356, 357, 360. [2] A. Vaswani, et al., “Attention is all you need," in Advances in Neural Information Processing Systems, 30, 2017. [3] Y. Cui, W. Che, T. Liu, B. Qin and Z. Yang, “Pre-training with whole word masking for Chinese BERT.” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3504-3514, 2021. [4] A. Subakti, H. Murfi and N. Hariadi, “The performance of bert as data representation of text clustering,” Journal of Big Data, vol. 9, no. 1, pp. 1-21, 2022. [5] S. Basu, A. Banerjee and R. Mooney, “Semi-supervised clustering by seeding,” in Proc. of the 10th International Conference on Machine Learning (ICML-2002), Sydney, Australia, July, 2002. [6] M. Bilenko, S. Basu and R. Mooney, “Integrating constraints and metric learning in semi-supervised clustering,” in Proc. of the 21st International Conference on Machine Learning, (ICML-2004), Banff, Canada, July, 2004. [7] Z. Wang, H. Mi and A. Ittycheriah, “Semi-supervised clustering for short text via deep representation learning,” in Proc. of the 20th SIGNLL Conference on Computational Natural Language Learning, Berlin, Germany, 2016, pp. 31-39. [8] “Clustering,” Scikit Learn. https://scikit-learn.org/stable/modules/clustering.html. (accessed Nov. 27, 2023). [9] D. Arthur and S. Vassilvitskii, “k-means++: the advantages of careful seeding,” in Proc. of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, New Orleans Louisiana, the U.S., 2007. [10] “sklearn.cluster.kmeans_plusplus,” Scikit Learn. https://scikit-learn.org/stable/modules/generated/sklearn.cluster.kmeans_plusplus.html#sklearn.cluster.kmeans_plusplus. (accessed Dec. 7, 2023). [11] Akai，〈EM Algorithm 詳盡介紹：利用簡單例子輕鬆讀懂 EM 的原理及概念〉，玩轉部落格。 https://playround.site/?p=628 。（存取日期：2023 年 12 月 31 日）。 [12] 周子皓，〈基於語境特徵及分群模型之中文多義詞消歧〉，碩士論文，國立政治大學資訊科學研究所，2019年。 [13] 陳垂呈，黃俊榮，〈利用群組發掘書籍最適性之推薦〉，教育資料與圖書館學，第43卷，第3期，第 309-325 頁，2006年。 [14] 〈維基百科分類索引〉，維基百科。 https://zh.wikipedia.org/zh-tw/Wikipedia:分類索引。（存取日期：2023 年 10 月 4 日）。 [15] M. Majlis, “Wikipedia-API,” Python Software Foundation. https://pypi.org/project/Wikipedia-API/. (accessed Dec. 7, 2023). [16] 〈營養作用〉，維基百科。 https://zh.wikipedia.org/zh-tw/营养作用。（存取日期：2023 年 12 月 27 日）。 [17] 〈評測簡介〉，中國法律智能技術評測。 http://cail.cipsc.org.cn 。（存取日期：2023 年 12 月 20 日）。 [18] 〈Open Chinese Convert 開放中文轉換〉，Github。 https://github.com/BYVoid/OpenCC 。（存取日期：2023 年 12 月 7 日）。 [19] 〈反式脂肪〉，維基百科。 https://zh.wikipedia.org/zh-tw/反式脂肪。（存取日期：2023 年 12 月 13 日）。 [20] “Jieba,” Github. https://github.com/fxsjy/jieba. (accessed Dec. 8, 2023). [21] “Clustering text documents using k-means,” Scikit Learn. https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html#sphx-glr-auto-examples-text-plot-document-clustering-py. (accessed Dec. 8, 2023).
描述	碩士國立政治大學資訊科學系 109753140
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0109753140
資料類型	thesis

dc.contributor.advisor	劉昭麟	zh_TW
dc.contributor.advisor	Liu, Chao-Lin	en_US
dc.contributor.author (作者)	陳筠	zh_TW
dc.contributor.author (作者)	Chen, Yun	en_US
dc.creator (作者)	陳筠	zh_TW
dc.creator (作者)	Chen, Yun	en_US
dc.date (日期)	2024	en_US
dc.date.accessioned	1-三月-2024 13:41:20 (UTC+8)	-
dc.date.available	1-三月-2024 13:41:20 (UTC+8)	-
dc.date.issued (上傳時間)	1-三月-2024 13:41:20 (UTC+8)	-
dc.identifier (其他識別碼)	G0109753140	en_US
dc.identifier.uri (URI)	https://nccur.lib.nccu.edu.tw/handle/140.119/150166	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	資訊科學系	zh_TW
dc.description (描述)	109753140	zh_TW
dc.description.abstract (摘要)	當試從大量資料中挑選出有興趣的類別內容時，往往需花費人力資源進行瀏覽，或標記資料以分類。相較之下，分群得將類似的文本分在同群，是個更快速且節省成本的方式。故為更有效地找到類似資料以進行文件推薦，本研究透過微調的 BERT 對文本進行向量化再以 K-means 分群，並實驗指定起始點的「種子分群」方式，以期達資料無標記、只需少量線索即可有效分群之效。實驗結果顯示，文本透過微調 BERT 向量化後的分群結果，遠勝於未微調 BERT 及以 TF-IDF 向量化的分群效果。然同時也發現，BERT 投入 K-means 分群的穩定性極高，導致每次分群結果幾無差別，也影響到種子分群之結果，使得本研究中的種子分群方法對分群的改善甚微。是故未來相關研究可在以微調 BERT 進行文本向量化的基礎之上，嘗試其他分群和種子分群的方式。	zh_TW
dc.description.abstract (摘要)	When users try to find similar contents or documents they’re interested in from an abundance of data, remarkable resources are usually spent on human reviewing or labeling for the classification. In contrast, clustering, which can assign similar documents in the same clusters, is faster and more cost-saving. Therefore, to find similar contents more efficiently, in this research, documents are vectorized through fine-tuned BERT models and clustered by K-means, and by “seed clustering”, which is clustering with appointed initial centroids. The study shows that the clustering with the fine-tuned BERT embeddings outperforms those of BERT without fine-tuning and those of TF-IDF. However, it is found that K-means clustering of BERT embeddings has high stability, causing the results throughout multiple times of clustering to remain nearly identical, which also affects the performance of the seed clustering. The methods of seed clustering thus are shown to have little effect on improving the clustering. Therefore, it is suggested that research in the future be based on fine-tuned BERT embeddings but in different ways of clustering or seed clustering.	en_US
dc.description.tableofcontents	第一章緒論 1 第一節研究動機 1 第二節目的 1 第三節主要貢獻 2 第四節論文架構 2 第二章相關研究及文獻 3 第一節 BERT EMBEDDINGS 與分群 3 第二節分群技術 4 第三章研究方法 8 第一節研究架構 8 第二節語料 8 第三節前處理 16 第四節向量化與 EMBEDDINGS 19 第五節分群 24 第六節評估指標 25 第四章結果評估與分析 28 第一節微調 BERT 結果 28 第二節分群結果 30 第五章結論 88 第六章未來展望 90 參考文獻 91 第一章緒論 1 第一節研究動機 1 第二節目的 1 第三節主要貢獻 2 第四節論文架構 2 第二章相關研究及文獻 3 第一節 BERT EMBEDDINGS 與分群 3 第二節分群技術 4 第三章研究方法 8 第一節研究架構 8 第二節語料 8 第三節前處理 16 第四節向量化與 EMBEDDINGS 19 第五節分群 24 第六節評估指標 25 第四章結果評估與分析 28 第一節微調 BERT 結果 28 第二節分群結果 30 第五章結論 88 第六章未來展望 90 參考文獻 91 附錄一學位考試紀錄與論文相關編修 94	zh_TW
dc.format.extent	8987457 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0109753140	en_US
dc.subject (關鍵詞)	深度學習	zh_TW
dc.subject (關鍵詞)	BERT	zh_TW
dc.subject (關鍵詞)	文本向量化	zh_TW
dc.subject (關鍵詞)	半監督式分群	zh_TW
dc.subject (關鍵詞)	Deep learning	en_US
dc.subject (關鍵詞)	BERT	en_US
dc.subject (關鍵詞)	document embeddings	en_US
dc.subject (關鍵詞)	semi-supervised clustering	en_US
dc.title (題名)	綜合分群技術與 BERT 模型於文件推薦的探索	zh_TW
dc.title (題名)	An Exploration of Integrating Clustering and BERT Models for Document Recommendation	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	[1] C. D. Manning, P. Raghavan and H. Schütze, “Flat Clustering”, in Introduction to Information Retrieval, online ed. Cambridge, England: Cambridge UP, 2009, ch16, pp. 349-350, 354, 356, 357, 360. [2] A. Vaswani, et al., “Attention is all you need," in Advances in Neural Information Processing Systems, 30, 2017. [3] Y. Cui, W. Che, T. Liu, B. Qin and Z. Yang, “Pre-training with whole word masking for Chinese BERT.” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3504-3514, 2021. [4] A. Subakti, H. Murfi and N. Hariadi, “The performance of bert as data representation of text clustering,” Journal of Big Data, vol. 9, no. 1, pp. 1-21, 2022. [5] S. Basu, A. Banerjee and R. Mooney, “Semi-supervised clustering by seeding,” in Proc. of the 10th International Conference on Machine Learning (ICML-2002), Sydney, Australia, July, 2002. [6] M. Bilenko, S. Basu and R. Mooney, “Integrating constraints and metric learning in semi-supervised clustering,” in Proc. of the 21st International Conference on Machine Learning, (ICML-2004), Banff, Canada, July, 2004. [7] Z. Wang, H. Mi and A. Ittycheriah, “Semi-supervised clustering for short text via deep representation learning,” in Proc. of the 20th SIGNLL Conference on Computational Natural Language Learning, Berlin, Germany, 2016, pp. 31-39. [8] “Clustering,” Scikit Learn. https://scikit-learn.org/stable/modules/clustering.html. (accessed Nov. 27, 2023). [9] D. Arthur and S. Vassilvitskii, “k-means++: the advantages of careful seeding,” in Proc. of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, New Orleans Louisiana, the U.S., 2007. [10] “sklearn.cluster.kmeans_plusplus,” Scikit Learn. https://scikit-learn.org/stable/modules/generated/sklearn.cluster.kmeans_plusplus.html#sklearn.cluster.kmeans_plusplus. (accessed Dec. 7, 2023). [11] Akai，〈EM Algorithm 詳盡介紹：利用簡單例子輕鬆讀懂 EM 的原理及概念〉，玩轉部落格。 https://playround.site/?p=628 。（存取日期：2023 年 12 月 31 日）。 [12] 周子皓，〈基於語境特徵及分群模型之中文多義詞消歧〉，碩士論文，國立政治大學資訊科學研究所，2019年。 [13] 陳垂呈，黃俊榮，〈利用群組發掘書籍最適性之推薦〉，教育資料與圖書館學，第43卷，第3期，第 309-325 頁，2006年。 [14] 〈維基百科分類索引〉，維基百科。 https://zh.wikipedia.org/zh-tw/Wikipedia:分類索引。（存取日期：2023 年 10 月 4 日）。 [15] M. Majlis, “Wikipedia-API,” Python Software Foundation. https://pypi.org/project/Wikipedia-API/. (accessed Dec. 7, 2023). [16] 〈營養作用〉，維基百科。 https://zh.wikipedia.org/zh-tw/营养作用。（存取日期：2023 年 12 月 27 日）。 [17] 〈評測簡介〉，中國法律智能技術評測。 http://cail.cipsc.org.cn 。（存取日期：2023 年 12 月 20 日）。 [18] 〈Open Chinese Convert 開放中文轉換〉，Github。 https://github.com/BYVoid/OpenCC 。（存取日期：2023 年 12 月 7 日）。 [19] 〈反式脂肪〉，維基百科。 https://zh.wikipedia.org/zh-tw/反式脂肪。（存取日期：2023 年 12 月 13 日）。 [20] “Jieba,” Github. https://github.com/fxsjy/jieba. (accessed Dec. 8, 2023). [21] “Clustering text documents using k-means,” Scikit Learn. https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html#sphx-glr-auto-examples-text-plot-document-clustering-py. (accessed Dec. 8, 2023).	zh_TW

學術產出-學位論文

文章檢視/開啟

書目匯出

Google ScholarTM

政大圖書館

引文資訊

TAIR相關學術產出

Google Scholar^TM