學術產出-Theses
Article View/Open
Publication Export
-
題名 基於BERT模型的專利相似度計算:以台灣金融科技專利為例
Using BERT to Analyze Patent Similarity: A Study of Taiwan’s FinTech Patents作者 蔡孟純
Tsai, Meng-Chun貢獻者 宋皇志<br>蔡炎龍
Sung, Huang-Chih<br>Tsai, Yen-Lung
蔡孟純
Tsai, Meng-Chun關鍵詞 專利檢索
自然語言處理
深度學習
BERT
中文專利
Patent Retrieval
Natural Language Processing
Deep Learning
BERT
Chinese patents日期 2023 上傳時間 6-Jul-2023 16:22:52 (UTC+8) 摘要 隨著科技迅速發展、全球競爭加劇,各行各業對專利的價值和保護越來越重視。金融科技領域尤其如此,傳統金融機構面臨著科技公司帶來的巨大競爭威脅,迫使它們重新評估競爭力和創新能力。專利不僅是技術創新的重要保護手段,也是企業獲取競爭優勢和市場份額的關鍵要素。其中,專利檢索任務可以提供關鍵的商業和法律信息,幫助企業做出決策並保護自身的利益。然而,由於日益龐大的專利數據庫以及專利文本包含的技術詞彙、艱澀用語等特性,使得專利檢索任務面臨較大時間成本與專業門檻。本研究旨在運用BERT 模型,提出一種更準確和有效的專利相似度計算方法,來改善專利從業人員的困境。為了進一步探討深度學習模型於中文文本的適用性,本研究將研究範圍鎖定在台灣金融科技專利,並選用BERT-Base-Chinese 的模型。我們將以專利範圍與專利標題作為模型的主要輸入,藉由BERT 轉換成代表向量之後,再進一步進行相似檢索分析。為了測試本次研究的實驗成果,我們從總數據集13,478 筆專利當中,找出兩類測試資料作為測試。分別是具有引用專利的引用專利集,共計2,123 筆資料;以及核駁案件專利集,共計640 筆資料。實驗結果顯示,經由BERT 轉換的中文文本向量能在一定程度上保留語意。在測試階段,兩個測試資料集在前段的表現都十分良好,尤其是在數據集的第一分位數之前。其相似檢索表現可以濾掉約99% 數據,找到1%的關鍵數據。然而,實驗在中後段的排名表現以及標準差數值上出現了侷限性。總體來說,此次研究證實BERT 模型具備發展中文專利檢索的潛力。能夠作為專利檢索方法的輔助工具,協助專利檢索者在有限條件下更有效的找到與查詢相關的專利文件,以降低遺漏重要專利的風險。
With rapid technological advancement and intensified global competition, the value and protection of patents are increasingly emphasized across various industries. This holds particularly true in the field of FinTech, where traditional financial institutions face significant competitive threats from technology companies, compelling them to reassess their competitiveness and innovative capabilities. Patents serve as not only vital means of protecting technological innovation but also key elements for businesses to gain a competitive edge and market share. Among them, patent retrieval tasks can provide crucial business and legal information, helping companies make decisions and protect their interests. However, due tothe ever-expanding patent databases and the technical terminology and complex language present in patent documents, patent retrieval tasks face challenges ofhigher time costs and professional thresholds.This research aims to utilize the BERT model to propose a more accurate and efficient method for patent similarity calculation, aiming to improve the plight of patent professionals. To further explore the applicability of deep learning models to Chinese texts, this study focuses on FinTech patents in Taiwan and employs BERT-Base-Chinese. We use the patent claims and titles as the primary inputsfor the model. After transforming them into representative vectors using BERT, we conduct further analysis for similarity retrieval. To evaluate the experimentalresults, we extract two categories of test data from a total dataset of 13,478 patents: a set of cited patents with a total of 2,123 records and a set of rejected patent caseswith a total of 640 records.The experimental results show that the Chinese text vectors transformed by BERT can preserve semantics to a certain extent. In the testing phase, both test data sets perform well in the early stage, particularly before the first quartile of the dataset. The similarity retrieval performance filters out approximately 99% of the data and identifies 1% of the critical data. However, the experiment reveals limitations in ranking performance and standard deviation values in the middle and later stages. Overall, this study confirms the potential of the BERT model in developing Chinese patent retrieval. It can serve as an auxiliary tool for patent retrieval, assisting patent searchers in more effectively finding relevant patent documents under limited conditions, thereby reducing the risk of overlooking important patents.參考文獻 [1] Douglas W Arner, Janos Barberis, and Ross P Buckley. The evolution of fintech: A new post-crisis paradigm. Geo. J. Int’l L., 47:1271, 2015.[2] National Association. United Services Automobile Association v. Wells Fargo Bank. Number 5:18-cv-00246-BO. 2021.[3] Michael Buckland and Fredric Gey. The relationship between recall and precision. Journal of the American society for information science, 45(1):12–19, 1994.[4] Erik Cambria and Bebo White. Jumping nlp curves: A review of natural language processing research. IEEE Computational intelligence magazine, 9(2):48–57, 2014.[5] Hung-Yun Chiang and Kuan-Yu Chen. A bert-based siamese-structured retrieval model. In Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing (ROCLING 2021), pages 163–172, 2021.[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv: 1810.04805, 2018.[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv: 1810.04805, 2018.[8] Mattyws F Grawe, Claudia A Martins, and Andreia G Bonfante. Automated patent classification using word embedding. In 2017 16th IEEE International Conference onMachine Learning and Applications (ICMLA), pages 408–411. IEEE, 2017.[9] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.[10] Makoto Iwayama, Atsushi Fujii, Noriko Kando, and Akihiko Takano. Overview of patent retrieval task at ntcir-3. In Proceedings of the ACL-2003 workshop on Patent corpusprocessing, pages 24–32, 2003.[11] Dylan Myungchul Kang, Charles Cheolgi Lee, Suan Lee, and Wookey Lee. Patent prior art search using deep learning language model. In Proceedings of the 24th Symposium onInternational Database Engineering & Applications, pages 1–5, 2020.[12] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553): 436–444, 2015.[13] Jieh-Sheng Lee and Jieh Hsiang. Patentbert: Patent classification with fine-tuning a pretrainedbert model. arXiv preprint arXiv:1906.02124, 2019.[14] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. Biobert: a pre-trained biomedical language representation model forbiomedical text mining. Bioinformatics, 36(4):1234–1240, 2020.[15] Shaobo Li, Jie Hu, Yuxin Cui, and Jianjun Hu. Deeppatent: patent classification with convolutional neural networks and word embedding. Scientometrics, 117:721–744, 2018.[16] Mihai Lupu, Jimmy Huang, Jianhan Zhu, and John Tait. Trec-chem: large scale chemical information retrieval evaluation at trec. In Acm Sigir Forum, volume 43, pages 63–70. ACM New York, NY, USA, 2009.[17] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. Advances in neuralinformation processing systems, 26, 2013.[18] Luca Molà. The silk industry of Renaissance Venice. JHU Press, 2000.[19] NTCIR. Ntcir test collections, 2023. http://research.nii.ac.jp/ntcir/link/link-en.html/, Last accessed on 2023-04-09.[20] World Intellectual Property Organization. World intellectual property indicators 2022.[21] Florina Piroi, Mihai Lupu, and Allan Hanbury. Overview of clef-ip 2013 lab: Information retrieval in the patent domain. In Information Access Evaluation. Multilinguality,Multimodality, and Visualization: 4th International Conference of the CLEF Initiative, CLEF 2013, Valencia, Spain, September 23-26, 2013. Proceedings 4, pages 232–249.Springer, 2013.[22] André Rattinger12, Jean-Marie Le Goff, and Christian Guetl. Local word embeddings for query expansion based on co-authorship and citations. 2018.[23] Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 61:85–117, 2015.[24] Walid Shalaby and Wlodek Zadrozny. Patent retrieval: a literature review. Knowledge and Information Systems, 61:631–660, 2019.[25] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances inneural information processing systems, 30, 2017.[26] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al.Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations,pages 38–45, 2020.[27] world intellectual property (WIPO). Guide to the International Patent Classification (2023). 2023.[28] 宋皇志. 人工智能在專利檢索之應用初探. 全國律師, 21(10):27–33, 2017.[29] 張東揚/李維峻/吳俊彥. 我國金融科技專利的現況與挑戰. 財金資訊季刊, No.93: 頁 23–35, 2018.[30] 戴余修. 基於bert 預訓練模型的專利檢索方法. 2021.[31] 曹旭友, 周志平, 王利, and 赵卫东. 基于bert+ att 和dbscan 的长三角专利匹配算法. 信息技术, 44(3):1–5, 2020.[32] 曾元顯. 專利文自知知識探勘:技術與挑戰. 現代資訊組織與檢索研討會, pages 111–123, 2004.[33] 顏俊仁; 林彥廷; 廖國智; 李清祺. 金融科技專利發展的概貌. 智慧財產權月刊, pages 頁6–21, 2018.[34] 金融監督管理委員會. 金融科技發展策略白皮書. 2016.[35] 陳妍錦, 呂新科, 羅嘉惠, 林芃君, and 簡志維. 專利地圖分析與檢索技術之探討. In KC 2013 第九屆知識社群研討會論文集, pages 923–933, 2013. 描述 碩士
國立政治大學
科技管理與智慧財產研究所
109364101資料來源 http://thesis.lib.nccu.edu.tw/record/#G0109364101 資料類型 thesis dc.contributor.advisor 宋皇志<br>蔡炎龍 zh_TW dc.contributor.advisor Sung, Huang-Chih<br>Tsai, Yen-Lung en_US dc.contributor.author (Authors) 蔡孟純 zh_TW dc.contributor.author (Authors) Tsai, Meng-Chun en_US dc.creator (作者) 蔡孟純 zh_TW dc.creator (作者) Tsai, Meng-Chun en_US dc.date (日期) 2023 en_US dc.date.accessioned 6-Jul-2023 16:22:52 (UTC+8) - dc.date.available 6-Jul-2023 16:22:52 (UTC+8) - dc.date.issued (上傳時間) 6-Jul-2023 16:22:52 (UTC+8) - dc.identifier (Other Identifiers) G0109364101 en_US dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/145744 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 科技管理與智慧財產研究所 zh_TW dc.description (描述) 109364101 zh_TW dc.description.abstract (摘要) 隨著科技迅速發展、全球競爭加劇,各行各業對專利的價值和保護越來越重視。金融科技領域尤其如此,傳統金融機構面臨著科技公司帶來的巨大競爭威脅,迫使它們重新評估競爭力和創新能力。專利不僅是技術創新的重要保護手段,也是企業獲取競爭優勢和市場份額的關鍵要素。其中,專利檢索任務可以提供關鍵的商業和法律信息,幫助企業做出決策並保護自身的利益。然而,由於日益龐大的專利數據庫以及專利文本包含的技術詞彙、艱澀用語等特性,使得專利檢索任務面臨較大時間成本與專業門檻。本研究旨在運用BERT 模型,提出一種更準確和有效的專利相似度計算方法,來改善專利從業人員的困境。為了進一步探討深度學習模型於中文文本的適用性,本研究將研究範圍鎖定在台灣金融科技專利,並選用BERT-Base-Chinese 的模型。我們將以專利範圍與專利標題作為模型的主要輸入,藉由BERT 轉換成代表向量之後,再進一步進行相似檢索分析。為了測試本次研究的實驗成果,我們從總數據集13,478 筆專利當中,找出兩類測試資料作為測試。分別是具有引用專利的引用專利集,共計2,123 筆資料;以及核駁案件專利集,共計640 筆資料。實驗結果顯示,經由BERT 轉換的中文文本向量能在一定程度上保留語意。在測試階段,兩個測試資料集在前段的表現都十分良好,尤其是在數據集的第一分位數之前。其相似檢索表現可以濾掉約99% 數據,找到1%的關鍵數據。然而,實驗在中後段的排名表現以及標準差數值上出現了侷限性。總體來說,此次研究證實BERT 模型具備發展中文專利檢索的潛力。能夠作為專利檢索方法的輔助工具,協助專利檢索者在有限條件下更有效的找到與查詢相關的專利文件,以降低遺漏重要專利的風險。 zh_TW dc.description.abstract (摘要) With rapid technological advancement and intensified global competition, the value and protection of patents are increasingly emphasized across various industries. This holds particularly true in the field of FinTech, where traditional financial institutions face significant competitive threats from technology companies, compelling them to reassess their competitiveness and innovative capabilities. Patents serve as not only vital means of protecting technological innovation but also key elements for businesses to gain a competitive edge and market share. Among them, patent retrieval tasks can provide crucial business and legal information, helping companies make decisions and protect their interests. However, due tothe ever-expanding patent databases and the technical terminology and complex language present in patent documents, patent retrieval tasks face challenges ofhigher time costs and professional thresholds.This research aims to utilize the BERT model to propose a more accurate and efficient method for patent similarity calculation, aiming to improve the plight of patent professionals. To further explore the applicability of deep learning models to Chinese texts, this study focuses on FinTech patents in Taiwan and employs BERT-Base-Chinese. We use the patent claims and titles as the primary inputsfor the model. After transforming them into representative vectors using BERT, we conduct further analysis for similarity retrieval. To evaluate the experimentalresults, we extract two categories of test data from a total dataset of 13,478 patents: a set of cited patents with a total of 2,123 records and a set of rejected patent caseswith a total of 640 records.The experimental results show that the Chinese text vectors transformed by BERT can preserve semantics to a certain extent. In the testing phase, both test data sets perform well in the early stage, particularly before the first quartile of the dataset. The similarity retrieval performance filters out approximately 99% of the data and identifies 1% of the critical data. However, the experiment reveals limitations in ranking performance and standard deviation values in the middle and later stages. Overall, this study confirms the potential of the BERT model in developing Chinese patent retrieval. It can serve as an auxiliary tool for patent retrieval, assisting patent searchers in more effectively finding relevant patent documents under limited conditions, thereby reducing the risk of overlooking important patents. en_US dc.description.tableofcontents 致謝 i中文摘要 iiAbstract iiiContents vList of Tables viiList of Figures viii1 Introduction 11.1 Background and Significance of the Study 11.2 Statement of the Purpose 31.3 Statement of the Problem 42 Literature Review 52.1 Patent retrieval methods and techniques 52.1.1 Typical patent retrieval methods 52.1.2 International research trends 62.2 Natural Language Processing and Deep Learning Models 82.3 Summary 103 Methodology 123.1 Deep Learning 123.1.1 Seq2Seq and Attention 123.1.2 Transformer 143.1.3 BERT 183.2 Data Collection 203.2.1 Data Source 203.2.2 Data Preprocessing 213.2.3 Test Dataset 213.3 Research Framework 223.3.1 Overview of Methods 223.3.2 BERT-Base-Chinese 243.3.3 Self-Attention Mechanisms Used in Experiments 243.3.4 Cosine Similarity Calculation 253.3.5 Evaluation Metrics 264 Results Analysis 284.1 Initial Experiment Results 284.2 Test Dataset 1–Citation Patent Set 304.2.1 Statistical Performance 304.2.2 Ranking Value N & Accuracy 324.3 Test Dataset 2–Rejection Patent Set 344.3.1 Statistical Performance 344.3.2 Ranking Value N & Accuracy 354.4 Summary 385 Conclusion 395.1 Pedagogical Implications 395.2 Limitations of the study 415.3 Directions for Future Research 42Bibliography 43 zh_TW dc.format.extent 1041695 bytes - dc.format.mimetype application/pdf - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0109364101 en_US dc.subject (關鍵詞) 專利檢索 zh_TW dc.subject (關鍵詞) 自然語言處理 zh_TW dc.subject (關鍵詞) 深度學習 zh_TW dc.subject (關鍵詞) BERT zh_TW dc.subject (關鍵詞) 中文專利 zh_TW dc.subject (關鍵詞) Patent Retrieval en_US dc.subject (關鍵詞) Natural Language Processing en_US dc.subject (關鍵詞) Deep Learning en_US dc.subject (關鍵詞) BERT en_US dc.subject (關鍵詞) Chinese patents en_US dc.title (題名) 基於BERT模型的專利相似度計算:以台灣金融科技專利為例 zh_TW dc.title (題名) Using BERT to Analyze Patent Similarity: A Study of Taiwan’s FinTech Patents en_US dc.type (資料類型) thesis en_US dc.relation.reference (參考文獻) [1] Douglas W Arner, Janos Barberis, and Ross P Buckley. The evolution of fintech: A new post-crisis paradigm. Geo. J. Int’l L., 47:1271, 2015.[2] National Association. United Services Automobile Association v. Wells Fargo Bank. Number 5:18-cv-00246-BO. 2021.[3] Michael Buckland and Fredric Gey. The relationship between recall and precision. Journal of the American society for information science, 45(1):12–19, 1994.[4] Erik Cambria and Bebo White. Jumping nlp curves: A review of natural language processing research. IEEE Computational intelligence magazine, 9(2):48–57, 2014.[5] Hung-Yun Chiang and Kuan-Yu Chen. A bert-based siamese-structured retrieval model. In Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing (ROCLING 2021), pages 163–172, 2021.[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv: 1810.04805, 2018.[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv: 1810.04805, 2018.[8] Mattyws F Grawe, Claudia A Martins, and Andreia G Bonfante. Automated patent classification using word embedding. In 2017 16th IEEE International Conference onMachine Learning and Applications (ICMLA), pages 408–411. IEEE, 2017.[9] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.[10] Makoto Iwayama, Atsushi Fujii, Noriko Kando, and Akihiko Takano. Overview of patent retrieval task at ntcir-3. In Proceedings of the ACL-2003 workshop on Patent corpusprocessing, pages 24–32, 2003.[11] Dylan Myungchul Kang, Charles Cheolgi Lee, Suan Lee, and Wookey Lee. Patent prior art search using deep learning language model. In Proceedings of the 24th Symposium onInternational Database Engineering & Applications, pages 1–5, 2020.[12] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553): 436–444, 2015.[13] Jieh-Sheng Lee and Jieh Hsiang. Patentbert: Patent classification with fine-tuning a pretrainedbert model. arXiv preprint arXiv:1906.02124, 2019.[14] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. Biobert: a pre-trained biomedical language representation model forbiomedical text mining. Bioinformatics, 36(4):1234–1240, 2020.[15] Shaobo Li, Jie Hu, Yuxin Cui, and Jianjun Hu. Deeppatent: patent classification with convolutional neural networks and word embedding. Scientometrics, 117:721–744, 2018.[16] Mihai Lupu, Jimmy Huang, Jianhan Zhu, and John Tait. Trec-chem: large scale chemical information retrieval evaluation at trec. In Acm Sigir Forum, volume 43, pages 63–70. ACM New York, NY, USA, 2009.[17] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. Advances in neuralinformation processing systems, 26, 2013.[18] Luca Molà. The silk industry of Renaissance Venice. JHU Press, 2000.[19] NTCIR. Ntcir test collections, 2023. http://research.nii.ac.jp/ntcir/link/link-en.html/, Last accessed on 2023-04-09.[20] World Intellectual Property Organization. World intellectual property indicators 2022.[21] Florina Piroi, Mihai Lupu, and Allan Hanbury. Overview of clef-ip 2013 lab: Information retrieval in the patent domain. In Information Access Evaluation. Multilinguality,Multimodality, and Visualization: 4th International Conference of the CLEF Initiative, CLEF 2013, Valencia, Spain, September 23-26, 2013. Proceedings 4, pages 232–249.Springer, 2013.[22] André Rattinger12, Jean-Marie Le Goff, and Christian Guetl. Local word embeddings for query expansion based on co-authorship and citations. 2018.[23] Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 61:85–117, 2015.[24] Walid Shalaby and Wlodek Zadrozny. Patent retrieval: a literature review. Knowledge and Information Systems, 61:631–660, 2019.[25] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances inneural information processing systems, 30, 2017.[26] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al.Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations,pages 38–45, 2020.[27] world intellectual property (WIPO). Guide to the International Patent Classification (2023). 2023.[28] 宋皇志. 人工智能在專利檢索之應用初探. 全國律師, 21(10):27–33, 2017.[29] 張東揚/李維峻/吳俊彥. 我國金融科技專利的現況與挑戰. 財金資訊季刊, No.93: 頁 23–35, 2018.[30] 戴余修. 基於bert 預訓練模型的專利檢索方法. 2021.[31] 曹旭友, 周志平, 王利, and 赵卫东. 基于bert+ att 和dbscan 的长三角专利匹配算法. 信息技术, 44(3):1–5, 2020.[32] 曾元顯. 專利文自知知識探勘:技術與挑戰. 現代資訊組織與檢索研討會, pages 111–123, 2004.[33] 顏俊仁; 林彥廷; 廖國智; 李清祺. 金融科技專利發展的概貌. 智慧財產權月刊, pages 頁6–21, 2018.[34] 金融監督管理委員會. 金融科技發展策略白皮書. 2016.[35] 陳妍錦, 呂新科, 羅嘉惠, 林芃君, and 簡志維. 專利地圖分析與檢索技術之探討. In KC 2013 第九屆知識社群研討會論文集, pages 923–933, 2013. zh_TW