學術產出-Theses

題名 中文訴訟文書檢索系統雛形實作
A Prototype of Information Services for Chinese Judicial Documents
作者 藍家樑
Lan, Chia Liang
貢獻者 劉昭麟
Liu, Chao Lin
藍家樑
Lan, Chia Liang
關鍵詞 法學資訊系統
自然語言處理
階層式分群法
k最近鄰居法
日期 2008
上傳時間 19-Sep-2009 12:12:11 (UTC+8)
摘要 訴訟案件與日俱增,欲閱讀完所有案件顯然不容易,此時便需要一套較完善的檢索系統來輔助使用者。我們整合前人的相關研究成果,實作一套分群式檢索系統的雛形,依檢索條件搜尋相關案件,並將結果分群輸出,便於使用者對各群集進行查詢,以期減少使用者閱讀案件上的負擔,同時獲得較完整資訊。另設計文件標記與註解功能,供使用者建立個人化資料庫,便於日後檢索。
當輸入為關鍵詞時我們利用階層式分群法來為結果作分群,也以共現詞彙的概念建立的索引,列出可能的相關詞彙提供使用者作查詢;檢索條件亦可輸入一段犯罪事實,系統透過k最近鄰居法的概念,找到相似的案件,依照案由分群。另外也可以透過判決刑期分佈針對特定區間作檢索。
本系統難以進行較正規的實驗,因為這是一個使用者互動的系統,而適不適用也難有一個評定標準。我們從使用者的執行效率,以及對於分群結果的相似度與判決刑期統計來分析與討論,檢驗本系統對使用者的助益以及討論系統本身須要再改善之處。
Because cumulative number of the judgments grows unceasingly, it is obviously not easy for the users to read all the judicial documents. They need a handier system to retrieve the judgment information. We present a prototype of clustering retrieval system for Chinese judicial documents. The system can automatically cluster and integrate the search results. It is easy for the users to focus on the information they need and pass over the others. When they read a judicial document, they can mark some parts of sentences or annotate some comments if they are interested in. We let them create the personalized database and search more easily.
We can type a keyword, and then our system executes the hierarchical clustering method to cluster search results. We also can view some words which may be relative to the keyword from the collocation word lists. Besides we can input a crime description, and then our system executes the k-nearest neighbor method to classify the crime into some prosecution reason and provide the similar cases. Moreover, our system lets the users view the distribution of prison sentence lengths and the documents in the specific interval.
A formal evaluation of our system is not easy because this is an interactive system. We cannot definitely judge whether it is helpful or unhelpful. We evaluated the efficiency of our system by the operations of human subjects.
Besides we made some statistics about the similarity and the distribution of prison sentence lengths from the clustering results. We tried to discuss the help by our system for users and how to improve the system.
參考文獻 [1] HowNet電子詞典,http://www.keenage.com/,最後造訪日期2009/1/7。
[2] HSQLDB資料庫,http://hsqldb.sourceforge.net/,最後造訪日期2009/1/7。
[3] Lucene全文檢索引擎,http://lucene.apache.org/,最後造訪日期2009/1/7。
[4] SQLite資料庫,http://www.sqlite.org/,最後造訪日期2009/1/7。
[5] 中研院CKIP中文斷詞系統,http://ckipsvr.iis.sinica.edu.tw/,最後造訪日期2009/1/7。
[6] 司法院法學資料檢索系統,http://jirs.judicial.gov.tw/,最後造訪日期2009/1/7。
[7] 司法院統計處,http://www.judicial.gov.tw/Juds/,最後造訪日期2009/1/7。
[8] 司法院網站,http://www.judicial.gov.tw/,最後造訪日期2009/1/7。
[9] 全國法規資料庫,http://law.moj.gov.tw/,最後造訪日期2009/1/7。
[10] 何君豪,階層式分群法在民事裁判要旨分群上之應用,碩士論文,國立政治大學資訊科學系,2007。
[11] 呂明欣和王加元,建構一個簡單的字彙網,自然語言處理學期報告,2007。
[12] 李剛、宋傳和邱哲,征服Ajax+Lucene建構搜尋引擎,文魁資訊,2006。
[13] 孟維德,犯罪熱點之研究,刑事政策與犯罪論文集(五),法務部犯罪研究中心編印,93–116,2002。
[14] 張正宗,電腦輔助簡易刑事判決技術之探討,碩士論文,國立政治大學訊科學系,2003。
[15] 張智星,資料群聚與樣式辨認,網路線上課程,可由作者之網頁 http://www.cs.nthu.edu.tw/~jang連結到此線上教材。
[16] 鄭人豪,中文詞彙集的來源與權重對中文裁判書分類成效的影響,碩士論文,國立政治大學資訊科學系,2007。
[17] 謝淳達,利用詞組檢索中文訴訟文書之研究,碩士論文,國立政治大學資訊科學系,2005。
[18] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval, Addison Wesley, 1999.
[19] J. Han and M. Kamber, Data Mining: Concept and Techniques, Morgan Kaufmann, 2001.
[20] M. A. Hearst, Clustering Versus Faceted Categories for Information Exploration, Communications of the ACM, Volume 49, Issue 4, 59–61, 2006.
[21] C. R. Huang, A. Kilgarriff, Y. Wu, C. M. Chiu, S. Smith, P. Rychlý, M. H. Bai and K. J. Chen, Chinese Sketch Engine and the Extraction of Collocations, Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, 48–55, 2005.
[22] J. Y. Jian, Y. C. Chang and J. S. Chang, TANGO: Bilingual Collocational Concordanc-er, Proceedings of the ACL 2004 on Interactive Poster and Demonstration Sessions, Article No. 19, 2004.
[23] K. S. Jones, A Statistical Interpretation of Term Specificity and Its Application in Re-trieval, Journal of Documentation, Volume 28, 11–21, 1972.
[24] K. Kummamuru, R. Lotlikar, S. Roy, K. Singal, and R. Krishnapuram, A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Re-sults, Proceedings of the thirteenth International Conference on World Wide Web, 658–665, 2004.
[25] H. P. Luhn, A Statistical Approach to Mechanized Encoding and Searching of Literary Information, IBM Journal of Research and Development, Volume 1, No. 2, 309–317, 1957.
[26] T. O’Reilly, What is Web 2.0–Design Patterns and Business Models for the Next Gen-eration of Software, Web 2.0 Report, O`Reilly, 2005.
[27] E. L. Rissland, K. D. Ashley and L. K. Branting, Case-based Reasoning and Law, The Knowledge Engineering Review, 293–298, 2006.
[28] G. Salton, A. Wong and C. S. Yang, A Vector Space Model for Automatic Indexing, Communications of the ACM, Volume 18, Issue 11, 613–620, 1975.
[29] E. Schweighofer, G. Haneder, A. Rauber and M. Dittenbach, Improvement of Vector Representations of Legal Documents with Legal Ontologies, Proceedings of the Fifth In-ternational Conference on Business Information Systems, 2002.
[30] F. Smadja, Retrieving Collocations from Text: Xtract, Computational Linguistics,Volume 19, Issue 1, 143–177, 1993.
[31] Y. H. Tseng, Yu-Chin Tsai, Chi-Jen Lin, Comparison of Global Term Expansion Me-thods for Text Retrieval, Proceedings of the Fifth NTCIR Workshop, 2005.
[32] J. Xu and W. B. Croft, Query Expansion Using Local and Global Document Analysis, Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 4–11, 1996.
[33] H. Zeng, Q. He, Z. Chen, W. Ma, and J. Ma, Learning to Cluster Web Search Results, Proceedings of the Twenty-seventh Annual International ACM SIGIR Conference on Re-search and Development in Information Retrieval, 2004.
描述 碩士
國立政治大學
資訊科學學系
95753042
97
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0957530421
資料類型 thesis
dc.contributor.advisor 劉昭麟zh_TW
dc.contributor.advisor Liu, Chao Linen_US
dc.contributor.author (Authors) 藍家樑zh_TW
dc.contributor.author (Authors) Lan, Chia Liangen_US
dc.creator (作者) 藍家樑zh_TW
dc.creator (作者) Lan, Chia Liangen_US
dc.date (日期) 2008en_US
dc.date.accessioned 19-Sep-2009 12:12:11 (UTC+8)-
dc.date.available 19-Sep-2009 12:12:11 (UTC+8)-
dc.date.issued (上傳時間) 19-Sep-2009 12:12:11 (UTC+8)-
dc.identifier (Other Identifiers) G0957530421en_US
dc.identifier.uri (URI) https://nccur.lib.nccu.edu.tw/handle/140.119/37123-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊科學學系zh_TW
dc.description (描述) 95753042zh_TW
dc.description (描述) 97zh_TW
dc.description.abstract (摘要) 訴訟案件與日俱增,欲閱讀完所有案件顯然不容易,此時便需要一套較完善的檢索系統來輔助使用者。我們整合前人的相關研究成果,實作一套分群式檢索系統的雛形,依檢索條件搜尋相關案件,並將結果分群輸出,便於使用者對各群集進行查詢,以期減少使用者閱讀案件上的負擔,同時獲得較完整資訊。另設計文件標記與註解功能,供使用者建立個人化資料庫,便於日後檢索。
當輸入為關鍵詞時我們利用階層式分群法來為結果作分群,也以共現詞彙的概念建立的索引,列出可能的相關詞彙提供使用者作查詢;檢索條件亦可輸入一段犯罪事實,系統透過k最近鄰居法的概念,找到相似的案件,依照案由分群。另外也可以透過判決刑期分佈針對特定區間作檢索。
本系統難以進行較正規的實驗,因為這是一個使用者互動的系統,而適不適用也難有一個評定標準。我們從使用者的執行效率,以及對於分群結果的相似度與判決刑期統計來分析與討論,檢驗本系統對使用者的助益以及討論系統本身須要再改善之處。
zh_TW
dc.description.abstract (摘要) Because cumulative number of the judgments grows unceasingly, it is obviously not easy for the users to read all the judicial documents. They need a handier system to retrieve the judgment information. We present a prototype of clustering retrieval system for Chinese judicial documents. The system can automatically cluster and integrate the search results. It is easy for the users to focus on the information they need and pass over the others. When they read a judicial document, they can mark some parts of sentences or annotate some comments if they are interested in. We let them create the personalized database and search more easily.
We can type a keyword, and then our system executes the hierarchical clustering method to cluster search results. We also can view some words which may be relative to the keyword from the collocation word lists. Besides we can input a crime description, and then our system executes the k-nearest neighbor method to classify the crime into some prosecution reason and provide the similar cases. Moreover, our system lets the users view the distribution of prison sentence lengths and the documents in the specific interval.
A formal evaluation of our system is not easy because this is an interactive system. We cannot definitely judge whether it is helpful or unhelpful. We evaluated the efficiency of our system by the operations of human subjects.
Besides we made some statistics about the similarity and the distribution of prison sentence lengths from the clustering results. We tried to discuss the help by our system for users and how to improve the system.
en_US
dc.description.tableofcontents 第一章 緒論.............1
1.1 研究背景與動機.............1
1.2 研究方法與成果.............2
1.3 論文架構.............3
第二章 相關研究.............4
2.1 文件分群檢索.............4
2.2 法律文件分類.............7
2.3 法學資訊檢索系統.............8
第三章 背景知識與資料來源.............10
3.1 刑事案件裁判書簡介.............10
3.2 資料來源和前處理.............13
第四章 系統需求及功能設計.............14
4.1 系統需求.............14
4.2 系統功能及操作介面.............16
4.3 系統架構簡介.............24
第五章 相關技術.............26
5.1 Lucene索引建置.............26
5.2 裁判書轉換為文件特徵向量檔案.............28
5.3 階層式分群演算法.............29
5.4 相似案件分群.............31
5.5 判決刑期擷取及分群.............31
5.6 共現詞彙索引建立.............33
第六章 案由分類演算法評估與討論.............37
6.1 實驗設計.............37
6.2 實驗結果與討論.............40
第七章 系統效能評估.............45
7.1 程式執行效率評估.............45
7.2 依相似度分群檢索評估.............46
7.2.1 使用者操作效率.............46
7.2.2 群集相似度評估.............50
7.3 相似案件檢索評估.............52
7.4 量刑輔助檢索功能評估.............54
第八章 結論與未來展望.............57
8.1 結論.............57
8.2 未來展望.............59
附錄I中研院平衡語料庫詞類標記集.............68
附錄II使用者操作效率實驗用目標裁判書.............69
附錄III TermSpotter詞彙列表.............73
附錄IV相似案件檢索判決刑期統計 (不同回傳案件數).............77
zh_TW
dc.format.extent 352975 bytes-
dc.format.extent 164211 bytes-
dc.format.extent 382365 bytes-
dc.format.extent 368525 bytes-
dc.format.extent 406074 bytes-
dc.format.extent 498070 bytes-
dc.format.extent 726705 bytes-
dc.format.extent 938478 bytes-
dc.format.extent 977484 bytes-
dc.format.extent 741424 bytes-
dc.format.extent 630935 bytes-
dc.format.extent 726248 bytes-
dc.format.extent 421854 bytes-
dc.format.extent 1971169 bytes-
dc.format.mimetype application/pdf-
dc.format.mimetype application/pdf-
dc.format.mimetype application/pdf-
dc.format.mimetype application/pdf-
dc.format.mimetype application/pdf-
dc.format.mimetype application/pdf-
dc.format.mimetype application/pdf-
dc.format.mimetype application/pdf-
dc.format.mimetype application/pdf-
dc.format.mimetype application/pdf-
dc.format.mimetype application/pdf-
dc.format.mimetype application/pdf-
dc.format.mimetype application/pdf-
dc.format.mimetype application/pdf-
dc.language.iso en_US-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0957530421en_US
dc.subject (關鍵詞) 法學資訊系統zh_TW
dc.subject (關鍵詞) 自然語言處理zh_TW
dc.subject (關鍵詞) 階層式分群法zh_TW
dc.subject (關鍵詞) k最近鄰居法zh_TW
dc.title (題名) 中文訴訟文書檢索系統雛形實作zh_TW
dc.title (題名) A Prototype of Information Services for Chinese Judicial Documentsen_US
dc.type (資料類型) thesisen
dc.relation.reference (參考文獻) [1] HowNet電子詞典,http://www.keenage.com/,最後造訪日期2009/1/7。zh_TW
dc.relation.reference (參考文獻) [2] HSQLDB資料庫,http://hsqldb.sourceforge.net/,最後造訪日期2009/1/7。zh_TW
dc.relation.reference (參考文獻) [3] Lucene全文檢索引擎,http://lucene.apache.org/,最後造訪日期2009/1/7。zh_TW
dc.relation.reference (參考文獻) [4] SQLite資料庫,http://www.sqlite.org/,最後造訪日期2009/1/7。zh_TW
dc.relation.reference (參考文獻) [5] 中研院CKIP中文斷詞系統,http://ckipsvr.iis.sinica.edu.tw/,最後造訪日期2009/1/7。zh_TW
dc.relation.reference (參考文獻) [6] 司法院法學資料檢索系統,http://jirs.judicial.gov.tw/,最後造訪日期2009/1/7。zh_TW
dc.relation.reference (參考文獻) [7] 司法院統計處,http://www.judicial.gov.tw/Juds/,最後造訪日期2009/1/7。zh_TW
dc.relation.reference (參考文獻) [8] 司法院網站,http://www.judicial.gov.tw/,最後造訪日期2009/1/7。zh_TW
dc.relation.reference (參考文獻) [9] 全國法規資料庫,http://law.moj.gov.tw/,最後造訪日期2009/1/7。zh_TW
dc.relation.reference (參考文獻) [10] 何君豪,階層式分群法在民事裁判要旨分群上之應用,碩士論文,國立政治大學資訊科學系,2007。zh_TW
dc.relation.reference (參考文獻) [11] 呂明欣和王加元,建構一個簡單的字彙網,自然語言處理學期報告,2007。zh_TW
dc.relation.reference (參考文獻) [12] 李剛、宋傳和邱哲,征服Ajax+Lucene建構搜尋引擎,文魁資訊,2006。zh_TW
dc.relation.reference (參考文獻) [13] 孟維德,犯罪熱點之研究,刑事政策與犯罪論文集(五),法務部犯罪研究中心編印,93–116,2002。zh_TW
dc.relation.reference (參考文獻) [14] 張正宗,電腦輔助簡易刑事判決技術之探討,碩士論文,國立政治大學訊科學系,2003。zh_TW
dc.relation.reference (參考文獻) [15] 張智星,資料群聚與樣式辨認,網路線上課程,可由作者之網頁 http://www.cs.nthu.edu.tw/~jang連結到此線上教材。zh_TW
dc.relation.reference (參考文獻) [16] 鄭人豪,中文詞彙集的來源與權重對中文裁判書分類成效的影響,碩士論文,國立政治大學資訊科學系,2007。zh_TW
dc.relation.reference (參考文獻) [17] 謝淳達,利用詞組檢索中文訴訟文書之研究,碩士論文,國立政治大學資訊科學系,2005。zh_TW
dc.relation.reference (參考文獻) [18] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval, Addison Wesley, 1999.zh_TW
dc.relation.reference (參考文獻) [19] J. Han and M. Kamber, Data Mining: Concept and Techniques, Morgan Kaufmann, 2001.zh_TW
dc.relation.reference (參考文獻) [20] M. A. Hearst, Clustering Versus Faceted Categories for Information Exploration, Communications of the ACM, Volume 49, Issue 4, 59–61, 2006.zh_TW
dc.relation.reference (參考文獻) [21] C. R. Huang, A. Kilgarriff, Y. Wu, C. M. Chiu, S. Smith, P. Rychlý, M. H. Bai and K. J. Chen, Chinese Sketch Engine and the Extraction of Collocations, Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, 48–55, 2005.zh_TW
dc.relation.reference (參考文獻) [22] J. Y. Jian, Y. C. Chang and J. S. Chang, TANGO: Bilingual Collocational Concordanc-er, Proceedings of the ACL 2004 on Interactive Poster and Demonstration Sessions, Article No. 19, 2004.zh_TW
dc.relation.reference (參考文獻) [23] K. S. Jones, A Statistical Interpretation of Term Specificity and Its Application in Re-trieval, Journal of Documentation, Volume 28, 11–21, 1972.zh_TW
dc.relation.reference (參考文獻) [24] K. Kummamuru, R. Lotlikar, S. Roy, K. Singal, and R. Krishnapuram, A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Re-sults, Proceedings of the thirteenth International Conference on World Wide Web, 658–665, 2004.zh_TW
dc.relation.reference (參考文獻) [25] H. P. Luhn, A Statistical Approach to Mechanized Encoding and Searching of Literary Information, IBM Journal of Research and Development, Volume 1, No. 2, 309–317, 1957.zh_TW
dc.relation.reference (參考文獻) [26] T. O’Reilly, What is Web 2.0–Design Patterns and Business Models for the Next Gen-eration of Software, Web 2.0 Report, O`Reilly, 2005.zh_TW
dc.relation.reference (參考文獻) [27] E. L. Rissland, K. D. Ashley and L. K. Branting, Case-based Reasoning and Law, The Knowledge Engineering Review, 293–298, 2006.zh_TW
dc.relation.reference (參考文獻) [28] G. Salton, A. Wong and C. S. Yang, A Vector Space Model for Automatic Indexing, Communications of the ACM, Volume 18, Issue 11, 613–620, 1975.zh_TW
dc.relation.reference (參考文獻) [29] E. Schweighofer, G. Haneder, A. Rauber and M. Dittenbach, Improvement of Vector Representations of Legal Documents with Legal Ontologies, Proceedings of the Fifth In-ternational Conference on Business Information Systems, 2002.zh_TW
dc.relation.reference (參考文獻) [30] F. Smadja, Retrieving Collocations from Text: Xtract, Computational Linguistics,Volume 19, Issue 1, 143–177, 1993.zh_TW
dc.relation.reference (參考文獻) [31] Y. H. Tseng, Yu-Chin Tsai, Chi-Jen Lin, Comparison of Global Term Expansion Me-thods for Text Retrieval, Proceedings of the Fifth NTCIR Workshop, 2005.zh_TW
dc.relation.reference (參考文獻) [32] J. Xu and W. B. Croft, Query Expansion Using Local and Global Document Analysis, Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 4–11, 1996.zh_TW
dc.relation.reference (參考文獻) [33] H. Zeng, Q. He, Z. Chen, W. Ma, and J. Ma, Learning to Cluster Web Search Results, Proceedings of the Twenty-seventh Annual International ACM SIGIR Conference on Re-search and Development in Information Retrieval, 2004.zh_TW