學術產出-Theses

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

  • No doi shows Citation Infomation
題名 文件距離為基礎kNN分群技術與新聞事件偵測追蹤之研究
A study of relative text-distance-based kNN clustering technique and news events detection and tracking
作者 陳柏均
Chen, Po Chun
貢獻者 楊建民
陳柏均
Chen, Po Chun
關鍵詞 文字探勘
kNN
事件偵測與追蹤
分類分群
Text Mining
kNN
Events Detection and Tracking
Classification and Clustering
日期 2010
上傳時間 4-Sep-2013 16:59:37 (UTC+8)
摘要 新聞事件可描述為「一個時間區間內、同一主題的相似新聞之集合」,而新聞大多僅是一完整事件的零碎片段,其內容也易受到媒體立場或撰寫角度不同有所差異;除此之外,龐大的新聞量亦使得想要瞭解事件全貌的困難度大增。因此,本研究將利用文字探勘技術群聚相關新聞為事件,以增進新聞所帶來的價值。
分類分群為文字探勘中很常見的步驟,亦是本研究將新聞群聚成事件所運用到的主要方法。最近鄰 (k-nearest neighbor, kNN)搜尋法可視為分類法中最常見的演算法之一,但由於kNN在分類上必須要每篇新聞兩兩比較並排序才得以選出最近鄰,這也產生了kNN在實作上的效能瓶頸。本研究提出了一個「建立距離參考基準點」的方法RTD-based kNN (Relative Text-Distance-based kNN),透過在向量空間中建立一個基準點,讓所有文件利用與基準點的相對距離建立起遠近的關係,使得在選取前k個最近鄰之前,直接以相對關係篩選出較可能的候選文件,進而選出前k個最近鄰,透過相對距離的概念減少比較次數以改善效率。
本研究於Google News中抽取62個事件(共742篇新聞),並依其分群結果作為測試與評估依據,以比較RTD-based kNN與kNN新聞事件分群時的績效。實驗結果呈現出RTD-based kNN的基準點以常用字字彙建立較佳,分群後的再合併則有助於改善結果,而在RTD-based kNN與kNN的F-measure並無顯著差距(α=0.05)的情況下,RTD-based kNN的運算時間低於kNN達28.13%。顯示RTD-based kNN能提供新聞事件分群時一個更好的方法。最後,本研究提供一些未來研究之方向。
News Events can be described as "the aggregation of many similar news that describe the particular incident within a specific timeframe". Most of news article portraits only a part of a passage, and many of the content are bias because of different media standpoint or different viewpoint of reporters; in addition, the massive news source increases complexity of the incident. Therefore, this research paper employs Text Mining Technique to cluster similar news to a events that can value added a news contributed.
Classification and Clustering technique is a frequently used in Text Mining, and K-nearest neighbor(kNN) is one of most common algorithms apply in classification. However, kNN requires massive comparison on each individual article, and it becomes the performance bottlenecks of kNN. This research proposed Relative Text-Distance-based kNN(RTD-based kNN), the core concept of this method is establish a Base, a distance reference point, through a Vector Space, all documents can create the distance relationship through the relative distance between itself and base. Through the concept of relative distance, it can decrease the number of comparison and improve the efficiency.
This research chooses a sample of 62 events (with total of 742 news articles) from Google News for the test and evaluation. Under the condition of RTD-based kNN and kNN with a no significant difference in F-measure (α=0.05), RTD-based kNN out perform kNN in time decreased by 28.13%. This confirms RTD-based kNN is a better method in clustering news event. At last, this research provides some of the research aspect for the future.
參考文獻 中文部分
1.巫啟台(2002)。文件之關聯資訊萃取及其概念圖自動建構 (碩士論文),國立成功大學資訊工程學系碩士論文。
2.陳克健、陳正佳、林隆基 (1986)。中文語句的研究-斷詞與構詞。中央研究院技術報告,TR-86-006。
3.陳昱絃 (2007)。以螞蟻演算法探勘推薦系統上之分類規則,國立成功大學工程科學系碩士論文。
4.陳崇正 (2009)。應用網路書籤與VSM相似度演算法於強化實踐社群的形成,國立中正大學資訊工程研究所碩士論文。
5.黃孝文 (2010)。雲端運算服務環境下運用文字探勘於語意註解網頁文件分析之研究,國立政治大學資訊管理研究所碩士論文。
6.戴尚學 (2003)。運用事件偵測與追蹤技術於中文多文件摘要之研究,國立雲林科技大學資訊管理系碩士論文。
7.謝邦昌 (2006)。資料採礦與商業智慧,台北市:鼎茂圖書出版股份有限公司。

英文部分
1.Allan ,J. , Papka, R. & Lavrenko , V. (1998). On-line New Event Detection and Tracking. In Proceedings of ACM SIGIR, pp37-45.
2.Chen, K. J., Kiu, S. H. (1992). Word Identification for Mandarin Chinese Sentences. Fifth International Conference on Computational Linguistics, pp.101-107.
3.Cover, T.M., Hart, P.E. (1967). Nearest Neighbor Pattern Classification, IEEE Transaction on Information Theory. v.IT-13 n.1, pp.21-27.
4.Fayyed, U., Piatetsky-Shapiro, G. and Smyth, P. (1996). The KDD Process of Extracting Useful Knowledge from Volumes of Data. , Communication of the ACM, v.39, pp. 27-34.
5.Fan, C.K., Tsai, W.H. (1998). Automatic Word Identification in Chinese Sentences by the Relaxation Technique. Computer Proceeding of Chinese and Oriental Languages, pp.33-56.
6.Feldman, R., Dagan, I. (1995). Knowledge Discovery in Textual Data base(KDT). Proceedings of the first ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.112-117.
7.Han , Jiawei, Kamber, Micheline (2006). Data Mining: Concepts and Techniques
8.Jain, A.K., Murty, M.N. & Flynn, P.J.(1999). Data Clustering, A Review. ACM Computing Surveys, v.31 n.3, pp.264-323.
9.Joachims , T.(1998). Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Proceedings of the European Conference on Machine Learning Springer, pp. 137–142.
10.Krishnapuram, Raghu, Joshi, Anupam, Yi, Liyu (2001). Low-Complexity Fuzzy Relational Clustering Algorithm for Web Mining. IEEE Transactions on Fuzzy System, v.9 n.4, pp.595-607.
11.Li, B.Y., Lin, S., Sun, C.F. & Sun, M.S. (1991). A Maximal Matching Automatic Chinese Word Segmentation Algorithm using Corpus Tagging for Ambiguity Resolution. R.O.C. Computational Linguistics Conference, Taiwan, pp.135-146.
12.MacQueen, J. B. (1967). Some Methods for Classification and Analysis of Multivariate Observations. Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press, pp.281-297.
13.Berry, M., Linoff, G. (2000). Mastering Data Mining, The Art & Science of Customer Relationship Management, Wiley Publishing.
14.Nie, Jian-Yun, Brisebois, Martin & Ren, Xiaobo (1996). On Chinese Text Retrieval. Conference Proceedings of SIGIR, pp.225-233.
15.Popescu, A.(2001). Implementation of Term Weighting in a Simple IR System. Personal course project, University of Helsinki.
16.Roiger, Richard, Geatz, Michael (2003). Data Mining: A Tutorial Based Primer. Addison Wesley Higher Education.
17.Rousseeuw, P.J., Kaufman, L., Trauwaert, E.(1996). Fuzzy Clustering using Scatter Matrices. Computational Statistics and Data Analysis, v 23, pp.135-151.
18.Salton, G., McGill, M. (1983). Introduction to Modern Information Retrieval, New York: McGraw-Hill.
19.Salton, G., Wong, A., Yang, C. S. (1975). A Vector Space Model for Automatic Indexing. Communications of the ACM, v.18 n.11, pp.613-620.
20.Sebastiani, F. (2002). Machine Learning in Automated Text Categorization. ACM Computing Surveys, v.34 n.1, pp.1-47.
21.Singh, L., Scheuermann , P. & Chen , B. (1997). Generating Association Rules from Semi-Structured Documents Using an Extended Concept Hierarchy. ACM IKM, pp.193-200.
22.Sproat, R, Shih , C., 1990. A Statistical Method for Finding Word Boundaries in Chinese Text. Computer Processing of Chinese and Oriental Languages, pp. 336-351.
23.Teng, W.-G., Lee, H.-H.(2007). Collaborative Recommendation with Multi-Criteria Ratings. Journal of Computers (Special Issue on Data Mining), v.17 n.4, pp.69-78.
24.Yang, Yiming (1997), An Evaluation of Statistical Approaches to Text Categorization. Technical Report CMU-CS-97-127, Carnegie Mellon University.
25.Yang, Y., Pierce, T. & Carbonell, J.(1998). A Study on Retrospective And On-Line Event Detection. Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.28-36.
26.Yang , Yiming, Lin, Xin (1999). A Re-examination of Text Categorization Methods. Proceedings of the 22nd Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pp.12-29.
27.Yang, Y., Carbonell, J.G., Brown, R., Pierce, T., Archibald, B. T. & Liu, X. (1999). Learning Approaches for Detecting and Tracking News Events. IEEE Intelligent Systems, v.14 n.4, pp.32-43.
28.Yang, Y., Ault, T., & Pierce, T. (2000). Improving Text Categorization Methods for Event Tracking. Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.65-72.
29.You , Jia-Ming, Chen, Keh-Jiann (2006). Improving Context Vector Models by Feature Clustering for Automatic Thesaurus Construction , Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing.
描述 碩士
國立政治大學
資訊管理研究所
98356015
99
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0098356015
資料類型 thesis
dc.contributor.advisor 楊建民zh_TW
dc.contributor.author (Authors) 陳柏均zh_TW
dc.contributor.author (Authors) Chen, Po Chunen_US
dc.creator (作者) 陳柏均zh_TW
dc.creator (作者) Chen, Po Chunen_US
dc.date (日期) 2010en_US
dc.date.accessioned 4-Sep-2013 16:59:37 (UTC+8)-
dc.date.available 4-Sep-2013 16:59:37 (UTC+8)-
dc.date.issued (上傳時間) 4-Sep-2013 16:59:37 (UTC+8)-
dc.identifier (Other Identifiers) G0098356015en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/60218-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊管理研究所zh_TW
dc.description (描述) 98356015zh_TW
dc.description (描述) 99zh_TW
dc.description.abstract (摘要) 新聞事件可描述為「一個時間區間內、同一主題的相似新聞之集合」,而新聞大多僅是一完整事件的零碎片段,其內容也易受到媒體立場或撰寫角度不同有所差異;除此之外,龐大的新聞量亦使得想要瞭解事件全貌的困難度大增。因此,本研究將利用文字探勘技術群聚相關新聞為事件,以增進新聞所帶來的價值。
分類分群為文字探勘中很常見的步驟,亦是本研究將新聞群聚成事件所運用到的主要方法。最近鄰 (k-nearest neighbor, kNN)搜尋法可視為分類法中最常見的演算法之一,但由於kNN在分類上必須要每篇新聞兩兩比較並排序才得以選出最近鄰,這也產生了kNN在實作上的效能瓶頸。本研究提出了一個「建立距離參考基準點」的方法RTD-based kNN (Relative Text-Distance-based kNN),透過在向量空間中建立一個基準點,讓所有文件利用與基準點的相對距離建立起遠近的關係,使得在選取前k個最近鄰之前,直接以相對關係篩選出較可能的候選文件,進而選出前k個最近鄰,透過相對距離的概念減少比較次數以改善效率。
本研究於Google News中抽取62個事件(共742篇新聞),並依其分群結果作為測試與評估依據,以比較RTD-based kNN與kNN新聞事件分群時的績效。實驗結果呈現出RTD-based kNN的基準點以常用字字彙建立較佳,分群後的再合併則有助於改善結果,而在RTD-based kNN與kNN的F-measure並無顯著差距(α=0.05)的情況下,RTD-based kNN的運算時間低於kNN達28.13%。顯示RTD-based kNN能提供新聞事件分群時一個更好的方法。最後,本研究提供一些未來研究之方向。
zh_TW
dc.description.abstract (摘要) News Events can be described as "the aggregation of many similar news that describe the particular incident within a specific timeframe". Most of news article portraits only a part of a passage, and many of the content are bias because of different media standpoint or different viewpoint of reporters; in addition, the massive news source increases complexity of the incident. Therefore, this research paper employs Text Mining Technique to cluster similar news to a events that can value added a news contributed.
Classification and Clustering technique is a frequently used in Text Mining, and K-nearest neighbor(kNN) is one of most common algorithms apply in classification. However, kNN requires massive comparison on each individual article, and it becomes the performance bottlenecks of kNN. This research proposed Relative Text-Distance-based kNN(RTD-based kNN), the core concept of this method is establish a Base, a distance reference point, through a Vector Space, all documents can create the distance relationship through the relative distance between itself and base. Through the concept of relative distance, it can decrease the number of comparison and improve the efficiency.
This research chooses a sample of 62 events (with total of 742 news articles) from Google News for the test and evaluation. Under the condition of RTD-based kNN and kNN with a no significant difference in F-measure (α=0.05), RTD-based kNN out perform kNN in time decreased by 28.13%. This confirms RTD-based kNN is a better method in clustering news event. At last, this research provides some of the research aspect for the future.
en_US
dc.description.tableofcontents 第一章 緒論 1
第一節 研究背景 1
第二節 研究動機 2
第三節 研究目的 3
第二章 文獻探討 4
第一節 資料探勘 4
2.1.1 資料探勘定義 4
2.1.2 常用資料探勘方法 5
第二節 文字探勘 7
2.2.1 文字探勘定義 7
2.2.2 斷詞處理與權重計算 7
2.2.3 向量空間模型(Vector Space Model, VSM)的運用 11
2.2.4 相似度計算 12
2.2.5 分類技術 13
2.2.6 分群技術 14
第三節 k-最鄰近演算法 (k-Nearest Neighbor, kNN) 14
2.3.1 kNN分類演算法於文字探勘 14
2.3.2 kNN運用於新聞事件的偵測與追蹤 15
第三章 研究方法與設計 18
第一節 研究設計 18
第二節 RTD-based kNN 演算法 20
3.2.1 kNN分類法描述 20
3.2.2 kNN問題 22
3.2.3 參考距離的概念 22
第三節 分群結果的合併 24
第四節 新聞的偵測與追蹤 24
第五節 實驗流程與內容 26
第六節 評估方法 27
第七節 新聞來源與特性 28
第四章 實驗結果 29
第一節 基準點建立 29
第二節 事件偵測門檻值 33
第三節 文件相似門檻值 38
第四節 k值的提升 43
第五節 合併前後的差別 44
第六節 與kNN的比較 46
第五章 結論與未來展望 54
第一節 結論與建議 54
第二節 未來展望 56
參考文獻 57
附錄A:Google News新聞來源與事件 62
附錄B:RTD-based kNN群聚事件結果 63
zh_TW
dc.format.extent 1023455 bytes-
dc.format.mimetype application/pdf-
dc.language.iso en_US-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0098356015en_US
dc.subject (關鍵詞) 文字探勘zh_TW
dc.subject (關鍵詞) kNNzh_TW
dc.subject (關鍵詞) 事件偵測與追蹤zh_TW
dc.subject (關鍵詞) 分類分群zh_TW
dc.subject (關鍵詞) Text Miningen_US
dc.subject (關鍵詞) kNNen_US
dc.subject (關鍵詞) Events Detection and Trackingen_US
dc.subject (關鍵詞) Classification and Clusteringen_US
dc.title (題名) 文件距離為基礎kNN分群技術與新聞事件偵測追蹤之研究zh_TW
dc.title (題名) A study of relative text-distance-based kNN clustering technique and news events detection and trackingen_US
dc.type (資料類型) thesisen
dc.relation.reference (參考文獻) 中文部分
1.巫啟台(2002)。文件之關聯資訊萃取及其概念圖自動建構 (碩士論文),國立成功大學資訊工程學系碩士論文。
2.陳克健、陳正佳、林隆基 (1986)。中文語句的研究-斷詞與構詞。中央研究院技術報告,TR-86-006。
3.陳昱絃 (2007)。以螞蟻演算法探勘推薦系統上之分類規則,國立成功大學工程科學系碩士論文。
4.陳崇正 (2009)。應用網路書籤與VSM相似度演算法於強化實踐社群的形成,國立中正大學資訊工程研究所碩士論文。
5.黃孝文 (2010)。雲端運算服務環境下運用文字探勘於語意註解網頁文件分析之研究,國立政治大學資訊管理研究所碩士論文。
6.戴尚學 (2003)。運用事件偵測與追蹤技術於中文多文件摘要之研究,國立雲林科技大學資訊管理系碩士論文。
7.謝邦昌 (2006)。資料採礦與商業智慧,台北市:鼎茂圖書出版股份有限公司。

英文部分
1.Allan ,J. , Papka, R. & Lavrenko , V. (1998). On-line New Event Detection and Tracking. In Proceedings of ACM SIGIR, pp37-45.
2.Chen, K. J., Kiu, S. H. (1992). Word Identification for Mandarin Chinese Sentences. Fifth International Conference on Computational Linguistics, pp.101-107.
3.Cover, T.M., Hart, P.E. (1967). Nearest Neighbor Pattern Classification, IEEE Transaction on Information Theory. v.IT-13 n.1, pp.21-27.
4.Fayyed, U., Piatetsky-Shapiro, G. and Smyth, P. (1996). The KDD Process of Extracting Useful Knowledge from Volumes of Data. , Communication of the ACM, v.39, pp. 27-34.
5.Fan, C.K., Tsai, W.H. (1998). Automatic Word Identification in Chinese Sentences by the Relaxation Technique. Computer Proceeding of Chinese and Oriental Languages, pp.33-56.
6.Feldman, R., Dagan, I. (1995). Knowledge Discovery in Textual Data base(KDT). Proceedings of the first ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.112-117.
7.Han , Jiawei, Kamber, Micheline (2006). Data Mining: Concepts and Techniques
8.Jain, A.K., Murty, M.N. & Flynn, P.J.(1999). Data Clustering, A Review. ACM Computing Surveys, v.31 n.3, pp.264-323.
9.Joachims , T.(1998). Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Proceedings of the European Conference on Machine Learning Springer, pp. 137–142.
10.Krishnapuram, Raghu, Joshi, Anupam, Yi, Liyu (2001). Low-Complexity Fuzzy Relational Clustering Algorithm for Web Mining. IEEE Transactions on Fuzzy System, v.9 n.4, pp.595-607.
11.Li, B.Y., Lin, S., Sun, C.F. & Sun, M.S. (1991). A Maximal Matching Automatic Chinese Word Segmentation Algorithm using Corpus Tagging for Ambiguity Resolution. R.O.C. Computational Linguistics Conference, Taiwan, pp.135-146.
12.MacQueen, J. B. (1967). Some Methods for Classification and Analysis of Multivariate Observations. Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press, pp.281-297.
13.Berry, M., Linoff, G. (2000). Mastering Data Mining, The Art & Science of Customer Relationship Management, Wiley Publishing.
14.Nie, Jian-Yun, Brisebois, Martin & Ren, Xiaobo (1996). On Chinese Text Retrieval. Conference Proceedings of SIGIR, pp.225-233.
15.Popescu, A.(2001). Implementation of Term Weighting in a Simple IR System. Personal course project, University of Helsinki.
16.Roiger, Richard, Geatz, Michael (2003). Data Mining: A Tutorial Based Primer. Addison Wesley Higher Education.
17.Rousseeuw, P.J., Kaufman, L., Trauwaert, E.(1996). Fuzzy Clustering using Scatter Matrices. Computational Statistics and Data Analysis, v 23, pp.135-151.
18.Salton, G., McGill, M. (1983). Introduction to Modern Information Retrieval, New York: McGraw-Hill.
19.Salton, G., Wong, A., Yang, C. S. (1975). A Vector Space Model for Automatic Indexing. Communications of the ACM, v.18 n.11, pp.613-620.
20.Sebastiani, F. (2002). Machine Learning in Automated Text Categorization. ACM Computing Surveys, v.34 n.1, pp.1-47.
21.Singh, L., Scheuermann , P. & Chen , B. (1997). Generating Association Rules from Semi-Structured Documents Using an Extended Concept Hierarchy. ACM IKM, pp.193-200.
22.Sproat, R, Shih , C., 1990. A Statistical Method for Finding Word Boundaries in Chinese Text. Computer Processing of Chinese and Oriental Languages, pp. 336-351.
23.Teng, W.-G., Lee, H.-H.(2007). Collaborative Recommendation with Multi-Criteria Ratings. Journal of Computers (Special Issue on Data Mining), v.17 n.4, pp.69-78.
24.Yang, Yiming (1997), An Evaluation of Statistical Approaches to Text Categorization. Technical Report CMU-CS-97-127, Carnegie Mellon University.
25.Yang, Y., Pierce, T. & Carbonell, J.(1998). A Study on Retrospective And On-Line Event Detection. Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.28-36.
26.Yang , Yiming, Lin, Xin (1999). A Re-examination of Text Categorization Methods. Proceedings of the 22nd Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pp.12-29.
27.Yang, Y., Carbonell, J.G., Brown, R., Pierce, T., Archibald, B. T. & Liu, X. (1999). Learning Approaches for Detecting and Tracking News Events. IEEE Intelligent Systems, v.14 n.4, pp.32-43.
28.Yang, Y., Ault, T., & Pierce, T. (2000). Improving Text Categorization Methods for Event Tracking. Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.65-72.
29.You , Jia-Ming, Chen, Keh-Jiann (2006). Improving Context Vector Models by Feature Clustering for Automatic Thesaurus Construction , Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing.
zh_TW