應用MMB演算法清理網頁雜訊和擷取網頁Metadata

駱思安; 徐俊傑

Please use this identifier to cite or link to this item: https://ah.lib.nccu.edu.tw/handle/140.119/113845

題名:	應用MMB演算法清理網頁雜訊和擷取網頁Metadata
作者:	駱思安徐俊傑
關鍵詞:	多重關係貝氏演算法;網頁探勘;網頁清理;資訊擷取 Multimembership Bayesian Algorithm;Web Mining;Web Page Cleaning;Information Extraction;Metadata;TF/IDF;Entropy
日期:	2006
上傳時間:	19-Oct-2017
摘要:	傳統擷取網頁重要詞彙的方式大都是以TF/IDF和Entropy方式為主流，但我們赫然發現TF 值較高的詞彙並不等同於這個詞彙重要;而Entropy的方式雖然擁有不錯的判別能力，但由於其計算過程過於繁瑣，故本研究提出MMB 演算法法以期能取代這兩個方法，實驗證明MMB演算法確實有效地提昇辨識重要詞彙的機率值和網頁自動分類的準確率。每一個網站中包含著許許多多的文字，分散在網站內的每一個網頁中，而這些文字一部分是描述網網頁隸屬於屬於何種類別，另一部分則是與隸屬類別毫無關係的雜質。因此，如能有效地去除網站中的雜質文字，即能成功地提昇中文網頁自動分類的效能。 The traditional methods of acquiring important terms of the Web page are TF/IDF and Entropy, but we find the higher TF value is not stand for the more important term is. Although Entropy has good ability of differing, the processes of calculating are too much. So, in the research, we propose MMB algorithm to replace these two methods, and we verify MMB algorithm can actually improve the probabilities of differing important terms and the performances of classifying the Chinese Web page. A Web site contains a lot of terms which are distributed in each Web page of the Web site. Some of these terms describe the characteristics of the Web page and can used to classify the Web page to a specific category. The others have no relationship to the Web page are ignored while performing the classification task. So, if we can eliminate the noisy terms, we can successfully improve the performance of Web page automatically classified system.
關聯:	TANET 2006 台灣網際網路研討會論文集網際網路技術
資料類型:	conference
Appears in Collections:	會議論文

Files in This Item:

File	Description	Size	Format
598.pdf		390.25 kB	Adobe PDF2	View/Open

Show full item record

Google Scholar^TM

Check

Files in This Item:

Google ScholarTM

Google Scholar^TM