Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

題名 在高度分散式環境下進行Top-k相似文件檢索
Similar Top-k documents retrieval in highly distributed environments
作者 王俊閎
Wang, Chun Hung
貢獻者 陳良弼
Chen, Arbee L.P.
王俊閎
Wang, Chun Hung
關鍵詞 分散式環境
Tok-k 相似文件檢索
端對端網路
distributed environments
similar top-k documents retrieval
peer-to-peer network
日期 2012
上傳時間 3-Dec-2012 11:27:23 (UTC+8)
摘要 在文件資料庫的查詢處理上,Top-k相似文件查詢主要是協助使用者可以從龐大的文件集合中,檢索出和查詢文件具有高度相關性的文件集合。將資料庫內的文件依據和查詢文件之相似度程度,選擇出相似度最高的前k篇文件回傳給使用者。然而過去集中式資料庫,因其覆蓋性和可擴充性的不足,使得這種排名傾向的文件查詢處理,需耗費大量時間及運算成本。近年來,使用端對端(Peer-to-peer, P2P)架構解決相關的文件檢索問題已成為一種趨勢,但在高度分散式環境下,支援排名傾向的相似文件查詢是困難的,因為缺乏全域資訊和適當的系統協調者。
在本研究中,我們先針對各節點資料庫作分群前處理,並提出一個利用區域切割的作法[1],將P2P環境劃分成數個子區塊後,建立特徵索引表。因此在查詢處理時,可透過索引表加快挑選出Top-k相似群集的速度,並且確保有適當數量的回傳結果。最後在實驗中,我們提出的方法會與傳統集中式搜尋引擎以及SON-based [1] 做比較,在高度分散式環境下,我們的方法在執行Top-k相似文件查詢時,會比上述兩種作法有較為優異的表現。
On query processing in a large database, similar top-k documents query is an important mechanism to retrieve the highly correlated document collection with query for users. It ranks documents with a similarity ranking function and reports the k documents with highest similarity. However, the former approach in web searching, i.e., centralized search engines, rises some issues such as lack of coverage and scalability, impact provides rank-based query become a costly operation. Recently, using Peer-to-peer (P2P) architectures to tackle above issues has emerged as a trend of solution, but due to the shortage of global knowledge and some appropriate central coordinators, support rank-based query in highly distributed environment has been difficulty.
In this paper, we proposed a framework to solve these problems. First, we performed the local cluster pre-processing on each peer, followed by the zone creation process, forming sub-zones over P2P network, and then constructing the feature index table to improve the performance of selecting similar top-k cluster results. The experiments show that our approach performs similar top-k documents query outperforms than SON-based approach in highly distributed environment.
參考文獻 [1] Christos Doulkeridis, Kjetil Nørvåg, Michalis Vazirgiannis. 2008. Peer-to-peer similarity search over widely distributed document collections. LSDS-IR 35-42.
[2] Stoica, I., Morris, R., Karger, D., Kaashoek, M.F., Balakrishnan, H. 2001. Chord : A scalable peer-to-peer lookup service for internet applications. In Proceedings of the ACM SIGCOMM 149-160.
[3] Ratnasamy, S., Francis, P., Handley, M., Karp, R., Schenker, S. 2001. A scalable contentaddressable network. In Proceedings of the ACM SIGCOMM 161-172.
[4] Rowstron, A., Druschel, P. 2001. Pastry : Scalable, distributed object location and routing for large-scale peer-to-peer systems. In Proceedings of the Middleware
[5] Chunqiang Tang, Zhichen Xu, Sandhya Dwarkadas. 2003. Peer-to-peer information retrieval using self-organizing semantic overlay networks. In Proceedings of the ACM SIGCOMM 175-186.
[6] BitTorrent. .
[7] eMula. .
[8] Beverly Yang, Hector Garcia-Molina. 2003. Designing a Super-Peer Network. ICDE 49-60.
[9] The Gnutella protocol specification v0.6. .
[10] KaZaA. .
[11] Salton, G., Wong, A., Yang, C.S. 1975. A vector space model for automatic indexing. Communications of the ACM Volume 18 Issue 11 613-620.
[12] Bernard J. Jansen, Soumen Chakrabarti. 2006. Mining the Web : Discovering Knowledge from Hypertext Data. Morgan-Kaufmann Publishers, 352 pp., ISBN: 1-55860-754-4. Inf. Process. Manage. (IPM) 42(1) 317-318.
[13] Christos. Doulkeridis, Kjetil Nørvåg, and Michalis Vazirgiannis. 2007. DESENT: Decentralized and distributed semantic overlay generation in P2P networks. IEEE Journal on Selected Areas in Communications (J-SAC) 25(1) 25–34.
[14] Hersh, W.R., Buckley, C., J.Leone, T., Hickam, D.H. 1994. Ohsumed: An interactive retrieval evaluation and new large test collection for research. In Proceedings of the ACM SIGIR. 192–201
[15] GT-ITM : Georgia Tech Internetwork Topology Models. .
[16] Wolf-Tilo Balke, Wolfgang Nejdl, Wolf Siberski, Uwe Thaden. 2005. Progressive Distributed Top k Retrieval in Peer-to-Peer Networks. ICDE 174-185.
[17] Wolf-Tilo Balke. 2005. Supporting Information Retrieval in Peer-to-Peer Systems. Peer-to-Peer Systems and Applications 337-352.
[18] C. Gkantsidis, M. Mihail, and A. Saberi. 2005. Hybrid search schemes for unstructured peer-to-peer networks. In Proceedings of INFOCOM.
[19] Inderjit S. Dhillon, Dharmendra S. Modha. 2001. Concept Decompositions for Large Sparse Text Data Using Clustering. Machine Learning 42(1/2): 143-175.
[20] Akrivi Vlachou, Christos Doulkeridis, Kjetil Nørvåg, Michalis Vazirgiannis. 2008. On efficient top-k query processing in highly distributed environments. SIGMOD 753-764.
[21] Shiwei Zhu, Junjie Wu, Hui Xiong, Guoping Xia. 2011. Scaling up top-K cosine similarity search. Data Knowl. Eng. (DKE) 70(1) 60-83.
[22] Aoying Zhou, Rong Zhang, Weining Qian, Quang Hieu Vu, Tianming Hu. 2008. Adaptive indexing for content-based search in P2P systems. Data Knowl. Eng. (DKE) 67(3) 381-398.
描述 碩士
國立政治大學
資訊科學學系
99753034
101
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0099753034
資料類型 thesis
dc.contributor.advisor 陳良弼zh_TW
dc.contributor.advisor Chen, Arbee L.P.en_US
dc.contributor.author (Authors) 王俊閎zh_TW
dc.contributor.author (Authors) Wang, Chun Hungen_US
dc.creator (作者) 王俊閎zh_TW
dc.creator (作者) Wang, Chun Hungen_US
dc.date (日期) 2012en_US
dc.date.accessioned 3-Dec-2012 11:27:23 (UTC+8)-
dc.date.available 3-Dec-2012 11:27:23 (UTC+8)-
dc.date.issued (上傳時間) 3-Dec-2012 11:27:23 (UTC+8)-
dc.identifier (Other Identifiers) G0099753034en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/56330-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊科學學系zh_TW
dc.description (描述) 99753034zh_TW
dc.description (描述) 101zh_TW
dc.description.abstract (摘要) 在文件資料庫的查詢處理上,Top-k相似文件查詢主要是協助使用者可以從龐大的文件集合中,檢索出和查詢文件具有高度相關性的文件集合。將資料庫內的文件依據和查詢文件之相似度程度,選擇出相似度最高的前k篇文件回傳給使用者。然而過去集中式資料庫,因其覆蓋性和可擴充性的不足,使得這種排名傾向的文件查詢處理,需耗費大量時間及運算成本。近年來,使用端對端(Peer-to-peer, P2P)架構解決相關的文件檢索問題已成為一種趨勢,但在高度分散式環境下,支援排名傾向的相似文件查詢是困難的,因為缺乏全域資訊和適當的系統協調者。
在本研究中,我們先針對各節點資料庫作分群前處理,並提出一個利用區域切割的作法[1],將P2P環境劃分成數個子區塊後,建立特徵索引表。因此在查詢處理時,可透過索引表加快挑選出Top-k相似群集的速度,並且確保有適當數量的回傳結果。最後在實驗中,我們提出的方法會與傳統集中式搜尋引擎以及SON-based [1] 做比較,在高度分散式環境下,我們的方法在執行Top-k相似文件查詢時,會比上述兩種作法有較為優異的表現。
zh_TW
dc.description.abstract (摘要) On query processing in a large database, similar top-k documents query is an important mechanism to retrieve the highly correlated document collection with query for users. It ranks documents with a similarity ranking function and reports the k documents with highest similarity. However, the former approach in web searching, i.e., centralized search engines, rises some issues such as lack of coverage and scalability, impact provides rank-based query become a costly operation. Recently, using Peer-to-peer (P2P) architectures to tackle above issues has emerged as a trend of solution, but due to the shortage of global knowledge and some appropriate central coordinators, support rank-based query in highly distributed environment has been difficulty.
In this paper, we proposed a framework to solve these problems. First, we performed the local cluster pre-processing on each peer, followed by the zone creation process, forming sub-zones over P2P network, and then constructing the feature index table to improve the performance of selecting similar top-k cluster results. The experiments show that our approach performs similar top-k documents query outperforms than SON-based approach in highly distributed environment.
en_US
dc.description.tableofcontents 第一章 導論及研究動機 1
第二章 相關研究 3
2.1 端對端網路的相關研究 3
2.1.1 超級節點網路 3
2.1.2 層疊式網路 5
2.2 內文檢索的相關研究 6
2.2.1 向量空間模型 6
2.2.2 語義層疊式網路之內文檢索方法 7
第三章 方法描述與實作 11
3.1 文章前處理 11
3.1.1 分群的概念 13
3.1.2 分群演算法 14
3.2 特徵索引表建立 16
3.2.1 挑選啟動節點的想法 17
3.2.2 執行區域切割的方法 19
3.2.3 建立特徵索引表演算法 20
3.3 查詢處理 23
第四章 實驗方法與驗證 25
4.1 實驗設計 26
4.1.1 分群演算法實驗 28
4.1.2 準確率與查詢成本比較實驗 28
4.2 實驗結果 30
第五章 結論 34
參考文獻 35
zh_TW
dc.language.iso en_US-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0099753034en_US
dc.subject (關鍵詞) 分散式環境zh_TW
dc.subject (關鍵詞) Tok-k 相似文件檢索zh_TW
dc.subject (關鍵詞) 端對端網路zh_TW
dc.subject (關鍵詞) distributed environmentsen_US
dc.subject (關鍵詞) similar top-k documents retrievalen_US
dc.subject (關鍵詞) peer-to-peer networken_US
dc.title (題名) 在高度分散式環境下進行Top-k相似文件檢索zh_TW
dc.title (題名) Similar Top-k documents retrieval in highly distributed environmentsen_US
dc.type (資料類型) thesisen
dc.relation.reference (參考文獻) [1] Christos Doulkeridis, Kjetil Nørvåg, Michalis Vazirgiannis. 2008. Peer-to-peer similarity search over widely distributed document collections. LSDS-IR 35-42.
[2] Stoica, I., Morris, R., Karger, D., Kaashoek, M.F., Balakrishnan, H. 2001. Chord : A scalable peer-to-peer lookup service for internet applications. In Proceedings of the ACM SIGCOMM 149-160.
[3] Ratnasamy, S., Francis, P., Handley, M., Karp, R., Schenker, S. 2001. A scalable contentaddressable network. In Proceedings of the ACM SIGCOMM 161-172.
[4] Rowstron, A., Druschel, P. 2001. Pastry : Scalable, distributed object location and routing for large-scale peer-to-peer systems. In Proceedings of the Middleware
[5] Chunqiang Tang, Zhichen Xu, Sandhya Dwarkadas. 2003. Peer-to-peer information retrieval using self-organizing semantic overlay networks. In Proceedings of the ACM SIGCOMM 175-186.
[6] BitTorrent. .
[7] eMula. .
[8] Beverly Yang, Hector Garcia-Molina. 2003. Designing a Super-Peer Network. ICDE 49-60.
[9] The Gnutella protocol specification v0.6. .
[10] KaZaA. .
[11] Salton, G., Wong, A., Yang, C.S. 1975. A vector space model for automatic indexing. Communications of the ACM Volume 18 Issue 11 613-620.
[12] Bernard J. Jansen, Soumen Chakrabarti. 2006. Mining the Web : Discovering Knowledge from Hypertext Data. Morgan-Kaufmann Publishers, 352 pp., ISBN: 1-55860-754-4. Inf. Process. Manage. (IPM) 42(1) 317-318.
[13] Christos. Doulkeridis, Kjetil Nørvåg, and Michalis Vazirgiannis. 2007. DESENT: Decentralized and distributed semantic overlay generation in P2P networks. IEEE Journal on Selected Areas in Communications (J-SAC) 25(1) 25–34.
[14] Hersh, W.R., Buckley, C., J.Leone, T., Hickam, D.H. 1994. Ohsumed: An interactive retrieval evaluation and new large test collection for research. In Proceedings of the ACM SIGIR. 192–201
[15] GT-ITM : Georgia Tech Internetwork Topology Models. .
[16] Wolf-Tilo Balke, Wolfgang Nejdl, Wolf Siberski, Uwe Thaden. 2005. Progressive Distributed Top k Retrieval in Peer-to-Peer Networks. ICDE 174-185.
[17] Wolf-Tilo Balke. 2005. Supporting Information Retrieval in Peer-to-Peer Systems. Peer-to-Peer Systems and Applications 337-352.
[18] C. Gkantsidis, M. Mihail, and A. Saberi. 2005. Hybrid search schemes for unstructured peer-to-peer networks. In Proceedings of INFOCOM.
[19] Inderjit S. Dhillon, Dharmendra S. Modha. 2001. Concept Decompositions for Large Sparse Text Data Using Clustering. Machine Learning 42(1/2): 143-175.
[20] Akrivi Vlachou, Christos Doulkeridis, Kjetil Nørvåg, Michalis Vazirgiannis. 2008. On efficient top-k query processing in highly distributed environments. SIGMOD 753-764.
[21] Shiwei Zhu, Junjie Wu, Hui Xiong, Guoping Xia. 2011. Scaling up top-K cosine similarity search. Data Knowl. Eng. (DKE) 70(1) 60-83.
[22] Aoying Zhou, Rong Zhang, Weining Qian, Quang Hieu Vu, Tianming Hu. 2008. Adaptive indexing for content-based search in P2P systems. Data Knowl. Eng. (DKE) 67(3) 381-398.
zh_TW