學術產出-學位論文
文章檢視/開啟
書目匯出
-
題名 由搜尋引擎使用記錄探勘與時序相關的多義查詢
Identifying Time Sensitive Ambiguous Query from Query Logs of Search Engines作者 徐國獻
Hsu, Kuo Hsien貢獻者 沈錳坤
Shan, Man Kwan
徐國獻
Hsu, Kuo Hsien關鍵詞 搜尋引擎
搜尋意圖
與時序相關的查詢關鍵詞
多義的查詢關鍵詞
search engine
search intent
time sensitive query
ambiguous query日期 2012 上傳時間 1-七月-2014 12:14:57 (UTC+8) 摘要 在使用搜尋引擎時,同一個查詢關鍵詞可能會有不同的語意。舉例來說,「蘋果」這個查詢關鍵詞可以代表「蘋果日報」也可以代表「蘋果電腦」,這就是所謂的「多義查詢」。除此之外,一個查詢關鍵詞的多個語意在每一個時段會有不同的使用比例,例如:當使用者在早上搜尋「蘋果」的時候,有比較高的比例是想要去查詢「蘋果日報」,比較少的比例想要看到「蘋果電腦」;在下午搜尋「蘋果」的時候,則是比較想要看到與「蘋果電腦」相關的商品及網站。我們將這種查詢關鍵詞稱為「與時序相關的多義查詢」。 本研究主要的目的就是透過搜尋引擎使用記錄,探勘出這種查詢關鍵詞 。我們一共提出了兩種方法,第一種方法是針對每個查詢關鍵詞分別計算多義程度以及時序敏感度,並設定一個門檻值去找出超過門檻值的查詢關鍵詞;第二種方法則是對每個查詢關鍵詞的搜尋結果網頁,去計算兩兩之間的多義距離以及時序敏感距離,再透過階層式分群演算法分別建立出多義階層樹以及時序敏感階層樹,然後再分析這兩個階層樹的相似度,以找出兼具多義性及時序敏感性的查詢。 從實驗結果發現,第二種方法表現比第一種方法好。實驗的基準值,是直接使用全部的查詢關鍵詞,去重新排名整個搜尋引擎的改善效能,結果為0.75%;而使用第二種方法偵測出來的查詢關鍵詞,去重新排名的改善效能為5.61%,比基準值高,也比第一種方法3.34%高出許多。第二種方法,一共偵測出632個查詢關鍵詞,其中可以歸納出四種不同的類型:「入口網站類別」、「快速連結網站類別」、「集團網站類別」以及「同名網站類別」。 偵測出來的查詢關鍵詞,不僅可以應用在搜尋引擎的效能調整上,以及用來最佳化搜尋引擎的廣告,也可以用來幫助搜尋引擎決定要呈現哪些搜尋快現,以及搜尋快現的顯示順序。
In search engine, a query term could indicate different intents. For example, query term “apple” could stand for Apple store website and also Apple Daily newspaper website. This kind of queries are called “ambiguous query.” Through exploring the usage of search engine, we found the different intents of the same query term may be distributed in different time interval with different usage ratio. For example, more users search “apple” for apple dairy newspaper website in the morning, and search “apple” for Apple store website in the afternoon. We categorize this kind of query as “time sensitive ambiguous query.” In this thesis, we target on discovering “time sensitive ambiguous query” and propose two methods, MATISD and SHATIS. The first method, MATISD, evaluates the ambiguity degree and time sensitivity degree respectively for each query term, and finds out the time sensitive ambiguous query terms with threshold for both degrees. The second method, SHATIS, is to generate the distance between pairs of search results belonging to the same query term, and use that to build ambiguity hierarchy and time sensitivity hierarchy for each query term. Then, we identify time sensitive ambiguous query by the similarity of tree structure between ambiguity hierarchy and time sensitivity hierarchy. According to the experimental result, we found SHATIS performs better than MATISD. The baseline is to re-rank search engine by the order of users clicks with whole the query terms, and the improvement result is 0.75%. With the queries detected by SHATIS to re-rank search engine, we got improvement result 5.61%, and that’s better than baseline 0.75% and MATISD result 3.34%. SHATIS totally identifies 632 time sensitive ambiguous query terms, most of which are related to big website, such as “yahoo”, “google” and “pchome”. Some are the queries related to big company group, such as “Far East group.” Identifying time sensitive ambiguous query could be used to improve ranking performance of search engines and advertisement. It could be also used to improve the slotting and the trigger of Direct Display on search engine.參考文獻 [1] R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong, Diversifying Search Results, 2nd ACM International Conference on Web Search and Data Mining WSDM, Barcelona, Spain, 2009. [2] A. Bernstein, E. Kaufmann, C. Bürki, and M. Klein, How Similar Is It? Towards Personalized Similarity Measures in Ontologies, Wirtschaftsinformatik 2005, Physica-Verlag HD, 2005. [3] G. Capannini, F. M. Nardini, R. Perego, and F. Silvestri, Efficient Diversification of Search Results using Query Logs, International Conference on World Wide Web WWW, Hyderabad, India, 2011. [4] J. G. Carbonell, and J. Goldstein, The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries, ACM SIGIR International Conference on Information Retrieval, 1998. [5] P. Clough, M. Sanderson, M. Abouammoh, S. Navarro, and M. Paramita, Multiple Approaches to Analyzing Query Diversity, ACM SIGIR International Conference on Information Retrieval, 2009. [6] W. Dakka, L. Gravano, and P. G. Ipeirotis, Answering General Time-Sensitive Queries, IEEE Transactions on Knowledge and Data Engineering, Vol. 24, Issue 2, 2011. [7] B. Edelman, M. Ostrovsky, and M. Schwarz. Internet Advertising and The Generalized Second Price Auction: Selling Billions of Dollars Worth of Keywords, American Economic Review, American Economic Association, vol. 97(1), 2007. [8] J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2nd Edition, 2006. [9] R. Jones, and F. Diaz, Temporal Profiles of Queries, ACM Transactions on Information Systems, Vol. 25, No. 3, 2007. [10] Y. Kalfoglou, and M. Schorlemmer. Ontology Mapping: the State of the Art. Knowledge Engineering Review. Vol.18, No.1, 2003 [11] A. Kulkarni, J. Teevan, K. M. Svore, S. T. Dumais, Understanding Temporal Query Dynamics, 4th ACM International Conference on Web Search and Data Mining WSDM, Hong Kong, China, 2011. [12] S. Kullback, and R. A. Leibler, On Information and Sufficiency, Annals of Mathematical Statistics. Vol.22, No. 1, 1951 [13] C. D. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008. [14] A. Mehta, A. Saberi, U. Vazirani, and V. Vazirani, Adwords and Generalized Online Matching. Journal of the ACM, Vol.54, No.5, 2007 [15] D. Metzler, R. Jones, F. Peng, and R. Zhang, Improving Search Relevance for Implicitly Temporal Queries, ACM SIGIR International Conference on Information Retrieval, 2009. [16] U. Priss, Formal Concept Analysis in Information Science. ARIST. Vol.40, No.1, 2006. [17] D. Rose, and D. Levinson. Understanding User Goals in Web Search, International Conference on World Wide Web WWW, 2004. [18] R. Song, Z. Luo, J. Y. Nie, Y. Yu, and H. W. Hon, Identification of Ambiguous Queries in Web Search, Information Processing and Management, Vol. 45, Issue. 2, 2009. [19] M. J. Welch, J. Cho, and C. Olston, Search Result Diversity for Information Queries, International World Wide Web Conference WWW, Hyderabad, India, 2011. [20] J. Yang, and J. Leskovec. Patterns of Temporal Variation in Online Media, 4th ACM International Conference on Web Search and Data Mining WSDM, 2011. [21] Jaccard Distance http://en.wikipedia.org/wiki/Jaccard_index [22] Spearman’s rank correlation coefficient http://en.wikipedia.org/wiki/Spearman`s_rank-_correlation_coefficient 描述 碩士
國立政治大學
資訊科學學系
99971001
101資料來源 http://thesis.lib.nccu.edu.tw/record/#G0999710011 資料類型 thesis dc.contributor.advisor 沈錳坤 zh_TW dc.contributor.advisor Shan, Man Kwan en_US dc.contributor.author (作者) 徐國獻 zh_TW dc.contributor.author (作者) Hsu, Kuo Hsien en_US dc.creator (作者) 徐國獻 zh_TW dc.creator (作者) Hsu, Kuo Hsien en_US dc.date (日期) 2012 en_US dc.date.accessioned 1-七月-2014 12:14:57 (UTC+8) - dc.date.available 1-七月-2014 12:14:57 (UTC+8) - dc.date.issued (上傳時間) 1-七月-2014 12:14:57 (UTC+8) - dc.identifier (其他 識別碼) G0999710011 en_US dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/67157 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 資訊科學學系 zh_TW dc.description (描述) 99971001 zh_TW dc.description (描述) 101 zh_TW dc.description.abstract (摘要) 在使用搜尋引擎時,同一個查詢關鍵詞可能會有不同的語意。舉例來說,「蘋果」這個查詢關鍵詞可以代表「蘋果日報」也可以代表「蘋果電腦」,這就是所謂的「多義查詢」。除此之外,一個查詢關鍵詞的多個語意在每一個時段會有不同的使用比例,例如:當使用者在早上搜尋「蘋果」的時候,有比較高的比例是想要去查詢「蘋果日報」,比較少的比例想要看到「蘋果電腦」;在下午搜尋「蘋果」的時候,則是比較想要看到與「蘋果電腦」相關的商品及網站。我們將這種查詢關鍵詞稱為「與時序相關的多義查詢」。 本研究主要的目的就是透過搜尋引擎使用記錄,探勘出這種查詢關鍵詞 。我們一共提出了兩種方法,第一種方法是針對每個查詢關鍵詞分別計算多義程度以及時序敏感度,並設定一個門檻值去找出超過門檻值的查詢關鍵詞;第二種方法則是對每個查詢關鍵詞的搜尋結果網頁,去計算兩兩之間的多義距離以及時序敏感距離,再透過階層式分群演算法分別建立出多義階層樹以及時序敏感階層樹,然後再分析這兩個階層樹的相似度,以找出兼具多義性及時序敏感性的查詢。 從實驗結果發現,第二種方法表現比第一種方法好。實驗的基準值,是直接使用全部的查詢關鍵詞,去重新排名整個搜尋引擎的改善效能,結果為0.75%;而使用第二種方法偵測出來的查詢關鍵詞,去重新排名的改善效能為5.61%,比基準值高,也比第一種方法3.34%高出許多。第二種方法,一共偵測出632個查詢關鍵詞,其中可以歸納出四種不同的類型:「入口網站類別」、「快速連結網站類別」、「集團網站類別」以及「同名網站類別」。 偵測出來的查詢關鍵詞,不僅可以應用在搜尋引擎的效能調整上,以及用來最佳化搜尋引擎的廣告,也可以用來幫助搜尋引擎決定要呈現哪些搜尋快現,以及搜尋快現的顯示順序。 zh_TW dc.description.abstract (摘要) In search engine, a query term could indicate different intents. For example, query term “apple” could stand for Apple store website and also Apple Daily newspaper website. This kind of queries are called “ambiguous query.” Through exploring the usage of search engine, we found the different intents of the same query term may be distributed in different time interval with different usage ratio. For example, more users search “apple” for apple dairy newspaper website in the morning, and search “apple” for Apple store website in the afternoon. We categorize this kind of query as “time sensitive ambiguous query.” In this thesis, we target on discovering “time sensitive ambiguous query” and propose two methods, MATISD and SHATIS. The first method, MATISD, evaluates the ambiguity degree and time sensitivity degree respectively for each query term, and finds out the time sensitive ambiguous query terms with threshold for both degrees. The second method, SHATIS, is to generate the distance between pairs of search results belonging to the same query term, and use that to build ambiguity hierarchy and time sensitivity hierarchy for each query term. Then, we identify time sensitive ambiguous query by the similarity of tree structure between ambiguity hierarchy and time sensitivity hierarchy. According to the experimental result, we found SHATIS performs better than MATISD. The baseline is to re-rank search engine by the order of users clicks with whole the query terms, and the improvement result is 0.75%. With the queries detected by SHATIS to re-rank search engine, we got improvement result 5.61%, and that’s better than baseline 0.75% and MATISD result 3.34%. SHATIS totally identifies 632 time sensitive ambiguous query terms, most of which are related to big website, such as “yahoo”, “google” and “pchome”. Some are the queries related to big company group, such as “Far East group.” Identifying time sensitive ambiguous query could be used to improve ranking performance of search engines and advertisement. It could be also used to improve the slotting and the trigger of Direct Display on search engine. en_US dc.description.tableofcontents 第一章 緒論 1 1.1 研究動機 1 1.2 研究目的 1 第二章 文獻探討 3 2.1 AMBIGUOUS QUERY 相關研究 3 2.2 SEARCH RESULT DIVERSITY 相關研究 4 2.3 TIME SENSITIVE QUERY 相關研究 4 第三章 研究方法 7 3.1 方法概述 7 3.2 資料說明及資料前處理 8 3.3 方法一:MATISD 9 3.3.1 AMBIGUITY DEGREE 10 3.3.2 TIME SENSITIVITY DEGREE 11 3.4 方法二:SHATIS 14 3.4.1 HIERARCHICAL CLUSTERING 15 3.4.2 AMBIGUITY HIERARCHY 16 3.4.3 TIME SENSITIVITY HIERARCHY 19 3.4.4 HIERARCHY SIMILARITY 20 第四章 研究結果 25 4.1 實驗資料及設定 25 4.2 實驗設計 25 4.3 實驗結果 27 4.3.1 方法一:MATISD 27 4.3.2 方法二:SHATIS 29 4.4 實際案例 29 第五章 結論與未來研究 38 參考文獻 40 zh_TW dc.language.iso en_US - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0999710011 en_US dc.subject (關鍵詞) 搜尋引擎 zh_TW dc.subject (關鍵詞) 搜尋意圖 zh_TW dc.subject (關鍵詞) 與時序相關的查詢關鍵詞 zh_TW dc.subject (關鍵詞) 多義的查詢關鍵詞 zh_TW dc.subject (關鍵詞) search engine en_US dc.subject (關鍵詞) search intent en_US dc.subject (關鍵詞) time sensitive query en_US dc.subject (關鍵詞) ambiguous query en_US dc.title (題名) 由搜尋引擎使用記錄探勘與時序相關的多義查詢 zh_TW dc.title (題名) Identifying Time Sensitive Ambiguous Query from Query Logs of Search Engines en_US dc.type (資料類型) thesis en dc.relation.reference (參考文獻) [1] R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong, Diversifying Search Results, 2nd ACM International Conference on Web Search and Data Mining WSDM, Barcelona, Spain, 2009. [2] A. Bernstein, E. Kaufmann, C. Bürki, and M. Klein, How Similar Is It? Towards Personalized Similarity Measures in Ontologies, Wirtschaftsinformatik 2005, Physica-Verlag HD, 2005. [3] G. Capannini, F. M. Nardini, R. Perego, and F. Silvestri, Efficient Diversification of Search Results using Query Logs, International Conference on World Wide Web WWW, Hyderabad, India, 2011. [4] J. G. Carbonell, and J. Goldstein, The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries, ACM SIGIR International Conference on Information Retrieval, 1998. [5] P. Clough, M. Sanderson, M. Abouammoh, S. Navarro, and M. Paramita, Multiple Approaches to Analyzing Query Diversity, ACM SIGIR International Conference on Information Retrieval, 2009. [6] W. Dakka, L. Gravano, and P. G. Ipeirotis, Answering General Time-Sensitive Queries, IEEE Transactions on Knowledge and Data Engineering, Vol. 24, Issue 2, 2011. [7] B. Edelman, M. Ostrovsky, and M. Schwarz. Internet Advertising and The Generalized Second Price Auction: Selling Billions of Dollars Worth of Keywords, American Economic Review, American Economic Association, vol. 97(1), 2007. [8] J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2nd Edition, 2006. [9] R. Jones, and F. Diaz, Temporal Profiles of Queries, ACM Transactions on Information Systems, Vol. 25, No. 3, 2007. [10] Y. Kalfoglou, and M. Schorlemmer. Ontology Mapping: the State of the Art. Knowledge Engineering Review. Vol.18, No.1, 2003 [11] A. Kulkarni, J. Teevan, K. M. Svore, S. T. Dumais, Understanding Temporal Query Dynamics, 4th ACM International Conference on Web Search and Data Mining WSDM, Hong Kong, China, 2011. [12] S. Kullback, and R. A. Leibler, On Information and Sufficiency, Annals of Mathematical Statistics. Vol.22, No. 1, 1951 [13] C. D. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008. [14] A. Mehta, A. Saberi, U. Vazirani, and V. Vazirani, Adwords and Generalized Online Matching. Journal of the ACM, Vol.54, No.5, 2007 [15] D. Metzler, R. Jones, F. Peng, and R. Zhang, Improving Search Relevance for Implicitly Temporal Queries, ACM SIGIR International Conference on Information Retrieval, 2009. [16] U. Priss, Formal Concept Analysis in Information Science. ARIST. Vol.40, No.1, 2006. [17] D. Rose, and D. Levinson. Understanding User Goals in Web Search, International Conference on World Wide Web WWW, 2004. [18] R. Song, Z. Luo, J. Y. Nie, Y. Yu, and H. W. Hon, Identification of Ambiguous Queries in Web Search, Information Processing and Management, Vol. 45, Issue. 2, 2009. [19] M. J. Welch, J. Cho, and C. Olston, Search Result Diversity for Information Queries, International World Wide Web Conference WWW, Hyderabad, India, 2011. [20] J. Yang, and J. Leskovec. Patterns of Temporal Variation in Online Media, 4th ACM International Conference on Web Search and Data Mining WSDM, 2011. [21] Jaccard Distance http://en.wikipedia.org/wiki/Jaccard_index [22] Spearman’s rank correlation coefficient http://en.wikipedia.org/wiki/Spearman`s_rank-_correlation_coefficient zh_TW