學術產出-學位論文

題名 高效率常見超集合探勘演算法之研究
Efficient Algorithms for the Discovery of Frequent Superset
作者 廖忠訓
Liao, Zhung-Xun
貢獻者 沈錳坤
Shan, Man-Kwan
廖忠訓
Liao, Zhung-Xun
關鍵詞 常見超集合探勘
常見樣式探勘
關聯法則
資料探勘
Frequent Superset Mining
Frequent Pattern Mining
Association Rule
Data Mining
日期 2004
上傳時間 17-九月-2009 13:54:33 (UTC+8)
摘要 過去對於探勘常見項目集的研究僅限於找出資料庫中交易紀錄的子集合,在這篇論文中,我們提出一個新的探勘主題:常見超集合探勘。常見超集合意指它包含資料庫中各筆紀錄的筆數多於最小門檻值,而原本用來探勘常見子集合的演算法並無法直接套用,因此我們以補集合的角度,提出了三個快速的演算法來解決這個新的問題。首先為Apriori-C:此為使用先廣後深搜尋的演算法,並且以掃描資料庫的方式來決定具有相同長度之候選超集合的支持度,第二個方法是Eclat-C:此為採用先深後廣搜尋的演算法,並且搭配交集法來計算倏選超集合的支持度,最後是DCT:此方法可利用過去常見子集合探勘的演算法來進行探勘,如此可以省下開發新系統的成本。
常見超集合的探勘可以應用在電子化的遠距學習系統,生物資訊及工作排程的問題上。尤其在線上學習系統,我們可以利用常見超集合來代表一群學生的學習行為,並且藉以預測學生的學習成就,使得老師可以及時發現學生的學習迷失等行為;此外,透過常見超集合的探勘,我們也可以為學生推薦個人化的課程,以達到因材施教的教學目標。
在實驗的部份,我們比較了各演算法的效率,並且分別改變實驗資料庫的下列四種變因:1) 交易資料的筆數、2) 每筆交易資料的平均長度、3) 資料庫中項目的總數和4) 最小門檻值。在最後的分析當中,可以清楚地看出我們提出的各種方法皆十分有效率並且具有可延伸性。
The algorithms for the discovery of frequent itemset have been investigated widely. These frequent itemsets are subsets of database. In this thesis, we propose a novel mining task: mining frequent superset from the database of itemsets that is useful in bioinformatics, E-learning systems, jobshop scheduling, and so on. A frequent superset means that the number of transactions contained in it is not less than minimum support threshold. Intuitively, according to the Apriori algorithm, the level-wise discovering starts from 1-itemset, 2-itemset, and so forth. However, such steps cannot utilize the property of Apriori to reduce search space, because if an itemset is not frequent, its superset maybe frequent. In order to solve this problem, we propose three methods. The first is the Apriori-based approach, called Apriori-C. The second is the Eclat-based approach, called Eclat-C, which is a depth-first approach. The last is the proposed data complement technique (DCT) that we utilize original frequent itemset mining approach to discover frequent superset.
The experimental studies compare the performance of the proposed three methods by considering the effect of the number of transactions, the average length of transactions, the number of different items, and minimum support. The analysis shows that the proposed algorithms are time efficient and scalable.
參考文獻 [1] R. Agrawal, T. Imielinski, and A. Swami, “Mining Association Rules between Sets of Items in Large Databases,” Proc. of ACM Special Interest Group on Management of Data, ACM SIGMOD’93, 1993.
[2] R. Agrawal and R. Srikant, ”Fast Algorithms for Mining Association Rules,” Proc. of the 20th International Conference on Very Large Data Bases, VLDB’94, 1994.
[3] J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without Candidate Generation,” Proc. of ACM Special Interest Group on Management of Data, ACM SIGMOD’00, 2000.
[4] M. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, “New Algorithms for Fast Discovery of Association Rules,” Proc. of the 3rd ACM Special Interest Group on Knowledge Discovery in Data and Data Mining, ACM SIGKDD’97, 1997.
[5] R. Agarwal, C. Aggarwal, and V. Prasad, “A Tree Projection Algorithm for Generation of Frequent Itemsets,” Journal on Parallel and Distributed Computing (Special Issue on High Performance Data Mining), Vol. 61, Issue 3, 2000.
[6] J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang, “H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Database,” Proc. of the 1st IEEE International Conference on Data Mining, ICDM’01, 2001.
[7] J. Park, M. Chen, and P. Yu, “An Effective Hash-Based Algorithm for Mining Association Rules,” Proc. of ACM Special Interest Group on Management of Data, ACM SIGMOD’95, 1995.
[8] J. Han and M. Kamber, “Data Mining: Concepts and Techniques,” Morgan Kaufman Publishers, 2000.
[9] J. Hipp, U. Guntzer, and G. Nakhaeizadeh, “Algorithms for Association Rule Mining – A General Survey and Comparison,” ACM SIGKDD Explorations, Vol. 2, Issue 1, 2000.
[10] D. Burdick, M. Calimlim, and J. Gehrke, “MAFIA: A Maximum Frequent Itemset Algorithm for Transactional Databases,” Proc. of the 17th International Conference on Data Engineering, ICDE’01, 2001.
[11] R. Agarwal, C. Aggarwal, and V. Prasad, “Depth First Generation of Long Patterns,” Proc. of ACM Special Interest Group on Knowledge Discovery in Data and Data Mining, ACM SIGKDD’00, 2000.
[12] M. Holsheimer, M. Kersten, H. Mannila, and H. Toivonen, “A Perspective on Databases and Data Mining,” Proc. of the 1st International Conference on Knowledge Discovery and Data Mining, KDD’95, 1995.
[13] G. Gardarin, P. Pucheral, and F. Wu, “Bitmap Based Algorithm for Mining Association Rules,” Proc. of Actes des journes Bases de Donnes Avances, 1998.
[14] A. Savasere, E. Omiecinski, and S. Navathe, “An Efficient Algorithm for Mining Association Rules in Large Databases,” Proc. of the 21st International Conference on Very Large Data Bases, VLDB’95, 1995.
[15] F. Bodon, “A Fast APRIORI Implementation,” Proc. of the 3rd IEEE International Conference on Data Mining, ICDM’03, Workshop on Frequent Itemset Mining Implementions, FIMI’03, 2003.
[16] Synthetic Data Generation Code for Associations and Sequential Patterns. http://www. almaden.ibm.com/software/quest/Resources/index.shtml Intelligent Information System, IBM Almaden Research Center.
[17] T. Urdan and C. Weggen, “Corporate E-learning: Exploring a New Frontier,” WR Hambrecht + Co, 2000.
[18] R. Beasley and M. Waugh, “The Effects of Content-Structure Focusing on Learner Structural Knowledge Acquisition, Retention, and Disorientation in a Hypermedia Environment,” Journal of Research on Computing in Education, Vol. 28, Issue 3, 1996.
[19] R. Beasley and M. Waugh, “Cognitive Mapping Architectures and Hypermedia Disorientation: An Empirical Study,” Journal of Educational Multimedia and Hypermedia, Vol. 4, Issue 2-3, 1995.
[20] C. Chang, “Cognitive Problem and Suggestions in Information Space Navigation–An Introduction and Survey,” Hua Fan Annual Journal, Vol. 4, No. 1, 1997.
[21] G. Chen, C. Liu, K. Ou, and M. Lin, “Web Learning Portfolios: A Tool for Supporting Performance Awareness,” Innovations in Education and Training International, IETI, Vol. 38, Issue 1, 2000.
[22] C. Liu, G. Chen, K. Ou, B. Liu, and J. Horng, "Managing Activity Dynamics of Web Based Collaborative Applications," International Journal of Artificial Intelligence Tools, JAIT, Vol. 8, Issue 2, 1999.
[23] D. Wang, Y. Bao, G. Yu, and G. Wang, “Using Page Classification and Association Rule mining for Personalized recommendation in distance learning,” Proc. of the 1st International Conference on Web-based learning, ICWL’02, 2002.
[24] D. Chiu, Y. Wu, and A. Chen, “An Efficient Algorithm for Mining Frequent Sequences by New Strategy without Support Counting,” Proc. of the 20th International Conference on Data Engineering, ICDE’03, 2003.
[25] R. Agrawal and R. Srikant, “Mining Sequential Patterns,” Proc. of the 12th International Conference on Data Engineering, ICDE’95, 1995.
[26] J. Setubal and J. Meidanis, Introduction to Computational Molecular Biology, PWS Publishing Company, 1997.
[27] F. Gandon, “Ontology Engineering: a Survey and a Return on Experience,” Research Report of INRIA, RR4396, 2002.
[28] N. Noy and M. Musen, “SMART: Automated Support for Ontology Merging and Alignment,” Proc. of the 12th Banff Workshop on Knowledge Acquisition, Modeling, and Management, KAW’99, 1999.
[29] N. Noy and M. Musen, “PROMPT: Algorithm and Tool for Automated Ontology Merging and Alignment,” Proc. of the 7th National Conference on Artificial Intelligence and 12th Conference on Innovative Applications of Artificial Intelligence, AAAI’00, 2000.
[30] G. Stumme and A. Maedche, “FCA-MERGE: Bottom-Up Merging of Ontologies,” Proc. of the 17th Internation Joint Conference on Artificial Intelligence, IJCAI’01, 2001.
[31] S. Itoga, “The String Merging Problem,” BIT, Vol. 21, No. 1, 1981.
描述 碩士
國立政治大學
資訊科學學系
91753013
93
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0091753013
資料類型 thesis
dc.contributor.advisor 沈錳坤zh_TW
dc.contributor.advisor Shan, Man-Kwanen_US
dc.contributor.author (作者) 廖忠訓zh_TW
dc.contributor.author (作者) Liao, Zhung-Xunen_US
dc.creator (作者) 廖忠訓zh_TW
dc.creator (作者) Liao, Zhung-Xunen_US
dc.date (日期) 2004en_US
dc.date.accessioned 17-九月-2009 13:54:33 (UTC+8)-
dc.date.available 17-九月-2009 13:54:33 (UTC+8)-
dc.date.issued (上傳時間) 17-九月-2009 13:54:33 (UTC+8)-
dc.identifier (其他 識別碼) G0091753013en_US
dc.identifier.uri (URI) https://nccur.lib.nccu.edu.tw/handle/140.119/32638-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊科學學系zh_TW
dc.description (描述) 91753013zh_TW
dc.description (描述) 93zh_TW
dc.description.abstract (摘要) 過去對於探勘常見項目集的研究僅限於找出資料庫中交易紀錄的子集合,在這篇論文中,我們提出一個新的探勘主題:常見超集合探勘。常見超集合意指它包含資料庫中各筆紀錄的筆數多於最小門檻值,而原本用來探勘常見子集合的演算法並無法直接套用,因此我們以補集合的角度,提出了三個快速的演算法來解決這個新的問題。首先為Apriori-C:此為使用先廣後深搜尋的演算法,並且以掃描資料庫的方式來決定具有相同長度之候選超集合的支持度,第二個方法是Eclat-C:此為採用先深後廣搜尋的演算法,並且搭配交集法來計算倏選超集合的支持度,最後是DCT:此方法可利用過去常見子集合探勘的演算法來進行探勘,如此可以省下開發新系統的成本。
常見超集合的探勘可以應用在電子化的遠距學習系統,生物資訊及工作排程的問題上。尤其在線上學習系統,我們可以利用常見超集合來代表一群學生的學習行為,並且藉以預測學生的學習成就,使得老師可以及時發現學生的學習迷失等行為;此外,透過常見超集合的探勘,我們也可以為學生推薦個人化的課程,以達到因材施教的教學目標。
在實驗的部份,我們比較了各演算法的效率,並且分別改變實驗資料庫的下列四種變因:1) 交易資料的筆數、2) 每筆交易資料的平均長度、3) 資料庫中項目的總數和4) 最小門檻值。在最後的分析當中,可以清楚地看出我們提出的各種方法皆十分有效率並且具有可延伸性。
zh_TW
dc.description.abstract (摘要) The algorithms for the discovery of frequent itemset have been investigated widely. These frequent itemsets are subsets of database. In this thesis, we propose a novel mining task: mining frequent superset from the database of itemsets that is useful in bioinformatics, E-learning systems, jobshop scheduling, and so on. A frequent superset means that the number of transactions contained in it is not less than minimum support threshold. Intuitively, according to the Apriori algorithm, the level-wise discovering starts from 1-itemset, 2-itemset, and so forth. However, such steps cannot utilize the property of Apriori to reduce search space, because if an itemset is not frequent, its superset maybe frequent. In order to solve this problem, we propose three methods. The first is the Apriori-based approach, called Apriori-C. The second is the Eclat-based approach, called Eclat-C, which is a depth-first approach. The last is the proposed data complement technique (DCT) that we utilize original frequent itemset mining approach to discover frequent superset.
The experimental studies compare the performance of the proposed three methods by considering the effect of the number of transactions, the average length of transactions, the number of different items, and minimum support. The analysis shows that the proposed algorithms are time efficient and scalable.
en_US
dc.description.tableofcontents ABSTRACT IN CHINESE...........................................i
ABSTRACT.....................................................ii
ACKNOWLEDGEMENTS............................................iii
TABLE OF CONTENTS............................................iv
LIST OF TABLES...............................................vi
LIST OF FIGURES............................................viii
CHAPTER 1 Introduction........................................1
CHAPTER 2 Review of Frequent Itemset Mining...................5
2.1 Overview..........................................5
2.2 Algorithm Apriori.................................7
2.3 Algorithm Eclat..................................11
2.4 Algorithm FP-Growth..............................14
2.5 Summary..........................................18
CHAPTER 3 Discovery of Frequent Superset.....................20
3.1 Introduction.....................................20
3.2 Baseline Method..................................21
3.3 Algorithm Apriori-C..............................25
3.4 Algorithm Eclat-C................................31
3.5 Algorithm DCT....................................33
3.6 Summary..........................................36
CHAPTER 4 Experiment Results.................................37
4.1 Introduction.....................................37
4.2 Synthetic Data Generation........................38
4.3 Performance Analysis.............................39
4.3.1 Minimum Support............................39
4.3.2 Number of Transaction......................42
4.3.3 Number of Items............................45
4.3.4 Average Size of Transactions...............52
4.3.5 Number of Candidates.......................55
CHAPTER 5 Applications.......................................57
5.1 Introduction.....................................57
5.2 Online Learning..................................58
CHAPTER 6 Conclusions........................................63
6.1 Summary..........................................63
6.2 Future Work......................................64
REFERENCE....................................................66
PUBLICATION LIST.............................................71
zh_TW
dc.format.extent 76765 bytes-
dc.format.extent 103643 bytes-
dc.format.extent 118211 bytes-
dc.format.extent 44143 bytes-
dc.format.extent 44573 bytes-
dc.format.extent 46159 bytes-
dc.format.extent 59481 bytes-
dc.format.extent 207381 bytes-
dc.format.extent 166505 bytes-
dc.format.extent 113553 bytes-
dc.format.extent 66543 bytes-
dc.format.extent 47989 bytes-
dc.format.extent 51668 bytes-
dc.format.extent 43658 bytes-
dc.format.mimetype application/pdf-
dc.format.mimetype application/pdf-
dc.format.mimetype application/pdf-
dc.format.mimetype application/pdf-
dc.format.mimetype application/pdf-
dc.format.mimetype application/pdf-
dc.format.mimetype application/pdf-
dc.format.mimetype application/pdf-
dc.format.mimetype application/pdf-
dc.format.mimetype application/pdf-
dc.format.mimetype application/pdf-
dc.format.mimetype application/pdf-
dc.format.mimetype application/pdf-
dc.format.mimetype application/pdf-
dc.language.iso en_US-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0091753013en_US
dc.subject (關鍵詞) 常見超集合探勘zh_TW
dc.subject (關鍵詞) 常見樣式探勘zh_TW
dc.subject (關鍵詞) 關聯法則zh_TW
dc.subject (關鍵詞) 資料探勘zh_TW
dc.subject (關鍵詞) Frequent Superset Miningen_US
dc.subject (關鍵詞) Frequent Pattern Miningen_US
dc.subject (關鍵詞) Association Ruleen_US
dc.subject (關鍵詞) Data Miningen_US
dc.title (題名) 高效率常見超集合探勘演算法之研究zh_TW
dc.title (題名) Efficient Algorithms for the Discovery of Frequent Superseten_US
dc.type (資料類型) thesisen
dc.relation.reference (參考文獻) [1] R. Agrawal, T. Imielinski, and A. Swami, “Mining Association Rules between Sets of Items in Large Databases,” Proc. of ACM Special Interest Group on Management of Data, ACM SIGMOD’93, 1993.zh_TW
dc.relation.reference (參考文獻) [2] R. Agrawal and R. Srikant, ”Fast Algorithms for Mining Association Rules,” Proc. of the 20th International Conference on Very Large Data Bases, VLDB’94, 1994.zh_TW
dc.relation.reference (參考文獻) [3] J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without Candidate Generation,” Proc. of ACM Special Interest Group on Management of Data, ACM SIGMOD’00, 2000.zh_TW
dc.relation.reference (參考文獻) [4] M. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, “New Algorithms for Fast Discovery of Association Rules,” Proc. of the 3rd ACM Special Interest Group on Knowledge Discovery in Data and Data Mining, ACM SIGKDD’97, 1997.zh_TW
dc.relation.reference (參考文獻) [5] R. Agarwal, C. Aggarwal, and V. Prasad, “A Tree Projection Algorithm for Generation of Frequent Itemsets,” Journal on Parallel and Distributed Computing (Special Issue on High Performance Data Mining), Vol. 61, Issue 3, 2000.zh_TW
dc.relation.reference (參考文獻) [6] J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang, “H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Database,” Proc. of the 1st IEEE International Conference on Data Mining, ICDM’01, 2001.zh_TW
dc.relation.reference (參考文獻) [7] J. Park, M. Chen, and P. Yu, “An Effective Hash-Based Algorithm for Mining Association Rules,” Proc. of ACM Special Interest Group on Management of Data, ACM SIGMOD’95, 1995.zh_TW
dc.relation.reference (參考文獻) [8] J. Han and M. Kamber, “Data Mining: Concepts and Techniques,” Morgan Kaufman Publishers, 2000.zh_TW
dc.relation.reference (參考文獻) [9] J. Hipp, U. Guntzer, and G. Nakhaeizadeh, “Algorithms for Association Rule Mining – A General Survey and Comparison,” ACM SIGKDD Explorations, Vol. 2, Issue 1, 2000.zh_TW
dc.relation.reference (參考文獻) [10] D. Burdick, M. Calimlim, and J. Gehrke, “MAFIA: A Maximum Frequent Itemset Algorithm for Transactional Databases,” Proc. of the 17th International Conference on Data Engineering, ICDE’01, 2001.zh_TW
dc.relation.reference (參考文獻) [11] R. Agarwal, C. Aggarwal, and V. Prasad, “Depth First Generation of Long Patterns,” Proc. of ACM Special Interest Group on Knowledge Discovery in Data and Data Mining, ACM SIGKDD’00, 2000.zh_TW
dc.relation.reference (參考文獻) [12] M. Holsheimer, M. Kersten, H. Mannila, and H. Toivonen, “A Perspective on Databases and Data Mining,” Proc. of the 1st International Conference on Knowledge Discovery and Data Mining, KDD’95, 1995.zh_TW
dc.relation.reference (參考文獻) [13] G. Gardarin, P. Pucheral, and F. Wu, “Bitmap Based Algorithm for Mining Association Rules,” Proc. of Actes des journes Bases de Donnes Avances, 1998.zh_TW
dc.relation.reference (參考文獻) [14] A. Savasere, E. Omiecinski, and S. Navathe, “An Efficient Algorithm for Mining Association Rules in Large Databases,” Proc. of the 21st International Conference on Very Large Data Bases, VLDB’95, 1995.zh_TW
dc.relation.reference (參考文獻) [15] F. Bodon, “A Fast APRIORI Implementation,” Proc. of the 3rd IEEE International Conference on Data Mining, ICDM’03, Workshop on Frequent Itemset Mining Implementions, FIMI’03, 2003.zh_TW
dc.relation.reference (參考文獻) [16] Synthetic Data Generation Code for Associations and Sequential Patterns. http://www. almaden.ibm.com/software/quest/Resources/index.shtml Intelligent Information System, IBM Almaden Research Center.zh_TW
dc.relation.reference (參考文獻) [17] T. Urdan and C. Weggen, “Corporate E-learning: Exploring a New Frontier,” WR Hambrecht + Co, 2000.zh_TW
dc.relation.reference (參考文獻) [18] R. Beasley and M. Waugh, “The Effects of Content-Structure Focusing on Learner Structural Knowledge Acquisition, Retention, and Disorientation in a Hypermedia Environment,” Journal of Research on Computing in Education, Vol. 28, Issue 3, 1996.zh_TW
dc.relation.reference (參考文獻) [19] R. Beasley and M. Waugh, “Cognitive Mapping Architectures and Hypermedia Disorientation: An Empirical Study,” Journal of Educational Multimedia and Hypermedia, Vol. 4, Issue 2-3, 1995.zh_TW
dc.relation.reference (參考文獻) [20] C. Chang, “Cognitive Problem and Suggestions in Information Space Navigation–An Introduction and Survey,” Hua Fan Annual Journal, Vol. 4, No. 1, 1997.zh_TW
dc.relation.reference (參考文獻) [21] G. Chen, C. Liu, K. Ou, and M. Lin, “Web Learning Portfolios: A Tool for Supporting Performance Awareness,” Innovations in Education and Training International, IETI, Vol. 38, Issue 1, 2000.zh_TW
dc.relation.reference (參考文獻) [22] C. Liu, G. Chen, K. Ou, B. Liu, and J. Horng, "Managing Activity Dynamics of Web Based Collaborative Applications," International Journal of Artificial Intelligence Tools, JAIT, Vol. 8, Issue 2, 1999.zh_TW
dc.relation.reference (參考文獻) [23] D. Wang, Y. Bao, G. Yu, and G. Wang, “Using Page Classification and Association Rule mining for Personalized recommendation in distance learning,” Proc. of the 1st International Conference on Web-based learning, ICWL’02, 2002.zh_TW
dc.relation.reference (參考文獻) [24] D. Chiu, Y. Wu, and A. Chen, “An Efficient Algorithm for Mining Frequent Sequences by New Strategy without Support Counting,” Proc. of the 20th International Conference on Data Engineering, ICDE’03, 2003.zh_TW
dc.relation.reference (參考文獻) [25] R. Agrawal and R. Srikant, “Mining Sequential Patterns,” Proc. of the 12th International Conference on Data Engineering, ICDE’95, 1995.zh_TW
dc.relation.reference (參考文獻) [26] J. Setubal and J. Meidanis, Introduction to Computational Molecular Biology, PWS Publishing Company, 1997.zh_TW
dc.relation.reference (參考文獻) [27] F. Gandon, “Ontology Engineering: a Survey and a Return on Experience,” Research Report of INRIA, RR4396, 2002.zh_TW
dc.relation.reference (參考文獻) [28] N. Noy and M. Musen, “SMART: Automated Support for Ontology Merging and Alignment,” Proc. of the 12th Banff Workshop on Knowledge Acquisition, Modeling, and Management, KAW’99, 1999.zh_TW
dc.relation.reference (參考文獻) [29] N. Noy and M. Musen, “PROMPT: Algorithm and Tool for Automated Ontology Merging and Alignment,” Proc. of the 7th National Conference on Artificial Intelligence and 12th Conference on Innovative Applications of Artificial Intelligence, AAAI’00, 2000.zh_TW
dc.relation.reference (參考文獻) [30] G. Stumme and A. Maedche, “FCA-MERGE: Bottom-Up Merging of Ontologies,” Proc. of the 17th Internation Joint Conference on Artificial Intelligence, IJCAI’01, 2001.zh_TW
dc.relation.reference (參考文獻) [31] S. Itoga, “The String Merging Problem,” BIT, Vol. 21, No. 1, 1981.zh_TW