學術產出-Theses

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

題名 以規則為基礎的分類演算法:應用粗糙集
A Rule-Based classification algorithm: a rough set approach
作者 廖家奇
Liao, Chia Chi
貢獻者 徐國偉
Hsu, Kuo Wei
廖家奇
Liao, Chia Chi
關鍵詞 資料探勘
分類
粗糙集
規則學習
規則歸納
data mining
classification
rough set
rule learning
separate-and-conquer
rule induction
日期 2011
上傳時間 30-Oct-2012 11:28:18 (UTC+8)
摘要 在本論文中,我們提出了一個以規則為基礎的分類演算法,名為ROUSER(ROUgh SEt Rule),它利用粗糙集理論作為搜尋啟發的基礎,進而建立規則。我們使用一個已經被廣泛利用的工具實作ROUSER,也使用數個公開資料集對它進行實驗,並將它應用於真實世界的案例。
本論文的初衷可被追溯到一個真實世界的案例,而此案例的目標是從感應器所蒐集的資料中找出與機械故障之間的關聯。為了能支援機械故障的根本原因分析,我們設計並實作了一個以規則為基礎的分類演算法,它所產生的模型是由人類可理解的決策規則所組成,而故障的徵兆與原因則被決策規則所連結。此外,資料中存在著矛盾。舉例而言,不同時間點所蒐集的兩筆紀錄極為相似、甚至相同(除了時間戳記),但其中一筆紀錄與機械故障相關,另一筆則否。本案例的挑戰在於分析矛盾的資料。我們使用粗糙集理論克服這個難題,因為它可以處理不完美知識。
研究者們已經提出了各種不同的分類演算法,而實踐者們則已經將它們應用於各種領域,然而多數分類演算法的設計並不強調演算法所產生模型的可解釋性與可理解性。ROUSER的設計是專門從名目資料中萃取人類可理解的決策規則。而ROUSER與其它多數規則分類演算法不同的地方是利用粗糙集方法選取特徵。ROUSER也提供了數種方式來選擇合宜的屬性與值配對,作為規則的前項。此外,ROUSER的規則產生方法是基於separate-and-conquer策略,因此比其它基於粗糙集的分類演算法所廣泛採用的不可分辨矩陣方法還有效率。
我們進行延伸實驗來驗證ROUSER的能力。對於名目資料的實驗裡,ROUSER在半數的結果中的準確率可匹敵、甚至勝過其他以規則為基礎的分類演算法以及決策樹分類演算法。ROUSER也可以在一些離散化的資料集之中達到可匹敵甚至超越的準確率。我們也提供了內建的特徵萃取方法與其它方法的比較的實驗結果,以及數種用來決定規則前項的方法的實驗結果。
In this thesis, we propose a rule-based classification algorithm named ROUSER (ROUgh SEt Rule), which uses the rough set theory as the basis of the search heuristics in the process of rule generation. We implement ROUSER using a well developed and widely used toolkit, evaluate it using several public data sets, and examine its applicability using a real-world case study.
The origin of the problem addressed in this thesis can be traced back to a real-world problem where the goal is to determine whether a data record collected from a sensor corresponds to a machine fault. In order to assist in the root cause analysis of the machine faults, we design and implement a rule-based classification algorithm that can generate models consisting of human understandable decision rules to connect symptoms to the cause. Moreover, there are contradictions in data. For example, two data records collected at different time points are similar, or the same (except their timestamps), while one is corresponding to a machine fault but not the other. The challenge is to analyze data with contradictions. We use the rough set theory to overcome the challenge, since it is able to process imperfect knowledge.
Researchers have proposed various classification algorithms and practitioners have applied them to various application domains, while most of the classification algorithms are designed without a focus on interpretability or understandability of the models built using the algorithms. ROUSER is specifically designed to extract human understandable decision rules from nominal data. What distinguishes ROUSER from most, if not all, other rule-based classification algorithms is that it utilizes a rough set approach to select features. ROUSER also provides several ways to decide an appropriate attribute-value pair for the antecedents of a rule. Moreover, the rule generation method of ROUSER is based on the separate-and-conquer strategy, and hence it is more efficient than the indiscernibility matrix method that is widely adopted in the classification algorithms based on the rough set theory.
We conduct extensive experiments to evaluate the capability of ROUSER. On about half of the nominal data sets considered in experiments, ROUSER can achieve comparable or better accuracy than do classification algorithms that are able to generate decision rules or trees. On some of the discretized data sets, ROUSER can achieve comparable or better accuracy. We also present the results of the experiments on the embedded feature selection method and several ways to decide an appropriate attribute-value pair for the antecedents of a rule.
參考文獻 [1] J. G. Bazan, H. S. Nguyen, S. H. Nguyen, P. Synak, J. Wróblewski, and blewski, "Rough set algorithms in classification problem," in Rough set methods and applications, ed: Physica-Verlag GmbH, 2000, pp. 49-88.
[2] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming: Athena Scientific, 1996.
[3] W.W. Cohen, “Fast Effective Rule Induction,” Proc. 12th Int`l Conf. Machine Learning (ICML), pp. 115-123, 1995.
[4] C. Cortes and V. Vapnik, "Support-vector networks," Machine Learning, vol. 20, pp. 273-297, 1995.
[5] J. Dai, Q. Xu, and W. Wang, "A comparative study on strategies of rule induction for incomplete data based on rough set approach," International Journal of Advancements in Computing Technology, vol. 3, p. 176–183, 2011.
[6] U. M. Fayyad, K. B. Irani, “Multi-interval discretization of continuous-valued attributes for classification learning”, presented at the Proceedings of 13th international joint conference on Artificial intelligence, 1022-1027, 1993.
[7] J. Fürnkranz, "Separate-and-Conquer Rule Learning," Artif. Intell. Rev., vol. 13, pp. 3-54, 1999.
[8] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, "The WEKA data mining software: an update," SIGKDD Explor. Newsl., vol. 11, pp. 10-18, 2009.
[9] L. Huan and R. Setiono, "Chi2: feature selection and discretization of numeric attributes," in Tools with Artificial Intelligence, 1995. Proceedings of IEEE Seventh International Conference on, 1995, pp. 388-391.
[10] J. C. Huhn and E. Hullermeier, "FR3: A Fuzzy Rule Learner for Inducing Reliable Classifiers," Fuzzy Systems, IEEE Transactions on, vol. 17, pp. 138-149, 2009.
[11] W. Jiabing, Z. Pei, W. Guihua, and W. Jia, "Classifying Categorical Data by Rule-Based Neighbors," in Data Mining (ICDM), 2011 IEEE 11th International Conference on, 2011, pp. 1248-1253.
[12] X. Jin, A. Xu, R. Bie, and P. Guo, "Machine Learning Techniques and Chi-Square Feature Selection for Cancer Classification Using SAGE Gene Expression Profiles Data Mining for Biomedical Applications." vol. 3916, J. Li, Q. Yang, and A.-H. Tan, Eds., ed: Springer Berlin / Heidelberg, 2006, pp. 106-115.
[13] M. T. Mitchell, Machine Learning, 1997 :McGraw-Hill
[14] G. Pagallo and D. Haussler, "Boolean Feature Discovery in Empirical Learning," Machine Learning, vol. 5, pp. 71-99, 1990.
[15] Z. Pawlak, "Some Issues on Rough Sets,” Transactions on Rough Sets I, vol. 3100, J. Peters, A. Skowron, J. Grzymala-Busse, B. Kostek, R. Swiniarski, and M. Szczuka, Eds., ed: Springer Berlin / Heidelberg, 2004, pp. 1-58.
[16] Z. Pawlak, A. Skowron, "Rudiments of rough sets", Information Sciences, vol.177, no.1, pp.3-27, 2007.
[17] J. R. Quinlan, "Induction of Decision Trees," Machine Learning, vol. 1, pp. 81-106, 1986.
[18] J. R. Quinlan, C4.5: programs for machine learning: Morgan Kaufmann Publishers Inc., 1993.
[19] J. Stefanowski and K. Slowiński, "Rough Set Theory and Rule Induction Techniques For Discovery of Attribute Dependencies in Medical Information Systems,” Principles of Data Mining and Knowledge Discovery. vol. 1263, J. Komorowski and J. Zytkow, Eds., ed: Springer Berlin / Heidelberg, 1997, pp. 36-46.
[20] S. M. Weiss and N. Indurkhya, "Reduced complexity rule induction," presented at the Proceedings of the 12th international joint conference on Artificial intelligence - Volume 2, Sydney, New South Wales, Australia, 1991.
[21] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. McLachlan, A. Ng, B. Liu, P. Yu, Z.-H. Zhou, M. Steinbach, D. Hand, and D. Steinberg, "Top 10 algorithms in data mining," Knowledge and Information Systems, vol. 14, pp. 1-37, 2008.
[22] L. A. Zadeh, "Fuzzy Sets," Information and Control, vol. 8, pp. 338–353, 1965.
[23] Frank, A. & Asuncion, A. (2010). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
[24] "Data Mining Curriculum". ACM SIGKDD. 2006-04-30. Retrieved 2011-10-28.
描述 碩士
國立政治大學
資訊科學學系
99753005
100
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0099753005
資料類型 thesis
DOI http://dx.doi.org/10.1109/CyberneticsCom.2012.6381605
dc.contributor.advisor 徐國偉zh_TW
dc.contributor.advisor Hsu, Kuo Weien_US
dc.contributor.author (Authors) 廖家奇zh_TW
dc.contributor.author (Authors) Liao, Chia Chien_US
dc.creator (作者) 廖家奇zh_TW
dc.creator (作者) Liao, Chia Chien_US
dc.date (日期) 2011en_US
dc.date.accessioned 30-Oct-2012 11:28:18 (UTC+8)-
dc.date.available 30-Oct-2012 11:28:18 (UTC+8)-
dc.date.issued (上傳時間) 30-Oct-2012 11:28:18 (UTC+8)-
dc.identifier (Other Identifiers) G0099753005en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/54649-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊科學學系zh_TW
dc.description (描述) 99753005zh_TW
dc.description (描述) 100zh_TW
dc.description.abstract (摘要) 在本論文中,我們提出了一個以規則為基礎的分類演算法,名為ROUSER(ROUgh SEt Rule),它利用粗糙集理論作為搜尋啟發的基礎,進而建立規則。我們使用一個已經被廣泛利用的工具實作ROUSER,也使用數個公開資料集對它進行實驗,並將它應用於真實世界的案例。
本論文的初衷可被追溯到一個真實世界的案例,而此案例的目標是從感應器所蒐集的資料中找出與機械故障之間的關聯。為了能支援機械故障的根本原因分析,我們設計並實作了一個以規則為基礎的分類演算法,它所產生的模型是由人類可理解的決策規則所組成,而故障的徵兆與原因則被決策規則所連結。此外,資料中存在著矛盾。舉例而言,不同時間點所蒐集的兩筆紀錄極為相似、甚至相同(除了時間戳記),但其中一筆紀錄與機械故障相關,另一筆則否。本案例的挑戰在於分析矛盾的資料。我們使用粗糙集理論克服這個難題,因為它可以處理不完美知識。
研究者們已經提出了各種不同的分類演算法,而實踐者們則已經將它們應用於各種領域,然而多數分類演算法的設計並不強調演算法所產生模型的可解釋性與可理解性。ROUSER的設計是專門從名目資料中萃取人類可理解的決策規則。而ROUSER與其它多數規則分類演算法不同的地方是利用粗糙集方法選取特徵。ROUSER也提供了數種方式來選擇合宜的屬性與值配對,作為規則的前項。此外,ROUSER的規則產生方法是基於separate-and-conquer策略,因此比其它基於粗糙集的分類演算法所廣泛採用的不可分辨矩陣方法還有效率。
我們進行延伸實驗來驗證ROUSER的能力。對於名目資料的實驗裡,ROUSER在半數的結果中的準確率可匹敵、甚至勝過其他以規則為基礎的分類演算法以及決策樹分類演算法。ROUSER也可以在一些離散化的資料集之中達到可匹敵甚至超越的準確率。我們也提供了內建的特徵萃取方法與其它方法的比較的實驗結果,以及數種用來決定規則前項的方法的實驗結果。
zh_TW
dc.description.abstract (摘要) In this thesis, we propose a rule-based classification algorithm named ROUSER (ROUgh SEt Rule), which uses the rough set theory as the basis of the search heuristics in the process of rule generation. We implement ROUSER using a well developed and widely used toolkit, evaluate it using several public data sets, and examine its applicability using a real-world case study.
The origin of the problem addressed in this thesis can be traced back to a real-world problem where the goal is to determine whether a data record collected from a sensor corresponds to a machine fault. In order to assist in the root cause analysis of the machine faults, we design and implement a rule-based classification algorithm that can generate models consisting of human understandable decision rules to connect symptoms to the cause. Moreover, there are contradictions in data. For example, two data records collected at different time points are similar, or the same (except their timestamps), while one is corresponding to a machine fault but not the other. The challenge is to analyze data with contradictions. We use the rough set theory to overcome the challenge, since it is able to process imperfect knowledge.
Researchers have proposed various classification algorithms and practitioners have applied them to various application domains, while most of the classification algorithms are designed without a focus on interpretability or understandability of the models built using the algorithms. ROUSER is specifically designed to extract human understandable decision rules from nominal data. What distinguishes ROUSER from most, if not all, other rule-based classification algorithms is that it utilizes a rough set approach to select features. ROUSER also provides several ways to decide an appropriate attribute-value pair for the antecedents of a rule. Moreover, the rule generation method of ROUSER is based on the separate-and-conquer strategy, and hence it is more efficient than the indiscernibility matrix method that is widely adopted in the classification algorithms based on the rough set theory.
We conduct extensive experiments to evaluate the capability of ROUSER. On about half of the nominal data sets considered in experiments, ROUSER can achieve comparable or better accuracy than do classification algorithms that are able to generate decision rules or trees. On some of the discretized data sets, ROUSER can achieve comparable or better accuracy. We also present the results of the experiments on the embedded feature selection method and several ways to decide an appropriate attribute-value pair for the antecedents of a rule.
en_US
dc.description.tableofcontents CHAPTER 1 INTRODUCTION 1
1.1 Classification Problem 1
1.2 Tree-Based and Rule-Based Classification Algorithms 2
1.3 Rough Set Based Classification Algorithms 3
1.4 Data Mining 4
1.5 Thesis Organization 5
CHAPTER 2 PRELIMINARY 6
2.1 Rule-Based Classification Algorithms 6
2.1.1 The Basics 6
2.1.2 Separate-and-Conquer 7
2.1.3 Search Heuristics 7
2.1.4 Pruning and Optimization 9
2.2 The Rough Set Theory 9
2.2.1 Information System and Decision Table 10
2.2.2 Indiscernibility Relation 11
2.2.3 Rough Set 12
2.2.4 Reduct and Core 14
2.2.5 Indiscernibility Matrix 17
CHAPTER 3 DESIGN OF THE PROPOSED METHOD 18
3.1 Potential Boundary Region and Discernibility Power 18
3.2 ROUSER 22
CHAPTER 4 IMPLEMENTATION OF THE PROPOSED METHOD 28
4.1 WEKA 28
4.1.1 Import Data 28
4.1.2 Classifier 28
4.1.3 Cross-Validation 29
4.2 Data Structure 29
4.2.1 Rough Set 29
4.2.2 Decision Rule 30
4.3 ROUSER 32
4.3.1 BuildClassifier( ) 32
4.3.2 OneClassRule( ) 32
4.3.3 grow( ) 34
4.3.4 BestAntd( ) 35
CHAPTER 5 EXPERIMENT AND RESULTS 36
5.1 Environmental Setting 36
5.2 Data Sets 36
5.3 Results 39
5.4 Summary 49
CHAPTER 6 CASE STUDY 50
6.1 Introduction 50
6.1.1 Back Ground Knowledge 50
6.1.2 Problem 51
6.1.3 Data 51
6.2 Model Building 52
6.2.1 Data Preprocessing 52
6.2.2 Classification Algorithms 53
6.3 Inspect the Result Rules 54
6.4 Summary 56
CHAPTER 7 CONCLUSIONS AND FUTURE WORK 57
REFERENCE 58
zh_TW
dc.language.iso en_US-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0099753005en_US
dc.subject (關鍵詞) 資料探勘zh_TW
dc.subject (關鍵詞) 分類zh_TW
dc.subject (關鍵詞) 粗糙集zh_TW
dc.subject (關鍵詞) 規則學習zh_TW
dc.subject (關鍵詞) 規則歸納zh_TW
dc.subject (關鍵詞) data miningen_US
dc.subject (關鍵詞) classificationen_US
dc.subject (關鍵詞) rough seten_US
dc.subject (關鍵詞) rule learningen_US
dc.subject (關鍵詞) separate-and-conqueren_US
dc.subject (關鍵詞) rule inductionen_US
dc.title (題名) 以規則為基礎的分類演算法:應用粗糙集zh_TW
dc.title (題名) A Rule-Based classification algorithm: a rough set approachen_US
dc.type (資料類型) thesisen
dc.relation.reference (參考文獻) [1] J. G. Bazan, H. S. Nguyen, S. H. Nguyen, P. Synak, J. Wróblewski, and blewski, "Rough set algorithms in classification problem," in Rough set methods and applications, ed: Physica-Verlag GmbH, 2000, pp. 49-88.
[2] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming: Athena Scientific, 1996.
[3] W.W. Cohen, “Fast Effective Rule Induction,” Proc. 12th Int`l Conf. Machine Learning (ICML), pp. 115-123, 1995.
[4] C. Cortes and V. Vapnik, "Support-vector networks," Machine Learning, vol. 20, pp. 273-297, 1995.
[5] J. Dai, Q. Xu, and W. Wang, "A comparative study on strategies of rule induction for incomplete data based on rough set approach," International Journal of Advancements in Computing Technology, vol. 3, p. 176–183, 2011.
[6] U. M. Fayyad, K. B. Irani, “Multi-interval discretization of continuous-valued attributes for classification learning”, presented at the Proceedings of 13th international joint conference on Artificial intelligence, 1022-1027, 1993.
[7] J. Fürnkranz, "Separate-and-Conquer Rule Learning," Artif. Intell. Rev., vol. 13, pp. 3-54, 1999.
[8] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, "The WEKA data mining software: an update," SIGKDD Explor. Newsl., vol. 11, pp. 10-18, 2009.
[9] L. Huan and R. Setiono, "Chi2: feature selection and discretization of numeric attributes," in Tools with Artificial Intelligence, 1995. Proceedings of IEEE Seventh International Conference on, 1995, pp. 388-391.
[10] J. C. Huhn and E. Hullermeier, "FR3: A Fuzzy Rule Learner for Inducing Reliable Classifiers," Fuzzy Systems, IEEE Transactions on, vol. 17, pp. 138-149, 2009.
[11] W. Jiabing, Z. Pei, W. Guihua, and W. Jia, "Classifying Categorical Data by Rule-Based Neighbors," in Data Mining (ICDM), 2011 IEEE 11th International Conference on, 2011, pp. 1248-1253.
[12] X. Jin, A. Xu, R. Bie, and P. Guo, "Machine Learning Techniques and Chi-Square Feature Selection for Cancer Classification Using SAGE Gene Expression Profiles Data Mining for Biomedical Applications." vol. 3916, J. Li, Q. Yang, and A.-H. Tan, Eds., ed: Springer Berlin / Heidelberg, 2006, pp. 106-115.
[13] M. T. Mitchell, Machine Learning, 1997 :McGraw-Hill
[14] G. Pagallo and D. Haussler, "Boolean Feature Discovery in Empirical Learning," Machine Learning, vol. 5, pp. 71-99, 1990.
[15] Z. Pawlak, "Some Issues on Rough Sets,” Transactions on Rough Sets I, vol. 3100, J. Peters, A. Skowron, J. Grzymala-Busse, B. Kostek, R. Swiniarski, and M. Szczuka, Eds., ed: Springer Berlin / Heidelberg, 2004, pp. 1-58.
[16] Z. Pawlak, A. Skowron, "Rudiments of rough sets", Information Sciences, vol.177, no.1, pp.3-27, 2007.
[17] J. R. Quinlan, "Induction of Decision Trees," Machine Learning, vol. 1, pp. 81-106, 1986.
[18] J. R. Quinlan, C4.5: programs for machine learning: Morgan Kaufmann Publishers Inc., 1993.
[19] J. Stefanowski and K. Slowiński, "Rough Set Theory and Rule Induction Techniques For Discovery of Attribute Dependencies in Medical Information Systems,” Principles of Data Mining and Knowledge Discovery. vol. 1263, J. Komorowski and J. Zytkow, Eds., ed: Springer Berlin / Heidelberg, 1997, pp. 36-46.
[20] S. M. Weiss and N. Indurkhya, "Reduced complexity rule induction," presented at the Proceedings of the 12th international joint conference on Artificial intelligence - Volume 2, Sydney, New South Wales, Australia, 1991.
[21] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. McLachlan, A. Ng, B. Liu, P. Yu, Z.-H. Zhou, M. Steinbach, D. Hand, and D. Steinberg, "Top 10 algorithms in data mining," Knowledge and Information Systems, vol. 14, pp. 1-37, 2008.
[22] L. A. Zadeh, "Fuzzy Sets," Information and Control, vol. 8, pp. 338–353, 1965.
[23] Frank, A. & Asuncion, A. (2010). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
[24] "Data Mining Curriculum". ACM SIGKDD. 2006-04-30. Retrieved 2011-10-28.
zh_TW
dc.identifier.doi (DOI) 10.1109/CyberneticsCom.2012.6381605en_US
dc.doi.uri (DOI) http://dx.doi.org/10.1109/CyberneticsCom.2012.6381605en_US