Linköping Studies in Science and Technology Dissertation No. 1086 Obtaining Accurate and Comprehensible Data Mining Models – An Evolutionary Approach by Ulf Johansson Department of Computer and Information Science Linköpings universitet SE-581 83 Linköping, Sweden Linköping 2007 Abstract When performing predictive data mining, the use of ensembles is claimed to virtually guarantee increased accuracy compared to the use of single models. Unfortunately, the problem of how to maximize ensemble accuracy is far from solved. In particular, the relationship between ensemble diversity and accuracy is not completely understood, making it hard to efficiently utilize diversity for ensemble creation. Furthermore, most high-accuracy predictive models are opaque, i.e. it is not possible for a human to follow and understand the logic behind a prediction. For some domains, this is unacceptable, since models need to be comprehensible. To obtain comprehensibility, accuracy is often sacrificed by using simpler but transparent models; a trade-off termed the accuracy vs. comprehensibility trade-off. With this trade-off in mind, several researchers have suggested rule extraction algorithms, where opaque models are transformed into comprehensible models, keeping an acceptable accuracy. In this thesis, two novel algorithms based on Genetic Programming are suggested. The first algorithm (GEMS) is used for ensemble creation, and the second (G-REX) is used for rule extraction from opaque models. The main property of GEMS is the ability to combine smaller ensembles and individual models in an almost arbitrary way. Moreover, GEMS can use base models of any kind and the optimization function is very flexible, easily permitting inclusion of, for instance, diversity measures. In the experimentation, GEMS obtained accuracies higher than both straightforward design choices and published results for Random Forests and AdaBoost. The key quality of G-REX is the inherent ability to explicitly control the accuracy vs. comprehensibility trade-off. Compared to the standard tree inducers C5.0 and CART, and some well-known rule extraction algorithms, rules extracted by G-REX are significantly more accurate and compact. Most importantly, G-REX is thoroughly evaluated and found to meet all relevant evaluation criteria for rule extraction algorithms, thus establishing G-REX as the algorithm to benchmark against. i ii Contents CHAPTER 1 INTRODUCTION..................................................................................................................1 1.1 PROBLEM STATEMENT...................................................................................................................5 1.2 MAIN CONTRIBUTIONS...................................................................................................................7 1.3 THESIS OUTLINE.............................................................................................................................8 CHAPTER 2 DATA MINING....................................................................................................................11 2.1 A GENERIC DESCRIPTION OF DATA MINING ALGORITHMS............................................................14 2.2 DATA...........................................................................................................................................17 2.3 PREDICTIVE REGRESSION.............................................................................................................20 2.4 PREDICTIVE CLASSIFICATION.......................................................................................................24 2.5 CLUSTERING................................................................................................................................28 2.6 CONCEPT DESCRIPTION................................................................................................................29 2.7 EVALUATION AND COMPARISON OF CLASSIFIERS........................................................................30 CHAPTER 3 BASIC DATA MINING TECHNIQUES...........................................................................39 3.1 LINEAR REGRESSION....................................................................................................................40 3.2 DECISION TREES..........................................................................................................................40 3.3 NEURAL NETWORKS....................................................................................................................44 3.4 GENETIC ALGORITHMS................................................................................................................53 3.5 GENETIC PROGRAMMING.............................................................................................................56 CHAPTER 4 RULE EXTRACTION.........................................................................................................61 4.1 SENSITIVITY ANALYSIS................................................................................................................63 4.2 RULE EXTRACTION FROM TRAINED NEURAL NETWORKS.............................................................63 4.3 RELATED WORK CONCERNING RULE EXTRACTION......................................................................67 CHAPTER 5 ENSEMBLES........................................................................................................................75 5.1 MOTIVATION FOR ENSEMBLES.....................................................................................................76 5.2 ENSEMBLE CONSTRUCTION..........................................................................................................77 5.3 DIVERSITY...................................................................................................................................81 5.4 RELATED WORK CONCERNING ENSEMBLE CREATION..................................................................87 CHAPTER 6 A NOVEL TECHNIQUE FOR RULE EXTRACTION...................................................89 6.1 THE G-REX TECHNIQUE..............................................................................................................90 6.2 REPRESENTATION LANGUAGES....................................................................................................95 6.3 EVALUATION CRITERIA..............................................................................................................100 6.4 USING ORACLE DATA FOR RULE EXTRACTION...........................................................................105 6.5 G-REX EVALUATION USING PUBLIC DATA SETS........................................................................107 CHAPTER 7 IMPACT OF ADVERTISING CASE STUDY................................................................131 7.1 HIGH ACCURACY PREDICTION IN THE MARKETING DOMAIN......................................................132 7.2 INCREASED PERFORMANCE AND BASIC UNDERSTANDING.........................................................143 7.3 INITIAL RULE EXTRACTION IN THE MARKETING DOMAIN...........................................................151 7.4 RULE EXTRACTION IN ANOTHER MARKETING DOMAIN..............................................................159 7.5 EXTENDING G-REX..................................................................................................................168 iii CHAPTER 8 A NOVEL TECHNIQUE FOR ENSEMBLE CREATION...........................................179 8.1 STUDY 1 – BUILDING ENSEMBLES USING GAS..........................................................................182 8.2 STUDY 2 – INTRODUCING GEMS...............................................................................................191 8.3 STUDY 3 – EVALUATING THE USE OF A VALIDATION SET..........................................................198 8.4 STUDY 4 – TWO GEMS VARIANTS............................................................................................207 8.5 STUDY 5 – EVALUATING DIVERSITY MEASURES........................................................................214 8.6 STUDY 6 – A UNIFIED GEMS.....................................................................................................220 CHAPTER 9 CONCLUSIONS AND FUTURE WORK.......................................................................225 9.1 CONCLUSIONS............................................................................................................................225 9.2 FUTURE WORK...........................................................................................................................235 iv List of Figures FIGURE 1: THE KDD PROCESS.......................................................................................................11 FIGURE 2: VIRTUOUS CYCLE OF DATA MINING (ADOPTED FROM [BL97])....................................12 FIGURE 3: PREDICTIVE MODELING................................................................................................16 FIGURE 4: A GENERIC SCORE FUNCTION BALANCING ACCURACY AND COMPLEXITY...................17 FIGURE 5: BIAS AND VARIANCE....................................................................................................22 FIGURE 6: A SAMPLE CONFUSION MATRIX....................................................................................26 FIGURE 7: A SAMPLE ROC CURVE................................................................................................27 FIGURE 8: A SAMPLE ANN............................................................................................................45 FIGURE 9: A TAPPED DELAY NEURAL NETWORK WITH TWO TAPPED INPUTS................................51 FIGURE 10: AN SRN......................................................................................................................52 FIGURE 11: GA (SINGLE-POINT) CROSSOVER................................................................................55 FIGURE 12: GENETIC PROGRAM REPRESENTING A BOOLEAN EXPRESSION...................................56 FIGURE 13: S-EXPRESSION REPRESENTING A BOOLEAN EXPRESSION...........................................56 FIGURE 14: GP CROSSOVER...........................................................................................................58 FIGURE 15: GP MUTATION.............................................................................................................59 FIGURE 16: BLACK-BOX RULE EXTRACTION.................................................................................65 FIGURE 17: SCHEMATIC ENSEMBLE...............................................................................................75 FIGURE 18: AVERAGING DECISION TREES TO OBTAIN A DIAGONAL DECISION BOUNDARY..........77 FIGURE 19: G-REX GUI................................................................................................................93 FIGURE 20: REPRESENTATION LANGUAGE FOR BOOLEAN TREES. (N CONTINUOUS INPUTS.).......96 FIGURE 21: REPRESENTATION LANGUAGE FOR DECISION TREES..................................................96 FIGURE 22: BNF FILE FOR DECISION TREES..................................................................................97 FIGURE 23: EXTRACTED RULE FOR BENIGN..................................................................................98 FIGURE 24: GRAPHICAL REPRESENTATION OF EXTRACTED RULE FOR BENIGN............................99 FIGURE 25: EXTRACTED DECISION TREE FOR IRIS.......................................................................100 FIGURE 26: BNF USED IN EXPERIMENT 1....................................................................................111 FIGURE 27: ILLUSTRATING A MULTI-SPLIT TEST IN C5.0............................................................118 FIGURE 28: A 1-OF-N TEST IN C5.0 WITH COMPLEX SIZE 1.........................................................119 FIGURE 29: A 1-OF-N TEST IN C5.0 WITH COMPLEX SIZE 4.........................................................119 FIGURE 30: SAMPLE G-REX TREE WITH COMPLEX SIZE 7..........................................................122 FIGURE 31: TOM AGAINST TOTAL INVESTMENT FOR VOLVO.....................................................134 FIGURE 32: TOTAL INVESTMENT OVER 100 WEEKS.....................................................................134 FIGURE 33: THE TOM AS A RESULT OF THE INVESTMENTS.........................................................135 FIGURE 34: SRN WITH INPUT AND OUTPUT (IN ITALICS) FOR THIS PROBLEM.............................138 FIGURE 35: SRN LONG-TERM PREDICTION OF FORD TOM.........................................................140 FIGURE 36: TDNN SHORT-TERM PREDICTION OF VOLVO TOM.................................................141 FIGURE 37: THE PROBLEM WITH USING R2 AS QUALITY MEASURE.............................................145 FIGURE 38: A SAMPLE RULE FOR IMPACT OF ADVERTISING.......................................................155 FIGURE 39: G-REX RULE FOR HIGH IMPACT FOR VOLKSWAGEN.................................................157 FIGURE 40: TREPAN RULE FOR HIGH IMPACT FOR VOLKSWAGEN..............................................157 FIGURE 41: TOM/TOTAL INVESTMENT FOR FRITIDSRESOR........................................................164 FIGURE 42: SHORT-TERM PREDICTION OF IM FOR APOLLO........................................................165 FIGURE 43: EXTRACTED RULE FOR HIGH TOM FOR APOLLO......................................................166 FIGURE 44: EXTRACTED TREE FOR IM FOR ALWAYS WITH THREE CLASSES.................................166 FIGURE 45: CONFUSION MATRIX FOR VING IM (TEST SET).........................................................167 FIGURE 46: CONFUSION MATRIX FOR APOLLO TOM (TEST SET)................................................167 FIGURE 47: CONFUSION MATRIX FOR ALWAYS IM (TEST SET)...................................................167 FIGURE 48: CONFUSION MATRIX FOR APOLLO TOM (TEST SET)................................................167 FIGURE 49: REPRESENTATION LANGUAGE FOR REGRESSION TREES...........................................171 FIGURE 50: FUZZIFICATION OF INPUT VARIABLES.......................................................................172 FIGURE 51: REPRESENTATION LANGUAGE FOR FUZZY TREES.....................................................173 v FIGURE 52: ANN PREDICTION FOR FORD IM. TRAINING AND TEST SET.....................................174 FIGURE 53: G-REX PREDICTION FOR FORD IM. TEST SET ONLY................................................174 FIGURE 54: EVOLVED REGRESSION TREE FOR FORD IM.............................................................175 FIGURE 55: EVOLVED BOOLEAN RULE FOR TOYOTA HIGH IM...................................................175 FIGURE 56: EVOLVED FUZZY RULE FOR FORD HIGH IM.............................................................175 FIGURE 57: FRIEDMAN TEST STUDY 1.........................................................................................190 FIGURE 58: A SAMPLE GEMS ENSEMBLE...................................................................................192 FIGURE 59: REPRESENTATION LANGUAGE FOR GEMS...............................................................193 FIGURE 60: TEST ACCURACY VS. VALIDATION ACCURACY FOR TIC-TAC-TOE...........................195 FIGURE 61: TEST ACCURACY VS. VALIDATION ACCURACY FOR VEHICLE..................................196 FIGURE 62: SMALL GEMS ENSEMBLE.........................................................................................197 FIGURE 63: AVERAGED-SIZE GEMS ENSEMBLE.........................................................................197 FIGURE 64: ENSEMBLES SORTED ON VALIDATION ACCURACY. R=0.24......................................201 FIGURE 65: TEST ACCURACY VS. VALIDATION ACCURACY FOR VEHICLE. R=0.77.....................203 FIGURE 66: TEST ACCURACY VS. VALIDATION ACCURACY FOR WAVEFORM. R=0.13................204 FIGURE 67: ANNS SORTED ON VALIDATION ACCURACY. R=0.13...............................................205 FIGURE 68: COMPARISON OF MODEL SETS..................................................................................206 FIGURE 69: GRAMMAR FOR GEMS ENSEMBLE TREES................................................................210 FIGURE 70: SAMPLE ENSEMBLE TREE IN GEMS SYNTAX...........................................................210 FIGURE 71: FRIEDMAN TEST STUDY 4.........................................................................................212 FIGURE 72: A SAMPLE FOLD WITH VERY STRONG CORRELATION (-0.85)....................................218 FIGURE 73: A SAMPLE FOLD WITH TYPICAL CORRELATION (-0.27).............................................219 FIGURE 74: FITNESS FUNCTION USED IN STUDY 6.......................................................................223 FIGURE 75: A DECISION LINE REQUIRING TESTS BETWEEN INPUT VARIABLES............................236 FIGURE 76: A DECISION LINE REQUIRING ARITHMETIC OPERATORS...........................................237 vi List of Tables TABLE 1: THERMOMETER CODING................................................................................................18 TABLE 2: CRITICAL VALUES OF THE WILCOXON T STATISTIC, TWO-TAILED TEST.......................36 TABLE 3: CRITICAL VALUES FOR THE TWO-TAILED NEMENYI TEST.............................................37 TABLE 4: CRITICAL VALUES FOR THE TWO-TAILED BONFERRONI-DUNN TEST............................38 TABLE 5: INTEGRATION STRATEGIES............................................................................................77 TABLE 6: GP PARAMETERS FOR G-REX........................................................................................91 TABLE 7: WBC - PERCENT CORRECT ON TEST SET........................................................................98 TABLE 8: PERCENT CORRECT ON THE TEST SET FOR IRIS............................................................100 TABLE 9: UCI DATA SET CHARACTERISTICS...............................................................................109 TABLE 10: GP PARAMETERS FOR EXPERIMENT 1........................................................................111 TABLE 11: COMPARING G-REX, C5.0 AND CART ACCURACIES................................................112 TABLE 12: RANKS FOR TECHNIQUES PRODUCING TRANSPARENT MODELS.................................113 TABLE 13: WILCOXON SIGNED-RANKS TEST BETWEEN G-REX AND C5.0-GEN.........................114 TABLE 14: RESULTS FOR EXPERIMENT WITH ORACLE DATA.......................................................115 TABLE 15: RANKS FOR TECHNIQUES PRODUCING TRANSPARENT MODELS.................................116 TABLE 16: COMPLEXITY MEASURED AS NUMBER OF QUESTIONS...............................................120 TABLE 17: WILCOXON SIGNED-RANKS TEST BETWEEN G-REX AND C5.0. SIZE........................121 TABLE 18: COMPLEXITY MEASURED AS NUMBER OF TESTS........................................................121 TABLE 19: TREPAN PARAMETERS................................................................................................123 TABLE 20: RX PARAMETERS.......................................................................................................125 TABLE 21: RESULTS G-REX AND TREPAN..................................................................................126 TABLE 22: RESULTS G-REX AND RX.........................................................................................126 TABLE 23: INTRA-MODEL CONSISTENCY. ONE FOLD ZOO PROBLEM..........................................128 TABLE 24: INTER-MODEL CONSISTENCY. ONE FOLD PID PROBLEM...........................................128 TABLE 25: AVERAGE CONSISTENCY OVER ALL PAIRS AND ALL FOLDS FOR EACH DATA SET.....129 TABLE 26: R2 VALUES FOR MULTIPLE LINEAR REGRESSION WEEKS 1-100..................................136 TABLE 27: LONG-TERM FORECAST USING MULTIPLE LINEAR REGRESSION................................137 TABLE 28: SHORT-TERM FORECAST USING MULTIPLE LINEAR REGRESSION...............................137 TABLE 29: R2 VALUES FOR LONG-TERM FORECAST USING A FEED-FORWARD NET.....................139 TABLE 30: R2 VALUES FOR LONG-TERM FORECAST.....................................................................140 TABLE 31: R2 VALUES FOR FORECAST, ONE TO FOUR WEEKS AHEAD, USING TDNNS................141 TABLE 32: R2 VALUES FOR THE TDNN ARCHITECTURE..............................................................146 TABLE 33: R2 VALUES FOR THE SRN ARCHITECTURE.................................................................147 TABLE 34: MEDIA CATEGORIES FOUND TO BE IMPORTANT.........................................................148 TABLE 35: RESULTS FOR REDUCED DATA SET USING MA2 POST-PROCESSING...........................149 TABLE 36: COMPANIES AND MEDIA CATEGORIES USED..............................................................153 TABLE 37: PERCENT CORRECT ON THE TEST SET FOR IMPACT OF ADVERTISING........................156 TABLE 38: COMPLEXITY MEASURED AS INTERIOR NODES..........................................................156 TABLE 39: RESULTS FOR LONG-TERM PREDICTIONS GIVEN AS R2 VALUES.................................164 TABLE 40: RESULTS FOR SHORT-TERM PREDICTIONS GIVEN AS R2 VALUES...............................165 TABLE 41: BINARY CLASSIFICATION. PERCENT CORRECT ON TEST SET......................................165 TABLE 42: CLASSIFICATION WITH THREE CLASSES. PERCENT CORRECT ON TEST SET...............165 TABLE 43: G-REX FIDELITY ON THE BINARY CLASSIFICATION PROBLEM..................................166 TABLE 44: G-REX FIDELITY ON THE CLASSIFICATION PROBLEM WITH 3 CLASSES....................166 TABLE 45: RESULTS FOR THE REGRESSION TASK........................................................................173 TABLE 46: RESULTS FOR THE CLASSIFICATION TASK (TOM).....................................................175 TABLE 47: RESULTS FOR THE CLASSIFICATION TASK (IM).........................................................175 TABLE 48: UCI DATA SET CHARACTERISTICS.............................................................................182 TABLE 49: PROPERTIES FOR SETUPS NOT USING GAS.................................................................184 TABLE 50: PROPERTIES FOR SETUPS USING GAS.........................................................................186 TABLE 51: RESULTS FOR UNIFORM ENSEMBLES..........................................................................187 vii TABLE 52: RESULTS FOR MIXED AND SELECTED ENSEMBLES.....................................................188 TABLE 53: RESULTS FOR SETUPS USING GAS..............................................................................189 TABLE 54: GP PARAMETERS FOR GEMS.....................................................................................194 TABLE 55: NUMBER OF ANNS IN FIXED ENSEMBLES..................................................................194 TABLE 56: RESULTS USING 50 ANNS..........................................................................................195 TABLE 57: RESULTS USING 20 ANNS..........................................................................................196 TABLE 58: CORRELATION BETWEEN VALIDATION AND TEST ACCURACY: ENSEMBLES.............201 TABLE 59: ANOVA RESULTS FOR ENSEMBLES...........................................................................202 TABLE 60: CORRELATION BETWEEN VALIDATION AND TEST ACCURACY: ANNS......................203 TABLE 61: ANOVA RESULTS FOR ANNS...................................................................................204 TABLE 62: MEAN TEST SET ACCURACY FOR TOP 5% ENSEMBLES, SIZE 10.................................205 TABLE 63: MEAN TEST SET ACCURACY FOR TOP 5% ENSEMBLES, RANDOM SIZE.......................206 TABLE 64: TOMBOLA TRAINING PARAMETERS............................................................................209 TABLE 65: GP PARAMETERS FOR TOMBOLA TRAINING...............................................................209 TABLE 66: RESULTS FOR STUDY 4..............................................................................................211 TABLE 67: RESULT SUMMARY FOR GEMS, ADABOOST AND RANDOM FOREST........................213 TABLE 68: COMPARISON WITH ADABOOST AND RANDOM FOREST...........................................213 TABLE 69: EXPERIMENTS IN ENSEMBLE STUDY 5.......................................................................215 TABLE 70: DIVERSITY MEASURES...............................................................................................216 TABLE 71: MEASURES ON TRAINING SET. ENUMERATED ENSEMBLES........................................216 TABLE 72: MEASURES ON VALIDATION SET. ENUMERATED ENSEMBLES...................................217 TABLE 73: MEASURES ON TRAINING SET. RANDOMIZED ENSEMBLES........................................217 TABLE 74: MEASURES ON VALIDATION SET. RANDOMIZED ENSEMBLES....................................218 TABLE 75: COMPARING THE TOP 1% DIVERSE ENSEMBLES WITH ALL........................................220 TABLE 76: CODINGS USED IN STUDY 6........................................................................................222 TABLE 77: RESULTS STUDY 6......................................................................................................223 TABLE 78: COMPARISON OF THE DIFFERENT ALGORITHMS FOR RULE EXTRACTION..................235 viii List of Publications Thesis Johansson, U., Rule Extraction - the Key to Accurate and Comprehensible Data Mining Models, Licentiate thesis, Institute of Technology, Linköping University, 2004. Journal papers Johansson, U., Löfström, T., König, R. and Niklasson, L., Why Not Use an Oracle When You Got One?, Neural Information Processing - Letters and Reviews, Vol. 10, No 8-9:227-236, 2006. Löfström, T. and Johansson, U., Predicting the Benefit of Rule Extraction - A Novel Component in Data Mining, Human IT, Vol. 7.3:78-108, 2005. König, R., Johansson, U. and Niklasson, L., Increasing rule extraction comprehensibility, International Journal of Information Technology and Intelligent Computing, Vol. 1, No. 2:303-314, 2006 International conference papers Johansson, U. and Niklasson, L., Predicting the impact of advertising - a neural network approach, The International Joint Conference on Neural Networks, IEEE Press, Washington D.C., pp. 1799-1804, 2001. Johansson, U. and Niklasson, L., Increased Performance with Neural Nets - An Example from the Marketing Domain, The International Joint Conference on Neural Networks, IEEE Press, Honolulu, HI, pp. 1684-1689, 2002. Johansson, U. and Niklasson, L., Neural Networks - from Prediction to Explanation, IASTED International Conference Artificial Intelligence and Applications, IASTED, Malaga, Spain, pp. 93-98, 2002. Johansson, U., König, R. and Niklasson, L., Rule Extraction from Trained Neural Networks using Genetic Programming, 13th International Conference on Artificial Neural Networks, Istanbul, Turkey, supplementary proceedings pp. 13-16, 2003. Johansson, U., Sönströd, C., König, R. and Niklasson, L., Neural Networks and Rule Extraction for Prediction and Explanation in the Marketing Domain, The International Joint Conference on Neural Networks, IEEE Press, Portland, OR, pp. 2866-2871, 2003. Johansson, U., König, R. and Niklasson, L., The Truth is in There - Rule Extraction from Opaque Models Using Genetic Programming, 17th Florida Artificial Intelligence Research Society Conference (FLAIRS) 04, Miami, FL, AAAI Press, pp. 658-662, 2004. Johansson, U., Niklasson, L. and König, R., Accuracy vs. Comprehensibility in Data Mining Models, 7th International Conference on Information Fusion, Stockholm, Sweden, pp. 295-300, 2004. Johansson, U., Sönströd, C. and Niklasson, L., Why Rule Extraction Matters, 8th IASTED International Conference on Software Engineering and Applications, MIT, Cambridge, MA, pp. 47-52, 2004. Löfström, T., Johansson, U. and Niklasson, L., Rule Extraction by Seeing Through the Model, 11th International Conference on Neural Information Processing (ICONIP), Calcutta, India, pp. 555-560, 2004. ix