Table Of Content

Data mining neural networks with genetic algorithms Ajit Narayanan, Edward Keedwell and Dragan Savic School of Engineering and Computer Science University of Exeter Exeter EX4 4PT United Kingdom [email protected] tel: (+)1392 264064 Abstract It is an open question as to what is the best way to extract symbolic rules from trained neural networks in domains involving classification. Previous approaches based on an exhaustive analysis of network connection and output values have already been demonstrated to be intractable in that the scale-up factor increases exponentially with the number of nodes and connections in the network. A novel approach using genetic algorithms to search for symbolic rules in a trained neural network is demonstrated in this paper. Preliminary experiments involving classification are reported here, with the results indicating that our proposed approach is successful in extracting rules. While it is accepted that further work is required to convincingly demonstrate the superiority of our approach over others, there is nevertheless sufficient novelty in these results to justify early dissemination. (If the paper is accepted, the latest results will be reported, together with sufficient information to aid replicability and verification.) Introduction Artificial neural networks (ANNs) are increasingly used in problem domains involving classification. They are adept at finding commonalities in a set of seemingly unrelated data and for this reason are used in a growing number of classification tasks. Unfortunately, a commonly perceived problem with ANNs when used for classification is that, while a trained ANN can indeed classify the data, sometimes with more accuracy than a traditional, symbolic machine learning approach, the reasons for their classification cannot be found easily. Trained ANNs are commonly perceived to be ‘black boxes’ which map input data onto a class through a number of mathematically weighted connections between layers of neurons. While the idea of ANNs as black boxes may not be a problem in applications where there is little interest in the reasons behind classification, this can be a major obstacle in applications where it is important to have symbolic rules or other forms of knowledge structure, such as identification or decision trees, which are easily interpretable by human experts. In particular, it may be important to identify knowledge not previously known to domain experts and which may therefore lie at the periphery of domain expertise. Also, safety-critical systems (such as air traffic control or missile firing) which use neural networks successfully to classify data face difficulty in being accepted because of the reluctance by managers and administrators to accept a system which is not open to symbolic verification. Often, there is a legal 1 requirement that such safety-critical systems be demonstrated to be correct to a certain degree of confidence. It is often claimed that neural networks, because of their plasticity and use of soft constraints, can handle noisy data better than their symbolic counterparts and should therefore be used precisely in those areas which are likely to benefit most from their application, such as safety-critical systems and data mining. In general, an ANN can be said to make its decisions by using the activation of the units (input and hidden) combined with the weights of the connections between these units. The topology of the network can also be used. Andrews et al. (1996) identify three types of rule extraction techniques: ‘decompositional’, ‘pedagogical’ and ‘eclectic’, each of which refers to a different method of extracting information from the network. A decompositional approach is distinguished by its focus on extracting rules at the level of individual (hidden and output) units. The computed output from each hidden and output unit is mapped onto a binary ‘yes/no’ outcome corresponding to the notion of a rule consequent. The major problem with this approach is the apparent exponential behaviour of associated algorithms (Towell and Shavlik, 1993). Extracting rules from complex ANNs may therefore be intractable. A pedagogical approach is distinguished by its treatment of a trained ANN as a ‘black box’ where the knowledge to be extracted deals directly with the way that input is mapped onto output by the internal weights (i.e. no ‘yes/no rules’ are extracted – just rules dealing with the changes in the levels of the input and output units). The major problem with this approach is the sheer number of rules generated for even the simplest domains. Finally, the eclectic approach is characterised by any use of knowledge concerning the internal architecture and/or weight vectors in a trained ANN to complement a symbolic learning algorithm. There is currently very little understanding of available methods for constructing an eclectic approach, of the domains where eclectic approaches may outperform their traditional symbolic and ANN counterparts, and how to evaluate the results of an eclectic approach. In this paper we propose a novel, evolutionary eclectic approach which integrates traditional ANNs with genetic algorithms for extracting simple, intelligible and useful rules from trained ANNs. It is claimed that this approach adopts the advantages of ANNs (gradual, incremental training which overcomes inconsistencies and ambiguities in the data) as well as symbolic learning (intelligible output, rules for verification). In brief, the paper proposes the use of a genetic algorithm to search the weight space of a trained neural network to identify the best rules for classification. The genetic algorithm uses chromosomes which can be mapped directly onto intelligible rules (phenotypes). Two major constraints are the following. First, the goal of many rule-extraction techniques is to find a comprehensive rule base for the network so that it can be encoded as a set of ‘expert system’ rules in which the attributes causing a particular classification can be precisely and fully determined. In this paper we propose that this is not necessary in the majority of applications. Algorithms attempting to produce comprehensive rule sets have a tendency to become exponential in complexity as network size increases. This has been recognised by researchers, and in a recent paper (Arbatli and Akin, 1997) the search space available to the symbolic algorithm has been decreased by optimizing the topology of the network using genetic algorithms. The approach described here differs in 2 that it uses GAs to search a trained neural network for the extraction of symbolic rules directly and not to optimise the network for another set of rule extraction techniques to be applied. Secondly, the experiments below have been performed on categorical rather than continuous data. Many datasets of significance in the real world do indeed have continuous attributes, but datasets with large numbers of unpartitioned continuous attributes are unlikely to be successfully classified by a neural network in any case. The Genetic Algorithm/Neural Network System The starting point of any rule-extraction system is firstly to train the network on the data required, i.e. the ANN is trained so that a satisfactory error level is reached. For classification problems, each input unit typically corresponds to a single feature in the real world, and each output unit to a class value or class. The first objective of our approach is to encode the network in such a way that a genetic algorithm can be run over the top of it. This is achieved by creating an n-dimensional weight space where n is the number of layers of weights. The network can be represented by simply enumerating each of the nodes and/or connections. For example, Figure 1 depicts a simple neural network with five input units (input features, data attributes), three hidden units, and one output unit (class or class value), with each node enumerated in this case except the output. Typically, there will be more than one output class or class value and therefore more than one output node. Figure 1 - A typical encoding of a simple neural network with only one class value (one output node) From this encoding, genes can be created which, in turn, are used to construct chromosomes where there is at least one gene representing a node at the input layer and at least one gene representing a node at the hidden layer. A typical chromosome for the network depicted in Figure 1 could look something like this (Figure 2): Figure 2 - A typical chromosome generated from the encoded network for only one class value 3 This chromosome corresponds to the fifth unit in the input layer and the third unit in the hidden layer. That is, the first gene contains the weight connecting input node 5 to hidden unit 3, and the second gene contains the weight connecting hidden unit 3 to the output class. Fitness is computed as a direct function of the weights which the chromosome represents. For chromosomes containing just two genes (one for the input unit, the other for the hidden unit), the fitness function is: Fitness = Weight(Inputfi Hidden)*Weight(Hiddenfi Output) where ‘fi ’ signifies the weight between the two enumerated nodes. So the fitness of the chromosome in Figure 2 is: Fitness = Weight(5fi 3)*Weight(3fi Output) This fitness is computed for an initial set of random chromosomes, and the population is sorted according to fitness. An elitist strategy is then used whereby a subset of the top chromosomes is selected for inclusion in the next generation. Crossover and mutation are then performed on these chromosomes to create the rest of the next population. The chromosome is then easily converted into IF…THEN rules with an attached weighting. This is achieved by using the template: ‘IF <gene1> THEN output is <class> (weighting)’, with the weighting being the fitness of the gene and the class signifies which output unit is being switched on. The weighting is a major part of the rule generation procedure because the value of this is a direct measure of how the network interprets the data. Since ‘Gene 1’ above corresponds to the weight between an input unit and a hidden unit, the template is essentially stating that the consequent of the rule is caused by the activation on that particular input node and its connection to a hidden unit (not specified explicitly in the rule). The rule template above therefore allows the extraction of single-condition rules. The number of extracted rules in each population can be set by the user, according to the complexity of the network and/or the data. A larger number of rules will yield less fit chromosomes and thus less important rules. This property is essential in extracting rules which represent knowledge at the periphery of expertise. Experimentation Three experiments are described here. The first two experiments use a toy example to show that our approach can find rules comparable to those found with purely symbolic methods of data-mining. The third experiment was performed on a larger data set to show that this method is generalisable to real-world domains. All GA programs are written in C++. Neural network packages used were Neurodimensions’ Neurosolutions v3.0 and Thinkspro v1.05 by Logical Designs Consulting. Experiment 1 4 The dataset refers to named individuals for whom there are four attributes and two possible class values (Figure 3 - adapted from Winston, 1992): Name Hair Height Weight Lotion Result Sarah Blonde Average Light No Sunburned Dana Blonde Tall Average Yes Not sunburned Alex Brown Short Average Yes Not sunburned Annie Blonde Short Average No Sunburned Emily Red Average Heavy No Sunburned Pete Brown Tall Heavy No Not sunburned John Brown Average Average No Not sunburned Katie Blonde Short Light Yes Not sunburned Figure 3 - The Sunburn Dataset This dataset is converted as follows into a form suitable for input to the ANN (Figure 4): Hair Blonde 100 Brown 010 Red 001 Height Short 100 Average 010 Tall 001 Weight Light 100 Average 010 Heavy 001 Lotion No 10 Yes 01 Class Sunburned 10 Not sunburned 01 Figure 4 - Neural Network Conversion of Data in Figure 4. One example of input is therefore: 10001010010, which represents a blonde haired (100), average height (010), light (100), no-lotion used (10) individual (i.e. Sarah). Note that we are dealing with a supervised learning network, where the class in which the sample falls is explicitly represented for training purposes. So, in the case of Sarah, the output 10 (sunburned) is used for supervised training. ‘10’ here signifies that the first output node is switched on and the second is not. A neural network with 11 input, 5 hidden and 2 output units was created. The input to the network was a string of 0’s and 1’s which corresponded to the records in the data set above. The network was then trained (using back-propagation) until a mean square error of 0.001 was achieved. The network weights were then recorded and the genetic algorithm process started. The weights between the 11 input and 5 hidden units are as follows: Hidden Unit 1 (all eleven input units): -2.029721 1.632389 -1.702274 -1.369853 0.133539 0.296253 -0.465295 0.680639 -0.610233 -1.432447 -1.462687 Hidden Unit 2: 0.960469 1.304169 -0.558034 -0.870080 0.394558 0.537783 0.047991 0.575487 -1.571345 0.476647 -0.003466 5 Hidden Unit 3: 0.952550 -2.791922 1.133562 0.518217 1.647397 -1.801673 -1.518900 -0.245973 0.450328 -0.169588 -1.979129 Hidden Unit 4: -1.720175 1.247111 1.095436 0.365523 0.350067 0.584151 0.773993 1.216627 -1.174810 -1.624518 2.342727 Hidden Unit 5: -1.217552 2.288170 -1.088214 -0.389681 -0.919714 1.168223 0.579115 1.039906 1.499586 -2.902985 2.754642 The weights between the five hidden units and the two output units are as follows: Output Unit 1 (all 5 hidden units): -2.299536 -0.933331 2.137592 -2.556154 -4.569341 Output Unit 2: 2.235369 -0.597022 -3.967368 1.887921 3.682286 A random number generator was used to create the initial population of five chromosomes for the detection of rules, where an extra gene is added to the end of the chromosome to represent one of the two output class values. The alleles for this gene are either 1 or 2 (to represent the output node values of 10 (sunburned) and 01 (not sunburned). The following decisions were taken: 1. The fittest chromosome of each generation goes through to the next generation 2. The next chromosome is chosen at random, but a greater fitness gives a greater chance of being chosen. Negative fitnesses were not included. (A ‘roulette wheel’ selection.) 3. The remaining four chromosomes are created as a mutation of the two chosen above and crossover on these same two. Duplicate chromosomes are removed. 4. Fitness was computed simply as Weight(input_to_hidden)*Weight(hidden_to_output). The more positive the number, the greater the fitness. An example run (first three generations only) for extracting rules dealing with the first output node only (i.e. for sunburn cases only) is given in Figure 5. Results A traditional symbolic learning algorithm running on this dataset will find the following four rules: (a) If person has red hair then person is sunburned; (b) If person is brown haired then person is not sunburned; (c) If person has blonde hair and no lotion used then person is sunburned; and (d) If person has blonde hair and lotion used then person is not sunburned. Our approach identified the following five single condition rules in ten generations, with a maximum population of 6 in each generation: (i) ‘IF unit1 is 1 THEN output is 1 (fitness 4.667)’, which corresponds to: ‘IF hair colour=blonde THEN result is sunburned’. The fitness here is calculated as follows: input unit 1 to hidden unit 1 weight of -2.029721* hidden unit 1 to output unit 1 weight of -2.299536. 6 Figure 5 – First three generations of chromosome evolution in the extraction of rules dealing with sunburn cases (output node 1) only (ii) ‘IF unit 3 is 1 THEN output is 1 (fitness 3.908)’, which corresponds to ` IF hair colour=red THEN result is sunburned’ (input unit 3 to hidden unit 1 weight of -1.702274 * hidden unit 1 to output unit 1 weight of -2.299536). (iii) ‘IF unit 10 is 1 then output is 1 (fitness 4.154), which corresponds to ‘IF no lotion used THEN result is sunburned’ (input unit 10 to hidden unit 4 weight of -1.624518 * hidden unit 4 to output weight of -2.556154) (iv) ‘IF unit 2 is 1 THEN output is 2 (fitness 8.43)’, which corresponds to: ‘IF hair colour=brown THEN result is not sunburned’ (input unit 2 to hidden unit 5 weighting of 2.288170 * hidden unit 5 to output unit 2 weighting of 3.682286, with rounding) (v) ‘IF unit 11 is 1 THEN output is 2 (fitness 10.12)’, which corresponds to ‘IF lotion used THEN result is not sunburned’ (input unit 11 to hidden unit 5 weighting of 2.754642 * hidden unit 5 to output unit 2 weighting of 3.682286, with rounding). Figure 5 shows that, for the sunburnt cases (rules (i) – (iii) above), there is early convergence (within three generations) to these rules. The fitness values cited in the rule set above may not be the maximum attainable but are nevertheless significantly above 0. Experiment 2 Another toy example was chosen from the machine learning literature, again, only 8 records with four attributes (Figure 6). 7 Dataset Run Supervisor Overtime Operator Output 1 Sally Yes Joe High 2 John No Samantha High 3 Sally Yes Joe High 4 John No Joe Low 5 Sally Yes Samantha High 6 Patrick No Samantha Low 7 Sally Yes Joe High 8 Patrick No Samantha Low Figure 6: Second experimental dataset The conversion between data and neural network representation was performed as before (Figure 7). Supervisor Sally 100 John 010 Patrick 001 Overtime Yes 10 No 01 Operator Joe 10 Samantha 01 Output High 10 Low 01 Figure 7: Conversion of second dataset into a neural network format The rules involved in this classification are complex and there is some repetition so that only very few records actually make a contribution to a rule. Symbolic algorithms do not produce good results over this data set. See5 creates the ruleset: IF overtime = Yes THEN output = High [0.833] IF overtime = No THEN output = Low [0.667] CN2 creates these single-condition rules, along with some dual condition rules: IF supervisor = Sally THEN output = High [0 4] IF supervisor = Patrick THEN output = Low [2 0] where the numbers in brackets signifies how many cases of each class are captured by that rule. For instance, ‘[0 4]’ after the first rule above signifies that this rules captures none of the low output cases and 4 of the high output cases. The ANN with 7 input, 4 hidden and 2 output units was trained over a series of 1522 epochs to achieve a mean squared error of 0.040. Below is the weight space for the network. 8 Hidden Unit 1 (all seven input to hidden connections) -0.836101 -0.437469 -0.972496 -0.977659 0.265379 -0.459824 0.313158 Hidden Unit 2 -2.508566 -2.855611 1.858439 -1.711295 2.86410 2.675891 -1.834709 Hidden Unit 3 1.726850 0.421753 -0.725803 1.372710 -1.471043 0.338697 0.652326 Hidden Unit 4 -1.738682 -1.385388 2.255858 -0.626335 2.316902 0.007883 -3.285211 Output Unit 1 (all four hidden to output connections) 0.491153 -4.961958 2.423375 -2.589325 Output Unit 2 -0.687410 4.479441 -2.092269 3.477822 The genetic algorithm was started with a population of 10 and run for just 20 generations. The top rules for each classification were as follows: IF Supervisor = John THEN output = High (12.948) IF Supervisor = Sally THEN output = High (10.966) IF Operator = Samantha THEN output = High (7.847) IF Overtime = No THEN output = Low (11.498) IF Operator = Joe THEN output = Low (10.706) IF Supervisor = Patrick THEN output = Low (7.120) As before, the fitness measures for each rule are quoted to allow decisions to be made as to the validity of each of the rules. As can be seen from the ruleset, the results from the symbolic algorithms have largely been reproduced and the algorithm has also found some extra rules. Experiment 3 The dataset used was the mushroom dataset - a well-known collection of data used for classifying mushrooms into an edible or poisonous class. The data contains 125 categories spanning 23 attributes. As before, the data was converted into a neural network input format. The network was first trained on this full dataset for 41 epochs and an error of 0.0161. However, the test results from these runs were very poor and it prompted an investigation of the network weights, revealing that the network was not learning successfully. Several solutions to this problem were hypothesised and implemented with little success. The problem turned out to be that the data set has a large number of unused categories and these were translated along with the rest of the data, resulting in a network with a very sparse distribution of information since over half of the categories were not present. These categories were eliminated from the data and a smaller network with 30 hidden units was trained on the smaller 62 category data set for 69 epochs. The error was higher than before at 0.03 but testing was, on average, better. The genetic algorithm was run for 100 9 iterations with a population of 20. There were 7 operations per population, 4 crossover and 3 mutation. The mutation rate was randomly set between –40 to +40. The rules found by the GAs were encouragingly similar to those found by traditional algorithms, but the system also supplemented the most obvious rules with some previously undiscovered ones, exclusive to our approach: IF odour=p THEN poisonous. (max 2.23) (found by CN2 and See5) IF gill-size=n THEN poisonous. (max 1.13) (exclusive) IF stalk-root = e THEN poisonous (max 1.13) (exclusive) IF gill-size=b THEN edible. (max 2.3) (found by CN2) IF odour=n THEN edible (max 1.58) (exclusive) IF cap-surface=f THEN edible (max 1.58) (found by CN2) The weightings specify maximum values since they surface frequently in the rule list with different fitness values, depending on which hidden unit the input was connected to. The rules correlate well with the ones found by traditional packages. In fact, they are almost identical to the rules found by CN2. The exciting aspect here is that there are some totally new rules extracted regarding each classification. The algorithms used in traditional classification programs found only the odour=p rule for poisonous classification, whereas our approach found two other rules. The need to adapt the neural network to deal with a subset of the original data highlights an inherent problem in any approach which attempts to integrate neural network learning with symbolic rule extraction: The genetic algorithm can only generate rules from the neural network if they already exist. If a network has not been trained properly on the data set then the algorithm will not find the required associations. This means that users must be very sure that the trained network is an accurate model of the domain they are trying to mine. If this is not the case then the system will find spurious rules. Discussion Work is currently underway to amend the chromosome representation to extract two- condition and multi-condition rules from the neural network trained on the mushroom dataset, as well as to improve the behaviour of the trained neural network even further when tested with examples not previously seen. It is an open question as to how well the trained neural network has to perform on unseen examples before the process of rule extraction can begin. Together, the preliminary results reported here provide evidence of the feasibility of integrating GAs with trained neural networks, both technically and in terms of efficiency. The approach can be scaled up easily, with the major constraint on scale being the accuracy of the trained neural network when dealing with large datasets. What was particularly interesting was the extraction of rules not captured by traditional symbolic learning techniques. While such rules may not be totally accurate in that they don’t capture all or even most of the samples in a dataset, there is no doubt that the approach outlined here can perform the useful function of extracting rules which lie at the periphery of domain expertise or which capture exceptions (which can then be further analysed to identify reasons for being exceptions). One of the major advantages of this 10