Table Of ContentData mining neural networks with genetic algorithms
Ajit Narayanan, Edward Keedwell and Dragan Savic
School of Engineering and Computer Science
University of Exeter
Exeter EX4 4PT
United Kingdom
ajit@dcs.ex.ac.uk
tel: (+)1392 264064
Abstract
It is an open question as to what is the best way to extract symbolic rules from trained
neural networks in domains involving classification. Previous approaches based on an
exhaustive analysis of network connection and output values have already been
demonstrated to be intractable in that the scale-up factor increases exponentially with the
number of nodes and connections in the network. A novel approach using genetic
algorithms to search for symbolic rules in a trained neural network is demonstrated in
this paper. Preliminary experiments involving classification are reported here, with the
results indicating that our proposed approach is successful in extracting rules. While it is
accepted that further work is required to convincingly demonstrate the superiority of our
approach over others, there is nevertheless sufficient novelty in these results to justify
early dissemination. (If the paper is accepted, the latest results will be reported, together
with sufficient information to aid replicability and verification.)
Introduction
Artificial neural networks (ANNs) are increasingly used in problem domains
involving classification. They are adept at finding commonalities in a set of seemingly
unrelated data and for this reason are used in a growing number of classification tasks.
Unfortunately, a commonly perceived problem with ANNs when used for classification is
that, while a trained ANN can indeed classify the data, sometimes with more accuracy
than a traditional, symbolic machine learning approach, the reasons for their
classification cannot be found easily. Trained ANNs are commonly perceived to be
‘black boxes’ which map input data onto a class through a number of mathematically
weighted connections between layers of neurons. While the idea of ANNs as black boxes
may not be a problem in applications where there is little interest in the reasons behind
classification, this can be a major obstacle in applications where it is important to have
symbolic rules or other forms of knowledge structure, such as identification or decision
trees, which are easily interpretable by human experts. In particular, it may be important
to identify knowledge not previously known to domain experts and which may therefore
lie at the periphery of domain expertise. Also, safety-critical systems (such as air traffic
control or missile firing) which use neural networks successfully to classify data face
difficulty in being accepted because of the reluctance by managers and administrators to
accept a system which is not open to symbolic verification. Often, there is a legal
1
requirement that such safety-critical systems be demonstrated to be correct to a certain
degree of confidence. It is often claimed that neural networks, because of their plasticity
and use of soft constraints, can handle noisy data better than their symbolic counterparts
and should therefore be used precisely in those areas which are likely to benefit most
from their application, such as safety-critical systems and data mining.
In general, an ANN can be said to make its decisions by using the activation of the units
(input and hidden) combined with the weights of the connections between these units.
The topology of the network can also be used. Andrews et al. (1996) identify three types
of rule extraction techniques: ‘decompositional’, ‘pedagogical’ and ‘eclectic’, each of
which refers to a different method of extracting information from the network. A
decompositional approach is distinguished by its focus on extracting rules at the level of
individual (hidden and output) units. The computed output from each hidden and output
unit is mapped onto a binary ‘yes/no’ outcome corresponding to the notion of a rule
consequent. The major problem with this approach is the apparent exponential behaviour
of associated algorithms (Towell and Shavlik, 1993). Extracting rules from complex
ANNs may therefore be intractable. A pedagogical approach is distinguished by its
treatment of a trained ANN as a ‘black box’ where the knowledge to be extracted deals
directly with the way that input is mapped onto output by the internal weights (i.e. no
‘yes/no rules’ are extracted – just rules dealing with the changes in the levels of the input
and output units). The major problem with this approach is the sheer number of rules
generated for even the simplest domains. Finally, the eclectic approach is characterised
by any use of knowledge concerning the internal architecture and/or weight vectors in a
trained ANN to complement a symbolic learning algorithm. There is currently very little
understanding of available methods for constructing an eclectic approach, of the domains
where eclectic approaches may outperform their traditional symbolic and ANN
counterparts, and how to evaluate the results of an eclectic approach.
In this paper we propose a novel, evolutionary eclectic approach which integrates
traditional ANNs with genetic algorithms for extracting simple, intelligible and useful
rules from trained ANNs. It is claimed that this approach adopts the advantages of ANNs
(gradual, incremental training which overcomes inconsistencies and ambiguities in the
data) as well as symbolic learning (intelligible output, rules for verification). In brief, the
paper proposes the use of a genetic algorithm to search the weight space of a trained
neural network to identify the best rules for classification. The genetic algorithm uses
chromosomes which can be mapped directly onto intelligible rules (phenotypes).
Two major constraints are the following. First, the goal of many rule-extraction
techniques is to find a comprehensive rule base for the network so that it can be encoded
as a set of ‘expert system’ rules in which the attributes causing a particular classification
can be precisely and fully determined. In this paper we propose that this is not necessary
in the majority of applications. Algorithms attempting to produce comprehensive rule sets
have a tendency to become exponential in complexity as network size increases. This
has been recognised by researchers, and in a recent paper (Arbatli and Akin, 1997) the
search space available to the symbolic algorithm has been decreased by optimizing the
topology of the network using genetic algorithms. The approach described here differs in
2
that it uses GAs to search a trained neural network for the extraction of symbolic rules
directly and not to optimise the network for another set of rule extraction techniques to
be applied. Secondly, the experiments below have been performed on categorical rather
than continuous data. Many datasets of significance in the real world do indeed have
continuous attributes, but datasets with large numbers of unpartitioned continuous
attributes are unlikely to be successfully classified by a neural network in any case.
The Genetic Algorithm/Neural Network System
The starting point of any rule-extraction system is firstly to train the network on
the data required, i.e. the ANN is trained so that a satisfactory error level is reached. For
classification problems, each input unit typically corresponds to a single feature in the
real world, and each output unit to a class value or class. The first objective of our
approach is to encode the network in such a way that a genetic algorithm can be run over
the top of it. This is achieved by creating an n-dimensional weight space where n is the
number of layers of weights. The network can be represented by simply enumerating
each of the nodes and/or connections. For example, Figure 1 depicts a simple neural
network with five input units (input features, data attributes), three hidden units, and one
output unit (class or class value), with each node enumerated in this case except the
output. Typically, there will be more than one output class or class value and therefore
more than one output node.
Figure 1 - A typical encoding of a simple
neural network with only one class value
(one output node)
From this encoding, genes can be created which, in turn, are used to construct
chromosomes where there is at least one gene representing a node at the input layer and
at least one gene representing a node at the hidden layer. A typical chromosome for the
network depicted in Figure 1 could look something like this (Figure 2):
Figure 2 - A typical chromosome generated from
the encoded network for only one class value
3
This chromosome corresponds to the fifth unit in the input layer and the third unit in the
hidden layer. That is, the first gene contains the weight connecting input node 5 to hidden
unit 3, and the second gene contains the weight connecting hidden unit 3 to the output
class. Fitness is computed as a direct function of the weights which the chromosome
represents. For chromosomes containing just two genes (one for the input unit, the other
for the hidden unit), the fitness function is:
Fitness = Weight(Inputfi Hidden)*Weight(Hiddenfi Output)
where ‘fi ’ signifies the weight between the two enumerated nodes. So the fitness of the
chromosome in Figure 2 is:
Fitness = Weight(5fi 3)*Weight(3fi Output)
This fitness is computed for an initial set of random chromosomes, and the population is
sorted according to fitness. An elitist strategy is then used whereby a subset of the top
chromosomes is selected for inclusion in the next generation. Crossover and mutation
are then performed on these chromosomes to create the rest of the next population.
The chromosome is then easily converted into IF…THEN rules with an attached
weighting. This is achieved by using the template: ‘IF <gene1> THEN output is
<class> (weighting)’, with the weighting being the fitness of the gene and the class
signifies which output unit is being switched on. The weighting is a major part of the rule
generation procedure because the value of this is a direct measure of how the network
interprets the data. Since ‘Gene 1’ above corresponds to the weight between an input unit
and a hidden unit, the template is essentially stating that the consequent of the rule is
caused by the activation on that particular input node and its connection to a hidden unit
(not specified explicitly in the rule). The rule template above therefore allows the
extraction of single-condition rules. The number of extracted rules in each population can
be set by the user, according to the complexity of the network and/or the data. A larger
number of rules will yield less fit chromosomes and thus less important rules. This
property is essential in extracting rules which represent knowledge at the periphery of
expertise.
Experimentation
Three experiments are described here. The first two experiments use a toy
example to show that our approach can find rules comparable to those found with purely
symbolic methods of data-mining. The third experiment was performed on a larger data
set to show that this method is generalisable to real-world domains. All GA programs are
written in C++. Neural network packages used were Neurodimensions’ Neurosolutions
v3.0 and Thinkspro v1.05 by Logical Designs Consulting.
Experiment 1
4
The dataset refers to named individuals for whom there are four attributes and
two possible class values (Figure 3 - adapted from Winston, 1992):
Name Hair Height Weight Lotion Result
Sarah Blonde Average Light No Sunburned
Dana Blonde Tall Average Yes Not sunburned
Alex Brown Short Average Yes Not sunburned
Annie Blonde Short Average No Sunburned
Emily Red Average Heavy No Sunburned
Pete Brown Tall Heavy No Not sunburned
John Brown Average Average No Not sunburned
Katie Blonde Short Light Yes Not sunburned
Figure 3 - The Sunburn Dataset
This dataset is converted as follows into a form suitable for input to the ANN (Figure 4):
Hair Blonde 100
Brown 010
Red 001
Height Short 100
Average 010
Tall 001
Weight Light 100
Average 010
Heavy 001
Lotion No 10
Yes 01
Class Sunburned 10
Not sunburned 01
Figure 4 - Neural Network Conversion of Data in Figure 4.
One example of input is therefore: 10001010010, which represents a blonde haired (100),
average height (010), light (100), no-lotion used (10) individual (i.e. Sarah). Note that we
are dealing with a supervised learning network, where the class in which the sample falls
is explicitly represented for training purposes. So, in the case of Sarah, the output 10
(sunburned) is used for supervised training. ‘10’ here signifies that the first output node is
switched on and the second is not. A neural network with 11 input, 5 hidden and 2
output units was created. The input to the network was a string of 0’s and 1’s which
corresponded to the records in the data set above. The network was then trained (using
back-propagation) until a mean square error of 0.001 was achieved. The network
weights were then recorded and the genetic algorithm process started. The weights
between the 11 input and 5 hidden units are as follows:
Hidden Unit 1 (all eleven input units):
-2.029721 1.632389 -1.702274 -1.369853 0.133539 0.296253 -0.465295 0.680639 -0.610233 -1.432447 -1.462687
Hidden Unit 2:
0.960469 1.304169 -0.558034 -0.870080 0.394558 0.537783 0.047991 0.575487 -1.571345 0.476647 -0.003466
5
Hidden Unit 3:
0.952550 -2.791922 1.133562 0.518217 1.647397 -1.801673 -1.518900 -0.245973 0.450328 -0.169588 -1.979129
Hidden Unit 4:
-1.720175 1.247111 1.095436 0.365523 0.350067 0.584151 0.773993 1.216627 -1.174810 -1.624518 2.342727
Hidden Unit 5:
-1.217552 2.288170 -1.088214 -0.389681 -0.919714 1.168223 0.579115 1.039906 1.499586 -2.902985 2.754642
The weights between the five hidden units and the two output units are as follows:
Output Unit 1 (all 5 hidden units):
-2.299536 -0.933331 2.137592 -2.556154 -4.569341
Output Unit 2:
2.235369 -0.597022 -3.967368 1.887921 3.682286
A random number generator was used to create the initial population of five
chromosomes for the detection of rules, where an extra gene is added to the end of the
chromosome to represent one of the two output class values. The alleles for this gene are
either 1 or 2 (to represent the output node values of 10 (sunburned) and 01 (not
sunburned).
The following decisions were taken:
1. The fittest chromosome of each generation goes through to the next generation
2. The next chromosome is chosen at random, but a greater fitness gives a greater
chance of being chosen. Negative fitnesses were not included. (A ‘roulette wheel’
selection.)
3. The remaining four chromosomes are created as a mutation of the two chosen above
and crossover on these same two. Duplicate chromosomes are removed.
4. Fitness was computed simply as Weight(input_to_hidden)*Weight(hidden_to_output).
The more positive the number, the greater the fitness.
An example run (first three generations only) for extracting rules dealing with the first
output node only (i.e. for sunburn cases only) is given in Figure 5.
Results
A traditional symbolic learning algorithm running on this dataset will find the following
four rules: (a) If person has red hair then person is sunburned; (b) If person is brown
haired then person is not sunburned; (c) If person has blonde hair and no lotion used then
person is sunburned; and (d) If person has blonde hair and lotion used then person is not
sunburned. Our approach identified the following five single condition rules in ten
generations, with a maximum population of 6 in each generation:
(i) ‘IF unit1 is 1 THEN output is 1 (fitness 4.667)’, which corresponds to: ‘IF hair
colour=blonde THEN result is sunburned’. The fitness here is calculated as follows:
input unit 1 to hidden unit 1 weight of
-2.029721* hidden unit 1 to output unit 1 weight of -2.299536.
6
Figure 5 – First three generations of chromosome evolution in the extraction of rules
dealing with sunburn cases (output node 1) only
(ii) ‘IF unit 3 is 1 THEN output is 1 (fitness 3.908)’, which corresponds to `IF hair
colour=red THEN result is sunburned’ (input unit 3 to hidden unit 1 weight of -1.702274
* hidden unit 1 to output unit 1 weight of -2.299536).
(iii) ‘IF unit 10 is 1 then output is 1 (fitness 4.154), which corresponds to ‘IF no lotion
used THEN result is sunburned’ (input unit 10 to hidden unit 4 weight of -1.624518 *
hidden unit 4 to output weight of -2.556154)
(iv) ‘IF unit 2 is 1 THEN output is 2 (fitness 8.43)’, which corresponds to: ‘IF hair
colour=brown THEN result is not sunburned’ (input unit 2 to hidden unit 5 weighting of
2.288170 * hidden unit 5 to output unit 2 weighting of 3.682286, with rounding)
(v) ‘IF unit 11 is 1 THEN output is 2 (fitness 10.12)’, which corresponds to ‘IF lotion
used THEN result is not sunburned’ (input unit 11 to hidden unit 5 weighting of
2.754642 * hidden unit 5 to output unit 2 weighting of 3.682286, with rounding).
Figure 5 shows that, for the sunburnt cases (rules (i) – (iii) above), there is early
convergence (within three generations) to these rules. The fitness values cited in the rule
set above may not be the maximum attainable but are nevertheless significantly above 0.
Experiment 2
Another toy example was chosen from the machine learning literature, again,
only 8 records with four attributes (Figure 6).
7
Dataset
Run Supervisor Overtime Operator Output
1 Sally Yes Joe High
2 John No Samantha High
3 Sally Yes Joe High
4 John No Joe Low
5 Sally Yes Samantha High
6 Patrick No Samantha Low
7 Sally Yes Joe High
8 Patrick No Samantha Low
Figure 6: Second experimental dataset
The conversion between data and neural network representation was performed as before
(Figure 7).
Supervisor Sally 100
John 010
Patrick 001
Overtime Yes 10
No 01
Operator Joe 10
Samantha 01
Output High 10
Low 01
Figure 7: Conversion of second dataset into a neural network format
The rules involved in this classification are complex and there is some repetition so that
only very few records actually make a contribution to a rule. Symbolic algorithms do not
produce good results over this data set. See5 creates the ruleset:
IF overtime = Yes THEN output = High [0.833]
IF overtime = No THEN output = Low [0.667]
CN2 creates these single-condition rules, along with some dual condition rules:
IF supervisor = Sally THEN output = High [0 4]
IF supervisor = Patrick THEN output = Low [2 0]
where the numbers in brackets signifies how many cases of each class are captured by
that rule. For instance, ‘[0 4]’ after the first rule above signifies that this rules captures
none of the low output cases and 4 of the high output cases. The ANN with 7 input, 4
hidden and 2 output units was trained over a series of 1522 epochs to achieve a mean
squared error of 0.040. Below is the weight space for the network.
8
Hidden Unit 1 (all seven input to hidden connections)
-0.836101 -0.437469 -0.972496 -0.977659 0.265379 -0.459824 0.313158
Hidden Unit 2
-2.508566 -2.855611 1.858439 -1.711295 2.86410 2.675891 -1.834709
Hidden Unit 3
1.726850 0.421753 -0.725803 1.372710 -1.471043 0.338697 0.652326
Hidden Unit 4
-1.738682 -1.385388 2.255858 -0.626335 2.316902 0.007883 -3.285211
Output Unit 1 (all four hidden to output connections)
0.491153 -4.961958 2.423375 -2.589325
Output Unit 2
-0.687410 4.479441 -2.092269 3.477822
The genetic algorithm was started with a population of 10 and run for just 20 generations.
The top rules for each classification were as follows:
IF Supervisor = John THEN output = High (12.948)
IF Supervisor = Sally THEN output = High (10.966)
IF Operator = Samantha THEN output = High (7.847)
IF Overtime = No THEN output = Low (11.498)
IF Operator = Joe THEN output = Low (10.706)
IF Supervisor = Patrick THEN output = Low (7.120)
As before, the fitness measures for each rule are quoted to allow decisions to be made as
to the validity of each of the rules. As can be seen from the ruleset, the results from the
symbolic algorithms have largely been reproduced and the algorithm has also found some
extra rules.
Experiment 3
The dataset used was the mushroom dataset - a well-known collection of data
used for classifying mushrooms into an edible or poisonous class. The data contains 125
categories spanning 23 attributes.
As before, the data was converted into a neural network input format. The network was
first trained on this full dataset for 41 epochs and an error of 0.0161. However, the test
results from these runs were very poor and it prompted an investigation of the network
weights, revealing that the network was not learning successfully. Several solutions to
this problem were hypothesised and implemented with little success. The problem turned
out to be that the data set has a large number of unused categories and these were
translated along with the rest of the data, resulting in a network with a very sparse
distribution of information since over half of the categories were not present. These
categories were eliminated from the data and a smaller network with 30 hidden units was
trained on the smaller 62 category data set for 69 epochs. The error was higher than
before at 0.03 but testing was, on average, better. The genetic algorithm was run for 100
9
iterations with a population of 20. There were 7 operations per population, 4 crossover
and 3 mutation. The mutation rate was randomly set between –40 to +40. The rules
found by the GAs were encouragingly similar to those found by traditional algorithms,
but the system also supplemented the most obvious rules with some previously
undiscovered ones, exclusive to our approach:
IF odour=p THEN poisonous. (max 2.23) (found by CN2 and See5)
IF gill-size=n THEN poisonous. (max 1.13) (exclusive)
IF stalk-root = e THEN poisonous (max 1.13) (exclusive)
IF gill-size=b THEN edible. (max 2.3) (found by CN2)
IF odour=n THEN edible (max 1.58) (exclusive)
IF cap-surface=f THEN edible (max 1.58) (found by CN2)
The weightings specify maximum values since they surface frequently in the rule
list with different fitness values, depending on which hidden unit the input was connected
to. The rules correlate well with the ones found by traditional packages. In fact, they are
almost identical to the rules found by CN2. The exciting aspect here is that there are
some totally new rules extracted regarding each classification. The algorithms used in
traditional classification programs found only the odour=p rule for poisonous
classification, whereas our approach found two other rules.
The need to adapt the neural network to deal with a subset of the original data highlights
an inherent problem in any approach which attempts to integrate neural network learning
with symbolic rule extraction: The genetic algorithm can only generate rules from the
neural network if they already exist. If a network has not been trained properly on the
data set then the algorithm will not find the required associations. This means that users
must be very sure that the trained network is an accurate model of the domain they are
trying to mine. If this is not the case then the system will find spurious rules.
Discussion
Work is currently underway to amend the chromosome representation to extract two-
condition and multi-condition rules from the neural network trained on the mushroom
dataset, as well as to improve the behaviour of the trained neural network even further
when tested with examples not previously seen. It is an open question as to how well the
trained neural network has to perform on unseen examples before the process of rule
extraction can begin.
Together, the preliminary results reported here provide evidence of the feasibility of
integrating GAs with trained neural networks, both technically and in terms of efficiency.
The approach can be scaled up easily, with the major constraint on scale being the
accuracy of the trained neural network when dealing with large datasets. What was
particularly interesting was the extraction of rules not captured by traditional symbolic
learning techniques. While such rules may not be totally accurate in that they don’t
capture all or even most of the samples in a dataset, there is no doubt that the approach
outlined here can perform the useful function of extracting rules which lie at the
periphery of domain expertise or which capture exceptions (which can then be further
analysed to identify reasons for being exceptions). One of the major advantages of this
10
Description:Data mining neural networks with genetic algorithms. Ajit Narayanan, Edward Keedwell and Dragan Savic. School of Engineering and Computer