ebook img

NeuroRule: A Connectionist Approach to Data Mining PDF

0.16 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview NeuroRule: A Connectionist Approach to Data Mining

NeuroRule: A Connectionist Approach to Data Mining Hongjun Lu Rudy Setiono Huan Liu Department of Information Systems and Computer Science National University of Singapore 7 1 {luhj,rudys,liuh}@iscs.nus.sg 0 2 n a J dataareavaluableassetofanorganization,mostorga- 5 nizationsmayfacetheproblemofdata rich butknowl- Abstract edge poor sooner or later. This situation aroused the ] G recent surge of research interests in the area of data L Classification, which involves finding rules mining [1, 9, 2]. that partition a given data set into disjoint . One of the data mining problems is classification. s groups, is one class of data mining problems. c Data items in databases, such as tuples in relational [ Approachesproposedsofarforminingclassifi- databasesystemsusuallyrepresentrealworldentities. cationrulesforlargedatabasesaremainlyde- The values of the attributes of a tuple represent the 1 v cision tree based symbolic learning methods. properties of the entity. Classification is the process 8 The connectionist approach based on neural of finding the common properties among different en- 5 networks has been thought not well suited tities and classifying them into classes. The results 3 for data mining. One of the major reasons are often expressed in the form of rules – the classi- 1 cited is that knowledge generated by neural fication rules. By applying the rules, entities rep- 0 networks is not explicitly represented in the . resented by tuples can be easily classified into dif- 1 form of rules suitable for verification or inter- ferent classes they belong to. We can restate the 0 pretation by humans. This paper examines 7 problem formally defined by Agrawal et al. [1] as 1 this issue. With our newly developed algo- follows. Let A be a set of attributes A1,A2,...,An : rithms, rules which are similar to, or more and dom(A ) refer to the set of possible values for at- v i concise than those generated by the symbolic i tributeAi. LetC beasetofclassesc1,c2,...,cm. We X methodscanbeextractedfromtheneuralnet- are given a data set, the training set whose members r works. The data mining process using neural are(n+1)-tuplesoftheform(a ,a ,...,a ,c )where a 1 2 n k networkswiththeemphasisonruleextraction a ∈ dom(A ),(1 ≤ i ≤ n) and c ∈ C(1 ≤ k ≤ m). i i k is described. Experimental results and com- Hence, the class to which each tuple in the training parison with previously published works are set belongs is known for supervised learning. We are presented. also given a second large database of (n+1)-tuples, the testing set. The classificationproblemis toobtain 1 Introduction a set of rules R using the given training data set. By applying these rules to the testing set, the rules can With the wide use of advanced database technology be checkedwhether they generalizewell(measuredby developedduring pastdecades,it is not difficult to ef- thepredictiveaccuracy). Therulesthatgeneralizewell ficiently store huge volume of data in computers and canbe safely appliedto the applicationdatabase with retrieve them whenever needed. Although the stored unknown classes to determine each tuple’s class. This problem has been widely studied by re- Permission to copy without fee all or part of this material is grantedprovidedthatthecopiesarenotmadeordistributedfor searchers in the AI field [28]. It is recently re- direct commercial advantage, the VLDB copyright notice and examined by database researchers in the context of the title of the publication and its date appear, and notice is largedatabasesystems[5,7,14,15,13]. Twobasicap- giventhatcopyingisbypermissionoftheVeryLargeDataBase proaches to the classification problems studied by AI Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment. researchersarethesymbolicapproachandtheconnec- Proceedings of the 21st VLDB Conference tionist approach. The symbolic approach is based on Zurich, Swizerland, 1995 decision trees and the connectionist approach mainly uses neural networks. In general, neural networks with a strong relationship among attributes, the give a lower classification error rate than the deci- rules extracted are generally more concise. siontreesbutrequirelongerlearningtime [17,24,18]. • A data mining system, NeuroRule, basedon neu- While both approaches have been well received by ral networks was developed. The system success- the AI community, the general impression among the fullysolvedanumberofclassificationproblemsin database community is that the connectionist ap- the literature. proach is not well suited for data mining. The major criticisms include the following: To better suit large database applications, we also 1. Neural networks learn the classification rules by developed algorithms for input data pre-processing multiple passes over the training data set so that andforfastneuralnetworktrainingtoreducethetime the learning time, or the training time needed for needed to learn the classification rules [22, 19]. Lim- a neural network to obtain high classification ac- ited by space, those algorithms are not presented in curacy is usually long. this paper. The remainder of the paper is organizedas follows. 2. A neural network is usually a layered graph with Section 2 gives a discussion on using the connection- the output of one node feeding into one or many ist approach to learn classification rules. Section 3 other nodes in the next layer. The classification describes our algorithms to extract classification rules rulesareburiedinboththestructureofthegraph fromaneuralnetwork. Section4presentssomeexper- andthe weightsassignedto the links betweenthe imentalresultsobtainedandacomparisonwithprevi- nodes. Articulating the classification rules be- ously published results. Finally a conclusion is given comes a difficult problem. in Section 5. 3. For the same reason,available domainknowledge 2 Mining classification rules using neu- is rather difficult to be incorporated to a neural ral networks network. Artificial neural networks are densely interconnected Among the above three major disadvantagesof the networks of simple computational elements, neurons. connectionistapproach,thearticulatingproblemisthe There exist many different network topologies [10]. most urgent one to be solved for applying the tech- Among them, the multi-layer perceptron is especially nique todatamining. Without explicitrepresentation useful for implementing a classification function. Fig- of classification rules, it is very difficult to verify or ure1 showsathree layerfeedforwardnetwork. Itcon- interpret them. More importantly, with explicit rules, sists of an input layer, a hidden layer and an output tuples of a certain pattern can be easily retrieved us- layer. A node (neuron) in the network has a number ing a database query language. Access methods such of inputs and a single output. For example, a neu- as indexing can be used or built for efficient retrieval ron H in the hidden layer has xi,xi,...,xi as its as those rules usually involve only a small set of at- j 1 2 n input and αj as its output. The input links of H tributes. This is especially important for applications j has weights wj,wj,...,wj. A node computes its out- involving a large volume of data. 1 2 n put, the activation value by summing up its weighted In this paper, we present the results of our study inputs, subtracting a threshold, and passing the re- on applying the neuralnetworksto mine classification sulttoanon-linearfunctionf,theactivation function. rulesforlargedatabaseswiththefocusonarticulating Outputsfromneuronsinonelayerarefedasinputsto theclassificationrulesrepresentedbyneuralnetworks. neurons in the next layer. In this manner, when an The contributions of our study include the following: inputtupleisappliedtotheinputlayer,anoutputtu- • Different from previous research work that ex- ple is obtained at the output layer. For a well trained cludes the connectionist approachentirely, we ar- networkwhichrepresentstheclassificationfunction, if gue that the connectionist approach should have tuple (x1,x2,...,xn) is applied to the input layer of its position in data mining because of its merits the network, the output tuple, (c1,c2,...,cm) should such as low classification error rates and robust- be obtained where ci has value 1 if the input tuple ness to noise [17, 18]. belongs to class ci and 0 otherwise. Our approach that uses neural networks to mine • With our newly developed algorithms, explicit classification rules consists of three steps: classificationrulescanbe extractedfromaneural network. Therulesextractedusuallyhavealower 1. Network training classification error rate than those generated by A three layer neural network is trained in this the decision tree based methods. For a data set step. The trainingphaseaimsto findthe bestset of weights for the network which allow the net- manyhiddennodesmayleadtooverfittingofthedata work to classify input tuples with a satisfactory and poor generalization, while too few hidden nodes level of accuracy. An initial set of weights are may not give rise to a network that learns the data. chosen randomly in the interval [-1,1]. Updating Two different approacheshavebeen proposedto over- these weights is normally done by using informa- come the problem of determining the optimal number tions involving the gradient of an error function. ofhiddennodesrequiredbyaneuralnetworktosolvea This phase is terminated when the norm of the given problem. The first approachbegins with a min- gradient of the error function falls below a pre- imal network and adds more hidden nodes only when specified value. they are needed to improve the learning capability of the network [3, 11, 19]. The second approach begins 2. Network pruning with anoversizednetworkandthen prunes redundant The network obtained from the training phase is hidden nodes and connections between the layers of fullyconnectedandcouldhavetoomanylinksand the network. We adopt the second approach since we sometimes too many nodes as well. It is impossi- are interested in finding a network with a small num- ble to extract concise rules which are meaningful ber of hidden nodes as well as the fewest number of tousersandcanbeusedtoformdatabasequeries input nodes. An input node with no connection to from such a network. The pruning phase aims any of the hidden nodes after pruning plays no role in at removing redundant links and nodes without the outcomeofclassificationprocessandhencecanbe increasing the classification error rate of the net- removed from the network. work. A smaller number of nodes and links left in the network after pruning provide for extract- ingconsiseandcomprehensiblerulesthatdescribe the classification function. Output Layer vmp 3. Rule extraction This phase extracts the classification rules from Hidden Layer the pruned network. The rules generated are in wm the form of “if (a θv ) and (x θv ) and ... and l 1 1 2 2 (x θv ) then C ” where a ’s are the attributes Input Layer n n j i of an input tuple, v ’s are constants, θ’s are re- i lational operators (=,≤,≥,<>), and C is one j of the class labels. It is expected that the rules areconciseenoughforhumanverificationandare easily applicable to large databases. Figure 1: A three layer feedforward neural network. In this section, we will briefly discuss the first two The activation value of a node in the hidden layer phase. The third phase, rule extraction phase will be is computed by passing the weighted sum of input discussed in the next section. values to a non-linear activation function. Let wℓm be the weights for the connections from input node 2.1 Network training ℓ to hidden node m. Given an input pattern xi,i ∈ {1,2,...,k}, where k is the number of tuples in the Assume that input tuples in an n-dimensional space dataset, the activationvalueofthe m-thhiddennode aretobeclassifiedintothreedisjointclassesA,B,and is C. WeconstructanetworkasshowninFigure1which n consists of three layers. The number of nodes in the αm =f xiwm −τm , ℓ ℓ input layer corresponds to the dimensionality of the ! ℓ=1 inputtuples. Thenumberofnodesintheoutputlayer X(cid:0) (cid:1) equals to the number of classes to be classified, which where f(.) is an activation function. In our study, we is three in this example. The network is trained with use the hyperbolic tangent function targetvaluesequalto{1,0,0}forallpatternsinsetA, {0,1,0}forallpatternsinB,and{0,0,1}foralltuples f(x):=δ(x)=(ex−e−x)/(ex+e−x) in C. An input tuples will be classified as a member of the class A,B or C if the largest activation value as the activation function for the hidden nodes, which is obtained by the first, second or third output node, makes the range of activation values of the hidden respectively. nodes [-1, 1]. Thereisstillnoclearcutruletodeterminethenum- Once the activation values of all the hidden nodes berofhiddennodestobeincludedinthenetwork. Too have been computed, the p-th output of the network for input tuple xi is computed as may be removed later from the network at the cost of a decrease in its accuracy. h The training phase starts with an initial set of Si =σ αmvm , p p weights(w,v)(0) anditerativelyupdatestheweightsto ! mX=1 minimize E(w,v)+P(w,v). Any unconstrained min- wherevm is theweightofthe connectionbetweenhid- imization algorithm can be used for this purpose. In p dennodemandoutputnodepandhisthenumberof particular, the gradient descent method has been the hidden nodes in the network. The activation function most widely used in the training algorithm known as used here is the sigmoid function, the backpropagationalgorithm. A number of alterna- tive algorithms for neural network training have been σ(x)=1/(1+e−x), proposed [4]. To reduce the network training time, whichisveryimportantinthedataminingasthedata which yields activation values of the output nodes in setisusuallylarge,weemployedavariantofthequasi- the range [0, 1]. Newton algorithm[27], the BFGS method. This algo- A tuple will be correctly classified if the following rithm has a superlinear convergence rate, as opposed condition is satisfied to the linear rate of the gradientdescentmethod. De- tails of the BFGS algorithm can be found in [6, 23]. max|ei|=max|Si −ti|≤η , (1) p p p 1 The network training is terminated when a local p p minimum of the function E(w,v)+P(w,v) has been where ti = 0, except for ti = 1 if xi ∈ A, ti = 1 if reached, that is when the gradient of the function is p 1 2 xi ∈ B, and ti = 1 if xi ∈ C, and η is a small pos- sufficiently small. 3 1 itive number less than 0.5. The ultimate objective of the training phase is to obtain a set of weights that 2.2 Network pruning make the network classify the input tuples correctly. Afullyconnectednetworkisobtainedattheendofthe Tomeasuretheclassificationerror,anerrorfunctionis training process. There are usually a large number of needed so that the trainingprocess becomes a process links in the network. With n input nodes, h hidden to adjust the weights(w,v) to minimize this function. nodes, and m output nodes, there are h(m+n) links. Furthermore, to facilitate the pruning phase, it is de- It is very difficult to articulate such a network. The sired to have many weights with very small values so network pruning phase aims at removing some of the thattheycanbesettozero. Thisisachievedbyadding links without affecting the classification accuracy of a penalty term to the error function. the network. In our training algorithm, the cross entropy func- Itcanbeshownthat[20]ifanetworkisfullytrained tion tocorrectlyclassifyaninputtuple, xi,withthecondi- k o tion(1)satisfiedwecansetwm tozerowithoutdeteri- ℓ E(w,v)=− ti logSi +(1−ti)log(1−Si) oratingtheoverallaccuracyofthenetworkiftheprod- p p p p Xi=1Xp=1(cid:0) (cid:1) uct|vmwℓm|issufficientlysmall. Ifmaxp|vpmwℓm|≤4η2 (2) and the sum (η + η ) is less than 0.5, then the 1 2 isusedasthe errorfunction. Inthisexample,oequals network can still classify xi correctly. Similarly, if to3sincewehave3differentclasses. Thecrossentropy max |vm| ≤ 4η , then vm can be removed from the p p 2 p function is chosen because faster convergence can be network. achieved by minimizing this function instead of the Ourpruningalgorithmbasedonthisresultisshown widely used sum of squared error function [26]. inFigure2. Thetwoconditions(4)and(5)forpruning The penalty term P(w,v) we used is depend on the magnitude of the weights for connec- tions between input nodes and hidden nodes and be- ǫ h n β(wℓm)2 + h o β(vpm)2 + tweenhiddennodesandoutputnodes. Itisimperative 1 1+β(wm)2 1+β(vm)2 that during training these weights be prevented from m=1ℓ=1 ℓ m=1p=1 p ! X X X X getting too large. At the same time, small weights (3) should be encouraged to decay rapidly to zero. By h n h o ǫ (wm)2+ vm 2 , using penalty function (3), we can achieve both. 2 ℓ p ! m=1ℓ=1 m=1p=1 X X X X(cid:0) (cid:1) 2.3 An example whereǫ andǫ aretwopositiveweightdecayparame- 1 2 ters. Theirvaluesreflecttherelativeimportanceofthe Wehavechosentouseafunctiondescribedin[2]asan accuracy of the network versus its complexity. With example to showhow a neuralnetworkcan be trained larger values of these two parameters more weights and pruned for solving a classification problem. The Table 1: Attributes of the test data adapted from Agrawalet al.[2] Attribute Description Value salary salary uniformly distributed from 20,000 to 150,000 commission commission if salary ≥ 75000 → commission = 0 else uniformly distributed from 10000 to 75000. age age uniformly distributed from 20 to 80. elevel education level uniformly distributed from [0,1,...,4]. car make of the car uniformly distributed from [1,2,...20]. zipcode zip code of the town uniformly chosen from 9 available zipcodes. hvalue value of the house uniformly distributed from 0.5k10000to 1.5k1000000 where k ∈{0...9} depends on zipcode. hyears years house owned uniformly distributed from [1,2,...,30]. loan total amount of loan uniformly distributed from 1 to 500000. input tuple consists of nine attributes defined in Ta- ble 1. Ten classification problems are given in [2]. Limited by space, we will present and discuss a few functions and the experimental results. Function 2 classifies a tuple in Group A if Neural network pruning algorithm (NP) ((age<40)∧(50000≤salary≤100000))∨ 1. Let η and η be positive scalars such that η + 1 2 1 η2 <0.5. ((40≤age<60)∧(75000≤salary≤125000))∨ 2. Pick a fully connected network. Train this ((age≥60)∧(25000≤salary≤75000)). network until a predetermined accuracy rate is Otherwise, the tuple is classified in Group B. achieved and for each correctly classified pattern The trainingdata setconsistedof1000tuples. The the condition (1) is satisfied. Let (w,v) be the values of the attributes of each tuple were generated weights of this network. randomly according to the distributions given in Ta- 3. For each wm, if ble 1. Following Agrawal et al. [2], we also included ℓ a perturbation factor as one of the parameters of the max|vpm×wℓm|≤4η2, (4) random data generator. This perturbation factor was p setat5percent. Foreachtuple,aclasslabelwasdeter- then remove wm from the network mined according to the rules that define the function ℓ above. 4. For each vm, if p To facilitate the rule extraction in the later phase, the values of the numeric attributes were discretized. |vm|≤4η , (5) p 2 Eachofthesixattributeswithnumericvalueswasdis- then remove vm from the network cretized by dividing its range into subintervals. The p attributesalary forexample,whichwasuniformlydis- 5. If no weight satisfies condition (4) or condition tributed from 25000 to 150000 was divided into 6 (5), then remove wm with the smallest product subintervals: subinterval 1 contained all salary values ℓ max |vm×wm|. that were strictly less than 25000, subinterval 2 con- p p ℓ tainedthosegreaterthanorequalto25000andstrictly 6. Retrain the network. If accuracy of the network less than 50000,etc. The thermometer coding scheme falls below an acceptable level, then stop. Other- wasthenemployedtogetthebinaryrepresentationsof wise, go to Step 3. theseintervalsforinputstotheneuralnetwork. Hence, a salary value less that 25000 was coded as {000001}, Figure 2: Neural network pruning algorithm a salaryvalue inthe interval[25000,50000)wascoded as {000011}, etc. The second attribute commission wassimilarlycoded. Theintervalfrom10000to75000 was divided into 7 subintervals, each having a width of 10000 except for the last one, [70000,75000]. Zero commission was coded by all zero values for the seven inputs. Thecodingschemefortheotherattributesare Positive given in Table 2. weight Negative weight Table 2: Binarization of the attribute values Attribute Input number Interval width salary I - I 25000 1 6 commission I - I 10000 7 13 age I - I 10 14 19 elevel I - I - 20 23 car I - I - 24 43 zipcode I - I - I−1 I−2 I−4 I−5 I−13 I−15 I−17 44 52 hvalue I - I 100000 53 66 hyears I67 - I76 3 Figure3: PrunednetworkforFunction2. Itsaccuracy loan I77 - I86 50000 rate on the 1000 training samples is 96.30 % and it contains only 17 connections. With this coding scheme, we had a total of 86 bi- naryinputs. The87thinputwasaddedtothenetwork there is no method available in the literature that can toincorporatethebiasorthresholdineachofthehid- extract explicit and concise rules as the algorithm we den node. The input value to this input was set to will describe in this section. one. Therefore the input layer of the initial network consisted of 87 input nodes. Two nodes were used at 3.1 Rule extracting algorithm theoutputlayer. Thetargetoutputofthenetworkwas {1,0}ifthetuplebelongedtoGroupA,and{0,1}oth- A number ofreasonscontribute to the difficulty of ex- erwise. The number of the hidden nodes was initially tractingrulesfromaprunednetwork. First,evenwith set as four. a pruned network, the links may be still too many to Therewerea totalof386links inthe network. The express the relationship between an input tuple and weights for these links were given initial values that its class label in the form of if ... then ··· rules. If a were randomly generated in the interval [-1,1]. The node has n input links with binary values,there could network was trained until a local minimum point of be as many as 2n distinct input patterns. The rules the error function had been reached. couldbequitelengthyorcomplexevenwithasmalln, The fully connected trained network was then say 7. Second, the activation values of a hidden node pruned by the pruning algorithm described in Section couldbeanywhereintherange[-1,1]dependingonthe 2.2. Wecontinuedremovingconnectionsfromtheneu- input tuple. With a large number of testing data, the ralnetworkaslongastheaccuracyofthenetworkwas activation values are virtually continuous. It is rather still higher than 90 %. difficulttoderivetheexplicitrelationshipbetweenthe Figure3showstheprunednetwork. Ofthe386links activation values of the hidden nodes and the output intheoriginalnetwork,only17remainedinthepruned values of a node in the output layer. network. One of the four hidden nodes was removed. Our rule extracting algorithm is outlined in Fig- A small number of links from the input nodes to the ure 4. The algorithm first discretizes the activation hiddennodesmadeitpossibletoextractcompactrules values of hidden nodes into a manageable number of with the same accuracy level as the neural network. discretevalueswithoutsacrificingtheclassificationac- curacy of the network. A small set of the discrete ac- 3 Extracting rules from a neural net- tivation values make it possible to determine both the dependency among the output values and the hidden work node values and the dependency among the hidden Networkpruningresultsinarelativelysimplenetwork. node activation values and the input values. In the example shown in the last section, the pruned Fromthedependencies,rulescanbegenerated[12]. network has only 7 input nodes, 3 hidden nodes, and Here we show the process of extracting rules from the 2 output nodes. The number of links is 17. However, prunednetworkinFigure3obtainedforthe classifica- it is still very difficult to articulate the network, i.e., tion problem Function 2. find the explicit relationship between the input tuples The network has three hidden nodes. The activa- andthe output tuples. Researchwork in this area has tion values of 1000 tuples were discretized. The value beenreported[25,8]. However,toourbestknowledge, of ǫ was set to 0.6. The results of discretization are shown in the following table. Rule extraction algorithm (RX) Node No of clusters Cluster activation values 1. Activation value discretization via clustering: 1 3 (-1, 0, 1) (a) Let ǫ ∈ (0,1). Let D be the number of dis- 2 2 ( 0, 1) crete activation values in the hidden node. 3 3 (-1, 0.24, 1) Let δ be the activation value in the hidden 1 node for the firstpattern in the training set. The classification accuracy of the network was Let H(1) = δ ,count(1) = 1,sum(1) = δ checked by replacing the individual activation value 1 1 and set D =1. with its discretized activation value. The value of ǫ=0.6 was sufficiently small to preserve the accuracy (b) For all patterns i=2,3,...k in the training of the neural network and large enough to produce set: only a small number of clusters. For the three hidden • Let δ be its activation value. nodes, the numbers of discreteactivationvalues (clus- • If there exists an index j such that ters) are 3,2 and 3, or a total of 18 different outcomes atthe twooutputnodesarepossible. We tabulate the |δ−H(j)| = min |δ−H(j)| outputs C (1 ≤ j ≤ 2) of the network according to j∈{1,2,...,D} j the hidden node activation values α ,(1≤m≤3) as m and|δ−H(j)| ≤ ǫ, follows. then set count(j) := count(j) + 1, α α α C C 1 2 3 1 2 sum(D):=sum(D)+δ -1 1 -1 0.92 0.08 else D =D+1,H(D)=δ, -1 1 1 0.00 1.00 count(D)=1,sum(D)=δ. -1 1 0.24 0.01 0.99 (c) Replace H by the average of all activation -1 0 -1 1.00 0.00 valuesthathavebeenclusteredintothisclus- -1 0 1 0.11 0.89 ter: -1 0 0.24 0.93 0.07 1 1 -1 0.00 1.00 H(j):=sum(j)/count(j), j =1,2...,D. 1 1 1 0.00 1.00 1 1 0.24 0.00 1.00 (d) Check the accuracy of the network with the 1 0 -1 0.89 0.11 activation values δi at the hidden nodes re- 1 0 1 0.00 1.00 placedbyδ ,theactivationvalueoftheclus- d 1 0 0.24 0.00 1.00 ter to which the activation value belongs. 0 1 -1 0.18 0.82 (e) Iftheaccuracyfallsbelowtherequiredlevel, 0 1 1 0.00 1.00 decrease ǫ and repeat Step 1. 0 1 0.24 0.00 1.00 0 0 -1 1.00 0.00 2. Enumerate the discretized activation values and 0 0 1 0.00 1.00 compute the network output. 0 0 0.24 0.18 0.82 Generate perfect rules that have a perfect cover of all the tuples from the hidden node activation Following Algorithm RX step 2, the predicted out- values to the output values. putsofthenetworkaretakentobeC =1andC =0 1 2 if the activation values α ’s satisfy one of the follow- 3. For the discretized hidden node activation values m ing conditions (since the table is small, the rules can appeared in the rules found in the above step, be checked manually): enumeratetheinputvaluesthatleadtothem,and generate perfect rules. 4. Generate rules that relate the input values and R11 :C1 =1,C2 =0 ⇐ α2 =0,α3 =−1. the output values by rule substitution based on R :C =1,C =0 ⇐ α =−1,α =1,α =−1. 12 1 2 1 2 3 the results of the above two steps. R :C =1,C =0 ⇐ α =−1,α =0,α =0.24. 13 1 2 1 2 3 Otherwise, C =0 and C =1. Figure 4: Rule extraction algorithm (RX) 1 2 The activation values of a hidden node are deter- mined by the inputs connected to it. In particular, the three activationvalues ofhidden node1 aredeter- Given the fact that salary ≥ 75000 ⇔ minedby4inputs,I ,I ,I ,andI . Theactivation commission = 0, the above four rules obtained by 1 13 15 17 valuesofhidden node 2aredeterminedby 2inputs I the pruned network are identical to the classification 2 andI ,andtheactivationvaluesofhiddennode3are Function 2. 17 determined by I ,I ,I ,I and I . Note that only 4 5 13 15 17 5 different activation values appear in the abovethree 3.2 Hidden node splitting and creation of a rules. FollowingAlgorithmRXstep3,weobtainrules subnetwork that show how a hidden node is activated for the five Afternetworkpruningandactivationvaluediscretiza- different activation values at the three hidden nodes: tion, rules canbe extractedby examining the possible Hidden node 1: combinations in the network outputs as shown in the R21 : α1 =−1 ⇐ I13 =1 previous section. However, when there are still too R22 : α1 =−1 ⇐ I1 =I13 =I15 =0, many connections between a hidden node and input I17 =1 nodes, it is not trivial to extractrules, even if we can, the rules may not be easy to understand. To address Hidden node 2: theproblem,athreelayerfeedforwardsubnetworkcan R23 : α2 =1 ⇐ I2 =1 be employedtosimplify rule extractionforthe hidden R24 : α2 =1 ⇐ I17 =1 node. The number of output nodes of this subnet- R25 : α2 =0 ⇐ I2 =I17 =0 work is the number of discrete values of the hidden node, while the input nodes are those connected to Hidden node 3: thehiddennodeintheoriginalnetwork. Tuplesinthe R26 : α3 =−1 ⇐ I13 =0 training set are groupedaccordingto their discretized R27 : α3 =−1 ⇐ I5 =I15 =1 activation values. Given d discrete activation values R28 : α3 =0.24 ⇐ I4 =I13 =1, I17 =0 D1,D2,...,Dd,alltrainingtuples withactivationval- R29 : α3 =0.24 ⇐ I5 =0, I13 =I15 =1 uesequaltoDj aregivenad-dimensionaltargetvalue ofallzerosexpectforone1inpositionj. Anewhidden With all the intermediate rules obtained above, we layer is introduced for this subnetwork. This subnet- can derive the classification rules as in Algorithm RX work is trained and pruned in the same ways as is the step 4. For example, substituting rule R with rules 11 original network. The rule extracting process is ap- R ,R , and R , we have the following two rules in 25 26 27 pliedforthesubnetworktoobtaintherulesdescribing terms of the original inputs: the input and the discretized activation values. R1:C =1,C =0 ⇐ I =I =0,I =0 This process is applied recursively to those hidden 1 2 2 17 13 nodes with too many input links until the number of R′ :C =1,C =0 ⇐ I =I =0,I =I =1 1 1 2 2 17 5 15 connectionissmallenoughorthenewsubnetworkcan- not simplify the connections between the inputs and Recall that the input values of I to I represent 14 19 thehiddennodeatthehigherlevel. Formostproblems codedagegroupswhereI =1ifageisin[60,80)and 15 that we have solved, this step is not necessary. One I =1 if age is in [20, 40). Therefore rule R′ in fact 17 1 problemwherethisstepisrequiredbythealgorithmis can never be satisfied by any tuple, hence redundant. for a genetic classificationproblemwith 60 attributes. Similarly, replacing rule R with R ,R , 12 21 22 The details of the experiment can be found in [21]. R ,R ,R andR ,wehavethefollowingtworules: 23 24 26 27 4 Preliminary experimental results R2:C =1,C =0 ⇐ I =I =I =1. 1 2 5 13 15 R3:C =1,C =0 ⇐ I =I =I =0,I =1. Unlike the pattern classification research in the AI 1 2 1 13 15 17 community where a set of classic problems have been studied by a large number of researchers, fewer well Substituting R with R , R , R , R and R , documented benchmark problems are available for 13 21 22 25 28 29 we have another rule: data mining. In this section, we report the experi- mental results of applying the approach described in R4:C =1,C =0 ⇐ I =I =0,I =I =1. the previous sections to the data mining problem de- 1 2 2 17 4 13 fined in [2]. As mentioned earlier,the database tuples It is now trivial to obtain the rules in terms of the consistedofnineattributes(SeeTable1). Tenclassifi- original attributes. Conditions of the rules after sub- cationfunctionsofAgrawaletal.[2]wereusedtogen- stitution can be rewritten in terms of the original at- erate classification problems with different complexi- tributes and classification problem as shown in Fig- ties. The training set consistedof 1000tuples and the ure 5. testing data sets had 1000 tuples. Efforts were made Rule 1. If (salary < 100000)∧ (commission = 0) ∧ (age ≤ 40), then Group A. Rule 2. If (salary ≥ 25000) ∧ (commission > 0) ∧ (age ≥ 60), then Group A. Rule 3. If (salary < 125000)∧ (commission = 0) ∧ (40 ≤ age ≤ 60), then Group A. Rule 4. If (50000 ≤ salary < 100000)∧ (age < 40), then Group A. Default Rule. Group B. Figure 5: Rules generated by NeuroRule for Function 2. to generate the data sets as described in the original 4.2 Rules extracted functions. Among 10 functions described, we found Here we present some of the classification rules ex- that functions 8 and 10 produced highly skewed data tracted from our experiments. that made classification not meaningful. We will only discuss functions other than these two. To assess our For simple classification functions, the rules ex- approach,we compare the results with that of C4.5,a tractedareexactlythe sameasthe classificationfunc- tions. These include functions 1, 2 and 3. One in- decision tree-based classifier [16]. teresting example is Function 2. The detailed pro- cess of finding the classification rules is described as 4.1 Classification accuracy an example in Section 2 and 3. The resulting rules The following table reports the classificationaccuracy are the same as the original functions. As reported using both our system and C4.5 for eight functions. by Agrawal et al. [2], ID3 generated a relatively large Here, classification accuracy is defined as numberofstringsforFunction2whenthedecisiontree is built. We observed similar results when C4.5rules no tuples correctly classified was used (a member of ID3). C4.5rules generated 18 accuracy = (6) total number of tuples rules. Among the 18 rules, 8 rules define the condi- tions for Group A. Another 10 rules define Group B. Tuples that do not satisfy the conditions specified are classifiedasdefaultclass,GroupB.Figure6showsthe Func. Pruned Networks C4.5 rules that define tuples to be a member of Group A. no Training Testing Training Testing By comparing the rules generated by C4.5rules 1 98.1 100.0 98.3 100.0 (Figure 6) with the rules generated by NeuroRule in 2 96.3 100.0 98.7 96.0 Figure 4, it is obvious that our approach generates 3 98.5 100.0 99.5 99.1 better rules in the sense that they are more compact, 4 90.6 92.9 94.0 89.7 which makes the verification and application of the 5 90.4 93.1 96.8 94.4 rules much easier. 6 90.1 90.9 94.0 91.7 7 91.9 91.4 98.1 93.6 Functions 4 and 5 are another two functions 9 90.1 90.9 94.4 91.8 for which ID3 generates a large number of strings. CDP [2] also generates a relatively large number of Fromthetablewecanseethattheclassificationac- strings than for other functions. The original classi- curacyoftheneuralnetworkbasedapproachandC4.5 fication function 4, the rule sets that define Group A is comparable. In fact, the networkobtained after the tuples extracted using NeuroRule and C4.5, respec- training phase has higher accuracy than what listed tively are shown in Figure 7. here, which is mainly determined by the threshold set The five rules extracted by NeuroRule are not ex- for the network pruning phase. In our experiments, it actly the same as the original function descriptions is set to 90%. That is, a network will be pruned until (Function 4). To test the rules extracted, the rules further pruning will cause the accuracy to fall below were applied to three test data sets of different sizes, this threshold. For applications where high classifi- showninTable3. ThecolumnTotal isthetotalnum- cation accuracy is desired, the threshold can be set ber of tuples that are classified as group A by each higher so that less nodes and links will be pruned. Of rule. The column Correct is the percentage of cor- course, this may lead to more complex classification rectly classified tuples. E.g., rule R1 classifies all tu- rules. Tradeoff between the accuracy and the com- ples correctly. On the other hand, among 165 tuples plexityoftheclassificationrulesetisoneofthedesign that were classified as Group A by rule R2, 6.1% of issues. them belong to Group B, i.e. they were misclassified. Rule 16: (salary > 45910) ∧ (commission > 0) ∧ (age > 59) Rule 10: (51638 < salary ≤ 98469)∧ (age age ≤ 39) Rule 13: (salary ≤ 98469) ∧ (commission ≤ 0) ∧ (age ≤ 60) Rule 6: (26812 < salary ≤ 45910)∧ (age > 61) Rule 20: (98469 < salary ≤ 121461)∧ (39 < age ≤ 57) Rule 7: (45910 < salary ≤ 98469)∧ (commission ≤ 51486)∧ (age ≤ 39) ∧ (hval ≤ 705560) Rule 26: (125706 < salary ≤ 127088)∧ (age ≤ 51) Rule 4: (23873 salary ≤ 26812) ∧ (age > 61) ∧ (loan > 237756) Figure 6: Group A rules generated by C4.5rules for Function 2. (a) Original classification rules defining Group A tuples Group A: ((age<40)∧ (((elevel∈[0..1])?(25K ≤salary≤75K)):(50K ≤salary≤100K))))∨ ((40≤age<60)∧ (((elevel∈[1..3])?(50K ≤salary≤100K)):(75K ≤salary≤125K))))∨ ((age≥60)∧ (((elevel∈[2..4])?(50K ≤salary≤100K)):(25K ≤salary≤75K)))) (b) Rules generated by NeuroRule R1: if (40 ≤ age < 60) ∧ (elevel ≤ 1) ∧ (75K ≤ salary <100K) then Group A R2: if ( age <60) ∧ (elevel ≥ 2) ∧ (50K ≤ salary <100K) then Group A R3: if (age <60) ∧ (elevel ≤ 1) ∧ (50K ≤ salary < 75K ) then Group A R4: if (age ≥ 60) ∧ ( elevel ≤ 1) ∧ (salary <75K) then Group A R5: if (age ≥ 60) ∧ (elevel ≥ 2) ∧ (50K ≤ salary < 100K) then Group A (C) Rules generated by C4.5rules Rule 30: (elevel = 2) ∧ (50762 < salary ≤ 98490) Rule 25: (elevel = 3) ∧ (48632 < salary ≤ 98490) Rule 23: (elevel = 4) ∧ (60357 < salary ≤ 98490) Rule 32: (33 < age ≤ 60) ∧ (48632 < salary ≤ 98490)∧ (elevel = 1) Rule 57: (age > 38) ∧ (102418 < salary ≤ 124930 ∧ (age ≤ 59) ∧ elevel = 4) Rule 37: (salary > 48632) ∧ (commission > 18543) Rule 14: (age ≤ 39) ∧ (elevel = 0) ∧ (salary ≤ 48632) Rule 16: (age > 59) ∧ (elevel = 0) ∧ (salary ≤ 48632) Rule 12: (age > 65) ∧ (elevel = 1) ∧ (salary ≤ 48632) Rule 48: (car = 4) ∧ (98490 < salary ≤ 102418) Figure 7: Classification function 4 and rules extracted.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.