Feature Selection for Microarray Gene Expression Data Using Simulated Annealing Guided by the Multivariate Joint Entropy Fe´lixFernandoGonza´lez-Navarro1 andLlu´ısA.Belanche-Mun˜oz2 1 InstitutodeIngenier´ıa, UniversidadAuto´nomadeBajaCalifornia,Mexicali,Mexico 2 Dept. deLlenguatgesiSistemesInforma`tics, UniversitatPolite`cnicadeCatalunya,Barcelona,Spain [email protected],[email protected] Abstract. Microarrayclassificationposesmanychallen- conjunto de datos de expresio´n de genes puede con- ges for data analysis, given that a gene expression tenerdocenasdeobservacionesconmilesoinclusode- data set may consist of dozens of observations with cenasdemilesdegenes. Enestecontexto,laste´cnicas thousands or even tens of thousands of genes. In this deseleccio´ndesubconjuntosdecaracter´ısticaspueden context,featuresubsetselectiontechniquescanbevery sermuyu´tilesparareducirelespacioderepresentacio´n usefultoreducetherepresentationspacetoonethatis aunomanejablemediantete´cnicasdeclasificacio´n. En manageablebyclassificationtechniques.Inthisworkwe este trabajo se utiliza la entrop´ıa conjunta discretizada usethediscretizedmultivariatejointentropy asthebasis multivariada como base para la evaluacio´n ra´pida de for a fast evaluation of gene relevance in a Microarray la relevancia de genes en el contexto de expresio´n GeneExpressioncontext. Theproposedalgorithmcom- ge´nica mediante microarreglos. El algoritmo propuesto binesasimulatedannealingschedulespeciallydesigned desarrolla una te´cnica de recocido simulado disen˜ada for feature subset selection with the incrementally com- especialmente para la seleccio´n de subconjuntos de puted joint entropy, reusing previous values to compute caracter´ısticas, a trave´s de la entrop´ıa conjunta. E´sta currentfeaturesubsetrelevance.Thiscombinationturns es calculada incrementalmente, reutilizando los valores outtobeapowerfultoolwhenappliedtothemaximiza- anterioresparacalcularlarelevanciadelossubconjun- tion of gene subset relevance. Our method delivers tos de caracter´ısticas. Esta combinacio´n resulta ser highlyinterpretablesolutionsthataremoreaccuratethan una herramienta poderosa cuando se aplica a la maxi- competing methods. The algorithm is fast, effective mizacio´n de la relevancia de un subconjunto de genes. and has no critical parameters. The experimental re- Nuestrome´todoofrecesolucionesaltamenteinterpreta- sultsinseveralpublic-domainmicroarraydatasetsshow bles y ma´s precisas que las propuestas por me´todos a notoriously high classification performance and low competidores. El algoritmo propuesto es ra´pido, eficaz size subsets, formed mostly by biologically meaningful y no presenta para´metros cr´ıticos. Los resultados de genes. The technique is general and could be used in los experimentos con varios conjuntos de datos de mi- othersimilarscenarios. croarreglosdedominiopu´blicorevelanaltorendimiento de clasificacin´ y subconjuntos de pequen˜o taman˜o, for- Keywords. Feature selection, microarray gene expres- mados en su mayor´ıa por genes biolo´gicamente signi- siondata,multivariatejointentropy,simulatedannealing. ficativos. Late´cnicaesgeneralypodr´ıaserutilizadaen otrosescenariossimilares. Seleccio´n de caracter´ısticas para datos de expresio´n de los genes mediante microarreglos usando recocido simulado guiado por la entrop´ıa conjunta multivariada Palabras clave. Seleccio´n de caracter´ısticas, datos Resumen. La clasificacio´n de microarreglos plantea de expresiones de los genes mediante microarreglos, muchosdesaf´ıosparaelana´lisisdedatos,dadoqueun entrop´ıaconjuntamultivariada,recocidosimulado. 1 Introduction complexdomainssuchasmicroarraygeneexpres- sion data. Nonetheless, a few contributions using In cancer diagnosis, accurate classification of the the classical SA algorithm for FSS are found in different tumor types is of paramount importance. prostateproteinmassspectrometrydata[32],mar- Anaccuratepredictionofdifferenttumortypespro- keting applications [35], or parameter optimization vides better treatment and toxicity minimization on inclusteringgeneexpressionanalysis[18]. patients. Traditional methods of tackling this situ- Our answer to these computational problems is ation are based primarily on morphological char- twofold. First, we use a filter objective function for acteristics of tumorous tissue [13]. These conven- FSS(thusavoidingthedevelopmentofapredictive tional methods are reported to have several diag- model for every subset evaluation). Second, the nosislimitations. Inordertoanalyzetheproblemof objectivefunctionitselfisevaluatedveryefficiently cancer classification using gene expression data, basedinthereutilizationofpreviouscomputations. moresystematicapproachesweredeveloped [34]. Specifically,awaytocalculatethemultivariatejoint Pioneeringworkincancerclassificationbygene entropy for categorical variables is presented that expressionusingDNAmicroarrayshowedthepos- is both exact and very efficient. This measure sibility to help the diagnosis by means of Machine is then used by a SA-based TAFS algorithm to Learning or more general Data Mining methods search for small subsets of highly relevant genes. [22], which are now extensively used for this task Classificationexperimentsinfivepublicdomainmi- [16]. However,inthissettinggeneexpressiondata croarraydatasetsyieldsomeofthebestprediction analysis entails a heavy computational consump- results reported so far for these problems while tion of resources, due to the extreme sparseness offeringadrasticreductioninsubsetsizes. compared to standard data sets in classification The paper is organized as follows. Sec- tasks [51]. Classifying cancer types using such tion 2 briefly reviews the necessary background. a very high ratio of the number of variables to Section 3 develops the proposed method, the the number of observations is a delicate process. new information-theoretic measure for feature rel- As a result, dimensionality reduction and in par- evance,itsefficientimplementationanditsembed- ticular feature subset selection (FSS) techniques ding into a TAFS-like algorithm, which we name may be very useful. The finding of small subsets µ-TAFS. Section 4 describes the data sets and of very relevant genes among a huge quantity of the experimental settings, Section 5 presents the genes could result in much specific and thus effi- results and their interpretation. The paper ends cienttreatments. withtheconclusionsanddirectionsforfuturework. This work addresses the problem of selecting a subset of features by using the TAFS (Thermody- 2 Preliminaries namic Algorithms for Feature Selection) family of methods for the FSS problem. Given a suitable In this section we briefly review the necessary objective function, these algorithms make use of background: the Simulated Annealing technique, a special-purpose simulated annealing (SA) tech- basic Information Theory concepts and the TAFS nique to find a good subset of features that max- family of thermodynamic algorithms for feature imizes the objective function. A distinctive char- subsetselection. acteristic over other search algorithms for FSS is the probabilistic capability to momentarily accept 2.1SimulatedAnnealing worse solutions, which in the end may result in betterhypotheses. Simulated Annealing (SA) is a stochastic tech- Despite their powerful optimization capability, nique inspired on statistical mechanics for finding SA-based search algorithms usually lack execu- (near) globally optimal solutions to large optimiza- tion speed, involving long convergence times. In tionproblems. SAisaweakmethodinthatitneeds consequence, they have been generally excluded almost no information about the structure of the as an option in FSS problems, let alone in highly search space. The algorithm works by assuming that some parts of the current solution belong to a according to Metropolis, an annealing schedule is potentiallybetterone, andthusthesepartsshould designedtopreventtheprocessfromgettingstuck be retained by exploring neighbors of the current at a local minimum. The SA algorithm introduced solution. Assuming the objective function is to in[30]consistsinusingtheMetropolisideaateach be minimized, then SA would jump from hill to temperature T for a finite amount of time. In this hill and hence escape or simply avoid sub-optimal algorithm T is first set at a initially high value, solutions. spendingenoughtimeatitinordertoapproximate When a system S (considered as a set of pos- thermal equilibrium. Then a small decrement of T sible states) is in thermal equilibrium (at a given is performed and the process is iterated until the temperatureT),theprobabilitythatitisinacertain systemisconsideredfrozen. state s, called P (s), depends on T and on the Ifthecoolingscheduleiswelldesigned,thefinal T energy E(s) of the state s. This probability follows reached state may be considered a near-optimal aBoltzmanndistribution: solution. However, thewholeprocessisinherently slow, mainly because of the thermal equilibrium requirementateverytemperatureT. (cid:16) (cid:17) exp −E(s) (cid:18) (cid:19) kT (cid:88) E(s) P (s)= , with Z = exp − T Z kT 2.2InformationTheory s∈S Entropy,amainconceptinInformationTheory[47], where k is the Boltzmann constant and Z acts can be seen as an average of the uncertainty in a as a normalization factor. Metropolis and his co- randomvariable. IfX isadiscreterandomvariable workers developed a stochastic relaxation method with probability mass function (PMF) p, its entropy that works by simulating the behavior of a system isdefinedby1 at a given temperature T [36]. Being s the current state and s(cid:48) a neighboring state, the probability of (cid:88) makingatransitionfromstos(cid:48) istheratioPT(s → H(X)=− p(x) logp(x)=−EX[log p(X)] (2) s(cid:48)) of the probability of being in s to the probability x ofbeingins(cid:48): being E[] the expectation operator over the PMF of X. If a variable (X) is known and another P (s(cid:48)) (cid:18) ∆E(cid:19) one (Y) is not, the conditional entropy of Y with PT(s→s(cid:48))= PT(s) =exp −kT (1) respect to X is the mutual entropy with respect to T thecorrespondingconditionaldistribution: wherewehavedefined∆E =E(s(cid:48))−E(s). There- fore, the acceptance or rejection of s(cid:48) as the new (cid:88)(cid:88) H(Y|X)=− p(x,y)log p(y|x). (3) state depends on the difference of the energies of both states at temperature T. If P (s(cid:48)) ≥ x y T P (s) then the “move” is always accepted. It T The mutual information (MI) can be interpreted P (s(cid:48)) < P (s) then it is accepted with probability T T as a measure of the information that a random P (s,s(cid:48)) < 1 (this situation corresponds to a tran- T variablehasorexplainsaboutanotherone: sitiontoahigher-energystate). Note that this probability depends upon the cur- rent temperature T and decreases as T does. In I(X;Y) =H(Y)−H(Y|X) the end, there will be a value of T low enough =E [log p(x,y) ] (4) (the freezing point), wherein these transitions will X,Y p(x)p(y) beveryunlikelyandthesystemwillbeconsidered whereH denotestheentropy. NotethatI(X;X)= frozen. In order to maximize the probability of H(X),sinceH(X|X) = 0andI(X;Y) = I(Y;X). finding states of minimal energy at every value of T,thermalequilibriummustbereached. Todothis, 1Alllogarithmsaretakeninbase2. Theconditional MIisexpressedinthenaturalway, 2.3ThermodynamicAlgorithmsforFeature byconditioningin(4): Selection In this section we review TAFS (Thermodynamic I(X;Y|Z)=H(Y|Z)−H(Y|X,Z) (5) Algorithm for Feature Selection), an algorithm for FSS that was originally designed for problems of MI has been successfully used in feature se- moderate feature size (up to one hundred) [23]. If lection, as a way to measure the influence that we consider FSS as a search of possible feature a feature has over the class or target variable. subsets of the full feature set X, then SA acts Sometimes a normalized variant is used, given by as a combinatorial optimization process [41]. In C = I(X;Y),whereY istheclassortargetvari- this sense, TAFS finds a subset of features that XY H(Y) able, commonly known as coefficient of constraint optimize the value of a given objective function or uncertainty coefficient –note that the maximum J : P(X) → R, which we assume as non-negative value that I(X;Y) can take is H(Y). This coeffi- andtobemaximized2. cientcanbeunderstoodbyanalyzingFig. 1. Tothisend,aspecial-purposeforward/backward mechanismisembeddedintoanSAalgorithm,tak- ingadvantageofitsmostdistinctivecharacteristic: the probabilistic acceptance of worse scenarios over a finite time. This characteristic is enhanced by the notion of an (cid:15)-improvement: a feature (cid:15)- improves a current solution if it has a higher value of the objective function or a value not worse than (cid:15)%. This mechanism is intended to account for noise in the evaluation of the objective function (caused either by the finiteness of the data set or introducedbythechosenresamplingmethod). Fig.1. Mutualinformation(eq. 4)betweenX andY Thepseudo-codeofTAFSispresentedinAlgo- rithm1. Thealgorithmconsistsoftwomajorloops. Theouterloopwaitsfortheinnerlooptofinishand then updates T according to the chosen cooling Increasing the value I(X;Y), H(Y|X) is de- schedule. When this loop reaches T , the algo- min creased which means that the uncertainty about rithmhalts. Itkeepstrackofthebestsolutionfound a variable Y is reduced by the knowledge of X. (whichisnotnecessarilythecurrentone). There is an index of relevance that exploits this Theinnerloopisthecoreofthealgorithmandis property to measure the relevance of a feature composedoftwointerleavedprocedures: Forward X (with respect to a class or target value Y) by andBackward,thatiterateuntilanequilibriumpoint conditioningonathirdvariableZ [4]: is found. These procedures work independently of each other, but share information about the I(X;Y|Z) results of their respective searches (the current R(X;Y|Z)= H(Y|Z) solution). Within each of them, FSS takes place (6) and the mechanism to escape from local minima H(Y|Z)−H(Y|X,Z) = starts working. The pseudo-code for Forward and H(Y|Z) Backward procedures, and (cid:15)-improvement is out- linedinAlgorithms2,3and4. Theseprocedures where R(X;Y|Z) = 0 if H(Y|Z) = 0. Using iteratively add or remove features one at a time in a forward selection strategy to maximize it, Eq. (6) has been applied with some success to low- 2Thisisthecaseofaccuracy,mutualinformation,inter-class dimensionaldatasets[4]. distancesandmanyotherusefulmeasures. Algorithm1:TAFSalgorithmforfeatureselec- Algorithm 3: Procedure Backward (Z,J are Z tion modified) input : X : Full Feature set{X1...Xn} input : Z,JZ J(): Objective Function 1 repeat α(): Cooling Schedule 2 x←argmaxJ(Z\{Xi}) (cid:15): Epsilon Xi∈Z T0: Initial Temperature 3 if>(cid:15)(Z,x,false)then Tmin: Final Temperature 4 accept←true 1 Xcur ←∅ Initial Current Subset 5 else 2 Jcur ←0 Initial Objective Function Value 6 ∆J ←J(Z\{x})−J(Z) 3 T ←T0 Initial Temperature 7 accept←rand(0,1)<e∆TJ 4 while T >Tmindo 5 repeat 8 ifacceptthen 6 Y ←Xcur 9 Z←Z\{x} 7 Forward (Xcur,Jcur) 10 ifJ(Z)>JZ then 8 Backward(Xcur,Jcur) 11 JZ ←J(Z) 190 uTn←tilαY(T=)Xcur 12 untilnotaccept Algorithm4:Function> (cid:15) input : Z,x,d output:boolean Algorithm 2: Procedure Forward (Z,J are 1 ifdthen modified) Z 2 Z(cid:48)←Z∪{x} 3 else input : Z,JZ 4 Z(cid:48)←Z\{x} 1 repeat 5 ∆x←J(Z(cid:48))−J(Z) 23 xif>←(cid:15)Xa(Zrig∈,mXxa,\txZruJe()Zth∪e{nXi}) 76 if∆xre>tu0rnthternue 54 elseaccept←true 89 elsereturn −J(∆Zx) <(cid:15) 6 ∆J ←J(Z∪{x})−J(Z) ∆J 7 accept←rand(0,1)<e T 8 ifacceptthen 3 Proposed Method 9 Z←Z∪{x} 10 ifJ(Z)>Jcurthen The proposed method represents a development 11 JZ ←J(Z) of the TAFS algorithm in three ways: first, we 12 untilnotaccept enhance the algorithm to make it much faster and effective; second, we derive a new information- theoretic measure for feature relevance; and third, wepresentanefficientincrementalimplementation of this measure. We name this new algorithm such a way that an (cid:15)-improvement is accepted un- µ-TAFS. conditionally, whereas a non (cid:15)-improvement is ac- cepted probabilistically. When Forward and Back- 3.1eTAFS:anEnhancedTAFSAlgorithm ward finish their respective tasks, TAFS checks if the current solution is the same as it was prior to A modification to Algorithm 1 aimed at speeding theirexecution. Ifthisisthecase,thenweconsider uprelaxationtimeispresentedinthissection. The that thermal equilibrium has been reached and T algorithm—named eTAFS, see Algorithms 5 and is adjusted, according to the cooling schedule. If 6—is endowed with a feature search window (of it is not, another loop of Forward and Backward is size l) in the backward step as follows. In forward carriedout. steps always the best feature is added (by looking at all possible additions). In backward steps this Algorithm 6: eTAFS Backward procedure search is limited to l tries at random (without re- (Z,J are modified). Note that X can be Z 0 placement). Thevalueoflisincrementedbyoneat efficientlycomputedwhileintheforloop) everythermal-equilibriumpoint. Thismechanismis input : Z,JZ,l anadditionalsourceofnon-determinismandabias 1 A←∅;AB ←∅ towards adding a feature only when it is the best 2 repeat option available. On the contrary, to remove one, 3 fori:=1tomin(l,|Z|)do it suffices that its removal (cid:15)-improves the current 4 Selectx∈Z\AB randomly 5 if>(cid:15)(Z,x,false)then solution. Another direct consequence is of course 6 A←A∪{x} a considerable speed-up of the algorithm. Note 7 AB ←AB∪{x} thatthedesignofeTAFSissuchthatitgrowsmore 8 X0←argmax{J(Z\{X})} and more deterministic, informed and costly as it X∈AB convergestowardthefinalconfiguration. 9 ifX0∈Athen 10 accept←true 11 else Algorithm 5: eTAFS algorithm for feature se- 12 ∆J ←J(Z\{X0})−J(Z) ∆J lection 13 accept←rand(0,1)<e t input : X : Full Feature Set{X1...Xn} 14 ifacceptthen J(): Objective Function 15 Z←Z\{X0} α(): Cooling Schedule (cid:15): Epsilon 16 ifJ(Z)>JZ then T0 Initial Temperature 17 JZ ←J(Z) Tmin Final Temperature 18 untilnotaccept 1 Xcur ←∅ Initial Current Subset 2 Jcur ←0 Initial Objective Function Value 3 T ←T0 Initial Temperature 4 l←2 Window Size (for backward steps) 5 while T >Tmindo 6 repeat 7 Y ←Xcur 8 Forward (Xcur,Jcur,l) 9 Backward(Xcur,Jcur,l) 10 until Y =Xcur 11 T ←α(T) 12 l←l+1 3.2Information-TheoreticFeatureRelevance The definition in Eq. (6) lacks some compo- nents, Fig. 2 represents a three-variable interac- tion diagram. The stronger shaded area repre- sentsEq.(6), wheretwocomponentsaremissing, Fig. 2. Conditional mutual information (eq. 5) between I(X;Y;Z) and I(Y;Z|X), and therefore normal- X,Y andZ ization is done in a bigger area than it should, namely,inH(Y|Z)ratherthaninH(Y|X,Z). Asa consequence,establishingameasurewithrespect tothereferenceentropyH(Y)isincomplete. The shaded area in Fig. 3 represents In this work we calculate the conditional MI be- I(Y;X,Z) = H(Y)−H(Y|X,Z), the information tween a class variable Y and two variables X that two variables explain about a third one. Re- and Z as the joint information that X,Z explain markably, this quantity can be calculated without aboutY. explicitconditioningasfollows: An index of relevance is then obtained which evaluates the influence of two variables X,Z with respect to a class variable Y. It takes values be- tween zero (no relevance) and one (maximum rel- evance). ThisindexactsastheobjectivefunctionJ tobeoptimized. Therewardofusingthisobjective function by a TAFS-like algorithm consists in the possibility of testing it in highly complex domains such as microarray data sets. We name the com- bination of eTAFS and the objective function in Eq.(8)astheµ-TAFS algorithm. 3.3IncrementalMultivariateJointEntropy For a pair of discrete random variables X,Y, it is knownthatthejointentropyobeys Fig.3. MutualinformationbetweentwovariablesX and Z andaclassvariableY,asinEq. (7) H(X,Y)≥H(X). (9) Thispropertysaysthatjointentropyisalwaysat least equal to the entropies of the original system: I(Y;X,Z)=H(Y)−H(Y|X,Z) adding a new variable can never reduce the avail- =(cid:88)P(Y)log 1 ableuncertainty. Ifwerewrite(9)asanequation P(Y) Y (cid:88) 1 − P(X,Y,Z)log P(Y|X,Z) X,Y,Z H(X,Y)=H(X)+(cid:52) (Y), (10) X (cid:88) 1 = P(X,Y,Z)log P(Y) X,Y,Z then (cid:52) (Y) ≥ 0 represents the increment in en- X − (cid:88) P(X,Y,Z)log 1 tropy due to the addition of the variable Y to the P(Y|X,Z) X,Y,Z system. In a feature selection setting, given Z = (cid:88) P(X,Y,Z)logP(Y|X,Z) a class variable, τ ⊂ X the current subset and P(Y) X,Y,Z H(τ) its joint entropy, if a new feature X ∈ X \τ i (cid:88) P(X,Y,Z) is considered for possible inclusion in the current = P(X,Y,Z)log P(Y)P(X,Z) X,Y,Z subsetthen: =H(Y)+H(X,Z)−H(X,Y,Z), fromwhichweobtain H(Z,τ ∪{X })=H(Z,τ)+(cid:52) (X ). (11) i Z,τ i It turns out that, to obtain the next calculation, it I(Y;X,Z)=H(Y)+H(X,Z)−H(X,Y,Z). (7) is computationally far more advantageous to store H(Z,τ) and calculate the quantity (cid:52) (X ) than Given that I(Y;X,Z) ≤ 1 and that H(Y) acts Z,τ i to compute the full joint entropy H(Z,τ ∪ {X }) as the baseline reference, it is wise to normalize i directly. Inordertoobtainthisvalue, anincremen- Eq.(7)as talproceduretocalculatemultivariatejointentropy hasbeendevelopedasdescribedinwhatfollows. H(Y)+H(X,Z)−H(X,Y,Z) The incremental multivariate joint entropy (11) J(Y;X,Z)= . H(Y) must be computed at every evaluation step involv- (8) ing a possible candidate feature X to be included i Table1. MarginalEntropyScheme(MES)tablesforonevariable(left)andtheadditionofasecondvariable(right). P(·) istheprobabilitymassfunction,obtainedfromthedata(allentropiesareinbits) X1 X2 P(X1,X2) −P(X1,X2)logP(X1,X2) 0 0 0.231 0.488 X1 P(X1) −P(X1)logP(X1) 0 1 0.308 0.523 0 0.538 0.481 1 0.462 0.515 H(X1,X2)= 1.011 1 0 0.154 0.415 H(X1)= 0.996 1 1 0.308 0.523 H(X1,X2)= 0.939 in the current subset τ. Throughout the process, D|τ stands for the restriction of the dataset D to τ is associated with its current Marginal Entropy thefeaturesinτ. Scheme (MES), a table storing the unique values contained in the data set for its forming features Algorithm7:IncrementalMultivariateJointEn- and its corresponding entropy value. An example tropy ofaMEStablefortwobinaryvariables{X ,X }is 1 2 input : τ:Currentsubset; showninTable1. Xi:Featuretobeadded; Hτ :Currentsubsetjointentropy; At the initial step (τ = ∅) the MES table for the Eτ :MarginalentropiesschemeofHτ; D:Dataset; addition of X to ∅ is indicated in the left part of 1 output:τ, Hτ, Eτ Table1. Thetwouniquevaluesandtheirentropies 1 if|τ|=0then H(X1 = 0) = 0.481 and H(X1 = 1) = 0.515 are 2 τ ←{Xi} calculated. Let us suppose that a feature X is to 3 D←Sort(D) 2 be evaluated w.r.t the current subset τ = {X }. 4 Hτ ←JointEntropyofD 1 5 Eτ ←MarginalEntropyScheme(D|τ) The MES table with its unique forming patterns 6 else is indicated in the right part of Table 1. We can 7 τ+←τ∪{Xi} see that by introducing X2 to the current subset 8 Sort(D|τ+) τ, four partitions are generated for each unique 9 Eτ+ ←MarginalEntropyScheme(D|τ+) value of X1: {00,01,10,11}. In the particular case 10 Eτ− ←(cid:88)Eτj //jrunsthroughthevaluesof τ of X = 0, a change in its entropy contribution is j produ1cedbytheactionofX bysplittingitintotwo 11 (cid:52)τ ←(cid:88)Eτi+−Eτi− 2 i entropy values: H(X1 = 0,X2 = 0) = 0.488 and 12 τ ←τ+ H(X1 = 0,X2 = 1) = 0.523, for a total entropy 13 Hτ ←Hτ +(cid:52)τ of H(X = 0,X ) = 1.011. The increment in 14 Eτ ←Eτ+ //newMES 1 2 entropy (cid:52) is obtained as the difference between τ the current MES (considering the addition of X ) 2 Initial entropy is evaluated in lines 2-5. This first andthepreviousscheme(withoutit),seeTable2. stepcalculatesthestartingjointentropyaswellas itsfirstMES(lines4-5),whichwillbetakenasinput Table 2. (cid:52) computations from the Marginal Entropy τ to the next computation. Note that these two lines Scheme,seeTable1 can be efficiently implemented as one function using only one loop-cycle with complexity θ(|D|), (cid:52)τ H(X1,X2) −P(X1)logP(X1) difference (cid:52)τ(X1=0) 1.011 0.481 0.531 where|D|isthenumberoftraininginstances. (cid:52)τ(X1=1) 0.939 0.515 0.424 (cid:52)τ 0.954 In the else part of the if clause, the MES is calculated with the addition of X to the current i Finally, this last value is applied to Eq. (11) subset τ (named E ). Taking into account that τ+ to obtain the joint entropy H(X ,X ) = H(X )+ previous MES inherits the ordering sequence de- 1 2 1 (cid:52) (X ) = 0.996+0.954 = 1.950. The listings in rivedfromapreviousstage(becauseoflines5and τ 2 Algorithms7and8showthepseudo-codetocom- 9), entropies generated by changes in the MES pute the procedure explained above. The notation given by τ ∪ {X } are summed (E ) in groups i τ− (line11)bythenewlyformedpatterns,renderinga — Breast Cancer: 97 patients with primary inva- one-to-one correspondence between the previous sive breast carcinoma; 24,481 genes have to MESandthecurrentMES. beanalyzed[52]. To compute the necessary entropies described Algorithm 8: MarginalEntropyScheme Func- in previous sections, a discretization process is tion needed. This change of representation does not input : D:Dataset; output:E oftenresultinasignificantlossofaccuracy(some- 1 foreachuniquevaluevinDdo times significantly improves it [39], [40]); it also of- 2 Υ[v]←fractionofinstancesinDwithvaluev fersreductioninlearningtime[10]. Inthiswork,the 3 E←H(Υ) //calculateentropyofthisdistribution CAIMalgorithmwasselectedfortworeasons: itis designed to work with supervised data and does not require the user to define a specific number of Thus, the entropy contribution (cid:52) (X ), showing τ i intervals[31]. theeffectofaddingX toτ,iscomputedbythedif- i ferenceinbothMESs(line11),beingfinallyadded 4.2Settings tothecurrententropyH (line13). Theimplemen- τ tationoflines10-11followsthesameconsideration Provided that the core nature of the µ-TAFS al- asthatoflines4-5,andhencecomplexityisofthe gorithm resides in its stochasticity, multiple runs sameorder. can be performed and used to obtain better solu- tions. Theexperimentaldesigntotesttheµ-TAFS algorithm measures performance by carrying out 4 Experimental Work 100 different independent runs. In each run, the algorithmisexecutedonthecorrespondingdataset and returns the set of all those feature subsets 4.1DataSets reaching the best found performance (maximum relevance, inthiscase). Thesubsetthatoffersthe Five public-domain microarray gene expression lowestmutualinformation(MI)amongitselements, data sets are used to test and validate the ap- i.e., the less redundancy, is taken as the subset proachproposedinthiswork: deliveredinthisrun. The µ-TAFS parameters are set as follows: (cid:15) = — Colon Tumor: 62 observations of colon tis- 0.01,T = 0.1 and T = 0.0001; these are stan- 0 min sue,ofwhich40aretumorousand22normal, dardsettingsandarekeptconstantforalltheprob- 2,000genes[2]. lems [23]. The cooling function was chosen to be geometricα(t)=0.9t,followingrecommendations — Leukemia: 72 bone marrow observations intheliterature[41]. and 7,129 probes: 6,817 human genes and 312 control genes [22]. The goal is to tell Table 3. µ-TAFS running performance. Time indicates acutemyeloidleukemia(AML)fromacutelym- the running time (in minutes) over the 100 executions; phoblasticleukemia(ALL). J is the number of evaluations of J; size is the eval averagesizeofthefinalsolutionsanditsstandarderror — Lung Cancer: distinction between malignant pleuralmesotheliomaandadenocarcinomaof lung[25];181observationswith12,533genes. Dataset Time Jeval size ColonTumor 6.41 503,901 6.93±0.06 Leukemia 6.51 506,489 3.36±0.06 — Prostate Cancer: used in [49] to analyze dif- LungCancer 7.45 560,972 2.58±0.04 ferences in pathological features of prostate ProstateCancer 98.74 7,119,800 9.85±0.05 cancer and to identify genes that might an- BreastCancer 136.93 10,943,628 9.62±0.03 ticipate its clinical behavior; 136 observations and12,600genes. Colon Tumor Leukemia 1 1 0.8 0.8 0.6 0.6 J J 0.4 0.4 0.2 0.2 0 0 0 20 40 60 80 100 120 0 20 40 60 80 Index of Temperature Index of Temperature Lung Cancer Prostate Cancer 1 1 0.8 0.8 0.6 0.6 J J 0.4 0.4 0.2 0.2 0 0 0 20 40 60 80 0 20 40 60 80 100120140160180200 Index of Temperature Index of Temperature Breast Cancer 1 0.8 0.6 J 0.4 0.2 0 0 20 40 60 80 100120140160180200 Index of Temperature Fig.4. µ-TAFSsearchprocesses. Thex-axisistheiterationcounterfortheouterloopofthealgorithm 5 Experimental Results consecutively reach the maximum possible value (J =1)inallcases. 5.1µ-TAFSPerformanceResults The running performance of µ-TAFS is summa- rized in Table III. The results show that µ-TAFS The evolution of µ-TAFS from a high temperature yields subsets of considerably low size and also state to a frozen point is depicted in Fig. 4. Highly low variability. Notorious readings correspond to unstable, i.e., high temperature condition, read- Leukemia and Lung Cancer. It is conjectured that ings are observed at the initial stages in each of suchsizesrespondtothenatureoftheinformation- the datasets. As soon as the algorithm becomes theoretic model on discretized data sets, in the more relaxed due to Eq. (1), worse solutions are sense that only a few genes significantly con- avoided. The frozen condition is observed at the tribute to increasing the index of relevance given final stages of each execution, where J values by Eq. (8). On the one hand, working with con-
Description: