Anytime Subgroup Discovery in Numerical Domains with Guarantees Aimene Belfodil ∗,1,2, Adnene Belfodil∗,1, and Mehdi Kaytoue1,3 1 Univ Lyon, INSA Lyon, CNRS, LIRIS UMR 5205, F-69621, LYON, France 2 Mobile Devices Ing´enierie, 100 Avenue Stalingrad, 94800, Villejuif, France 3 Infologic, 99 avenue de Lyon, 26500 Bourg-L`es-Valence, France [email protected] Abstract. Subgroup discovery is the task of discovering patterns that accuratelydiscriminateaclasslabelfromtheothers.Existingapproaches can uncover such patterns either through an exhaustive or an approx- imate exploration of the pattern search space. However, an exhaustive exploration is generally unfeasible whereas approximate approaches do notprovideguaranteesboundingtheerror ofthebestpatternqualitynor theexplorationprogression(“How far are we of an exhaustive search”). We design here an algorithm for mining numerical data with three key properties w.r.t. the state of the art: (i) It yields progressively interval patterns whose quality improves over time; (ii) It can be interrupted anytime and always gives a guarantee bounding the error on the top pattern quality and (iii) It always bounds a distance to the exhaustive exploration. After reporting experimentations showing the effectiveness of our method, we discuss its generalization to other kinds of patterns. Keywords: Subgroup discovery, Anytime algorithms, Discretization 1 Introduction Weaddresstheproblemofdiscoveringpatternsthataccuratelydiscriminateone class label from the others in a numerical dataset. Subgroup discovery (SD) [27] is a well established pattern mining framework which strives to find out data regions uncovering such interesting patterns. When it comes to numerical at- tributes, a pattern is generally a conjunction of restrictions over the attributes, e.g.,pattern50 age<70 smoke per day 3fosterslungcancerincidence.To ≤ ∧ ≥ lookforsuchpatterns(namelyintervalpatterns),variousapproachesareusually implemented.Commontechniquesperformadiscretization transformingthenu- merical attributes to categorical ones in a pre-processing phase before using the wide spectrum of existing mining techniques [2,20,22,3]. This leads, however, to a loss of information even if an exhaustive enumeration is performed on the transformed data [2]. Other approaches explore the whole search space of all restrictionseitherexhaustively[18,14,6]orheuristically[23,5].Whileanexhaus- tiveenumerationisgenerallyunfeasibleinlargedata,thevariousstate-of-the-art ∗Both authors contributed equally to this work. 2 Aimene Belfodil, Adnene Belfodil, and Mehdi Kaytoue algorithms that heuristically explore the search space provide no provable guar- antee on how they approximate the top quality patterns and on how far they are from an exhaustive search. Recent techniques set up a third and elegant paradigm, that is direct sampling approaches [3,4,13]. Algorithms falling un- der this category are non-enumerative methods which directly sample solutions fromthepatternspace.Theysimulateadistributionwhichrewardshighquality patterns with respect to some interestingness measure. While [3,4] propose a direct two-step sampling procedure dedicated for categorical/boolean datasets, authorsin[13]deviseaninterestingframeworkwhichaddathirdsteptohandle thespecificityofnumericaldata.Theproposedalgorithmaddressesthediscovery of dense neighborhood patterns by defining a new density metric. Nevertheless, it does not consider the discovery of discriminant numerical patterns in labeled numericaldatasets.Directsamplingapproachesabandonthecompletenessprop- ertyandgenerateonlyapproximateresults.Incontrast,anytimepatternmining algorithms [5,16] are enumerative methods which exhibits the anytime feature [29], a solution is always available whose quality improves gradually over time and which converges to an exhaustive search if given enough time, hence ensur- ing completeness. However, to the best of our knowledge, no existing anytime algorithm in SD framework, makes it possible to ensure guarantees on the pat- terns discriminative power and the remaining distance to an exhaustive search while taking into account the nature of numerical data. Toachievethisgoal,weproposeanovelanytimealgorithm,RefineAndMine, tailored for discriminant interval patterns discovery in numerical data. It starts by mining interval patterns in a coarse discretization, followed by successive re- finements yielding increasingly finer discretizations highlighting potentially new interestingpatterns.Eventually,itperformsanexhaustivesearch,ifgivenenough time.Additionally,ourmethodgivestwoprovableguaranteesateachrefinement. The first evaluates how close is the best found pattern so far to the optimal one in the whole search space. The second measures how already found patterns are diverse and cover well all the interesting regions in the dataset. Theoutlineisasfollows.WerecallinSec.2basicdefinitions.Next,wedefine formally the problem in Sec. 3. Subsequently We introduce in Sec. 4 our mining algorithmbeforeformulatingtheguaranteesitprovidesinSec.5.Weempirically evaluate the efficiency of RefineAndMine in Sec. 6 and discuss its potential improvements in Sec. 7. Additional materials are available in our companion page4. For more details and proofs, please refer to the technical report5. 2 Preliminaries Input. A labeled numerical dataset ( , ) isgiven by afinite set (ofobjects) G M G partitioned into two subsets + and − enclosing respectively positive (target) G G and negative instances; and a sequence of numerical attributes =(m ) i 1≤i≤p of size p= . Each attribute m is an application m : RMthat associates i i |M| G → 4https://github.com/Adnene93/RefineAndMine 5https://goo.gl/NWtXfp Anytime Subgroup Discovery in Numerical Domains with Guarantees 3 to each object g a value m (g) R. We can also see as a mapping i : Rp,g ∈(Gm (g)) . We∈denote m [ ] = m (g)Mg (More i 1≤i≤p i i M G → (cid:55)→ G { | ∈ G} generally, for a function f :E F and a subset A E, f[A]= f(e) e A ). → ⊆ { | ∈ } Fig. 1 (left table) presents a 2-dimensional labeled numerical dataset and its representation in the Cartesian plane (filled dots represent positive instances). Interval patterns and their extents. When dealing with numerical domains in SD, we generally consider for intelligibility interval patterns [18]. An Interval pattern is a conjunction of restrictions over the numerical attributes; i.e. a set of conditions attribute ≷ v with ≷ =, ,<, ,> . Geometrically, interval ∈ { ≤ ≥ } patterns are axis-parallel hyper-rectangles. Fig. 1 (center-left) depicts pattern (non-hatched rectangle) c =(1 m 4) (0 m 3)(cid:44)[1,4] [0,3]. 2 1 2 ≤ ≤ ∧ ≤ ≤ × Interval patterns are naturally partially ordered thanks to “hyper-rectangle inclusion”. We denote the infinite partially ordered set (poset) of all interval patternsby( , )where (sameorderusedin[18])denotesthedualorder of D (cid:118) (cid:118) ⊇ hyper-rectangleinclusion.Thatispatternd d iff d enclosesd (d d ).It 1 2 1 2 1 2 (cid:118) ⊇ isworthmentioningthat( , )formsacompletelattice [26].ForasubsetS , (cid:70) D (cid:118) ⊆D the join S (i.e. smallest upper bound) is given by the rectangle intersection. Dually,themeet(cid:100)S (i.ethelargestlowerbound)isgivenbythesmallesthyper- rectangle enclosing all patterns in S. Note that the top (resp. bottom) pattern in ( , ) is given by = (resp. = Rp). Fig. 1 (right) depicts two patterns D (cid:118) (cid:62) ∅ ⊥ (hatched) e = [1,5] (1,4] and e = [0,4) [2,6], their meet (non hatched) 1 2 × × e e =[0,5] (1,6] and their join (black) e e =[1,4) [2,4]. 1 2 1 2 (cid:117) × (cid:116) × A pattern d is said to cover an object g iff (g) d. To use ∈ D ∈ G M ∈ the same order to define such a relationship, we associate to each g (cid:118) ∈ G its corresponding pattern δ(g) which is the degenerated hyper-rectangle p ∈ D δ(g)= (g) = [m (g),m (g)]. The cover relationship becomes d δ(g). The ext{eMnt of}a p×atti=er1n isi the siet of objects supporting it. Formally, t(cid:118)here is a function ext : ℘( ),d g d δ(g) = g (g) d D → G (cid:55)→ { ∈ G | (cid:118) } { ∈ G | M ∈ } (where ℘( ) denotes the set of all subsets of ). Note that if d d then 1 2 G G (cid:118) ext(d ) ext(d ). We define also the positive (resp. negative) extent as follows: 2 1 ⊆ ext+(d) = ext(d) + (resp. ext−(d) = ext(d) −). With the mapping δ : and the co∩mpGlete lattice ( , ), we call th∩eGtriple P=( ,( , ),δ) the G →D D (cid:118) G D (cid:118) interval pattern structure [18,10]. my2 my2 my2 g1 m11 m22 la+bel 56 g6 g4 56 g6 g4 56 e2 g2 1 3 + 4 g5 g7 g8 4 g5 g7 g8 4 g3 2 1 + g2 g2 ggg456 322 545 −−+ 23 g1 c1 c2 23 g1 d1 d2 23 ee11(cid:116)e2 gg78 34 44 −− 01 g3 01 g3 01 e1(cid:117)e2 0 1 2 3 4 5 mx1 0 1 2 3 4 5 mx1 0 1 2 3 4 5 mx1 Fig.1: (left to right)(1)alabelednumericaldataset.(2)closedc vsnon-closedc 1 2 interval patterns. (3) cotp d vs non cotp d . (4) meet and join of two patterns. 1 2 4 Aimene Belfodil, Adnene Belfodil, and Mehdi Kaytoue Measuring the discriminative power of a pattern. In SD, a quality mea- sure φ : R is usually defined to evaluate at what extent a pattern well- D → discriminates the positive instances in + from those in −. Two atomic mea- G G sures are generally employed to quantify the quality of a pattern d: the true positive rate tpr : d ext+(d)/ + and the false positive rate fpr : d → | | |G | → ext−(d)/ − .Severalmeasuresexistintheliterature[12,21].Ameasureissaid | | |G | to be objective or probability based [12] if it depends solely on the number of co- occurrencesandnonco-occurrencesofthepatternandthetargetlabel.Inother words, those measures can be defined using only tpr, fpr and potentially other constants (e.g. ). Formally, φ∗ : [0,1]2 R s.t. φ(d) = φ∗(tpr(d),fpr(d)). |G| ∃ → Objective measures depends only on the pattern extent. Hence, we use inter- changeably φ(ext(d)) and φ(d). An objective quality measure φ is said to be discriminant ifitsassociatedmeasureφ∗ isincreasing withtpr(fprbeingfixed) anddecreasing withfpr (tpr beingfixed).Forinstance,withα+ = + / and |G | |G| α− = − / denotinglabelsprevalence,wracc∗(tpr,fpr)=α+ α− (tpr fpr) |G | |G| · · − and informedness∗(tpr,fpr)=tpr fpr are discriminant measures. − Compressing the set of interesting patterns using closure. Since dis- criminant quality measures depend only on the extent, closed patterns can be leveraged to reduce the number of resulting patterns [10]. A pattern d is said to be closed (w.r.t. pattern structure P) if and only if it is the most r∈esDtric- tive pattern (i.e. the smallest hyper-rectangle) enclosing its extent. Formally, d = int(ext(d)) where int mapping (called intent) is given by: int : ℘( ) p G → dDe,pAict(cid:55)→s th(cid:100)egc∈lAosδe(dg)in=ter×vali=p1a[mtteinrng∈(Ahmatic(hge)d,mreacxtga∈nAglme)i(cg)]=. F[1ig,2.]1 ([c1e,n3t]erw-lheifcth) 1 × is the closure of c =[1,4] [0,3] (non hatched rectangle). Note that since is 2 × G finite, the set of all closed patterns is finite and is given by int[℘( )]. G A more concise set of patterns using Relevance theory. Fig. 1 (center- right) depicts two interval patterns, the hatched pattern d = [1,2] [1,3] and 1 × the non-hatched one d = [1,4] [1,4]. While both patterns are closed, d has 2 1 × better discriminative power than d since they both cover exactly the same 2 positive instances g ,g ,g ; yet, d covers more negative instances than d . 1 2 3 2 1 { } Relevance theory [11] formalizes this observation and helps us to remove some clearly uninteresting closed patterns. In a nutshell, a closed pattern d is 1 ∈ D said to be more relevant than a closed pattern d iff ext+(d ) ext+(d ) 2 2 1 ∈ D ⊆ andext−(d ) ext−(d ).Forφdiscriminant,ifd ismorerelevantthand then 1 2 1 2 ⊆ φ(d ) φ(d ). A closed pattern d is said to be relevant iff there is no other 1 2 ≥ closed pattern c that is more relevant than d. It follows that if a closed pattern is relevant then it is closed on the positive (cotp for short). An interval pattern is said to be cotp if any smaller interval pattern will at least drop one positive instance (i.e. d=int(ext+(d))). interestingly, int ext+ is a closure operator on ◦ ( , ).Fig.1(center-right)depictsanoncotppatternd =[1,4] [1,4]andits 2 D (cid:118) × closureonthepositived =int(ext+(d ))=[1,2] [1,3]whichisrelevant.Note 1 2 × that not all cotp are relevant. The set of cotp patterns is given by int[℘( +)]. G We call relevant (resp. cotp) extent, any set A s.t. A = ext(d) with d is a ⊆ G relevant (resp. cotp) pattern. The set of relevant extents is denoted by . R Anytime Subgroup Discovery in Numerical Domains with Guarantees 5 3 Problem Statement Correct enumeration of relevant extents. First, consider the (simpler) problem of enumerating all relevant extents in . For a (relevant extents) enu- R merationalgorithm,threepropertiesneedgenerallytohold.Analgorithmwhich output is the set of solutions is said to be (1) complete if , (2) sound S S ⊇ R if and (3) non redundant if each solution in is outputted only once. S ⊆ R S It is said to be correct if the three properties hold. Guyet et al. [15] proposed a correct algorithm that enumerate relevant extents induced by the interval pat- tern structure in two steps: (1) Start by a DFS complete and non redundant enumeration of all cotp patterns (extents) using MinIntChange algorithm [18]; (2) Post-process the found cotp patterns by removing non relevant ones using [11] characterization (this step adds the soundness property to the algorithm). Problem Statement. Given a discriminant objective quality measure φ, we want to design an anytime enumeration algorithm such that: (1) given enough time,outputsallrelevantextentsin ,(2)wheninterrupted,providesaguaran- R tee bounding the difference of quality between the top-quality found extent and the top possible quality w.r.t. φ; and (3) outputs a second guarantee ensuring that the resulting patterns are diverse. Formally, let be the set of outputted solutions by the anytime algorithm i S atsomestep(orinstant)i(ati+1wehave ).Wewantthat(1)wheni i i+1 S ⊆S isbigenough, (onlycompleteness isrequired).For(2)and(3),wedefine i S ⊇R two metrics6 to compare the results in with the ones in . The first metric, i S R called accuracy (eq. 1), evaluates the difference between top pattern quality φ in and while the second metric, called specificity (eq. 2), evaluates how i S R diverse and complete are patterns in . i S accuracy (S , )= sup φ(A) sup φ(B) (1) φ i R − A∈R B∈Si specificity(S , )= sup inf (A∆B / ) (2) i R A∈RB∈Si | | |G| The idea behind specificity is that each extent A in is “approximated” R by the most similar extent in ; that is the set B minimizing the met- i i S ∈ S ric distance A,B A∆B / in ℘( ). The specificity7 is then the highest (cid:55)→ | | |G| G possible distance (pessimistic). Note that specificity( , )=0 is equivalent to i S R .Clearly,thelowerthesetwometricsare,thecloserwegettothedesired i S ⊇R output . While accuracy and specificity can be evaluated when a complete φ R exploration of is possible, our aim is to bound the two aforementioned mea- R sures independently from providing a guarantee. In other words, the anytime R algorithm need to output additionally to S , the two following measures: (2) i accuracy ( ) and (3) specificity( ) s.t. accuracy ( , ) accuracy ( ) φ i i φ i φ i S S S R ≤ S and specificity( , ) specificity( ). These two bounds need to decrease i i S R ≤ S overtime providing better information on through . i R S 6The metrics names fall under the taxonomy of [29] for anytime algorithms. 7The specificity is actually a directed Hausdorff distance [17] from R to S . i 6 Aimene Belfodil, Adnene Belfodil, and Mehdi Kaytoue 4 Anytime Interval Pattern Mining Discretizations and pattern space. Ouralgorithmreliesontheenumeration of a chain of discretization from the coarsest to the finest. A discretization of R is any partition of R using intervals. In particular, let C = c R be i 1≤i≤|C| { } ⊆ a finite set with c < c for i 1,..., C 1 . Element of C are called cut i i+1 ∈ { | |− } points or cuts. We associate to C a finite discretization denoted by dr(C) and (cid:8) (cid:9) given by dr(C)= ( ,c ) [c ,c ) i 1,..., C 1] [c ,+ ) . 1 i i+1 |C| Generally spea{ki−ng∞, let }p∪{N∗ and |let∈C{ = (C| |)− }}∪ ℘(R)p r∞epre- k 1≤k≤p senting sets of cut points asso∈ciated to each dimension k (i.e.∈C R finite k k 1,...,p ). The partition dr(C) of Rp is given by: dr(C) = (cid:81)p⊆ dr(C ). ∀ ∈ { } k=1 k Fig. 2 depicts two discretizations. Discretizations are ordered using the natural orderbetweenpartitions8.Moreover,cut-pointssetsareorderedby asfollows: ≤ C1 C2 ( k 1,...,p )C1 C2 with Ci =(Ci) . Clearly, if C1 C2 ≤ ≡ ∀ ∈{ } k ⊆ k k 1≤k≤p ≤ then discretization dr(C1) is coarser than dr(C2). LetC =(C ) bethecut-points.Usingtheelementaryhyper-rectangles k 1≤k≤p (i.e. cells) in the discretization dr(C), one can build a (finite) subset of descrip- tions which is the set of all possible descriptions (hyper-rectangles) C D ⊆ D that can be built using these cells. Formally: C = (cid:100)S S dr(C) . Note (cid:70) D { | ⊆ } that = C since (cid:100) = = by definition. Proposition 1 states that (cid:62) ∅ ∈ D ∅ D (cid:62) ( , ) is a complete sub-lattice of ( , ). C D (cid:118) D (cid:118) Proposition 1. ( , ) is a finite (complete) sub-lattice of ( , ) that is: C D (cid:118) D (cid:118) d ,d : d d and d d . Moreover, if C1 C2 are 1 2 C 1 2 C 1 2 C ∀ ∈ D (cid:116) ∈ D (cid:117) ∈ D ≤ two cut-points sets, then ( , ) is a (complete) sub-lattice of ( , ). C1 C2 D (cid:118) D (cid:118) Finest discretization for a complete enumeration of relevant extents. There exist cut points C ℘(R)p such that the space ( , ) holds all relevant C ⊆ D (cid:118) extents (i.e. ext[ ] ). For instance, if we consider C = (m [ ]) , the C k 1≤k≤p D ⊇ R G description space ( , ) holds all relevant extents. However, is there coarser C D (cid:118) discretizationthatholdsalltherelevantextents?Theanswerisaffirmative.One canshowthattheonlyinterestingcutsarethoseseparatingbetweenpositiveand negativeinstances(calledboundarycut-pointsby[9]).Wecallsuchcuts,relevant cuts. They are denoted by Crel = (Crel) and we have ext[ ] . k 1≤k≤p DCrel ⊇ R Formally, for each dimension k, a value c m [ ] is a relevant cut in Crel for ∈ k G k attribute m iff: (c m [ +] and prev(c,m [ ]) m [ −]) or (c m [ −] k k k k k ∈ G G ∈ G ∈ G and prev(c,m [ ]) m [ +]) where next(c,A) = inf a A c < a (resp. k k G ∈ G { ∈ | } prev(c,A)=sup a A a<c )isthefollowing(resp.preceding)elementofcin { ∈ | } A.Findingrelevantcuts Crelisofthesamecomplexityofsortingm [ ][9].Inthe k k G dataset depicted in Fig. 1, relevant cuts are given by Crel =( 2,3,4,5 , 4,5 ). { } { } Discretization dr(Crel) is depicted in Fig. 2 (center). 2 8Let E be a set, a partition P of E is finer than a partition P (or P is coarser 2 1 1 than P ) and we denote P ≤P if any subset in P is a subset of a subset in P . 2 1 2 1 2 Anytime Subgroup Discovery in Numerical Domains with Guarantees 7 m6y2 m2 +∞ 5 5 Fig.2: (left) Discretization dr((C1,C2)) in R2 with 4 4 C1 = {2,3} and C2 = {4,5} and (right) discretization 3 dr((C2)) in R. Adding a cut point in any Ck will create 2 finer discretization. 1 0 −∞ 0 1 2 3 4 5 mx1 Anytime enumeration of relevant extents. We design an anytime and interruptible algorithm dubbed RefineAndMine. This method, presented in Al- gorithm 1, relies on the enumeration of a chain of discretizations on the data space, from the coarsest to the finest. It begins by searching relevant cuts in pre-processingphase(line2).Then,itbuildsacoarse discretization (line3)con- taining a small set of relevant cut-points. Once the initial discretization built, cotp patterns are mined thanks to MinIntChange Algorithm (line 4) [18]. Then as long as the algorithm is not interrupted (or within the computational bud- get),weaddnewcut-points(line6)buildingfinerdiscretizations.Foreachadded cut-point (line 8), only new interval patterns are searched for (mined descrip- tions d are new but their extents ext(d) are not necessarily new) . That is cotp patterns which left or right bound is cut on the considered attribute attr (i.e. d.I [cut,a),[cut,+ ),[a,cut),( ,cut) a Ccur with d.I is the attr ∈ { ∞ −∞ | ∈ attr} attr attrth intervalofd).Thiscanbedonebyaslightmodificationof MinIntChange method. RefineAndMine terminates when the set of relevant cuts is exhausted (i.e. Ccur =Crel) ensuring a complete enumeration of relevant extents . R The initial discretization (Line 3) can be done by various strategies (see [28]). A simple, yet efficient, choice is the equal frequency discretization with a fixed number of cuts. Other strategies can be used, e.g. [9]. Adding new cut- points (Line 6) can also be done in various ways. One strategy is to add a random relevant cut on a random attribute to build the next discretization. Section 5.3 proposes another more elaborated strategy that heuristically guide RefineAndMinetorapidlyfindgoodqualitypatterns(observedexperimentally). Algorithm 1: RefineAndMine Input: (G,M) a numerical datasets with {G+,G−} partition of G 1 procedure RefineAndMine() 2 Compute relevant cuts Crel 3 Build an initial set of cut-points Ccur ≤Crel 4 Mine cotp patterns in DCcur (and their extents) using MinIntChange 5 while Ccur (cid:54)=Crel and within computational budget do 6 Choose the next relevant cut (attr,cut) with cut∈Cartetlr\Cacuttrr 7 Add the relevant cut cut to Ccur 8 Mine new cotp patterns (and their extents) in DCcur 8 Aimene Belfodil, Adnene Belfodil, and Mehdi Kaytoue 5 Anytime Interval Pattern Mining with Guarantees Algorithm RefineAndMine starts by mining patterns in a coarse discretization. It continues by mining more patterns in increasingly finer discretizations un- til the search space is totally explored (final complete lattice being ( , )). Crel D (cid:118) According to Proposition 1, the description spaces built on discretizations are complete sub-lattices of the total description space. A similar idea involves per- forming successive enumeration of growing pattern languages (projections) [6]. In our case, it is a successive enumeration of growing complete sub-lattices. For the sake of generality, in the following of this section ( , ) denotes a complete lattice, and for all i N∗, ( , ) denotes complete suDb-l(cid:118)attices of ( , ) such i ∈ D (cid:118) D (cid:118) that . For instance, in RefineAndMine, the total complete lat- i i+1 D ⊆ D ⊆ D tice is ( , ) while the ( , ) are ( , ) at each step. Following Sec. 3 Crel i Ccur D (cid:118) D (cid:118) D (cid:118) notation, the outputted set at a step i contains the set of all cotp extents as- i S sociatedto .Beforegivingtheformulasofaccuracy ( )andspecificity( ), i φ i i D S S we give some necessary definitions and underlying properties. At the end of this section, we show how RefineAndMine can be adapted to efficiently compute these two bounds for the case of interval patterns. Similarly to the interval pattern structure [18], we define in the general case a pattern structure P = ( ,( , ),δ) on the complete lattice ( , ) where is G D (cid:118) D (cid:118) G a non empty finite set (partitioned into +, − ) and δ : is a mapping {G G } G →D associatingtoeachobjectitsdescription(recallthatinintervalpatternstructure, δ isthedegeneratedhyper-rectanglerepresentingasinglepoint).Theextentext and intent int operators are then respectively given by ext : ℘( ),d D → G (cid:55)→ g d δ(g) and int : ℘( ) ℘( ),A (cid:100) δ(g) with (cid:100) represents { ∈ G | (cid:118) } G → G (cid:55)→ g∈A the meet operator in ( , ) [10]. D (cid:118) 5.1 Approximating descriptions in a complete sub-lattice Upper and lower approximations of a pattern. We start by approx- imating each pattern in using two patterns in . Consider for instance i Fig. 3 where is the spaDce of interval patterns in DR2 while is the space C D D containing only rectangles that can be built over discretization dr(C) with C =( 1,4,6,8 , 1,3,5,6 ). Since the hatched rectangle d=[3,7] [2,5.5] { } { } × ∈D does not belong to , two descriptions in can be used to encapsulate it. C C D D The first one, depicted by a gray rectangle, is called the upper approximation of d. It is given by the smallest rectangle in enclosing d. Dually, the second C D approximation represented as a black rectangle and coined lower approximation ofd,isgivenbythegreatestrectanglein enclosedbyd.Thistwodenomina- C D tionscomesfromRoughSetTheory[25]wherelowerandupperapproximations formtogetherarough set andtrytocapturetheundefinedrectangled . C ∈D\D Definition 1 formalizes these two approximations in the general case. Definition 1. The upper approximation mapping ψ and lower approximation i mapping ψ are the mappings defined as follows: i (cid:71)(cid:8) (cid:9) (cid:8) (cid:9) ψ : ,d c c d ψ : ,d (cid:108) c d c i i i i i i D →D (cid:55)→ ∈D | (cid:118) D →D (cid:55)→ ∈D | (cid:118) Anytime Subgroup Discovery in Numerical Domains with Guarantees 9 my2 7 Fig.3: Description d=[3,7]×[2,5.5] in D (hatched) 6 andC =({1,4,6,8},{1,3,5,6}).Upperapproximation 5 4 ψC(d) d ψC(d) of d in DC is ψC(d) = [1,8)×[1,6) (gray rectangle) 3 whilelowerapproximationofdisψC(d)=[4,6)×[3,5) 2 (black rectangle). 1 0 0 1 2 3 4 5 6 7 8 9 10mx1 The existence of these two mappings is ensured by the fact that ( , ) is a i D (cid:118) complete sublattice of ( , ). Theorem 4.1 in [8] provides more properties for D (cid:118) thetwoaforementionedmappings.Proposition2restatesanimportantproperty. Proposition 2. d : ψ (d) d ψ (d). The term lower and upper- i i ∀ ∈ D (cid:118) (cid:118) approximation here are reversed to fit the fact that in term of extent we have d : ext(ψ (d)) ext(d) ext(ψ (d)). i i ∀ ∈D ⊆ ⊆ A projected pattern structure. Now that we have the upper-approximation mapping ψ , one can associate a new pattern structure P =( ,( , ),ψ δ)9 i i i i G D (cid:118) ◦ to the pattern space ( , ). It is worth mentioning, that while extent ext i i mapping associated to PD is(cid:118)equal to ext, the intent int of P is given by int : i i i i ℘( ) ,A ψ (int(A)). Note that, the set of cotp patterns associated to i i P Gare→givDen by(cid:55)→int [℘( +)]=ψ [int[℘( +)]]. That is, the upper approximation i i i of a cotp pattern in PGis a cotp patternGin P . i Encapsulating patterns using their upper-approximations. We want to encapsulateanydescriptionbyknowingonlyitsupper-approximation.Formally, we want some function f : such that ( d )ψ (d) d f(ψ (d)). i i i i D → D ∀ ∈ D (cid:118) (cid:118) Proposition 3 define such a function f (called core) and states that the core is the tightest (w.r.t. ) possible function f. (cid:118) Proposition 3. The function core defined by: i (cid:16)(cid:71)(cid:110) (cid:111)(cid:17) core : ,c core(c)=ψ d ψ (d)=c i i i i i D →D (cid:55)→ ∈D | verifies the following property: d : ψ (d) d ψ (d) core (ψ (d)). i i i i ∀ ∈ D (cid:118) (cid:118) (cid:118) Moreover, for f : , ( d )d f(ψ (d)) ( c )core (c) f(c). i i i i i D →D ∀ ∈D (cid:118) ⇔ ∀ ∈D (cid:118) Notethat,whilethecoreoperatordefinitiondependsclearlyonthecomplete lattice ( , ), its computation should be done independently from ( , ). D (cid:118) D (cid:118) We show here how to compute the core in RefineAndMine. In each step and for cut-points C = (C ) ℘(R)p, the finite lattice ( , ) is a sub-lattice of k C ⊆ D (cid:118) the finest finite lattice ( , ) (since C Crel). Thereby, the core is com- Crel D (cid:118) ≤ puted according to this latter as follows: Let d with d.I =[a ,b ) for all C k k k ∈D 9P is said to be a projected pattern structure of P by the projection ψ [7]. i i 10 Aimene Belfodil, Adnene Belfodil, and Mehdi Kaytoue k 1,...,p . The left (resp. right) bound of core (d).I for any k is equal to C k ∈ { } next(a ,C ) (resp. prev(b ,C )) if next(a ,Crel) C (resp. prev(b ,Crel) k k k k k k (cid:54)∈ k k k (cid:54)∈ C ).Otherwise,itisequaltoa (resp.b ).Considerthestep =( 2,3 , 4,5 ) k k k C { } { } in RefineAndMine (its associated discretization is depicted in Fig. 2 (left)) and recall that the relevant cuts set is rel = ( 2,3,4,5 , 4,5 ). The core of the bottom pattern = R2 at this sCtep is co{re ( }){= (} ,3) R. Ccur ⊥ ⊥ −∞ × Indeed, there is three descriptions in which upper approximation is , Crel namely , c = ( ,4) R and c =D( ,5) R. Their lower approxim⊥a- 1 2 tions are⊥respectiv−e∞ly ,×( ,3) R an−d∞( ×,3) R. The join (intersec- ⊥ −∞ × −∞ × tion) of these three descriptions is then core ( ) = ( ,3) ( ,+ ). Ccur ⊥ −∞ × −∞ ∞ Note that particularly for interval patterns, the core has monotonicity, that is ( c,d )c d core (c) core (d). C C C ∀ ∈D (cid:118) ⇒ (cid:118) 5.2 Bounding accuracy and specificity metrics At the ith step, the outputted extents contains the set of cotp extents in P . i i S Formally, int [S ] int [℘( +)]. Theorem 1 and Theorem 2 gives respectively i i i ⊇ G the bounds accuracy and specificity. φ Theorem 1. Let φ : R be a discriminant objective quality measure. The D → accuracy metric is bounded by: accuracy ( )= sup (cid:2)φ∗(cid:0)tpr(cid:0)c(cid:1),fpr(cid:0)core (c)(cid:1)(cid:1) φ∗(tpr(c),fpr(c))(cid:3) φ i i S − c∈inti[Si] Moreover accuracy ( ) accuracy ( ). φ i+1 φ i S ≤ S Theorem 2. The specificity metric is bounded by: specificity( )= sup (cid:16)(cid:0)ext(c) ext(core+(c)))/(2 )(cid:1)(cid:17) Si | |−| i | ·|G| c∈inti[Si] where core+(c) = int (ext+(core (c))), that is core+(c) is the closure on the i i i i positive of core (c) in P . Moreover specificity( ) specificity( ). i i i+1 i S ≤ S 5.3 Computing and updating bounds in RefineAndMine We show below how the different steps of the method RefineAndMine (see Al- gorithm 1) should be updated in order to compute the two bounds accuracy and specificity. For the sake of brevity, we explain here a naive approach to provide an overview of the algorithm. Note that here, core (resp. core+) refers to core (resp. core+ ). Ccur Ccur Compute the initial bounds (line 4). AsMinIntChangeenumeratesallcotp patterns d , RefineAndMine stores in a key-value structure (i.e. map) Ccur ∈ D called BoundPerPosExt the following entries: ext+(d):(cid:0)φ(d),φ∗(cid:0)tpr(cid:0)d(cid:1),fpr(cid:0)core(d)(cid:1)(cid:1),(ext(d) ext(core+(d)))/(2 )(cid:1) | |−| | ·|G| The error-bounds accuracy and specificity are then computed at the end by φ a single pass on the entries of BoundPerPosExt using Theorems 1 and 2.
Description: