Efficient Mining of Partial Periodic Patterns in Time Series Database Jiawei Han (cid:0) Guozhu Dong (cid:1) Yiwen Yin SchoolofComputingScience DepartmentofComputerScienceandEngineering SchoolofComputingScience SimonFraserUniversity WrightStateUniversity SimonFraserUniversity [email protected] [email protected] [email protected] Abstract whereeverypointintimecontributes(preciselyorapproxi- mately)tothecyclicbehaviorofthetimeseries. Forexam- Partial periodicity search, i.e., search for partial peri- ple,allthedaysintheyearapproximatelycontributetothe odicpatternsintime-seriesdatabases,isaninterestingdata seasoncycle of theyear. A usefulrelated typeofperiodic mining problem. Previous studies on periodicity search patterns,calledpartialperiodicpatterns,whichspecifythe mainlyconsiderfindingfullperiodicpatterns,where every behaviorofthetimeseriesatsomebutnotallpointsintime, pointintimecontributes(preciselyorapproximately)tothe havenotreceivedenoughattention. Anexamplepartialpe- periodicity.However, partialperiodicityisverycommonin riodic pattern may state that Jim reads the Vancouver Sun practice since it is more likely that only some of the time newspaper from 7:00 to 7:30 every weekday morning but episodesmayexhibitperiodicpatterns. his activities at other times do not have much regularity. Wepresentseveralalgorithmsforefficientminingofpar- Thus,partialperiodicityisalooserkindofperiodicitythan tialperiodicpatterns,byexploringsomeinterestingproper- fullperiodicity,anditexistsubiquitouslyintherealworld. tiesrelatedtopartialperiodicity,suchastheAprioriprop- Thepurposeofthecurrentpaperistofillthegapbyconsid- ertyandthemax-subpatternhitsetproperty,andbyshared eringtheefficientminingofpartialperiodicpatterns. mining of multiple periods. The max-subpattern hit set Most methods for finding full periodic patterns are ei- property is a vital new property which allowsus to derive therinapplicableto orprohibitivelyexpensiveforthemin- the counts of all frequent patterns from a relatively small ing of partial periodicpatterns, becauseof themixture of subsetofpatternsexistinginthetimeseries. Weshowthat periodiceventsandnon-periodiceventsinthesameperiod. mining partial periodicity needs only two scans over the Forexample, FFT(FastFourierTransformation)cannotbe timeseriesdatabase,evenforminingmultipleperiods. The applied to mining partial periodicity because it treats the performance study shows our proposed methods are very time-series as an inseparable flow of values. Some peri- efficientinmininglongperiodicpatterns. odicitydetectionmethods candetectsomepartial periodic patterns, but only if the period, and the length and timing Keywords. Periodicitysearch, partialperiodicity,time- ofthesegmentinthepartialpatternswithspecificbehavior seriesanalysis,dataminingalgorithms. areexplicitlyspecified.Forthenewspaperreadingexample, weneed to explicitly specify details such as“find the reg- ularactivitiesofJimduringthehalf-hourafter7:00forthe 1.Introduction periodof hours.” Anaiveadaptationofsuchmethodsto (cid:4)(cid:6)(cid:5) ourpartial periodicpattern miningproblem wouldbe pro- Finding periodic patterns in time series databases is an hibitively expensive, requiring their application to a huge importantdata miningtask with many applications. Many numberofpossiblecombinationsofthethreeparametersof methodshavebeendevelopedforsearchingperiodicitypat- length,timing,andperiod. ternsinlargedatasets[8].However,mostpreviousmethods Besides full periodicity search, there are many recent on periodicitysearch are on mining full periodic patterns, studies on time series data mining: Most concentrate on symbolicpatterns,althoughsomeconsidernumericalcurve (cid:2) ResearchwassupportedinpartbyresearchgrantsfromtheNatural patterns in time series. Agrawal and Srikant [3] devel- SciencesandEngineeringResearchCouncilofCanadaandtheNetworks oped an Apriori-like technique [2] for mining sequential ofCentresofExcellenceProgramofCanada (cid:3) patterns. Mannilaet al.[10] consider frequentepisodesin Partofthis workwas donewhile visiting SimonFraser University duringhissabbaticalfromUniversityofMelbourne,Australia. sequences, where episodes are essentially acyclic graphs 1 of events whose edges specify the temporal before-and- oripropertyandthemax-subpatternhitsetproperty,andby afterrelationalshipbutwithouttiming-intervalrestrictions. sharedminingofmultipleperiods. Themax-subpatternhit Inter-transactionassociationrulesproposedbyLuetal.[9] set propertyisavitalnew propertywhichallowsto derive are implication rules whose two sides are totally-ordered the counts of all frequent patterns from a relatively small episodeswith timing-intervalrestrictions (onthe events in subset of patterns mined from the time series. We show the episodes and on the two sides). Bettini et al. [5] con- thatminingpartialperiodicityneedsonlytwoscansoverthe siderageneralizationofinter-transactionassociationrules: timeseriesdatabase,evenforminingmultipleperiods. The these are essentially rules whose left-hand and right-hand performance study shows our proposed methods are very sidesareepisodeswithtime-intervalrestrictions. However, efficient. The proposed methods are also robust that can unlikeours,periodicityisnotconsideredinthesestudies. beappliedinavarietyofcasesincludingminingmultiple- Similartoourproblem,theminingofcyclicassociation levelpartialperiodicityandminingpartialperiodicitywith rules by O¨zden, et al. [12] also considers the mining of perturbationandevolution. (cid:7) somepatternsofarangeofpossibleperiods. Observethat The remaining of the paper is organizedas follows. In cyclic association rules are partial periodic patterns with Section 2, conceptsrelated to partial periodicity are intro- perfectperiodicityinthesensethateachpatternreoccursin duced. InSection3,methodsforminingpartialperiodicity everycycle,with confidence.Theperfectnessinperi- in regard to both single and multiple periods are studied. (cid:8)(cid:10)(cid:9)(cid:11)(cid:9)(cid:13)(cid:12) odicityleadstoakeyideausedindesigningefficientcyclic InSection4, theimplementationofanovel data structure, associationruleminingalgorithms: Assoonasitisknown namely the max-subpattern tree, for facilitatingthe count- thatan associationrule does not hold at a particularin- ingofthehitmaximalpatterns,andthederivationoftheset (cid:14) stantoftime,wecaninferthat cannothaveperiodswhich offrequentpatternsfromthehitmaximalpatterns,arepre- (cid:14) includethistimeinstant. Forexample,ifthemaximumpe- sented. In Section 5, a comparison of the performance of riodofinterestis anditisdiscoveredthat doesnot theproposedalgorithmsisreported.Weconcludeourstudy (cid:15)(cid:17)(cid:16)(cid:19)(cid:18)(cid:21)(cid:20) (cid:14) holdinthefirst timeinstants,then cannothaveany inSection6. (cid:15)(cid:17)(cid:16)(cid:19)(cid:18)(cid:22)(cid:20) (cid:14) periods. This idea leads to the useful “cycle-elimination” strategyexplored in thatpaper. Sincereal lifepatternsare 2 ProblemDefinition usuallyimperfect,ourgoalisnottomineperfectperiodicity andthus“cycle-elimination”basedoptimizationwillnotbe consideredhere. (cid:23) Assumethatasequenceof (cid:26) timestampeddatasetshave AnApriori-likealgorithmhasbeenproposedformining beencollectedinadatabase. Foreachtimeinstant(cid:27),let(cid:28)(cid:30)(cid:29) imperfectpartialperiodicpatternswithagiven(single)pe- beasetoffeaturesderivedfromthedatasetcollectedatthe riod in arecent studyby twoofthe currentauthors[7]. It instant.Thus,thetimeseriesoffeaturesisrepesentedas, isaninterestingalgorithmforminingimperfectpartialpe- (cid:31)! riodicity. However, withadetailedexaminationofthedata (cid:28) (cid:28) (cid:28)(cid:30)’ (cid:7)#" (cid:23)(cid:13)"%$(cid:10)$%$&" $ characteristicsofpartial periodicity,wefoundthatApriori pruninginminingpartialperiodicitymaynotbeaseffective Let( betheunderlyingsetoffeatures. Wewillalsouse asinminingassociationrules. the“don’tcare”character) ,whichcanmatchanysingleset of features. We define a pattern as a non- Ourstudyhasrevealedthefollowingnewcharacteristics * * *&- empty sequence over (cid:7),. +(cid:10)W+%+e will use ofpartialperiodicpatternsintimeseries: TheApriori-like . /0(cid:4)(cid:6)13254(cid:6)687(cid:6)9;:<4#)=7 >*8> todenotethelengthof ,andwillsaythat istheperiod propertyamongpartialperiodicpatternsstillholdsforany * >*?> of the pattern . Let the -length of be the fixed period, but it does not hold forpatternsbetweendif- * ( * * * - numberof whichcontainslettersfrom . A(cid:7) +(cid:10)p+%a+tternwith ferent periods. Furthermore, there is a strong correlation *@(cid:29) ( -length isalsocalledan -pattern. Moreover, asubpat- amongfrequenciesofpartialpatterns. ( (cid:27) (cid:27) tern of a pattern is a pattern Themaincontributionsofthispaperareasfollows. We * * * - *(cid:17)A *(cid:17)A *@-A such that and have(cid:7)th+(cid:10)e+%+same length, and (cid:7) +(cid:10)+%+for considertheefficientminingofpartialperiodicpatterns,for * *@A *@A(cid:29)CB *@(cid:29) every position where . For example, the pattern asingleperiodaswellasforasetof periods. Wepropose (cid:27) *(cid:17)A(cid:29)ED ) F F isoflength anditisof -length (i.e.,itisa several mining algorithms, by exploring some interesting )G4 7(cid:6)J8K L ( (cid:5) 4-patte"IrHn);andF F and aretwoofthe properties related to partial periodicity such as the Apri- )M4 7N)O) )O) J8K (cid:4)(cid:13)P subpatternsofF F "IH . H )Q4 7#J8K (cid:24) Itisimportanttopointoutthat[12]concentratesontheeliminationof Thefrequency co"(cid:22)uHntandconfidenceofapattern ina * candidateitemsetsfortheassociationruleminingalgorithm,althoughthe timeseries aredefinedas (cid:28) (cid:28)R’ cycle-eliminationstrategydoesleadtoasmallreductiononthenumberof (cid:7)(cid:6)"(cid:10)$%$(cid:10)$(cid:22)" patternswhenweprocessthetimeseriesfromlefttoright. (cid:25) SUT F Notethatamodifiedstrategy,wherewestopconsideringcertainpat- K%V(cid:6)WXK%(cid:26) WU(cid:26)\[(cid:17)/0*%9 >]4(cid:6)(cid:27)N>(cid:6)(cid:9)(cid:30)^_(cid:27),‘ba (cid:26)cJ H(cid:17)Y H(cid:17)Z " ternsassoonasthelengthofthetimeseriestobeprocessedisnotenough d tomaketheconfidencehigherthanthethreshold,canbeused. If isasingletonwewillomitthebrackets,e.g.,wewrite as . e&f g(cid:17)h(cid:10)i h 2 thestring s istruein , 3 Methods for mining partial periodicity in (cid:28) (cid:29)kjl(cid:22)jm (cid:28) (cid:29)kjl(cid:22)jmnjl(cid:21)j7=> and (cid:7) +%+(cid:10)+ timeseries SUT S K(cid:10)V(cid:6)WUK%(cid:26) WX(cid:26)\[(cid:17)/o*(cid:10)9 H(cid:17)Z (cid:26) /0*%9 H(cid:17)aY H(cid:17)Z " In this section, we explore methods for mining partial periodicity in a time series, proceeding from mining par- where is the maximum number of periodsof length a >*8> tial periodicity for a single given period to mining partial contained in the time series (i.e., is the positiveinteger a periodicity for a specified range of periods (i.e., multiple such that ). Each segment of the ap>*?>q^r(cid:26)s‘t/0atuv(cid:8)(cid:10)9%>*?> periods). form , where , is called a (cid:28) (cid:29)kjl(cid:22)jm (cid:28) (cid:29)kjl(cid:21)jmnjl(cid:22)j (cid:9)w^x(cid:27)y‘xa periodsegm(cid:7) e+(cid:10)n+%t+. Wesayapattern istrue in * * * - 3.1 Miningpartialperiodicityforsingleperiod theperiodsegmentortheperiodsegmentm(cid:7) +(cid:10)a+%tc+ hes ,if,for * each position , either is or all the letters in occur (cid:27) * (cid:29) ) * (cid:29) 3.1.1 Single-periodapriorimethod in the set of features in the segment. Thus, if is a (cid:27){z}| *@A subpatternof , thenthesetofsequencesthatcanmatch Apopularkeyidea usedintheefficientminingofassocia- * * isasubsetofsequencesthatcanmatch . tion rules is the Apriori property discovered in [2]: If one *@A subset of an itemset is not frequent, then the itemset itself cannotbefrequent. Thisallowsustousefrequentitemsets Example2.1 Forexample,F isapatternofperiod ;its )(cid:127)~ (cid:128) ofsize asfiltersforcandidateitemsetsofsize . frequencycountin thefeatureseries F F F is2; (cid:27) (cid:27)\u5(cid:8) and its confidence is , where 3 is the4#~m"IHa7(cid:6)x~imK(cid:10)u~mH nK(cid:10)uJ mber Interestingly,foreachperiod(cid:134) , thepropertysupporting (cid:23) theApriori“trick”stillholds: ofperiodsoflength 3.. Thefrequencycountof F in 4#~ 7(cid:6)) F 4(cid:6)~ J\7(cid:6)K F 4#~ 7 F(cid:13)F ~ isalso (cid:23) . "(cid:22)H Property3.1 [Apriorionperiodicity]Eachsubpatternof "(cid:22)H(cid:6)" "IH H . afrequentpatternofperiod isitselfafrequentpatternof (cid:134) Similartominingassociationrules[2],wesaythatapat- period(cid:134) . tern is a frequent partial periodic pattern in a time se- Theproofisbasedonthefactthatpatternsaremorerestric- ries if its confidenceis larger thanor equal to athreshold, tivethantheir subpatterns. Suppose isasubpatternofa S . Theminingoffrequentpartialperiodicpatterns *@A aC(cid:27)k(cid:26) (cid:26) frequentpattern . Then isobtainedfrom bychanging in a tiH(cid:17)mZ e series is to discover, possibly with some restric- * *(cid:17)A * somesetofletterstoasubsetor . Hence ismorerestric- tions,allthefrequentpatternsoftheseriesforoneperiodor ) * tivethan andthusthefrequencycountof isgreaterthan arangeofspecifiedperiods. Morespecifically,theinputto *@A *@A orequaltothatof . Thus isfrequentaswell. miningincludes: * *(cid:17)A An algorithm for mining partial periodic patterns for a given fixed period based on this Apriori “trick” was pre- (cid:129) Atimeseries(cid:31) . sented in [7]. We include a simplied version here for the sakeofcompleteness. (cid:129) Aspecifiedperiod;orarangeofperiodsspecifiedbytwo integers and . Algorithm3.1 [Single-periodApriori]Findallpartialpe- (cid:130) (cid:132)U(cid:27){(cid:133)?(cid:132) Z(cid:6)(cid:131) riodic patternsforagivenperiod satisfyingagiven con- (cid:134) (cid:129) Aninteger indicatingthattheratioofthelengthsof(cid:31) fidence threshold min conf in time-series (cid:31) , based on the a Aprioriproperty3.1. andthepatternsmustbeatleast . Thiswillensurethat a thepatternsminedwouldbeofvaluetotheapplicationat Method. hand. 1. Find ,thesetoffrequent1-patternsofperiod ,byac- (cid:135) (cid:134) cumula(cid:7)tingthefrequencycountforeach1-patternineach Remark: Sometimes the derivation of the feature series whole periodsegment and selectingamong them whose fromtheoriginaldataseriesisquiteinvolved,andtheinter- frequencycountisnolessthan S(cid:138)(cid:137) ,where actionoftheperiodicpatternswiththederivationoffeatures a(cid:136)(cid:27)}(cid:26) (cid:26) a a isthemaximumnumberofperiods. H@Z mayleadtoimprovedperformance. Henceitisworthwhile tocombinetheminingofthefeaturesfromthedatasetswith 2. Find all frequent -patterns of period , for from 2 up (cid:27) (cid:134) (cid:27) themining ofthepatterns, as isthe caseforthemining of to ,basedontheideaofApriori,andterminateimmedi- (cid:134) cyclicassociationrules[12]. Forourworkontheminingof atelywhenthecandidatefrequent -patternsetisempty. (cid:27) frequentpartialperiodicpatternsthough,thisinteractionis notusefulforachieving computationaladvantageandthus Analysis. we will assume that we are dealing with the feature time Number of scans over the time series. Step 1 of the seriesinourstudy. algorithmneedstoscanthetimeseries(cid:31) once.Step2needs 3 toscan(cid:31) upto timesintheworstcase. Thusthetotal Obviously, the derivation of frequent -patterns is still (cid:134)(cid:139)2(cid:140)(cid:8) (cid:8) numberofscansisnomorethantheperiod . an effective way to dramatically reduce the candidate set (cid:134) Spaceneeded.(1)AtStep1,supposethereexistatotalof tobeexaminedlaterbecausethereareusuallyonlyasmall S distinctfeaturesatpositions in(cid:31) , number of features being frequent at a particular position (cid:29) (cid:27) (cid:134)cuR(cid:27) /oa(cid:141)2(cid:142)(cid:8)(cid:10)9(cid:143)(cid:134)cuR(cid:27) where isthenumbersuchthat" "(cid:10)$%$(cid:10)(cid:31)$@" . We but there couldbe alarge number offeaturesappearing in a aR(cid:134)(cid:144)^(cid:145)> >(cid:11)‘s/0a(cid:141)u(cid:136)(cid:8)(cid:10)9{(cid:134) need - S unitsofspace toholdthecounts.Intheworst theposition. Thisisespeciallytruewhentheaveragenum- (cid:146) (cid:29)(cid:143)(cid:147) (cid:29) (cid:148) casewhen(cid:7) every featureisdistinctintheentiretimeseries beroffeaturesperpositionislargerthan . Thus ¥ (cid:16)N(cid:29)’ (cid:7)ƒo§o’(cid:11)¤U' (cid:31) , we need j(cid:149)\j units of space . After Step 1, we ourdiscussionwillbefocusedonhowtoreducethesearch (cid:146) (cid:29)(cid:143)(cid:147) >(cid:28)R(cid:29)(cid:17)> P onlyneed >(cid:135) > un(cid:7)itsofspacetokeep(cid:135) ,thesetoffrequent effortafterthesetoffrequent (cid:8) -patterns,(cid:135) ,isfound. (cid:8) -patterns in(cid:7) (cid:31) . (2) At Step 2, the m(cid:7)aximum(cid:150) number of Ourkeyideaisbasedonthenotionsofm(cid:7) ax-patternsand candidatesubpatternsthatwemaygenerateis >(cid:135) > hitpatterns,definednext. (cid:7) u (cid:150) (cid:150) (cid:4) (cid:151) A candidate (frequent) max-pattern, “(cid:159)(cid:16)(cid:19)(cid:18)(cid:22)(cid:20) , is the >(cid:135) > >(cid:135) > j(cid:152)\(cid:153)&j .Considering maximalpatternwhichcanbegeneratedfrom ,thesetof (cid:7) u u (cid:7) (cid:4) 2C>(cid:135) >(cid:154)2(cid:138)(cid:8) (cid:135) (cid:128) (cid:151) +%+%+ >(cid:135) > (cid:151) (cid:7) frequent -patterns. For example, if thefrequen(cid:7)t1-pattern thatwe stillneed (cid:7)space to keepthe set of frequent 1- (cid:8) >(cid:135) > set is F , thecandidate patterns,thetotalamo(cid:7) untofspaceneededis j(cid:152) (cid:153) j inthe 4 )O)(cid:139))(cid:139)) )(cid:13)~,)(cid:139))(cid:13)) )«) )(cid:139)) )O)O)(cid:11)J8)87 (cid:4) 2p(cid:8) max-patternisF " .No"ticetHhata"positioninthecandidate worsecaseinthiscomputation. However, theaveragecase ~ J8) max-patternmayHbeallowedtohave adisjunctionofmore should be much smallerthan the worst casesince if every thanonenon- letter.Forexample,ifthefrequent1-pattern featureisdistinctinthetimeseries,thenthereisnoneedto ) setis F ,the find periodic patterns. The existence of any periodicity in 4 )(cid:139))O)«) )(cid:13)~ )O)(cid:13)) )(cid:11)~ )«)(cid:13)) )O) )«) )G)(cid:139))(cid:13)J8)87 candidatemax-p"atte(cid:7)rnisF" (cid:23) " . H " thetimeserieswillreducethememoryneeded. 4#~ ~ 7 J8) (cid:7)(cid:6)" (cid:23) H Letthe -lengthofthecandidatemax-pattern, ,be ( “(cid:159)(cid:16)(cid:19)(cid:18)(cid:22)(cid:20) . A subpattern of is hit in a period segment 3.1.2 Single-periodmax-subpatternhitsetmethod >“(cid:159)(cid:16)(cid:19)(cid:18)(cid:22)(cid:20)X> “(cid:159)(cid:16)G(cid:18)(cid:22)(cid:20) (cid:31) of (cid:31) if itisthemaximalsubpatternof in (cid:31) . For (cid:29) “(cid:159)(cid:16)G(cid:18)(cid:22)(cid:20) (cid:29) Althoughthe Aprioritrick mayreduce thesearch spacein example, for F , the hit subpatternfor “ (cid:16)G(cid:18)(cid:22)(cid:20) 4#~ ~ 7 J8) partial periodicity mining in a similar way as association aperiodsegment (cid:31) F (cid:7)(cid:10)" (cid:23) H is F (cid:29) 4#~ ~ 7 4#J J 7#K 4(cid:6)~ ~ 7(cid:19)) rulemining,itisimportanttonotethatthedatacharacteris- , because it is true in (cid:31) (cid:7)#"an(cid:23)dH%n(cid:7)one(cid:7)(cid:10)"of(cid:23) its superp(cid:7)(cid:10)a"tte(cid:23) rns )(cid:13)) (cid:29) ticsin thetwocases arevery different. Inminingassocia- F ,F ,andF ,isin(cid:31) . The 4#~ ~ 7 )‹) 4(cid:6)~ ~ 7?)\J?) 4(cid:6)~ ~ 7 J8) (cid:29) tionrules,thenumberoffrequent -itemsetsshrinksquickly hits(cid:7)(cid:10)e"t,(cid:23) H,ofatim(cid:7)(cid:10)e" s(cid:23)eries(cid:31) isthese(cid:7)(cid:10)t" o(cid:23)faHllhitsubpatterns (cid:27) › as increasesbecauseofthesparsityoffrequent -itemsets of in(cid:31) . (cid:27) (cid:27) “(cid:159)(cid:16)(cid:19)(cid:18)(cid:21)(cid:20) in a large transaction database. However, in mining par- Theusefulnessofhitmax-patternsis: Wecanderivethe tialperiodicity,veryoftenthenumberoffrequent -patterns (cid:27) completesetofpartialperiodicpatterns,fromthefrequency shrinksslowly(when )as increases. Theslowspeed (cid:27),(cid:155)v(cid:8) (cid:27) countsofallthehitmaximalsubpatternsof .Thiswill ofdecreaseinthenumberoffrequent -patternsisduetoa “ (cid:16)(cid:19)(cid:18)(cid:22)(cid:20) (cid:27) bedetailedbelow. strongcorrelationbetweenfrequenciesofpatternsandtheir We would like to give an estimate of the buffer size subpatterns.Wenowillustratethispoint. needed in computation based on the idea of hit patterns. Example3.1 Supposewehavetwofrequent1-patterns,F ) One upper bound of the buffer size is estimated in terms and )(cid:13)~ , such that (cid:26) S / F )(cid:11)9 (cid:9) and (cid:26) S /0)(cid:11)~(cid:17)9 (cid:9) , of , the total number of periods in (cid:31) . , the size of in a time-series (cid:31) .H(cid:17)Z Then it must$(cid:157)(cid:156)be theH(cid:17)Zcase that (cid:9) $(cid:157)^(cid:156) theahit set in a time series (cid:31) , should be no>›(cid:140)big> ger than , mH(cid:17)Z e(cid:26)nSts/ Fth~(cid:17)a9(cid:139)t^vma(cid:9)tc$(cid:157)(cid:156)h,Fas~ mexaptlcahinbeodthbeF l)owan.dS)(cid:11)in~ c,ea(cid:26)llS p/ eF r~(cid:17)i9(cid:159)od^_$(cid:157)(cid:158)se(cid:9) g- ic.aen.,g>›(cid:140)en>fier^flatea a.tTmhoisstiosnoebvhiiotussubspinactteerena,chanpderaiohditsseugbmpeaant-t holds.Toderivetheotherinequality,letF(cid:160) dH(cid:17)eZnotethepred$(cid:157)i(cid:156)- ternmaybehitinmorethanoneperiodsegment.Theother catethataletterisnotF ,similarly~(cid:160) .TheconfidenceofF(cid:160) ) in upperbound ofthebuffer sizeisestimatedin termsof the (cid:31) isatmost (cid:9) (cid:8) ,because (cid:26) S /(cid:22)F(cid:160) )(cid:13)9 (cid:8)Q2 (cid:26) S / F )(cid:13)9 . Sim- maximalnumberofpatternsthatcanbegeneratedfrom , ilarly, (cid:26) S /o) ~%(cid:160)$ 9q^¡(cid:9) (cid:8) .SiH(cid:17)nZce (cid:26) S / F ~(cid:17)9(cid:159)¢v(cid:8),H(cid:17)Z2 (cid:26) S /(cid:22)F(cid:160) )(cid:11)9c2 the set of frequent 1-patterns. Since each hit pattern of(cid:135) (cid:31)(cid:7) H(cid:17)Z (cid:26)TSh/oe)H(cid:17)Z~(cid:17)(cid:160)s9lo,witfroeldlouwctsio$thnaotfH(cid:17)Zt(cid:26)heS H(cid:17)/sZFe~(cid:17)t9Qo¢bf c(cid:9)a$n(cid:158) d.idateH(cid:17)Zfrequent - iilsaartsoubthpeatatenranlyosfis“(cid:159)p(cid:16)(cid:19)er(cid:18)(cid:22)f(cid:20)o,rmwhedichinisAglgeonreirtahtmed3f.r1o,mthe(cid:135) (cid:7)s,izseimo-f (cid:27) patternsas(cid:27) growsmakestheAprioripruningofAlgorithm t(cid:150)he set of subp(cid:150) atterns which can b(cid:150) e generated from (cid:135) (cid:7) is 3.1lessattractive.Isthereabetterway? >(cid:135) > >(cid:135) > >(cid:135) > j(cid:152) (cid:153) j . (cid:7) u (cid:7) u u (cid:7) (cid:4) 2(cid:176)(cid:8) £ (cid:8) (cid:151) (cid:4) (cid:151) +(cid:10)+%+ >(cid:135) > (cid:151) Theunitofspaceisthespaceneededtoholdthefeatureidentifierand Therefore, , the size of the hit set (cid:7)in a time series (cid:31) , itsassociatedcount, anditssizeis usually2-8bytes, dependingon the >›(cid:140)> shouldbenobiggerthan j(cid:152) (cid:153) j . Combiningbothupper imp⁄lementation. (cid:4) 25(cid:8) Thisisequaltothetotalspacethatthetimeseriesoccupies. bounds,wehave 4 Property3.2 [The bound of hit set] The size of the hit 2. Scan (cid:31) once. During the scan, for each period seg- setisboundedbytheformula, j(cid:152) (cid:153) j , ment,ifitshitsetisnonempty,dothefollowing:addthe >›(cid:140)>–^†aC(cid:27)k(cid:26)n4#a (cid:4) 25(cid:8)‡7 where isthetotalnumberofperiodsin (cid:31) , an" d isthe max-subpatternintothehitsetbuffer(withtheassociated a (cid:135) setoffrequent1-patterns. (cid:7) countinitializedto1)ifitisnotalreadythere;otherwise, increasethecountofthemax-subpatternbyone. Thehit Using this formula, we can calculate the bound of the setbufferisimplementedintheformofamax-subpattern maximalbuffersizeneededintheprocessing:Giventheset tree,anoveldatastructure,tobediscussedinSection4. offrequent1-patterns, , themaximal(additional) buffer (cid:135) size needed for registerin(cid:7) g the counts of all the maximal 3. After the scan, derive the frequent patterns from the hit set. Wewilldiscusshowtoimplementthefindingofthe subpatternsof is j(cid:152) (cid:153) j . “ (cid:16)G(cid:18)(cid:22)(cid:20) aC(cid:27)k(cid:26)n4#a (cid:4) 25>(cid:135) >#2¡(cid:8)‡7 " (cid:7) countsofthehitpatternsandhowto usethesecountsto Thispropertyisveryusefulinpractice. Forexample, if derivethefrequentpatternsinSection4. Itturnsoutthat we found 500 frequent 1-patterns when calculating yearly bothcanbedoneefficiently. periodic patterns for 100 years, the buffer size needed is at most 100; on the other hand, if we found 8 frequent Analysis. 1-patterns for calculating weekly periodicpatterns for 100 Numberofscansover the timeseries. Thefirststepof years,thebuffersizeneededisatmost (cid:4)‡·O2 2s(cid:8) (cid:4)‡(cid:5)?(cid:181) . thealgorithmneedstoscan (cid:31) once. Thesecondstepneeds Wecanalwaysselectthesmalleroneinestima(cid:158)tingthemax- to scan (cid:31) one more time. Thus the total number of time- imalbuffersizeneededincomputation. seriesscansis2,independentoftheperiod . (cid:134) Beforeturningtoourhit-setbasedalgorithm,weexam- Space needed. (1) The space needed for Step 1 is the inethe probabilitydistributionsofmaximal subpatternsof sameasAlgorithm3.1. AfterStep1,weneed unitsof >(cid:135) > “(cid:159)(cid:16)(cid:19)(cid:18)(cid:22)(cid:20) . spacetokeep ,thesetoffrequent -patternsin(cid:7) (cid:31) . (2)At (cid:135) (cid:8) thesecondstep,(cid:7) suppose thereare frequent -patterns >(cid:135) > (cid:8) Heuristic3.1 [Popularity of longer subpatterns] The in (cid:31) . AccordingtoProperty3.2,thet(cid:7)otalspaceneededfor probabilitydistributionofthemaximalsubpatternsof“ (cid:16)G(cid:18)(cid:22)(cid:20) the hit set is at most aC(cid:27)k(cid:26)n4#a (cid:4) j(cid:152) (cid:153) j 2(cid:176)(cid:8)‡7 , where a is the is usually denser for longer subpatterns (i.e., with the ( - totalnumberofperiodsin(cid:31) . " lengthcloserto )thantheshorterones. >“(cid:159)(cid:16)(cid:19)(cid:18)(cid:21)(cid:20)–> In comparison with Algorithm 3.1, Algorithm 3.2 re- ThisheuristiccanbeobservedinExample3.1.Fromtheex- duces the total number of scans of the time series from (cid:134) ample,wehave T F8¶ F T F F (the length of the period) to 2, and it also uses much less (cid:9) ^E(cid:134) ~%4(cid:6)a *(cid:10)WU~o(cid:134) [o[IK (cid:26)n/ ~(cid:17)9 ~#7•^ ,but T F8$(cid:157)¶(cid:158) FZ T F F . Inmost buffer space in the computation in most cases. This can (cid:9) (cid:134) ~%4(cid:6)a *%WU~o(cid:134) [o[IK (cid:26)n/ ~(cid:17)9 )=7(cid:138)^‚(cid:9) (cid:8) ca$(cid:157)(cid:156)ses,theeZxistenceofashortmax-subpatternind$ icatesthat also be seen from the followingobservation: Suppose the the nonexistence of some non- -letter, which reduces the hit subpattern for a period segment is F , which is not ) ~ J chanceforthecorrespondingnon- letterpatternstoreach in the hit set yet. We need only one unHit space to reg- ) highconfidence.Thuswehavetheheuristic. ister the string and its count 1. However, for the Apriori This heuristics will imply that the number of nodes in technique, thecandidate2-patternsto begeneratedwillbe the tree data structureof the next section is usually small. F F F , 3-patterns to 4 ~G)R) ) ) )(cid:142))(cid:13)J )(cid:13)~ ) )(cid:13)~,)RJ )(cid:30)) J\7 It is also useful for efficient buffer management: In order be gene"ratedH w"ill be "F H "F "F H , and the 4 ~ ) ~N)(cid:30)J ) J )(cid:11)~ J\7 to reducetheoverallcost ofaccess,the longersubpatterns 4-patternswillbe F ,H pl"usweh"avetHou"pdaH tethecount 4 ~ J\7 shouldbearrangedtobemoreeasilyaccessible(suchasput associatedwitheach oH fthem. Thus, itisexpected thatthe inmainmemory)thantheshorterones. max-subpatternhitsetmethodmayhavebetterperformance Wenowpresentamainalgorithmforminingpartialpe- inmostcases. Wewillcomparetheperformanceofthetwo riodic patterns for a given period, which is based on the algorithmsinSection5. discussionsabove. 3.2 Miningpartialperiodicitywithmultipleperi- Algorithm3.2 [Max-subpatternhit-set]Findallthepar- ods tialperiodicpatternsforagivenperiod inatime-series(cid:31) , (cid:134) based on the max-subpattern hit-set, for a given min conf Mining partial periodicity for a given period covers a threshold. goodsetofapplicationssincepeopleoftenliketomineperi- Method. odicpatternsfornaturalperiods,suchasannually,quarterly, monthly,weekly,daily,orhourly.However,certainpatterns 1. Scan (cid:31) once to find , the set of frequent 1-patterns may appear at some unexpected periods, such as every 11 (cid:135) of period , using Step(cid:7) 1 of Algorithm 3.1. Form the years,orevery14hours.Itisinterestingtoprovidefacilities (cid:134) candidatemax-pattern, ,from . tomineperiodicityforarangeofperiods. “ (cid:16)G(cid:18)(cid:22)(cid:20) (cid:135) (cid:7) 5 Toextendpartial periodicityminingfromoneperiod to Algorithm3.4 [Shared mining of multiple periods] multipleperiods,onemightwishtoextendtheideaofApri- Shared mining of all the partial periodicpatterns for a set ori to computing partial periodicity among different peri- ofperiodsin agivenrangeofinterest, , in time- (cid:134) (cid:134)X… ods, that is, to use the patterns of small periods as fil- series(cid:31) ,withthegivenmin conf thresho(cid:7)(cid:10)l"(cid:10)d$%.$%$I" (cid:134) ters for candidate patterns of periods of the form for „(cid:6)(cid:134) Method. an integer . This will work if all frequent patterns „s(cid:155)”(cid:8) of period are frequent patterns of period . Unfortu- 1. Scan(cid:31) once,forallperiods intherangeofinterest,do „(cid:6)(cid:134) (cid:134) (cid:134) ‰ nately,thisisnotthecase. Forexample,forthetimeseries thesameasStep1inAlgorithm3.2. fF(cid:8)(cid:10)r»(cid:6)~oH(cid:5)mJ.FpS~aHurKptiFpa~olHspJeeFrti~hoHedK i,ccH(cid:17)opZna(cid:26)fitStde/0er)(cid:139)nncs)eoHftJ?hp9ree rsihoo(cid:8)(cid:10)dl»(cid:13)d(cid:4)(cid:4) ,aisasn(cid:9)fid$l(cid:128)[email protected](cid:26)fIofS rw/ cH eaJ8n9uds ie- (cid:134)Tpe(cid:7)hr"%ai$(cid:10)to$%di$(cid:22)s",(cid:134)(cid:134)‰f…o,)r,ufisailnnldgp(cid:135)etrh(cid:7)ie/`o(cid:134)ds‰sa9 m,(cid:134) t‰eheiSnsteettphoe1frfaarnsegqieuneonAftli1gn-otperairttehtsemtrn(3is..eo1.f,. date partial periodicpatternsof period (cid:5) , we willmiss the Foreachsetoffrequent1-patternsofperiod(cid:134) ‰ ,formthe parGtiaivlepnertihoadticwpeacttaenrnno)Gte) xHteJ n.dtheApriori“trick”tomul- candidatemax-pattern,“(cid:159)(cid:16)(cid:19)(cid:18)(cid:22)(cid:20)\/`(cid:134) ‰ 9 ,from(cid:135) (cid:7) /`(cid:134) ‰ 9 . 2. Scan(cid:31) once,forallperiods intherangeofinterest,do tipleperiods,oneobviouswaytominepartialperiodicpat- (cid:134)(cid:11)‰ thesameasStep2inAlgorithm3.2. ternsforarangeofperiodsistorepeatedlyapplythesingle- periodalgorithmforeachperiodintherange. Asimilarprocesswhichwillnotbeexplainedindetail. Analysis. Algorithm3.3 [Looping over single period computa- Numberofscansover the timeseries. Thefirststepof tion]Findallthepartialperiodicpatternsforasetofperiods thealgorithmneedstoscan (cid:31) once. Thesecondstepneeds inagivenrangeofinterest,(cid:134) (cid:134)X… ,inthetime-series(cid:31) , to scan (cid:31) one more time. Thus the total number of time- withthegivenmin conf thres(cid:7)(cid:10)h"(cid:10)o$%l$%d$I." seriesscansis2,independentoftheperiod . (cid:134) Method. Spaceneeded. Thetotalspacerequiredintheworstcase issameasinAlgorithm3.3. 1. for each period in the range of interest (i.e., (cid:134)(cid:13)‰ ), apply Algorithm 3.2 (“max-subpattern hit- Algorithm3.4exploressharedprocessingatminingpar- (cid:134) (cid:134)U… se(cid:7)(cid:6)t"%”$(cid:10))$%o$(cid:22)n" period . tial periodicityfor multiple periods. The advantage of the (cid:134)(cid:13)‰ method is that we only need two scans of time series for Analysis. mining partial periodicity for multiple periods. The over- Numberofscansoverthetimeseries.Sinceeachperiod head of the method is thatalthough it reduces the number willtake2scansofthetimeseries,thetotalnumberofscans of scans to 2, it will require more space in the process- ofthetimeseriesis (cid:137) . ing of each scan than the multiplescan methodbecause it (cid:4) „ Spaceneeded.Forcomputingpartialperiodicityforperi- needs to register the correspondingcounts for each period odsfrom to ,thespacerequiredisbasicallythesumof (for ). However,sincethesharedfeatureswill (cid:134) (cid:134) … (cid:134)(cid:13)‰ (cid:8)(cid:139)‘w´ˆ‘_„ spacefore(cid:7)ach . Noticethatthespacerequiredforinitial sharethespaceaswell(withcountsincremented),andthere (cid:134) ‰ Step1computationisstill j(cid:149)\j intheworstcasesince should be many sharedfeaturesin periodicitysearch (oth- the space once used in com(cid:146) p(cid:29)(cid:143)(cid:147)u(cid:7)ta>(cid:28)(cid:30)tio(cid:29)In> for period , can be erwise, why mining periodicity?), the space required will (cid:134)(cid:13)‰ hardlyapproachtheworstcase.Therefore,itshouldstillbe reinitialized and reused for computing other periods. But weneedintotal … unitsofspacetokeepdiffer- anefficientmethodinmanycasesforminingpartialperiod- ent sets of frequ(cid:146)en‰ t(cid:147) 1(cid:7) ->(cid:135)pa(cid:7) t/(cid:154)t(cid:134)(cid:13)e‰(cid:13)rn9%s> , where is the set of icitywithmultipleperiods. (cid:135) /(cid:154)(cid:134)(cid:13)‰(cid:6)9 frequent -patternsin (cid:31) derivedforperio(cid:7)d . Similarly,it (cid:8) (cid:134)(cid:11)‰ takesatmost … j(cid:152) (cid:153)}(cid:190)-%¿I(cid:192) j unitsofspaceto 4 Derivationofallpartialpatterns (cid:146) ‰ (cid:147) aC(cid:27)k(cid:26)n4#a(cid:30)‰ (cid:4) 2!(cid:8)(cid:6)7 computeall,wher(cid:7)e isth"etotalnumberofperiods in a ‰ (cid:134) ‰ (cid:31) . Inthis section, weexamine the implementationconsid- erations of our proposed algorithms. Algorithm 3.1 is an Algorithm 3.3 provides an iterative method for mining Apriori-likealgorithmwhichcanbeimplementedsimilarly partialperiodicityformultipleperiods. However, whenthe asotherApriori-likealgorithmsforminingassociationrules number of periods is large, we still need a good number (e.g. [2]). Algorithm 3.2 forms the basis for all the three of scans to mine periodicity for multiple periods. An im- remainingalgorithmsandrequiresnewtrickstoachieveef- provementtotheabovemethodistomaximallyexplorethe ficiency, and thus ourdiscussionisfocusedonits efficient miningofperiodicityformultipleperiodsinthesamescan, implementation. whichleadstothesharedminingofperiodicityformultiple Algorithm 3.2 consists of two steps: Step 1, scan the periods,asillustratedbelow. time series once and find frequent 1-pattern set ; and (cid:135) (cid:7) 6 10 Step 2, scan the time series one moretime, collect the set a{b1, b2}*d* of the max-subpatterns hit in (cid:31) , and derive the set of fre- quent patterns. The implementation of Step 1 is straight- ~a ~d ~b1 ~b2 forward and has been discussed in the presentationof Al- 50 40 32 0 gorithm3.1. However, Step2isnontrivialandneedssome *{b1,b2}*d* ab2*d* ab1*d* a{b1,b2}*** gooddatastructuretofacilitatethestorageofthesetofmax- ~a ~a ~a ~b1 ~b1 ~b2 subpatternshitin(cid:31) andthederivationofthesetoffrequent ~d ~b2 ~b1 patterns. 8 ~b2 18 5 19~d 0 ~d 2 *b2*d* *b1*d* *{b1,b2}*** a**d* ab2*** ab1*** Anewdatastructure,calledmax-subpatterntree, isde- signedto facilitatetheregistration of thehit countofeach max-subpattern and derivation of the set of frequent pat- Figure 1. A max-subpattern tree to store the set of terns,asillustratedinFigure1. Itsdesignisnowoutlined. max-subpatternshitinthetime-series. Themax-subpatterntreetakesthecandidatemax-pattern astherootnode,whereeachsubpatternof with “(cid:159)(cid:16)(cid:19)(cid:18)(cid:22)(cid:20) “(cid:159)(cid:16)G(cid:18)(cid:22)(cid:20) one non- letter missingis a direct child node of the root. ) 2. If the node is found, increase its count by 1. Other- The tree expands recursively, according to the following wise,create(cid:131)anewnode (withcount1)anditsmissing rules. A node , if containing more than 2 non-) letters, ancestor nodes(only tho(cid:131)seon thepath to , with count mayhaveaseto(cid:131) fchildren,eachofwhichisasubpatternof 0), if any, and insert it (orthem) into the c(cid:131) orresponding with one more non- lettermissing. Notice that anode ) place(s)ofthetree. c(cid:131) ontaining only 2 non- letters will not have any children ) Forexample,iftheveryfirstmax-subpatternnodefound sinceeveryfrequent-1patternisalreadyin .Importantly, wedonotcreate anodeif neitherthe node(cid:135) n(cid:7) oritsdescen- in (cid:31) is )(cid:13)~ )(cid:142)J8) for “(cid:159)(cid:16)(cid:19)(cid:18)(cid:22)(cid:20) F 4(cid:6)~ ~ 7R)(cid:138)J8) , we will dant(s) containing more than 1 non- letter is hit in (cid:31)¯˜ . createthen(cid:7)ode)(cid:13)~ ),J8) (withcount1(cid:7) )",a(cid:23) ftercreatingtwo ) ancestornodes(wi(cid:7)thcount0): =F (which Eachnodehasa“count”field(whichregistersthenumber 4#~ ~ 7?)‹J8) istherootofthetree),and (cid:131)=(cid:7) (cid:7) " (cid:23) (whichis of hits of the current node), aparent link (which is nil for )84(cid:6)~ ~ 7N)GJ8) ’s child, following the F (cid:131) l(cid:23)ink). T(cid:7)h"e(cid:23)node is the root), and a set of child links; each child link points a (cid:160) )(cid:11)~ )(cid:139)J8) childandisassociatedwithacorrespondingmissingletter. (cid:131)«(cid:7) ’schild,followingthe~(cid:160) link. (cid:7) (cid:131)G(cid:23) (cid:23) Alinkcanbenilwhenthecorrespondingchildhasnotbeen Analysis. hit. Letthetotalnumberofnon- lettersin be . For ) “(cid:159)(cid:16)G(cid:18)(cid:22)(cid:20) (cid:26) ƒ Notice that a non-) letter position of a max-subpattern amax-subpattern containing (cid:26)U¸ ((cid:26)U¸¯(cid:155)†(cid:8) )non-) letters, inamax-subpatterntreemaycontainasetofletters,which weneedtofollow(cid:131) linkstofindthenodeandcreate (cid:26) ƒ 2p(cid:26)U¸ matchesthesetoflettersatthepositioninaperiodsegment. atmost new nodesin theworstcase. There- (cid:26) ƒ 2b(cid:26)U¸pu(cid:145)(cid:8) Forexample,for“ (cid:16)(cid:19)(cid:18)(cid:21)(cid:20) =F 4(cid:6)~ ~ 7X)cJ8) ,themax-subpattern fore,thetimecomplexityofnodesearchandnodecreation o)(cid:13)f) ,thaendpethrieosdegsemgemnetnwtiFll4#c~o(cid:7)(cid:10)n" t~(cid:7)(cid:10)r(cid:23)i"b7(cid:13)u4(cid:23) tH%e(cid:7)(cid:10)o"(cid:22)H(cid:17)n(cid:23)e7#cJ o. uK n(cid:148) titsoFth4#i~s(cid:7)(cid:10)n" ~o(cid:23)d7Ge.) swuibllpabtetelrensswtihllancre(cid:26)aƒt.e eAitlhseor, osinnlcye0enacohdein(swehrteinonitohfitms)axo-r The update of the max-subpattern tree is performed as lessthan nodes,thetotalnumberofthenodesinthetree (cid:26) ƒ follows. is less than (cid:137) , where is the size of the hit set. (cid:26) ƒ >›(cid:140)> >›(cid:140)> Algorithm4.1 [Insertion in the max-subpattern tree] Insertamax-subpattern foundduringthe scanof (cid:31) into Ingeneral,toinsertasubpatternweneedtobothlocate themax-subpatterntree (cid:131) . thepositionandupdatethecountofthenodeifthenodeis ˘ found,orotherwiseinsertoneorseveralnewnodes. Method. Example4.1 Let Figure 1 be the current max-subpattern 1. Startingfromtherootofthetree,findthecorresponding tree . Toinserta(max)subpatternF intothetree,we nodebycheckingthemissingnon-) letterinorder. searc˘ hthetreestartingwiththeroot, ~ (cid:7) )c)(cid:13)) F . “(cid:159)(cid:16)(cid:19)(cid:18)(cid:22)(cid:20) 4#~ ~ 7(cid:6))(cid:13)J8) Forexample,foramax-patternnode inatreewith The first non- letter missing is and the seco(cid:7)n" d(cid:23) non- )(cid:13)~ )(cid:11)J8) ) ~ ) theroot, “(cid:159)(cid:16)G(cid:18)(cid:22)(cid:20) F 4#~ ~ 7(cid:19))«J?) ,there(cid:7) aretwoletters,F letter missing is J . Thus we first(cid:23)follow the (cid:160)~ branch to and~ ,missing. Then(cid:7)o"de(cid:23) canbefoundby(1)following nodeF ~ )(cid:127)J8) ,andthenfollowtheJ(cid:160) branch.Sin(cid:23)cethenode theF (cid:23)link(markedas“ F ”inFigure1)to , F (cid:7) islocated,itscountisincrementedby1. (cid:160) ˙ )84#~ ~ 7X)cJ8) ~ )G)(cid:13)) and then(2)followingthe ~(cid:160) linkto )(cid:13)~ )OJ8) ,(cid:7) a" s(cid:23)shown (cid:7) inFigure1. (cid:23) (cid:7) Before discussing the derivation of the set of frequent patterns,weneedtointroducetheconceptofreachablean- ¨ weshowsuchanodeh%(cid:201) (cid:25) (cid:0)˚(cid:0)(cid:17)(cid:0) usingadottedboxinFigure1. cestors.Sincethetraversalandcreationofthechildrenofa 7 nodeinthemax-subpatterntreefollowthenon- letterpo- (cid:129) scantreeTtofindfrequencycountsofthesecandi- ) sitionorder,someoftheancestornodesofanodemaynot date patterns and eliminate the non-frequent ones. bedirectlylinkedto anode. For example, in Figure1, the Noticethatthefrequencycountofanodeisthesum node F is linkedto onlyone parent F but not of the count of itself and those of all of its reach- )(cid:139))(cid:13)J8) ~ )OJ8) the other F (note: this missing link is(cid:23) marked by a ableancestors. Ifthederivedfrequent -patternset ~ )(cid:139)J?) (cid:27) dashedlinei(cid:7)ntheFigure). isempty,return. Ingeneral, the set ofreachable ancestors ofa node in amax-subpatterntree is thesetofallthe nodesin (cid:131) , 7 ˘ ˘ whicharepropersuperpatternsof . Itcanbecomputedas Analysis. ofonll“(cid:159)ow(cid:16)Gs(cid:18)(cid:22):(cid:20) (,1w)hdiecrhiviesaroluisgth(cid:131)ly(cid:16) thoefmpoiss(cid:131)istiinogn-lewttiesersdfirfofemre(cid:131)ncbea,s(e2d) shoLwenttihnethteotaanlanluymsisbeorfoAflgnoorni-th) mle4tt.e1r,sthine“tim(cid:16)Ge(cid:18)(cid:22)(cid:20)cobmep(cid:26)clƒe.xAitys thesetoflinkedancestorsconsistsofthosepatternswhose forsearchinganodeislessthan . Sincethereareatmost (cid:26)cƒ omfinssoitn-glinlektetderasnfcoersmtorasparroeptheorsperepfiaxtteorfns(cid:131) w(cid:16) h,oasnedm(i3s)sitnhgelseet-t (cid:4) ’8((cid:209) in2¡cl(cid:26)cudƒ innogdaelslttohebmeigsesninegradteedscfernodmanthtse),maanxd-pthaettreernartereaet ˘ tersformapropersublist(butnotprefix)of(cid:131) (cid:16) . most >›(cid:140)> reachableancestorsin˘ ,where >›(cid:140)> isthesizeof thehitset,theworstcasetimecomplexityforderivationof Example4.2 We compute the set of reachable ancestors allthefrequentpatternsisO( (cid:137) ’ (cid:209) (cid:137) ),i.e.,propor- (cid:26) ƒ (cid:4) >›w> for a node )(cid:204))3)(cid:13)J8) in a max-subpattern tree with root tionalto ’ andthesizeofthehitset,butexponentialto(cid:26) ƒ “(cid:159)(cid:16)(cid:19)(cid:18)(cid:22)(cid:20) F 4(cid:6)~ ~ 7O)RJ8) . The list of missingnon-) letters (i.e.,propHortionaltothesizeofthetreethatcanbegenerated is ˝F ~ ~ . T(cid:7)h" us(cid:23) ,thesetoflinkedancestorsis(1)6(cid:160) (miss- by“ (cid:16)G(cid:18)(cid:22)(cid:20) ).Sinceaninfrequentnodewillreducethenumber ingn"ot(cid:7)h"in(cid:23)Ig˛,whichistheroot);(2)F(cid:160) (i.e.,missingF ,which ofcandidatesto begeneratedin thefuturerounds,thereal isthenode)84(cid:6)~ ~ 7;)(cid:127)J8K );and(3)F(cid:160) ~(cid:160) (i.e.,missingF ,then processingcostisusuallymuchsmallerthanthecostinthe missing ~ , wh(cid:7)i"ch(cid:23) is the node )(cid:13)~ )R(cid:7) J8K ). The set of not- worstcase. linkedanc(cid:7)estorsis: (corre(cid:23) spondingtothemissing )(cid:13)~ )«J8) letterpattern F(cid:160) (cid:160)~ ), F ~ (cid:7) )(cid:139)J8) (correspondingto (cid:160)~ ), F )O)(cid:13)J8) We illustrate how to derive the frequent „ -patterns for (correspondingt(cid:23)o~(cid:160) (cid:160)~ (cid:23) ),andF ~ ),J?) (correspond(cid:7) ingto(cid:160)~ ). „(cid:138)(cid:155)v(cid:8) fromthemax-subpatterntree˘ . Inotherwords,one(cid:7) ca(cid:23) nfollow(cid:7)thelinkswhosemarkisn(cid:23)ot Example4.3 Let Figure 1 be the derived max-subpattern J(cid:160) inorderedway(toavoidvisitingthesamenodemorethan tree ,and S(cid:210)(cid:137) . Wecantraversethemax- once)andcollectallthenon- nodesreachedin . ˘ a(cid:136)(cid:27)}(cid:26) (cid:26) a (cid:5)8L (cid:131) ˘ subpatterntreetoH(cid:17)Zfindallthefrequent -patternsfor „ „—(cid:155)†(cid:8) as follows. Starting at level 2, we have the followingfre- Essentiallythereisatreetraversalforeachfixedpattern, quentpatterns: (68), (68), exceptthatwedonotvisitanodeanditsdescendantsifthe 4#)(cid:11)~ )(cid:159)J8) )(cid:13)~ )(cid:159)J8) )84#~ ~ 7(cid:211))(cid:159))(cid:13)) (47),F (119),(cid:23) F (92),(cid:7) F (84) .(cid:7) W" (cid:23)eshow nodeisnotanancestorpatternofourcurrentpattern. )G)(cid:13)J8) ~ )(cid:159))(cid:13)) ~ )(cid:159))(cid:13)) 7 thederivationof (cid:23) (68) here:(cid:7)sincethelistofmiss- Thederivationofthefrequent -patternsisperformedas )(cid:13)~ )(cid:139)J8) „ inglettersinthisnod(cid:23) eis F ,itssetofreachableancestors follows. ˝ ~ is 4 6(cid:160) , F(cid:160) , (cid:160)~ 7 , and thus its fr(cid:7)(cid:212)e˛quent count = 10 + 0 + 50 + 8 (itself) =(cid:7) 68. Since level-2 has no infrequent nodes, we Algorithm4.2 [Derivation of frequent patterns from search all the nodes at level-1 and have the followingfre- max-subpattern tree] The derivation of the frequent - „ quentpatterns: F (60),F (50) ;Sincethereis patterns for all , given a max-subpattern tree , by an 4 ~ );J8) ~ ),J8) 7 „ ˘ onenodeinfrequent(cid:23),level-0(root(cid:7))hasnofrequentpatterns. Apriori-liketechnique. Noticealthoughweonlysavedonenodecomputationinthis Method. case,itwillsavemuchmorewhenthetreeislargeandthere aremoremissingnodes. 1. The set of frequent -patterns is derived in the first (cid:8) (cid:135) scanofAlgorithm3.2. (cid:7) Fromtheaboveexample,onecanseethattherearemany frequent -patternswithsmall thatcanbegeneratedfrom „ „ 2. Themax-subpatterntree isderivedinthesecondscan ˘ amax-subpatterntree.Inpracticalapplications,peoplemay ofAlgorithm3.2. Thesetoffrequent -patterns( ) „ „(cid:144)(cid:155)(cid:176)(cid:8) onlybeinterestedinthesetofmaximalfrequentpatterns isderivedasfollows. insteadofallfrequentpatterns,whereasetofmaximalfre- for to do quent patterns is a subset of the frequent pattern set and (cid:27)(cid:127)ˇ (cid:4) >(cid:135) > 4 (cid:7) every otherpatternin the setis asubpatternofan element (cid:129) derivecandidatepatternswith -length fromfre- intheset. Forexample,ifthesetoffrequent pattern(for ( (cid:27) „ quentpatternswith -length by“ -way ) is F F F , the set of maximal ( /0(cid:27)@2C(cid:8)%9 /0(cid:27)(cid:22)u—(cid:8)(cid:10)9 „w(cid:155)(cid:213)(cid:8) 4 ~(cid:159))•) )(cid:11)~ ) ) ) ~ )=7 join”. frequentpatternsis" F H " . H " H 4 ~ )87 H 8 Ifauserisinterestedinderivingthesetofmaximalfre- of frequent1-patterns)are fora fixed , and they are con- (cid:134) quentpatterns, the MaxMineralgorithmdeveloped byBa- trolledbythechoiceofsomeappropriateconfidencethresh- yardo[4]isagoodcandidate.Thesuccessofthisalgorithm old.Wefoundthatotherparameters,suchasthenumberof stems from generatingnew candidatesby joining frequent featuresoccurringatafixedpositionandthenumberoffea- itemsetsandlookinghead. However,itstillrequirestoscan tures in the time series, do not have much impact on the (cid:31) up to period times in the worst case. The mixture of performanceresultandthus theyare not consideredin the (cid:134) max-subpattern hit set method and the MaxMiner can get tests. rid of this problem and will be more efficient than pure MaxMiner. The details of the new method will be exam- 5.2 Performancecomparisonofthealgorithms inedinfutureresearch. Figure 2 shows there is a significant efficiency gain by 5 Performancestudy max-subpatternhit-set over Apriori. In this figure, the maximalpattern length(the maximal -length of frequent ( In this section we report a performance study which partial periodic patterns) grows from to . The other (cid:4) (cid:8)%(cid:9) compares the performance of the periodicitymining algo- parameters are kept constant: and . (cid:134) L(cid:13)(cid:9) >(cid:135) > (cid:8)(cid:10)(cid:4) rithmsproposedin thispaper. Inparticular,wegiveaper- We run two sets of tests, one with the length of(cid:7) the time formance comparison between the single-period Apriori series being and the other being . As (cid:8)%(cid:9)(cid:13)(cid:9) (cid:9)(cid:11)(cid:9)(cid:13)(cid:9) L(cid:13)(cid:9)(cid:11)(cid:9) (cid:9)(cid:13)(cid:9)(cid:11)(cid:9) algorithm (Algorithm 3.1) (or simply called Apriori), and we can see, the r"unning time of max-subpatte"rnhit-set themax-subpatternhit-setalgorithm(Algorithm3.2)(or is almost constant for both cases, while Apriori is almost simplyhit-set)appliedtoasingleperiod. linear. When MAX-PAT-LENGTH is , the gain by ¢ This comparison indicates that there is a significant max-subpatternhit-setover Aprioriisabo(cid:158) utdouble. We gaininefficiencybymax-subpatternhit-setoverApriori. expectthisgainwillincreaseforlargerMAX-PAT-LENGTH. Since there is more gain when applied to multiple pe- riods by using max-subpatternhit-set, it is clear that Time (seconds) Apriori 500k 7000 max-subpatternhit-setisthewinner. The performance study is conducted on a Pentium 166 6000 machinewith64megabytesmainmemory,runninginWin- 5000 dows/NT.TheprogramiswritteninMicrosoft/VisualC++. 4000 HitSet500k 5.1 TestingDatabases 3000 2000 Apriori 100k Eachtesttimeseriesisasynthetictime-seriesdatabases 1000 HitSet100k generated using a randomized periodicity data generation Max-Pat-Length algorithm. From a set of features, potentially frequent 1- 2 4 6 8 10 patternsarecomposed. Thesizeofthepotentiallyfrequent 1-patterns is determined based on a Poisson distribution. Figure 2. Performance gain when These patterns are generated and put into the time-series MAX-PAT-LENGTHincreases:(cid:134) L(cid:13)(cid:9) , >(cid:135) > (cid:8)(cid:10)(cid:4) . (cid:7) accordingtoanexponentialdistribution. Itisimportanttonotethat,thegainshowninFigure2is LENGTH thelengthoftimeseries donebykeepingeverythinginmemory,andbyconsidering (cid:149) aperiod onlyone period. Ingeneral, thiswill be unlikelythecase, (cid:134) MAX-PAT-LENGTH themaximal -lengthof andmax-subpatternhit-setwillperformeven betterthan ( frequentpatterns Aprioriforthefollowingreasons: thenumberoffrequent1-patterns >(cid:135) (cid:7) > (cid:129) In general, the time series of features may need to be storedondisk,dueto factorssuchaseach maycon- Table1.Parametersofsynthetictimeseries (cid:28)(cid:30)(cid:29) tainthousandsoffeaturesandthelengthofthetimeseries can be longer. When the time series is stored on disk, The basic parameters used to generate the synthetic there would be a large amount of extra disk-IO associ- databasesarelistedinTable1. TheparametersofLENGTH ated withApriori, but notwithmax-subpatternhit-set (cid:149) (thelengthoftimeseries)and (aperiod)areindependently since it only requires two scans. Even when the (cid:134) chosen. The parameters of MAX-PAT-LENGTH (the max- time series is not stored on disk, Apriori will need imal -length of frequent patterns) and (the number to go over this huge sequence many more times than ( >(cid:135) > (cid:7) 9 max-subpatternhit-set. Thusmax-subpatternhit-set periods, mining periodicassociation rulesbased onpartial willbefarbetterthanApriori. periodicity,andquery-andconstraint-basedminingofpar- tial periodicity [11]. We are studying these problems and (cid:129) When there are a range of periods to consider, implementingouralgorithmsforminingpartialperiodicity max-subpatternhit-set can find all frequent patterns inadataminingsystemandwillreportourprogressinthe in two scans but Apriori will require many more future. scans, depending on the number of periods and the -length of the maximal frequent patterns. Hence ( References max-subpatternhit-set will be again far better than Apriori. [1] R.Agrawal,G.Psaila,E.L.Wimmers,andM.Zait. Query- ingshapesofhistories. InProc.21stInt.Conf. VeryLarge 6 Conclusions Data Bases, pages 502–514, Zurich, Switzerland, Sept. 1995. Wehavestudiedefficientmethodsforminingpartialpe- [2] R.AgrawalandR. Srikant. Fastalgorithmsforminingas- riodicityintimeseriesdatabase. Partialperiodicity,which sociation rules. In Proc. 1994 Int. Conf. Very Large Data associates periodic behavior with only a subset of all the Bases,pages487–499,Santiago,Chile,September1994. timepoints,islessrestrictivethanfullperiodicityandthus [3] R.AgrawalandR.Srikant. Miningsequentialpatterns. In coversabroadclassofapplications. Proc.1995Int.Conf.DataEngineering,pages3–14,Taipei, Byexploringseveralinterestingpropertiesrelatedtopar- Taiwan,March1995. tial periodicity, including the Apriori property, the max- [4] R. J. Bayardo. Efficiently mining long patterns from subpattern hit set property, and shared mining of multiple databases.InProc.1998ACM-SIGMODInt.Conf.Manage- periods, a set of partial periodicity mining algorithms are mentofData,pages85–93,Seattle,Washington,June1998. proposed, with their relative performance compared. Our [5] C.Bettini,X.SeanWang,andS.Jajodia. Miningtemporal studyshowsthatthemax-subpatternhitsetmethod,which relationshipswithmultiplegranularities intime sequences. needsonly twoscans ofthe time series database, even for DataEngineeringBulletin,21:32–38,1998. miningmultipleperiods,offersexcellentperformance. [6] J. Han and Y. Fu. Discovery of multiple-level associa- Our study has been confinedto mining partial periodic tion rules from large databases. In Proc. 1995 Int. Conf. patterns in one time series for categorical data with sin- VeryLargeDataBases,pages420–431,Zurich,Switzerland, gle level of abstraction. However the method developed Sept.1995. here can be extended for mining multiple-level, multiple- [7] J.Han,W.Gong,andY.Yin. Miningsegment-wiseperiodic dimensionalpartialperiodicityandforminingpartialperi- patternsintime-relateddatabases. InProc.1998Int’lConf. odicitywithperturbationandevolution. onKnowledgeDiscoveryandDataMining(KDD’98),New Forminingnumericaldata,suchasstockorpowercon- YorkCity,NY,August1998. sumption fluctuation, one can examine the distribution of [8] H.J.LoetherandD.G.McTavish. DescriptiveandInferen- numericalvaluesinthetime-seriesdataanddiscretizethem tialStatistics:AnIntroduction. AllynandBacon,1993. into single- or multiple- level categorical data. For min- [9] H. Lu, J. Han, and L. Feng. Stock movement and n- ingmultiple-levelpartialperiodicity,onecanexplorelevel- dimensional inter-transaction association rules. In Proc. sharedminingbyfirstminingtheperiodicityatahighlevel, 1998SIGMODWorkshoponResearchIssuesonDataMin- and then progressively drilling-down with the discovered ing and Knowledge Discovery (DMKD’98), pages 12:1– periodicpatternsto see whether they are stillperiodicat a 12:7,Seattle,Washington,June1998. lowerlevel. [10] H. Mannila, H Toivonen, and A. I. Verkamo. Discover- Perturbation may happen from period to period which ingfrequentepisodesin sequences. In Proc.1st Int.Conf. maymakeitdifficulttodiscoverpartialperiodicityinmany Knowledge Discovery and Data Mining, pages 210–215, applications. For mining partial periodicitywith perturba- Montreal,Canada,Aug.1995. tion, one method is to slightly enlarge the time slot to be [11] R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. Ex- examined. Partialperiodicpatternswithminorperturbation ploratory mining and pruningoptimizations of constrained arelikelytobecaughtinthegeneralizedtimeslot.Another associationsrules. InProc.1998ACM-SIGMODInt.Conf. methodistoincludethefeatureshappeninginthetimeslots Management of Data, pages 13–24, Seattle, Washington, surroundingtheonebeinganalyzed.Wecanfurtheremploy June1998. regressiontechniquetoreducethenoiseofperturbation. [12] B.O¨zden, S.Ramaswamy,andA.Silberschatz. Cyclicas- Thereare stillmany issuesregarding partial periodicity sociationrules. InProc.1998Int.Conf. DataEngineering miningwhichdeservefurtherstudy, suchasfurther explo- (ICDE’98),pages412–421,Orlando,FL,Feb.1998. rationofsharedminingforminingperiodicitywithmultiple 10