Fast and Accurate Time Series Classification with WEASEL Patrick Schäfer Ulf Leser HumboldtUniversityofBerlin HumboldtUniversityofBerlin Berlin,Germany Berlin,Germany [email protected] [email protected] ABSTRACT 7 Time series (TS) occur in many scientific and commercial appli- 1 0 cations,rangingfromearthsurveillancetoindustryautomationto 2 the smart grids. An important type of TS analysis is classifica- tion, which can, for instance, improve energy load forecasting in n smart grids by detecting the types of electronic devices based on a theirenergyconsumptionprofilesrecordedbyautomaticsensors. J Suchsensor-drivenapplicationsareveryoftencharacterizedby(a) 6 very long TS and (b) very large TS datasets needing classifica- 2 tion. However,currentmethodstotimeseriesclassification(TSC) cannotcopewithsuchdatavolumesatacceptableaccuracy; they ] S areeitherscalablebutofferonlyinferiorclassificationquality, or D theyachievestate-of-the-artclassificationqualitybutcannotscale tolargedatavolumes. . s Inthispaper, wepresentWEASEL(WordExtrActionfortime c SEriescLassification),anovelTSCmethodwhichisbothscalable [ andaccurate. Likeotherstate-of-the-artTSCmethods,WEASEL 1 transformstimeseriesintofeaturevectors,usingasliding-window v approach,whicharethenanalyzedthroughamachinelearningclas- Figure 1: Daily power consumption of seven appliances with 1 sifier.ThenoveltyofWEASELliesinitsspecificmethodforderiv- twosamplesperclass. Bottomtotop: dishwasher,microwave 8 ingfeatures,resultinginamuchsmalleryetmuchmorediscrimina- oven,digitalreceiver,coffee-maker,amplifier,lamp,monitor. 6 tivefeatureset.OnthepopularUCRbenchmarkof85TSdatasets, 7 WEASELismoreaccuratethanthebestcurrentnon-ensembleal- 0 gorithms at orders-of-magnitude lower classification and training crete TS, the task is to determine to which of a set of predefined . times, and it is almost as accurate as ensemble classifiers, whose 1 classes this TS belongs to, the classes typically being character- computationalcomplexitymakestheminapplicableevenformid- 0 ized by a set of training examples. Research in TSC has a long sizedatasets.TheoutstandingrobustnessofWEASELisalsocon- 7 tradition[2,10],yetprogresswasfocusedonimprovingclassifica- firmed by experiments on two real smart grid datasets, where it 1 tionaccuracyandmostlyneglectedscalability,i.e.,theapplicabil- v: oduotm-oafi-nt-hsep-ebcoixficacmheiethvoesdsa.lmostthesameaccuracyashighlytuned, ityinareaswithverymanyand/orverylongTS.However,manyof i today’ssensor-drivenapplicationshavetodealwithexactlythese X data,whichmakesmethodsfutilethatdonotscale,irrespectiveof Keywords r theirqualityonsmalldatasets. Instead,TSCmethodsarerequired a thatarebothveryfastandveryaccurate. Time Series, Classification, Feature Selection, Bag-of-patterns, Asaconcreteexample,considertheproblemofclassifyingen- WordCo-Occurrences. ergyconsumptionprofilesofhomedevices(adishwasher,awash- ingmachine,atoasteretc.). Insmartgrids,everydeviceproduces 1. INTRODUCTION auniqueprofileasitconsumesenergyovertime; profilesareun- A (one-dimensional) time series (TS) is a collection of values equalbetweendifferenttypesofdevices,butrathersimilarforde- sequentially ordered in time. TS emerge in many scientific and vices of the same type (see Figure 1). The resulting TSC prob- commercial applications, like weather observations, wind energy lem is as follows: Given an energy consumption profile (which forecasting,industryautomation,mobilitytracking,etc. Onedriv- is a TS), determine the device type based on a set of exemplary ing force behind their rising importance is the sharply increasing profilespertype. Foranenergycompanysuchinformationhelps useofsensorsforautomaticandhighresolutionmonitoringindo- to improve the prediction of future energy consumption [14, 13]. mainslikesmarthomes[19],starlightobservations[31],machine Forapproachingthesekindsofproblems,algorithmsthatarevery surveillance[27],orsmartgrids[40,17]. fastandveryaccuratearerequired.Regardingscalability,consider Research in TS is diverse and covers topics like storage, com- millionsofcustomerseachhavingdozensofdevices,eachrecord- pression, clustering, etc.; see [10] for a survey. In this work, we ingonemeasurementpersecond. Toimproveforecasting,several studytheproblemoftimeseriesclassification(TSC):Givenacon- millionsofclassificationsoftimeserieshavetobeperformedevery hour,eachconsideringthousandsofmeasurements.Evenwhenop- Load Monitoring (PLAID dataset): timizationslikeTSsamplingoradaptivere-classificationintervals Accuracy vs Single Prediction Runtime areused,thenumberofclassificationsremainsoverwhelmingand 100% can only be approached with very fast TSC methods. Regarding Accurate and fast Accurate but slower accuracy, it should be considered that any improvement in pre- 90% WEASEL BOSS EE PROP dictionaccuracymaydirectlytransformintosubstantialmonetary sinavaicncgusr.acFyor(bienlsotwanc1e0,%[)40c,an17s]avreeptoerntstohfatmailslimonalsloimf dporollvaersmpenert uracy 80% BOSS VS DTW CV TSBF DTW ST c year and company. However, achieving high accuracy classifica- Ac 70% LS tionofhomedeviceenergyprofilesisnontrivialduetodifferent usagerhythms(e.g.,whereinadishwashercyclehastheTSbeen 60% ED recorded?),differencesintheprofilesbetweenconcretedevicesof Less accurate and slower thesametype,andnoisewithinthemeasurements,forinstancebe- 50% 1 10 100 1.000 10.000 100.000 causeoftheusageofcheapsensors. CurrentTSCmethodsarenotabletodealwithsuchdataatsuffi- Single Prediction Time in Milliseconds cientaccuracyandspeed.Severalhighaccuracyclassifiers,suchas ShapeletTransform(ST)[6],havebi-quadraticcomplexity(power Figure2:Classificationaccuracyandsinglepredictionruntime of4)inthelengthoftheTS;evenmethodswithquadraticclassifi- (logscale)fordifferentTSCmethodsontheenergyconsump- cationcomplexityareinfeasible.Thecurrentmostaccuratemethod tiondatasetPLAID.Runtimesincludeallpreprocessingsteps (COTE[3])evenisanensembleofdozensofcoreclassifiersmany (feature extraction, etc.). Methods are explained in detail in of which have a quadratic, cubic or bi-quadratic complexity. On Section 2, the system used for measurements is described in theotherhand,fastTSCmethods,suchasBOSSVS[36]orFast Section5. Shapelets[34],performmuchworseintermsofaccuracycompared tothestateoftheart[2]. Asconcreteexample, considerthe(ac- tuallyrathersmall)PLAIDbenchmarkdataset[13], consistingof ofmoreelaborated,butalsomoreruntime-intensivemethods. 1074 profiles of 501 measurements each stemming from 11 dif- We performed a series of experiments to assess the impact of ferent devices. Figure 2 plots classification times (in log scale) (each of) these improvements. First, we evaluated WEASEL on versus accuracy for seven state-of-the-art TSC methods and the thepopularUCRbenchmarksetof85TScollections[43]cover- novelalgorithmpresentedinthispaper,WEASEL.Euclideandis- ingavarietyofapplications,includingmotiontracking,ECGsig- tance(ED)basedmethodsarethefastest,buttheiraccuracyisfar nals,chemicalspectrograms,andstarlight-curves. WEASELout- below standard. Dynamic Time Warping methods (DTW, DTW performs the best core-classifiers in terms of accuracy while also CV)arecommonbaselinesandshowamoderateruntimeof10to beingoneofthefastestmethods;itisalmostasaccurateasthecur- 100msbutalsolowaccuracy. Highlyaccurateclassifierssuchas rentoverallbestmethod(COTE)butmultipleorders-of-magnitude ST[6]andBOSS[37]requireorders-of-magnitudelongerpredic- faster in training and in classification. Second, for the concrete tiontimes. Forthisrathersmalldataset,theCOTEensembleclas- usecaseofenergyloadforecasting, weappliedWEASELtotwo sifierhasnotyetterminatedtrainingafterrightCPUweeks(Linux real-livedatasetsandcompareditsperformancetotheothergeneral usertime), thuswecannotreporttheaccuracy, yet. Insummary, TSCmethodsandtoalgorithmsspecificallydevelopedandtuned thefastestmethodsforthisdatasetrequirearound1msperpredic- forthisproblem. WEASELagainoutperformsallotherTScore- tion,buthaveanaccuracybelow80%;themostaccuratemethods classifiersintermsofaccuracywhilebeingveryfast,andachieves achieve85%-88%accuracy,butrequire80msupto32secforeach anaccuracyon-parwiththedomain-specificmethodswithoutany TS. domainadaptation. Inthispaper, weproposeanewTSCmethodcalledWEASEL: Therestofthispaperisorganizedasfollows: InSection2we WordExtrActionfortimeSEriescLassification. WEASELisboth presentrelatedwork.Section3brieflyrecapsbag-of-patternsclas- veryfastandveryaccurate; forinstance, onthedatasetshownin sifiersandfeaturediscretizationusingFouriertransform. InSec- Figure 2 it achieves the highest accuracy while being the third- tion4wepresentWEASEL’snovelwayoffeaturegenerationand fastest algorithm (requiring only 4ms per TS). Like several other selection. Section 5 presents evaluation results. The paper con- methods, WEASEL conceptually builds on the so-called bag-of- cludeswithSection6. patterns approach: It moves a sliding window over a TS and ex- tractsdiscretefeaturesperwindowwhicharesubsequentlyfedinto 2. RELATEDWORK amachinelearningclassifier. However, theconcretewayofcon- Withtimeseriesclassification(TSC)wedenotetheproblemof structingandfilteringfeaturesinWEASELiscompletelydifferent assigning a given TS to one of a predefined set of classes. TSC fromanypreviousmethod. First,WEASELconsidersdifferences hasapplicationsinmanydomains;forinstance,itisappliedtode- betweenclassesalreadyduringfeaturediscretizationinsteadofre- terminethespeciesofaflyinginsectbasedontheacousticprofile lying on fixed, data-independent intervals; this leads to a highly generatedfromitswing-beat[30],orforidentifyingthemostpop- discriminative feature set. Second, WEASEL uses windows of ularTVshowsfromsmartmeterdata[16]. varying lengths and also considers the order of windows instead The techniques used for TSC can be broadly categorized into of considering each fixed-length window as independent feature; twoclasses: wholeseries-basedmethodsandfeature-basedmeth- thisallowsWEASELtobettercapturethecharacteristicsofeach ods [22]. Whole series similarity measures make use of a point- classes. Third,WEASELappliesaggressivestatisticalfeaturese- wisecomparisonofentireTS.Theseinclude1-NNEuclideanDis- lection instead of simply using all features for classification; this tance(ED)or1-NNDynamicTimeWarping(DTW)[33], which leadstoamuchsmallerfeaturespaceandheavilyreducedruntime is commonly used as a baseline in comparisons [23, 2]. Typi- withoutimpactingaccuracy.Theresultingfeaturesetishighlydis- cally,thesetechniquesworkwellforshortbutfailfornoisyorlong criminative,whichallowsustousefastlogisticregressioninstead TS [37]. Furthermore, DTW has a computational complexity of O(n2)forTSoflengthn. Techniqueslikeearlypruningofcandi- Sample 0.5 dateTSwithcascadinglowerboundscanbeappliedtoreducethe 00..34 0.2 effectiveruntime[33]. Anotherspeed-uptechniquesfirstclusters 0.1 0.0 the inputTS basedon thefast EDand later analyzesthe clusters 00..21 0.3 usingthetriangleinequality[28]. 0.4 0 200 400 600 800 1000 In contrast, feature-based classifiers rely on comparing fea- (1) Windowing tures generated from substructures of TS. The most successful 0.5 approaches can be grouped as either using shapelets or bag-of- 0.0 patterns(BOP).ShapeletsaredefinedasTSsubsequencesthatare 0.5 ...... ...... maximallyrepresentativeofaclass. In[26]adecisiontreeisbuilt 1.0 on the distance to a set of shapelets. The Shapelet Transform 0 200 400 600 800 1000 (ST)[24,6],whichisthemostaccurateshapeletapproachaccord- (2) Discretization ingtoarecentevaluation[2],usesthedistancetotheshapeletsas bcbccc cccccc bcbbcb bcbbcb bbbbbb babbab caccac dddddc bdbbdc aabbab babcac cccccc bdbbdb bcc bcc bcb ccc abb cac cab cdc bdc bab bac ccc bdb inputfeaturesforanensembleofdifferentclassificationmethods. bcc bcb bcb ccc abb cac cab cdb bdc bab bac ccc bdb bcc bcb bcb ccc abb cac cab bda bdc bab cac ccc bdb In the Learning Shapelets (LS) approach [15], optimal shapelets bcc bcb bcb ccc abb cac cac bda bdc bab cac ccc bdb bcc bcb bcb ccc abb cac dbc bda adb bab cac ccc bdb aresyntheticallygenerated. Thedrawbackofshapeletmethodsis bc.c.. bc.b.. bc.b.. cc.c.. ab.b.. ca.c.. db.d.. bd.a.. ad.a.. ba.c.. ca.c.. cc.c.. bd.b.. 0 200 400 600 800 1000 thehighcomputationalcomplexityresultinginratherlongtraining (3) Bag-of-Patterns model andclassificationtimes. 140 120 Thealternativeapproachwithintheclassoffeature-basedclas- s100 nt 80 sifiers is the bag-of-patterns (BOP) model [22]. Such methods ou 60 break up a TS into a bag of substructures, represent these sub- C 2400 structuresasdiscretefeatures,andfinallybuildahistogramoffea- 0aaaaaaaaaaabbbbbbbbbbbbcccccccccccddddddddddd ture counts as basis for classification. The first published BOP aaabbabbbccacbccdadbdcaaabacbabbbccbccdadbdcddabacbabbbccbccdadbdcddabacadbbbcbdcbcccddcdd model(whichweabbreviateasBOP-SAX)usesslidingwindows Figure 3: Transformation of a TS into the Bag-of-Patterns of fixed lengths and transforms these measurements in each win- (BOP)modelusingoverlappingwindows(secondtotop), dis- dowintodiscretefeaturesusingSymbolicAggregateapproXima- cretization of windows to words (second from bottom), and tion (SAX) [21]. Classification is implemented as 1-NN classi- wordcounts(bottom). fierusingEuclideandistanceoffeaturecountsasdistancemeasure. SAX-VSM[39]extendsBOP-SAXwithtf-idf weighingoffeatures andusestheCosinedistance;furthermore,itbuildsonlyonefeature weignoretimestamps. GivenaTST,awindowS oflengthwis vectorperclassinsteadofonevectorpersample,whichdrastically asubsequencewithw contiguousvaluesstartingatoffsetainT, reduces runtime. Another current BOP algorithm is the TS bag- i.e.,S(a,w) = (t ,...,t )with1 ≤ a ≤ n−w+1. We of-features framework (TSBF) [4], which first extracts windows a a+w−1 associateeachTSwithaclasslabely ∈ Y fromapredefinedset atrandompositionswithrandomlengthsandnextbuildsasuper- Y.Timeseriesclassification(TSC)isthetaskofpredictingaclass visedcodebookgeneratedfromarandomforestclassifier. Inour labelforaTSwhoselabelisunknown.ATSclassifierisafunction prior work, we presented the BOP-based algorithm BOSS (Bag- thatislearnedfromasetoflabeledtimeseries(thetrainingdata), of-SFA-Symbols)[37],whichusestheSymbolicFourierApprox- takesanunlabeledtimeseriesasinputandoutputsalabel. imation(SFA)[38]insteadofSAX.Incontrasttoshapelet-based Algorithms following the BOP model build this classification approaches, BOP-based methods typically have only linear com- function by (1) extracting windows from a TS, (2) transforming putationalcomplexityforclassification. eachwindowofrealvaluesintoadiscrete-valuedword(asequence ThemostaccuratecurrentTSCalgorithmsareEnsembles.These ofsymbolsoverafixedalphabet),(3)buildingafeaturevectorfrom classifyaTSCbyasetofdifferentcoreclassifiersandthenaggre- wordcounts,and(4)finallyusingaclassificationmethodfromthe gatethe resultsusingtechniques like baggingormajority voting. machinelearningrepertoireonthesefeaturevectors. Figure3il- The Elastic Ensemble (EE PROP) classifier [23] uses 11 whole lustratesthesestepsfromarawtimeseriestoaBOPmodelusing series classifiers including DTW CV, DTW, LCSS and ED. The overlappingwindows. COTE ensemble [3] is based on 35 core-TSC methods including BOPmethodsdifferintheconcretewayoftransformingawin- EE PROP and ST. If designed properly, ensembles combine the dow of real-valued measurements into discrete words (discretiza- advantagesoftheircoreclassifiers,whichoftenleadtosuperiorre- tion). WEASEL builds upon SFA which works as follows [38]: sults. However,thepricetopayisexcessiveruntimerequirement (1) Values in each window are normalized to have standard de- for training and for classification, as each core classifier is used viation of 1 to obtain amplitude invariance. (2) Each normalized independentlyofallothers. window of length w is subjected to dimensionality reduction by theuseofthetruncatedFourierTransform, keepingonlythefirst 3. TIMESERIES,BOP,ANDSFA l<wcoefficientsforfurtheranalysis.Thisstepactsasalowpass The method we introduce in this paper follows the BOP ap- filter,ashigherorderFouriercoefficientstypicallyrepresentrapid proachandusestruncatedFouriertransformationsasfirststepon changeslikedropoutsornoise.(3)Eachcoefficientisdiscretizedto feature generation. In this section we present these fundamental asymbolofanalphabetoffixedsizectoachievefurtherrobustness techniques, after formallyintroducing time seriesand time series againstnoise.Figure4exemplifiesthisprocess. classification. Inthiswork, atimeseries(TS)T isasequenceofn ∈ Nreal 4. WEASEL values,T = (t ,...,t ), t ∈ R1. Asweprimarilyaddresstime 1 n i seriesgeneratedfromautomaticsensorswithafixedsamplingrate, In this section, we present our novel TSC method WEASEL (Word ExtrAction for time SEries cLassification). WEASEL 1ExtensionstomultivariatetimeseriesarediscussedinSection6 specificallyaddressesthemajorchallengesanyTSCmethodhasto Raw Time Series DFT SFA 2.0 2.0 2.0 1.5 1.5 1.5 1.0 1.0 1.0 e u 0.5 0.5 0.5 al V 0.0 0.0 0.0 0.5 0.5 0.5 Figure 5: WEASEL Pipeline: Feature extraction using our ABDDABBB 1.0 1.0 1.0 novel supervised symbolic representation, the novel bag-of- 020406080100120 020406080100120 020406080100120 patterns model, and feature matching using a logistic regres- Time Time Time sionclassifier. Figure 4: The Symbolic Fourier Approximation (SFA): A window,nextdeterminesdiscriminativeFouriercoefficients timeseries(left)isapproximatedusingthetruncatedFourier usingtheANOVAf-testandfinallyappliesinformationgain transformation (center) and discretized to the word AB- binning for choosing appropriate discretization boundaries. DDABBB(right)withthefour-letteralphabet(’a’to’d’). The EachstepaimsatseparatingTSfromdifferentclasses. inversetransformisdepictedbyanorangearea(right),repre- sentingthetoleranceforallsignalsthatwillbemappedtothe 2. Co-occurringwords: Theorderofsubstructures(eachrep- sameword. resentedbyaword)islostintheBOPmodel. Tomitigate this effect, WEASEL also considers bi-grams of words as features. Thus,localorderisencodedintothemodel,butas copewithwhenbeingappliedtodatafromsensorreadouts,which asideeffectthefeaturespaceisincreaseddrastically. canbesummarizedasfollows(usinghomedeviceclassificationas 3. Variable-lengthwindows: Typically,characteristicTSpat- anexample): terns do not all have the same length. Current BOP ap- Invariance to noise: TS can be distorted by (ambiance) noise proaches, however, assume a fixed window length, which aspartoftherecordingprocess. Inasmartgrid, suchdistortions leads to ignorance regarding patterns of different lengths. arecreatedbyimprecisesensors,informationlossduringtransmis- WEASEL removes this restriction by extracting words for sion,stochasticdifferencesinenergyconsumption,orinterference multiple window lengths and joining all resulting words in ofdifferentconsumersconnectedtothesamepowerline.Identify- asinglefeaturevector-insteadoftrainingseparatevectors ingTSclass-characteristicpatternsrequirestobenoiserobust. andselecting(thebest)oneasinotherBOPmodels.Thisap- Scalability: TS in sensor-based applications are typically proachcancapturemorerelevantsignals,butagainincreases recorded with high sampling rates, leading to long TS. Further- thefeaturespace. more,smartgridapplicationstypicallyhavetodealwiththousands ormillionsofTS.TSCmethodsinsuchareasneedtobescalable 4. Feature selection: The wide range of features considered inthenumberandlengthofTS. captures more of the characteristic TS patterns but also in- Variablelengthsandoffsets: TStobeclassifiedmayhavevari- troducesmanyirrelevantfeatures.Therefore,WEASELuses ablelengths,andrecordingsofto-be-classifiedintervalscanstartat anaggressiveChi-Squaredtesttofilterthemostrelevantfea- anygivenpointintime. Inasmartgrid, sensorsproducecontin- turesineachclassandreducethefeaturespacewithoutneg- uousmeasurements,andthepartitioningofthisessentiallyinfinite ativelyimpactingclassificationaccuracy. streamintoclassificationintervalsisindependentfromtheusages ofdevices.Thus,characteristicpatternsmayappearanywhereina WEASEL is composed of the building blocks depicted in Fig- TS(ornotatall),buttypicallyinthesameorder. ure5: ournovelsupervisedsymbolicrepresentationfordiscrimi- Unknowncharacteristicsubstructures: Feature-basedclassifiers nativefeaturegenerationandthenovelbag-of-patternsmodelfor exploit local substructures within a TS, and thus depend on the buildingadiscriminativefeaturevector. First, WEASELextracts identification of recurring, characteristic patterns. However, the normalizedwindowsofdifferentlengthsfromatimeseries. Next, position,form,andfrequencyofthesepatternsisunknown;many each window is approximated using the Fourier transform, and substructuresmaybeirrelevantforclassification. Forinstance,the thoseFouriercoefficientsarekeptthatbestseparateTSfromdiffer- idleperiodsofthedevicesinFigure1areessentiallyidentical. entclassesusingtheANOVAF-test. TheremainingFouriercoef- WecarefullyengineeredWEASELtoaddressthesechallenges. ficientsarediscretizedintoawordusinginformationgainbinning, OurmethodconceptuallybuildsontheBOPmodelinBOSS[37], which also chooses discretization boundaries to best separate the yetusesratherdifferentapproachesinmanyoftheindividualsteps. TSclasses;MoredetailisgiveninSubsection4.2.Finally,asingle Wewillusethetermsfeatureandwordinterchangeablythroughout bag-of-patternsisbuiltfromthewords(unigrams)andneighboring the text. Compared to previous works in TSC, WEASEL imple- words(bigrams). Thisbag-of-patternsencodesunigrams,bigrams mentsthefollowingnovelideas,whichwillbeexplainedindetail andwindowsofvariablelengths.Tofilterirrelevantwords,theChi- inthefollowingsubsections: Squaredtestisappliedtothisbag-of-patterns(Subsection4.1).As WEASELbuildsahighlydiscriminativefeaturevector,afastlin- 1. Discriminativefeaturegeneration: WEASELderivesdis- eartimelogisticregressionclassifierisapplied,asopposedtomore criminativefeaturesbasedonclasscharacteristicsofthecon- complex,quadratictimeclassifiers(Subsection4.1). cretedataset. ThisdiffersfromcurrentBOP[21,38]meth- Algorithm1illustratesWEASEL:slidingwindowsoflengthw ods,whichapplythesamefeaturegenerationmethodinde- areextracted(line6)andwindowsarenormalized(line7).Weem- pendentoftheactualdataset,possiblyleadingtofeaturesthat piricallysetthewindowlengthstoallvaluesin[8,...,n].Smaller areequallyfrequentinallclasses, andthusnotdiscrimina- values are possible, but the feature space can become untrace- tive. Specifically,ourapproachfirstFouriertransformseach able, andsmallwindowlengthsarebasicallymeaninglessforTS Algorithm 1 Build one bag-of-patterns using a supervised sym- 2 Raw Time Series BOP: bigrams + w=50 4BOP:bigrams + w=75 bCohlii-csqrueaprreedsetnetsattifoonr,femautultrieplseelwecitniodnow. lleisngththes,nubmigbrearmosfaFnodurtiheer Value0 A Counts24 Counts2 valuestokeep. 0 20 40 60Tim8e0 100120140 050 aa50 aa ba50 aa bb50 ba50 ba aa50 ba bb50 ba ca50 ba cb50 bb50 bb ba50 bb db50 ca50 cb50 cb db50 db50 db bb50 db dc50 dc 0 75 aa 75 aa ab 75 aa ca 75 ab 75 ab ac 75 ac 75 ca 75 ca aa 1 function WEASEL(sample, l) 2 6 2 A 23 bag = empty BagOfPattern Value0 Counts24 Counts 456 f/o/arleleWxaticrnhadcowtwinswdoo=rwdSsLleIfnoDIgrNthGe_awWchI:NDwOinWdSo(wsamlepnlegt,hw) 0 20 40 60Tim8e0 100120140 050 aa50 aa ba50 aa bb50 ba50 ba aa50 ba bb50 ba ca50 ba cb50 bb50 bb ba50 bb db50 ca50 cb50 cb db50 db50 db bb50 db dc50 dc 0 75 aa 75 aa ab 75 aa ca 75 ab 75 ab ac 75 ac 75 ca 75 ca aa 7 norm(allWindows) 2 B 4 2 89 for each (prevWindow, window) in allWindows: Value0 Counts2 Counts 111012 bw/a/ogrBd[Ow=P++qcwouomardnpt]ui.tezindactriferooanms.etCruaonnuigsnrftao(mr)ms(window,l) 0 20 40 60Tim8e0 100120 140 050 aa50 aa ba50 aa bb50 ba50 ba aa50 ba bb50 ba ca50 ba cb50 bb50 bb ba50 bb db50 ca50 cb50 cb db50 db50 db bb50 db dc50 dc 0 75 aa 75 aa ab 75 aa ca 75 ab 75 ab ac 75 ac 75 ca 75 ca aa 2 2 13 B 1145 p/r/evBWOPordco=mqupauntetdizafrtoimon.btigrraanmsfsorm(prevWindow,l) Value0 Counts2 Counts 1167 bag[w++prevWord++word].increaseCount() 0 20 40 60Tim8e0 100120 140 050 aa50 aa ba50 aa bb50 ba50 ba aa50 ba bb50 ba ca50 ba cb50 bb50 bb ba50 bb db50 ca50 cb50 cb db50 db50 db bb50 db dc50 dc 0 75 aa 75 aa ab 75 aa ca 75 ab 75 ab ac 75 ac 75 ca 75 ca aa 18 // feature selection using ChiSquared 19 return CHI_SQUARED_FILTERED(bag) Figure6: Discriminativefeaturevector: Fourtimeseries,two fromclass’A’andtwofromclass’B’areshown. Featurevec- torscontainunigramsandbigramsforthewindowlengthsw of length > 103. Our supervised symbolic transformation is ap- of50and75.Thediscriminativewordsarehighlighted. pliedtoeachreal-valuedslidingwindow(line11,15).Eachwordis concatenatedwiththewindowlengthanditsoccurrenceiscounted (line12,16).Lines15–16illustratetheuseofbigrams:thepreced- spaceexplodesto48· 256=2563. ingslidingwindowisconcatenatedwiththecurrentwindow.Note, WEASEL uses the Chi-squared (χ2) test to identify the most thatallwords(unigrams,bigrams,window-length)arejoinedwithin relevantfeaturesineachclasstoreducethisfeaturespacetoafew asinglebag-of-patterns.Finallyirrelevantwordsareremovedfrom hundredfeaturespriortotrainingtheclassifier. Thisstatisticaltest thisbag-of-patternsusingtheChi-Squaredtest(line19).Thetarget determinesifforanyfeaturetheobservedfrequencywithinaspe- dimensionalitylislearnedthroughcross-validation. cific group significantly differs from the expected frequency, as- BOP-basedmethodshaveanumberofparameters,whichheav- sumingthedataisnominal. Largerχ2-valuesimplythatafeature ily influence their performance. Of particular importance is the occursmorefrequentlywithinaspecificclass.Thus,wekeepthose windowlengthw. Anoptimalvalueforthisparameteristypically featureswithχ2-valuesabovethethreshold.Thishighlightssubtle learnedforeachnewdatasetusingtechniqueslikecross-validation. distinctionsbetweenclasses. Allotherfeaturescanbeconsidered This does not only carry the danger of over-fitting (if the train- superfluousandareremoved. Onaveragethisreducesthesizeof ing samples are biased compared to the to-be-classified TS), but thefeaturespaceby30−70%toroughly104to105features. alsoleadstosubstantialtrainingtimes. Incontrast,WEASELre- Still, with thousands of time series or features an accurate, moves the need to set this parameter, by constructing one joined quadratictimeclassifiercantakedaystoweekstotrainonmedium- high-dimensional feature vector, in which every feature encodes sizeddatasets[36].Forsparsevectors,linearclassifiersareamong theparameter’svalues(Algorithm1lines12,16). thefastest,andtheyareknowntoworkwellforlargedimensional Figure 6 illustrates our use of unigrams, bigrams and variable (sparse)vectors,likeindocumentclassification.Theselinearclas- windowlengths.Thedepicteddatasetcontainstwoclasses’A’and sifierspredictthelabelbasedonadot-productoftheinputfeature ’B’ with two samples each. The time series are very similar and vectorandaweightvector.Theweightvectorrepresentsthemodel differencesbetweenthesearedifficulttospot, andaremostlylo- trainedonlabeledtrainsamples. Usingaweightvectorhighlights catedbetweentimestamps80and100to130. Thecenter(right) featuresthatarecharacteristicforaclasslabelandsuppressesirrel- columnillustratesthefeaturesextractedforwindowlength50(75). evantfeatures. Thus, theclassifieraimsatfindingthosefeatures, Feature’75aaca’(abigramforlength75)ischaracteristicforthe that can be used to determine a class label. Methods to obtain a A class, whereas the feature ’50 db’ (an unigram for length 50) weightvectorincludeSupportVectorMachines[8]orlogisticre- is characteristic for the B class. Thus, we use different window gression[12].Weimplementedourclassifierusingliblinear[11]as lengths, bigrams, and unigrams to capture subtle differences be- itscaleslinearlywiththedimensionalityofthefeaturespace[29]. tweenTSclasses.Weshowtheimpactofvariable-lengthwindows This results in a moderate runtime compared to Shapelet or en- andbigramstoclassificationaccuracyinSection5.5. semble classifiers, which can be orders of magnitude slower (see Section5.3). 4.1 Feature Selection and Weighting: Chi- squaredTestandLogisticRegression 4.2 SupervisedSymbolicRepresentation ThedimensionalityoftheBOPfeaturespaceisO(cl)forword Asymbolicrepresentationisneededtotransformareal-valued length l and c symbols. It is independent of the number of time TS window to a word using an alphabet of size c. The problem seriesN astheseonlyaffectthefrequencies. Forcommonparam- withSFA[38]isthatit(a)filtersthehighfrequencycomponents eterslikec=4,l=4,n=256thisresultsinasparsevectorwith ofthesignal,justlikealow-passfilter. Butforinstance,thepitch 44 =256dimensionsforaTS.WEASELusesbigramsandO(n) (frequency)ofabirdsoundisrelevantforthespeciesbutlostafter windowlengths,thusthedimensionalityofthefeaturespacerises low-passfiltering.Furthermore,it(b)doesnotdistinguishbetween toO(cl· cl· n). Fortheprevioussetofparametersthisfeature classlabelswhenquantizingvaluesoftheFouriertransform.Thus, SFA words Discriminative words Discriminative words Feature Selection and Quantization on Gun-Point Dataset 102 (unsupervised) 102 (supervised) + Logistic Regression Counts1100001aaAabacadbabbbcbdcacbcccddadbdcdd Counts1100001 aa ba bb ca cb db dc Weights0 aa ba bb ca cb db dc 1105 gun Featuporinet Selection Qu24antizat00i..o0342n Counts1110000012aaAabacadbabbbcbdcacbcccddadbdcdd Counts1110000012 aa ba bb ca cb db dc Weights0 aa ba bb ca cb db dc Value 505 420 00..2456 Counts111000012B Counts111000012 Weights2468 1150 86 0I.G11 0 0 0 Counts111000012aaBabacadbabbbcbdcacbcccddadbdcdd Counts111000012 aa ba bb ca cb db dc Weights102468 aa ba bb ca cb db dc 20 real0 real1 Foimag1urier creal2oefficieimag2nts real3 imag3 Split imag3Points 0 0 0 aaabacadbabbbcbdcacbcccddadbdcdd aa ba bb ca cb db dc aa ba bb ca cb db dc F-values: 0.6 0.2 0.2 0.4 0.1 0.2 1.5 Figure 7: Influence of the word model for feature extraction. From left to right: bag-of-patterns for SFA words, our novel Figure 8: On the left: Distribution of Fourier coefficients for discriminative words, and our weighted discriminative words thesamplesfromtheGun-Pointdataset. ThehighF-valueson afterlogisticregression. imag3andreal0(redtextatbottom)shouldbeselectedtobest separatethesamplesfromclasslabels’Gun’and’Point’. On theright: Zoominonimag3. Informationgainsplittingisap- thereisahighlikelihoodofSFAwordstooccurindifferentclasses pliedtofindthebestbinsforthesubsequentquantization.High with roughly equal frequencies. For classification, we need dis- informationgain(IG)indicatespure(good)splitpoints. criminativewordsforeachclass.Ourapproachisbasedon: 1. Discriminativeapproximation:Weintroducefeatureselec- tiontotheapproximationstepbyusingtheone-wayANOVA We chose to use a one-way ANOVA F-test [25] to select the F-test: we keep the Fourier values whose distribution best bestFouriercoefficients,asitisapplicableoncontinuousvariables, separatestheclasslabelsindisjointgroups. as opposed to the Chi-squared test, which is limited to categori- calvariables. Theone-wayANOVAF-testchecksthehypothesis 2. Discriminativequantization: Weproposetheuseofinfor- thattwoormoregroupshavethesamenormaldistributionaround mation gain [32]. This minimizes the entropy of the class themean. Theanalysisisbasedontwoestimatesforthevariance labelsforeachsplit. I.e.,themajorityofvaluesineachpar- existingwithinandbetweengroups: meansquarewithin(MS ) titioncorrespondtothesameclasslabel. W andmeansquarebetween(MS ). TheF-valueisthendefinedas: B InFigure7werevisitoursampledataset. Thistimewithawin- F = MSB.Ifthereisnodifferencebetweenthegroupmeans,the MSW dow length of 25. When using SFA words (left), the words are F-valueisclosetoorbelow1. Ifthegroupshavedifferentdistri- evenlyspreadoverthewholebag-of-patternsforbothprototypes. butionsaroundthemean,MS willbelargerthanMS . When B W Thereisnosinglefeaturewhoseabsenceorpresenceischaracteris- used as part of feature selection, we are interested in the largest ticforaclass.However,whenusingournoveldiscriminativewords F-values,equaltolargedifferencesbetweengroupmeans. TheF- (center),weobservelessdistinctwords,morefrequentcountsand valueiscalculatedforeachrealreal ∈REAL(T)andimaginary i theword’db’isuniquewithinthe’B’class. Thus,anysubsequent imag ∈ IMAG(T)Fouriervalue. WekeepthoselFourierval- i classifiercanseparateclassesjustbytheoccurrenceofthisfeature. ueswiththelargestF-values.InFigure8thesearereal andimag 0 3 Whentrainingalogisticregressionclassifieronthesewords(right), forl=2withF-values0.6and1.5. theword’db’getsboostedandotherwordsarefiltered. Note,that AssumptionsmadefortheANOVAF-test: the counts of the word ’db’ differ for both representations, as it represents other frequency ranges for the SFA and discriminative 1. The ANOVA F-test assumes that the data follows a nor- words. This showcase underlines that not only different window maldistributionwithequalvariance. TheBOP(WEASEL) lengthsorbigrams(asinFigure6),butalsothesymbolicrepresen- approach extracts subsequences for z-normalized time se- tationhelpstogeneratediscriminativefeaturesets. Ourshowcase ries. It has been shown that subsequences extracted from istheGun-Pointdataset[43],whichrepresentsthehandmovement z-normalized time series perfectly mimic normal distribu- ofactors,whoaimagun(prototypeA)orpointafinger(prototype tion [20]. Furthermore, the Fourier transform of a normal B)atpeople. distribution 4.2.1 DiscriminativeApproximationusingOne-Way f(x)= √1 ·e−−2σx22 ANOVAF-test σ 2π with µ = 0,σ = 1 results in a normal distribution of the ForapproximationeachTSisFouriertransformedfirst. Weaim Fouriercoefficients[7]: at finding those real an imaginary Fourier values that best sepa- ratebetweenclasslabelsforasetofTSsamples,insteadofsimply (cid:90) takingthefirstones. Figure8(left)showsthedistributionofthe F(t)= f(x)·e−itx =eiµσe−12(σt)2 =e−12(σt)2 Fourier values for the samples from the Gun-Point dataset. The Fouriervaluethatbestseparatesbetweentheclassesisimag with Thus,theFouriercoefficientsfollowasymmetricalanduni- 3 thehighestF-valueof1.5(bottom). modalnormaldistributionwithequalvariance. Algorithm 2 Entropy-based binning of the real and imaginary ofBOPmodelsthatusingaconstantc = 4isveryrobustoverall Fouriervalues. TSconsidered[22,39,37]. 1 function FitBins(dfts, l, c) 2 bins[l][c] // 2d array of bins 5. EVALUATION 3 for i = 1 to l: 4 // order line of class labels sorted by value 5.1 ExperimentalSetup 5 orderLine = buildOrderLine(dfts, i) 6 WemostlyevaluatedourWEASELclassifierusingthefullUCR 7 IGSplit(orderLine, bins[i]) 8 return bins benchmark dataset of 85 TSC problems [43]2. Furthermore, we 9 compareditsperformanceontworeal-lifedatasetsfromthesmart 10 function IGSplit(orderLine, bins) griddomain;resultsarereportedinSection5.6. 11 (sp, Y_L, Y_R) = find split with maximal IG 12 bins.add(sp) Each UCR dataset provides a train and test split set which we 13 if (remaining bins): // recursive splitting use unchanged to make our results comparable the prior publica- 14 IGSplit(Y_L, bins) // left tions. WecompareWEASELtothebestpublishedTSCmethods 15 IGSplit(Y_R, bins) // right (following[2]),namelyCOTE(Ensemble)[3],1-NNBOSS[37], LearningShapelets[15],ElasticEnsemble(EEPROP)[23],Time Series Bag of Features (TSBF) [4], Shapelet Transform (ST) [6], 2. The ANOVA F-test assumes that the samples are indepen- and1-NNDTWwithandwithoutawarpingwindowsetthrough dently drawn. To guarantee independence, we are extract- crossvalidationonthetrainingdata(CV)[23]. Arecentstudy[2] ingdisjointsubsequences,i.e. non-overlapping,totrainthe reportedCOTE,ST,BOSS,andEEPROPasthemostaccurate(in quantizationintervals. Usingdisjointwindowsforsampling thisorder). further decreases the likelihood of over-fitting quantization All experiments ran on a server running LINUX with 2xIntel intervals. XeonE5-2630v3and64GBRAM,usingJAVAJDKx641.8. We measuredruntimesofallmethodsusingtheimplementationgiven 4.2.2 Discriminative Quantization using Entropy / bytheauthors[5]whereverpossible, resortingtothecodeby[2] InformationGain ifthiswasnotthecase. For1-NNDTWand1-NNDTWCV,we makeuseofthestate-of-the-artcascadinglowerboundsfrom[33]. A quantization step is applied to find for each selected real or Multi-threadedcodeisavailableforBOSSandWEASEL,butwe imaginary Fouriervalue thebest split points, so that in each par- haverestrictedallcodestouseasinglecoretoensurecomparability tition a majority of values correspond to the same class. We use ofnumbers. Regardingaccuracy,wereportnumberspublishedby informationgain[32]andsearchforthesplitwithlargestinforma- eachauthor[3,1,15,4],complementedbythenumberspublished tiongain, whichrepresentsanincreaseinpurity. Figure8(right) by [41], for those datasets where results are missing (due to the illustrates five possible split points for the imag Fourier coeffi- growthofthebenchmarkdatasets). Allnumbersareaccuracyon 3 cientonthetwolabels’Gun’(orange)and’Point’(red). Thesplit thetestsplit. pointwiththehighestinformationgainof0.46ischosen. For WEASEL we performed 10-fold cross-validation on the Our quantization is based on binning (bucketing). The value training datasets to find the most appropriate value for the SFA range is partitioned into disjoint intervals, called bins. Each bin word length l ∈ [4,6,8] We kept c = 4 and chi = 2 con- is labeled by a symbol. A real value that falls into an interval is stant, as varying these values has only negligible effect on accu- represented by its discrete label. Common methods to partition racy(datanotshown). Weusedliblinearwithdefaultparameters the value range include equi-depth or equi-width bins. These ig- (bias=1,p=0.1andsolverL2R_LR_DUAL).Toensurerepro- noretheclasslabeldistributionandsplitsaresolelybasedonthe ducibleresults,weprovidetheWEASELsourcecodeandtheraw valuedistribution. Hereweintroduceentropy-basedbinning. This measurementsheets[42]. leadstodisjointfeaturesets. LetY = {(s ,y ),...,(s ,y )} 1 1 N N 5.2 Accuracy be a list of value and class label pairs with N unique class la- bels. The multi-class entropy is then given by: Ent(Y) = Figure9showsacriticaldifferencediagram(introducedin[9]) (cid:80) −p log p , where p is the relative frequency of overtheaverageranksofofthedifferentTSCmethods.Classifiers lab(esli,yyi)i∈nYY.Tyhieent2ropyyiforasplitypiointspwithalllabelsonthe withthelowest(best)ranksaretotheright.Thegroupofclassifiers i left Y = {(s ,y )|s ≤sp, (s ,y )∈Y } and all labelson the thatarenotsignificantlydifferentintheirrankingsareconnectedby L i i i i i rightY ={(s ,y )|s >sp, (s ,y )∈Y }isgivenby: abar. Thecriticaldifference(CD)length,whichrepresentsstatis- R i i i i i ticallysignificantdifferences,isshownabovethegraph. |Y | |Y | Ent(Y,sp)= L Ent(Y )+ R Ent(Y ) (1) The1-NNDTWand1-NNDTWCVclassifiersarecommonly |Y| L |Y| R used as benchmarks [23]. Both perform significantly worse than all other methods. Shapelet Transform (ST), Learning Shapelets Theinformationgainforthissplitisgivenby: (LS) and BOSS have a similar rank and competitive accuracies. InformationGain=Ent(Y)−Ent(Y,sp) (2) WEASEListhebest(lowestrankamongallcoreclassifiers(DTW, TSBF, LS, BOSS, ST), i.e., it is on average the most accurate Algorithm2illustratesentropy-binningforacsymbolalphabet coreclassifiers. ThisconfirmsourassumptionsthattheWEASEL andwordlengthl. ForeachsetofthelrealandimaginaryFourier pipelineresemblestherequirementsfortimeseriessimilarity(see values,anorder-lineisbuilt(line5). Wethensearchforthecsplit Section5.3). pointsthatmaximizetheinformationgain(line6). Afterchoosing Ensembleclassifiersgenerallyshowcompellingaccuraciesatthe thefirstsplitpoint(line10)anyremainingpartitionY orY that L R costofenormousruntimes.ThehighaccuracyisconfirmedinFig- is not pure is recursively split (lines 13-14). The recursion ends oncewehavefoundcbins(line12). 2The UCR archive has recently been extended from 45 to 85 Wefixthealphabetsizecto4,asithasbeenshowninthecontext datasets. CD UCR Datasets: Average Accuracy vs Total Train Time 9 8 7 6 5 4 3 2 1 90% Accurate and fast Accurate but slower WEASEL y COTE ac 83% ur 1-NN DTW 7.43 2.95 COTE ge Acc80% BOSS TSBF LS ST 1-NN DTW CV 6.65 3.35 WEASEL era EE (PROP) TSBF 5.51 4.22 ST Av DTW CV LS 5.29 4.64 BOSS 70% Less accurate and slower 4.97 EE (PROP) 1,E+02 1,E+03 1,E+04 1,E+05 1,E+06 Total Train Runtime in Minutes Figure9: Criticaldifferencediagramonaveragerankson85 UCR Datasets: Accuracy vs Average Predict Time benchmark datasets. WEASEL is as accurate as state of the 90% art. Accurate and fast Accurate but slower y WEASEL COTE ac 83% ure9,whereCOTE[3]istheoverallbestmethod. Theadvantage ur ofWEASELisitsmuchlowerruntime,whichweaddressinSec- e Acc80% TSBF LS BOSS ST tion5.3. g WeperformedaWilcoxonsignedranktesttoassessthediffer- Avera DTW CV DTW EE (PROP) encesbetweenWEASELandCOTE,ST,BOSS,EE.Thep-values are0.0001forBOSS,0.017forST,0.0000032forEEandCOTE Less accurate and slower 70% 0.57. Thus, at a cutoff of p = 0.05, WEASEL is significantly 1,E+00 1,E+01 1,E+02 1,E+03 1,E+04 betterthanBOSS,STandEE,yetverysimilartoCOTE. Average Single Prediction Time in Milliseconds Figure10: Averagesingleprediction(top)/totaltrainingtime 5.3 Scalability (bottom) in log scale vs average accuracy. Runtimes include allpreprocessingsteps(featureextraction,bopmodelbuilding, Figure10plotsforallTSCmethodsthetotalruntimeonthex- etc.).WEASELhasasimilaraverageaccuracyasCOTEbutis axisinlogscalevstheaverageaccuracyonthey-axisfortraining twoordersofmagnitudefaster.Asinglepredictiontakes38ms (top)andprediction(bottom). Runtimesincludeallpreprocessing onaverage. stepslikefeatureextractionorselection.Becauseofthehighwall- clocktimeofsomeclassifiers,welimitedthisexperimenttothe45 coreUCRdatasets, encompassingroughlyN = 17000trainand N =62000testtimeseries.Theslowestclassifierstookmorethan sors, sensor readings and image outlines. Image outlines result 340CPUdaystotrain(Linuxusertime). fromdrawingalinearoundtheshapeofanobject. Motionrecord- ingscanresultfromvideocapturesormotionsensors.Sensorread- The DTW classifier is the only classifier that does not require ings are real-world measurements like spectrograms, power con- training. TheDTWCVclassifierrequiresatrainingsteptoseta sumption,lightsensors,starlight-curvesorECGrecordings. Syn- warping window, which significantly reduces the runtime for the predictionstep. TrainingDTWCVtookroughly186CPUhours theticdatasetswerecreatedbyscientiststohavecertaincharacteris- tics. Forthisexperiment,weonlyconsiderthenon-ensembleclas- untilcompletion. WEASELandBOSShavesimilartraintimesof 16−24CPUhoursandareonetotwoordersofmagnitudefaster sifiers. Figure 11 shows the accuracies of WEASEL (black line) thantheothercoreclassifiers. WEASEL’spredictiontimeis38ms vs. thesixcoreclassifiers(orangearea). Theorangeareashowsa highvariabilitydependingonthedatasets. onaverageandoneorderofmagnitudefasterthanthatofBOSS. Overall,theperformanceofWEASELisverycompetitiveforal- LSandTSBFhavethelowestpredictiontimesbutalimitedaverage mostalldatasets. Theblacklineismostlyveryclosetotheupper accuracy[2,36]. Asexpected, thetwoEnsemblemethodsinour outlineoftheorangearea,indicatingthatWEASEL’sperformance comparison, EEPROPandCOTE,showbyfarthelongesttrain- isclosetothatofitsbestcompetitor. IntotalWEASELhas36out ing and classification times. On the NonInvasiveFatalECGTho- of 85 wins against the group of six core classifiers. On 69 (78) rax1,NonInvasiveFatalECGThorax2,andStarlightCurvesdatasets trainingeachensembletookmorethan120,120and45CPUdays. datasets it is not more than 5% (10%) to the best classifier. The no-free-lunch-theoremimpliesthatthereisnosingleclassifierthat canbebestforallkindsofdatasets. Table1showsthecorrelation betweentheclassifiersandeachofthefourdatasettypes. Itgives an idea of when to use which kind of classifier based on dataset 5.4 Accuracybydatasetsandbydomain types.E.g.,whendealingwithsensorreadings,WEASELislikely In this experiment we found that WEASEL performs well in- tobethebest,with48.6%wins. Overall,WEASELhasthehigh- dependent of the domain. We studied the individual accuracy of estpercentageofwinsinthegroupsofsensorreadings,synthetic eachmethodoneachofthe85differentdatasets,andalsogrouped andimageoutlinedatasets. Withinthegroupofmotionsensors,it datasets by domain to see if different methods have domain- performsequallygoodasLSandST. dependentstrengthsorweaknesses.Weusedthepredefinedgroup- The main advantage of WEASEL is that it adapts to variable- ingofthebenchmarkdataintofourtypes: synthetic,motionsen- length characteristic substructures by calculating discriminative Datasets ordered by Type 100% 80% h ig y h ac 60% er Accur 2400%% Image Outline Motion Sensors Sensor Readings 6W cEoArSeE cLlassSyntheticifiers is better 0% AdiacArrowHeadBeetleFlyBirdChickenDiatomSizeReductionDistalPhalanxOutlineAgeGroupDistalPhalanxOutlineCorrectDistalPhalanxTWFaceAllFaceFourFacesUCRFiftyWordsFishHandOutlinesHerringMedicalImagesMiddlePhalanxOutlineAgeGroupMiddlePhalanxOutlineCorrectMiddlePhalanxTWOSULeafPhalangesOutlinesCorrectProximalPhalanxOutlineAgeGrouProximalPhalanxOutlineCorrectProximalPhalanxTWShapesAllSwedishLeafSymbolsWordsSynonymsWormsWormsTwoClasscarplaneyogaCricket_XCricket_YCricket_ZGun_PointHapticsInlineSkateToeSegmentation1ToeSegmentation2UWaveGestureLibraryAlluWaveGestureLibrary_XuWaveGestureLibrary_YuWaveGestureLibrary_ZBeefChlorineConcentrationCinC_ECG_torsoCoffeeComputersECG200ECG5000ECGFiveDaysEarthquakesElectricDevicesFordAFordBHamInsectWingbeatSoundItalyPowerDemandLargeKitchenAppliancesLighting2Lighting7MeatMoteStrainNonInvasiveFatalECG_Thorax1NonInvasiveFatalECG_Thorax2OliveOilPhonemeRefrigerationDevicesScreenTypeSmallKitchenAppliancesSonyAIBORobot SurfaceSonyAIBORobot SurfaceIIStarlightCurvesStrawberryTraceTwoLeadECGWinewaferCBFMALLATShapeletSimTwo_Patternssynthetic_control p Figure11:ClassificationaccuraciesforWEASELvsthebestsixcoreclassifiers(ST,LS,BOSS,DTW,DTWCV,TSBF).Theorange area represents the six core classifiers’ accuracies. Red (green) dots indicate where WEASEL wins (evens out) against the other classifiers. WEASEL DTWCV DTW BOSS LS TSBF ST ImageOutline 39.4% 12.1% 9.1% 27.3% 3.0% 18.2% 21.2% MotionSensors 25.0% 8.3% 0.0% 16.7% 25.0% 25.0% 25.0% SensorReadings 48.6% 8.6% 8.6% 17.1% 20.0% 14.3% 25.7% Synthetic 60.0% 0.0% 20.0% 40.0% 0.00% 20.0% 0.0% Table1:Percentageofallfirstranks(wins)separatedbydatasettype:synthetic,motionsensors,sensorreadingsandimageoutlines. Rowsmaynotaddupto100%duetosharedfirstranks. CD features in combination with noise filtering. Thus, all datasets that are composed of characteristic substructures benefit from the use of WEASEL. This applies to most sensor readings like 5 4 3 2 1 all EEG or ECG signals (CinC_ECG_torso, ECG200, ECG5000, ECGFiveDays, NonInvasiveFatalECG_Thorax1, NonInvasiveFa- talECG_Thorax2, TwoLeadECG, ...), but also mass spectrometry (Strawberry, OliveOil, Coffee, Wine, ...), or recordings of insect one window + supervised + bigrams4.65 2.12supervised + bigrams wing-beats(InsectWingbeatSound). Thesearetypicallynoisyand unsupervised + unigrams2.85 2.64unsupervised + bigrams havevariable-length,characteristicsubstructuresthatcanappearat 2.74supervised + unigrams arbitrarytimestamps[18]. STalsofitstothiskindofdatabut,in contrasttoWEASEL,issensitivetonoise. Figure12:Impactofdesigndecisionsonranks.TheWEASEL Image outlines represent contours of objects. For example, (supervised+bigrams) classifier has the lowest rank over all arrow-heads,leafsorplanesarecharacterizedbysmalldifferences datasets. inthecontouroftheobjects. WEASELidentifiesthesesmalldif- ferences by the use of feature weighting. In contrast to BOSS it 5.5 Influence of Design Decisions on alsoadaptstovariablelengthwindows.TSBFdoesnotadapttothe WEASEL’sAccuracy positionofawindowinthetimeseries. STandWEASELadapt tovariablelengthwindowsatvariablepositionsbutWEASELalso We look into the impact of three design decisions on the offersnoisereduction,therebysmoothingthecontourofanobject. WEASELclassifier: Overall,ifyouaredealingwithnoisydatathatischaracterized • The use of a novel supervised symbolic representation that by windows of variable lengths and at variable positions, which generatesdiscriminativefeatures. may contain superfluous data, WEASEL might be the best tech- niquetouse. • Thenoveluseofbigramsthataddsorder-variancetothebag- of-patternsapproach. • The use of multiple window lengths to support variable accuracy is close to theirs, while our approach was not specially lengthsubstructures. adaptedforthedomain. WecannottesttheimpactoftheChi-Squared-test,asthefeature 6. CONCLUSION AND FUTURE DIREC- spaceofWEASELisnotcomputationallyfeasiblewithoutfeature selection(seeSection4.1). TION Figure 12 shows the average ranks of the WEASEL classifier Inthiswork,wehavepresentedWEASEL,anovelTSCmethod where each extension is disabled or enabled: (a) "one window followingthebag-of-patternapproachwhichachieveshighlycom- length,supervisedandbigrams",(b)"unsupervisedandunigrams", petitiveclassificationaccuraciesandisveryfast,makingitapplica- (c) "unsupervised and bigrams", (d) "supervised and unigrams", bleindomainswithveryhighruntimeandqualityconstraints.The and(e)"supervisedandbigrams". Thesinglewindowapproachis novelty of WEASEL is its carefully engineered feature space us- leastaccurate.Thisunderlinesthatthechoiceofwindowlengthsis ingstatisticalfeatureselection,wordco-occurrences,andasuper- crucialforaccuracy. Theunsupervisedapproachwithunigramsis visedsymbolicrepresentationforgeneratingdiscriminativewords. equal to the standard bag-of-patterns model. Using a supervised Thereby,WEASELassignshighweightstocharacteristic,variable- symbolic representation or bigrams slightly improves the ranks. length substructures of a TS. In our evaluation on altogether 87 Bothextensionscombined,significantlyimprovetheranks. datasets,WEASELisconsistentlyamongthebestandfastestmeth- TheplotjustifiesthedesigndecisionsmadeaspartofWEASEL. ods,andcompetitorsareeitheratthesamelevelofqualitybutmuch Each extension of the standard bag-of-patterns model contributes slowerorequallyfastbutmuchworseinaccuracy. totheclassifier’saccuracy.Bigramsaddordervarianceandthesu- Infuturework,wewillexploretwodirections. First,WEASEL pervisedsymbolicrepresentationproducesdisjointfeaturesetsfor currentlyonlydealswithunivariateTS,asopposedtomulti-variate different classes. Datasets contain characteristic substructures of TSrecordedfromanarrayofsensors.Wearecurrentlyexperiment- differentlengthswhichisaddressedbybuildingabag-of-patterns ingwithextensionstoWEASELtoalsodealwithsuchdata;afirst usingallpossiblewindowlengths. approachwhichsimplyconcatenatesthedifferentdimensionsinto 5.6 UseCase: SmartPlugs onevectorshowspromisingresults,butrequiresfurthervalidation. Second, throughout this work, we assumed fixed sampling rates, Applianceloadmonitoringhasbecomeanimportanttoolforen- whichletusomittimestampsfromtheTS.Infuturework,wealso ergysavings[14,13]. WetestedtheperformanceofdifferentTSC wanttoextendWEASELtoadequatelydealwithTSwhichhave methods on data obtained from intrusive load monitoring (ILM), varyingsamplingrates. whereenergyconsumptionisseparatelyrecordedateveryelectric device. WeusedtwopubliclyavailabledatasetsACS-F1[14]and 7. REFERENCES PLAID[13]. ThePLAIDdatasetconsistsof1074signaturesfrom 11appliances. TheACS-F1datasetcontains200signaturesfrom [1] A.Bagnall,L.M.Davis,J.Hills,andJ.Lines. 100appliancesandweusedtheirintersessionsplit. Thesecapture TransformationBasedEnsemblesforTimeSeries thepowerconsumptionoftypicalappliancesincludingaircondi- Classification.InProceedingsofthe2012SIAM tioners,lamps,fridges,hair-dryers,laptops,microwaves,washing InternationalConferenceonDataMining,volume12,pages machines, bulbs, vacuums, fans, andheaters. Eachappliancehas 307–318.SIAM,2012. acharacteristicshape. Someappliancesshowrepetitivesubstruc- [2] A.Bagnall,J.Lines,A.Bostrom,J.Large,andE.Keogh. tureswhileothersaredistortedbynoise.Astherecordingscapture TheGreatTimeSeriesClassificationBakeOff:An oneday,thesearecharacterizedbylongidleperiodsandsomehigh ExperimentalEvaluationofRecentlyProposedAlgorithms. burstsofenergyconsumptionwhentheapplianceisactive. When ExtendedVersion.DataMiningandKnowledgeDiscovery, active,appliancesshowdifferentoperationalstates. pages1–55,2016. Figure 13 shows the accuracy and runtime of WEASEL com- [3] A.Bagnall,J.Lines,J.Hills,andA.Bostrom.Time-Series paredtostateoftheart. COTEdidnotfinishtrainingaftereight ClassificationwithCOTE:TheCollectiveof CPUweeks,thuswecannotreporttheirresults,yet.EDandDTW Transformation-BasedEnsembles.IEEETransactionson donotrequiretraining. KnowledgeandDataEngineering,27(9):2522–2535,2015. WEASELscoresthehighestaccuracieswith92%and91.8%for [4] M.G.Baydogan,G.Runger,andE.Tuv.Abag-of-features bothdatasets. Withapredictiontimeof10and100msitisalso frameworktoclassifytimeseries.IEEETransactionson fast. TraintimesofWEASELarecomparabletothatofDTWCV PatternAnalysisandMachineIntelligence, andmuchlowerthanthatoftheotherhighaccuracyclassifiers. 35(11):2796–2802,2013. OnthelargePLAIDdatasetWEASELhasasignificantlylower [5] BOSSimplementation. predictiontimethanitscompetitors,whileonthesmallsizedACS- https://github.com/patrickzib/SFA/,2016. F1datasetthepredictiontimeisslightlyhigherthanthatofDTW orBOSS.1-NNclassifierssuchasBOSSandDTWscalewiththe [6] A.BostromandA.Bagnall.Binaryshapelettransformfor sizeofthetraindataset.Thus,forlargertraindatasets,theybecome multiclasstimeseriesclassification.InInternational ConferenceonBigDataAnalyticsandKnowledge slower. Atthesametime,forsmalldatasetslikePLAID,theycan bequitefast. Discovery,pages257–269.Springer,2015. Theresultsshowthatourapproachnaturallyadaptstoappliance [7] W.Bryc.Thenormaldistribution:characterizationswith load monitoring. These data show how WEASEL automatically applications,volume100.SpringerScience&Business adaptstoidleandactiveperiodsandshort,repetitivecharacteristic Media,2012. substructures,whichwerealsoimportantinthesensorreadingsor [8] C.CortesandV.Vapnik.Support-vectornetworks.Machine imageoutlinedomains(Section5.4). learning,20(3):273–297,1995. NotethattheauthorsoftheACS-F1datasetscored93%[35]us- [9] J.Demšar.Statisticalcomparisonsofclassifiersovermultiple ingahiddenMarkovmodelandamanualfeatureset.Unfortunately datasets.TheJournalofMachineLearningResearch, theircodeisnotavailableandtheruntimewasnotreported. Our 7:1–30,2006.