ebook img

The Asymmetric Approximate Anytime Join PDF

12 Pages·2008·1.4 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview The Asymmetric Approximate Anytime Join

The Asymmetric Approximate Anytime Join: A New Primitive with Applications to Data Mining LexiangYe XiaoyueWang DragomirYankov EamonnKeogh UniversityofCalifornia, Riverside {lexiangy,xwang,dyankov,eamonn}@cs.ucr.edu Abstract “John Lennon, 9 October 1940”,andtheothercontains “John W. Lennon, 09-Oct-40”, it is clear that these cor- It has long been noted that many data mining algorithms can respond to the same person, and an algorithm that failed to be built on top of join algorithms. This has lead to a wealth linkthemwouldbeveryundesirable. Incontrast, formany of recent work on efficiently supporting such joins with various dataminingapplicationsofjoinsitisnotreallynecessaryto indexingtechniques. However,therearemanyapplicationswhich findthenearestneighbor,itcansufficetofindanear-enough arecharacterizedbytwospecialconditions,firstlythetwodatasets neighbor. Examplesofusefultasksthatutilizethedetection tobejoinedareofradicallydifferentsizes, asituationwecallan ofnear-enoughneighborsasasubroutineincludeclustering asymmetricjoin. Secondly,thetwodatasetsarenot,andpossibly [6], classification [23], anomaly detection [22] and as we can not be indexed for some reason. In such circumstances the showinSection3.3,historicalmanuscriptannotation. Given time complexity is proportional to the product of the number of this,weshowthatbyframingtheproblemasananytimeal- objects in each of the two datasets, an untenable proposition in gorithm we can extract most of the benefit of the full join mostcases.Inthisworkwemaketwocontributionstomitigatethis algorithminonlyasmallfractionofthetimethatitrequires. situation. We argue that for many applications, an exact solution Anytimealgorithmsarealgorithmsthattradeexecutiontime to the problem is not required, and we show that by framing the forqualityofresults[9]. Inparticular,ananytimealgorithm problemasananytimealgorithmwecanextractmostofthebenefit alwayshasabest-so-faransweravailable,andthequalityof ofajoininasmallfractionofthetimetakenbythefullalgorithm. theanswerimproveswithexecutiontime. Theusermayex- Insituationswheretheexactanswerisrequired,weshowthatwe aminethisansweratanytime,andthenchoosetoterminate can quickly index just the smaller dataset on the fly, and greatly the algorithm, temporarily suspend the algorithm, or allow speed up the exact computation. We motivate and empirically the algorithm to run to completion. Furthermore, we show confirm the utility of our ideas with case studies on problems as that although we are explicitly assuming the data is not in- diverseasbatchclassification,anomalydetectionandannotationof dexedatquerytime,wecanbuildanindexontheflyforthe historicalmanuscripts. smallerdatasetandgreatlyspeeduptheprocess. 1 Introduction. The rest of the paper is organized as follows. The nextsectionoffersmorebackgroundandexplicitlystatesour Many researchers have noted that many data mining algo- assumptions. rithmscanbebuiltontopofanapproximatejoinalgorithm. This has lead to a wealth of recent work on efficiently sup- 1.1 Background and Assumptions. The Similarity Join portingsuchjoinswithvariousindexingtechniques[3],[5], (SJ) combines two sets of complex objects such that the [24],[27].However,wearguethatwhiletheclassicdatabase resultcontainsallpairsofsimilarobjects[3].Itisessentially use of approximate joins for record linkage (entity reso- the classic database join which has been relaxed to allow lution, duplicate detection, record matching, reduplication, linkageoftwoobjectsthatsatisfyasimilaritycriterion. The merge/purge processing database hardening etc.) does re- relatedAllNearestNeighbor(ANN)operationtakesasinput quireafulljoin,manydatamining/informationretrievaluses twosetsofmulti-dimensionaldatapointsandcomputesfor of joins can achieve the same end result with an approxi- eachpointinthefirstsetthenearestneighborinthesecond mate join. Here approximate does not refer to the distance set [5]. Note that this definition allows for points in the measure or rule used to link two objects, but rather to the second set to be unmatched. In this work we introduce the factthatonlyasmallsubsetoftheCartesianproductofthe Asymmetric Approximate Anytime Join (AAAJ) which also two datasets needs to be examined. While the result will allowsobjectsinthesecondsettobeunmatched,however,it not be the same as that of an exhaustive join, it can often differsfromtheaboveinseveralimportantways: be good enough for the task at hand. For example, when performing a classic record linkage, if one dataset contains • Weassumethatthesecondsetismanyordersofmag- nitudelargerthanthefirstset. Insomecasesthesecond setmaybeconsideredeffectivelyinfinite1,forexample, of n Setup Time tshtriesammaiyngbedathtae.setofallimagesontheinternetorsome Quality Solutio Current Solution Interruption Time • The sheer size of the second set means that it cannot be indexed, or can only be “weakly” indexed. For ex- Time S amplewecannotindexthebillionsofhighdimensional images on the WWW, but we can use Google image Figure 1: An abstract illustration of an anytime algorithm. searchtoweaklyorderimagesonsize,dateofcreation Note that the quality of the solution keeps improving up to or most significantly (cf. Section 3.3) keywords sur- timeS,whenthealgorithmisinterruptedbytheuser. roundingthem. • Thevastmajorityofworkinthisareaassumesthedis- outliers, however the data is not indexed due to its high di- tance metric used is the Euclidean distance [3], [5], mensionality and the relative cost and difficulty of building [24], [27]. However, motivated by several real world anindexforadatasetthatmayonlybequeriedafewtimes problems we need to be able to support more general a year. Additional examples include NASA Ames, which measures such as Dynamic Time Warping (DTW), ro- hasarchivedflighttelemetryforonemilliondomesticcom- tationinvariantEuclideandistance,orweightedcombi- mercial flights. Dr. Srivastava, the leader of the Intelligent nationsofindividualmeasuressuchasshape,colorand SystemsDivisionnotesthatlinearscansonthisdatasettake texturesimilarity. twoweeks,butthesizeatdimensionalityofthedatamakes • Given that the second set may be effectively infinite, indexinguntenableevenwithstateofthearttechniques[19]. wemayneedtoabandonanynotionoffindinganexact Giventheabove,wefeelthatourassumptionthatthelarger answer;ratherwehopetofindahighqualityanswer. In ofthedatasetsisnotindexedisareasonableassumptionre- suchcircumstancesweframetheproblemasananytime flectedbymanyrealwordscenarios. algorithm. Themaincontributionofthisworkistoshowthatjoins canbecastasanytimealgorithms. AsillustratedinFigure1 Notethatitiscriticaltothemotivationofthisworkthat anytimealgorithmsarealgorithmsthattradeexecutiontime weassumethatthesecondsetisnot indexed,becausethere for quality of results [9]. In particular, after some small are many excellent techniques for computing all manner of amount of setup-time an anytime algorithm always has a variations of joins when the data is indexed [3], [5], [24], best-so-far answer available, and the quality of the answer [27]. In addition to the reasons noted above, additional improveswithexecutiontime. reasons why the data might not be indexed include the ZilbersteinandRussell[28]giveanumberofdesirable following: propertiesofanytimealgorithms: • The input query could be intermediate results of com- plex database queries (as noted in [27]), or the incre- • Interruptability: After some small amount of setup mentalresultsofadirectedwebcrawl. time, the algorithm can be stopped at anytime and provideananswer. • The high dimensionality of the data we need to con- sider. For example, the five datasets considered in • Monotonicity: the quality of the result is a non- [5]haveanaveragedimensionalityof4.8, thedatasets decreasingfunctionofthecomputationtime. considered in [27] are all two dimensional and even • Measurable quality: the quality of an approximate [24]whichisoptimizedfor“handlinghigh-dimensional resultcanbedetermined. data efficiently” considers at most 64 dimensions. In contrastweneedtoconsiderdatasetswithdimensional- • Diminishing returns: the improvement in solution ityinthethousands,andatleastsomeofthesedatasets qualityislargestattheearlystagesofcomputation,and arenotamiabletodimensionalityreduction. diminishesovertime. At least some anecdotal evidence suggests that many real • Preemptability: the algorithm can be suspended and worlddatasetsareoftennotindexed. ForexampleProtopa- resumedwithminimaloverhead. pas et al. [15] have billions of star light curves (time se- ries measuring the intensity of a star) which they mine for Asweshallsee,wecanframeanapproximateasymmet- ricjoinasanyanytimealgorithmtoachieveallthesegoals. 1WecallthesetofimagesontheWWW“effectivelyinfinite”becauseit Due to their applicability to real world problems, there has isexpandedataratefasterthatthedownloadrateofanyonemachine. beenincreasinginterestinanytimealgorithms. Forexample Algorithm2.1BRUTEFORCEJOIN(A,B) framework. 1: fori←1to|A|do Before considering our approach in the next section, 2: mapping[i].dist←DIST(ai,b1) we will introduce one more idealized strawman that we 3: mapping[i].pointer←1 can compare to. Both flavors of the algorithms discussed 4: for j←2to|B|do above take a single item from one of the two datasets to be 5: d←DIST(ai,bj) joinedandscanitcompletelyagainsttheotherdatasetbefore 6: ifd<mapping[i].dist then consideringthenextitem.Recall,however,thatthedesirable 7: mapping[i].dist←d property of diminishing returns would like us to attempt to 8: mapping[i].pointer← j minimize Q as quickly as possible. For example, assume 9: return mapping that we must scan B in sequential order, but we can choose which objects in A to scan across B, and furthermore we canstartandstopwithdifferentobjectsfromAatanypoint. somerecentworkssuchas[21]and[23]showhowtoframe Supposethatataparticularpointinthealgorithm’sexecution nearest neighbor classification as an anytime algorithm and we could either scan a1 across five items in B to reduce its thattop-kqueriescanalsobecalculatedinananytimeframe- errorfrom10.0to2.5,orwecouldscana2acrosstenitemsin work,and[25]showshowBayesiannetworkstructurecanbe Btoreduceitserrorfrom11.0to1.0.Theformerwouldgive learnedinananytimesetting. us a rate of error reduction of 1.5=(10.0−2.5)/5, while the latter choice would give us a rate of error reduction of 2 TheAsymmetricApproximateAnytimeJoin. 1=(11.0−1.0)/10. In this case, the former choice gives usthefasterrateoferrorreductionandweshouldchooseit. For concreteness of the exposition we start by formalizing Imagine that we do this for every object in A, at every step thenotionoftheAllNearestNeighborquery. ofthealgorithm. Thiswouldgiveusthefastestpossiblerate DEFINITION2.1. GiventwosetsofobjectsAandB,anAll of error reduction for a join. Of course, we cannot actually NearestNeighborquery,denotedasANN query(A,B),finds computethisonthefly,wehavenowayofknowingthebest foreachobjectai∈Aanobjectbj∈Bsuchthatforallother choiceswithoutactuallydoingallthecalculations.However, objectsbk∈B,dist(bj,ai)≤dist(bk,ai). we can compute the best choices offline and imagine that suchanalgorithmexists.Fittingly,wecallsuchanalgorithm Note that in general ANN query(A,B) 6= magic,andcanuseitasanupperboundfortheimprovement ANN query(B,A). We will record the mapping from A wecanmakewithouralgorithms. toBwithadatastructurecalledmapping. Wecandiscover the index of the object in B that a maps to by accessing i 2.1 Augmented Orchard’s Algorithm. While the state- mapping[i].pointer, and we can discover the distance from mentoftheproblemathandexplicitlyprohibitsusfromin- a tothisobjectwithmapping[i].dist. i dexingthelargerdatasetB,nothingpreventsusfromindex- Itisusefulforevaluatinganytimeorapproximatejoins ing the smaller set A. If A is indexed, then we can simply to consider a global measure of how close all the objects sequentiallypullobjectsb fromBandquicklylocatethose in A are to their (current) nearest neighbor. We call this j objectsinAthatarenearertob thantotheircurrentbest-so- Q, the quality of the join and we measure it as: Q = j (cid:229) |A| mapping[i].dist. far. i=1 Whilethereisaplethoraofchoicesforindexingdataset Giventhisnotationwecanshowthebruteforcenested A, there are several special considerations which constrain loop algorithm for the All Nearest Neighbor (ANN) algo- andguideourchoice. Theoverarchingrequirementisgener- rithminAlgorithm2.1 ality,wewanttoapplyAAAJtoverydiversedatasets,some Note that lines 2 to 3 are not part of the classic ANN ofwhichmaybeveryhighdimensional,andsomeofwhich, algorithm. TheysimplymapeverythinginAtothefirstitem forexamplestringsundertheeditdistance,maynotbeami- in B. However, once this step has been completed, we can abletospatialaccessmethods. Withtheseconsiderationsin continuously measure Q as the algorithm progresses, a fact mind we decided to use a slightly modified version of Or- thatwillbeusefulwhenweconsideranytimeversionsofthe chard’salgorithm[14],whichrequiresonlythatthedistance algorithmbelow. measure used be a metric. Orchard’s algorithm is not com- InAlgorithm2.1wehaveAintheouterloopandBinthe monlyusedbecauseitsquadraticspacecomplexityissimply inner loop, a situation we denote as BF AoB (Brute Force, untenable for many applications. However, for most of the AoverB). Wecould,however,reversethissituationtohave practical applications we consider here this is not an issue. B in the outer loop and A in the inner loop. For a batch For example, assume that we can record both the distance algorithmthismakesnodifferencetoeithertheefficiencyor betweentwoobjectsandeachofthevaluesintherealvalued outcomeofthealgorithm. Yet,asweshallsee,itcanmake vectors with the same number of bits. Further we assume a big difference when we cast the algorithm in an anytime a j Algorithm2.2ORCHARDS(A,q) qqbbeesstt__ssoo__ffaarr ddiissttaannccee bboouunndd 1: nn.loc←randomintegerbetween1and|A| 22dd ?? ai 2: nn.dist←DIST(ann.loc,q) 3: index←1 dd qq 4: while P[ann.loc].dist[index]<2×nn.dist and index< a |A|do i dd22 dd11 bj 5: node←P[ann.loc].pointer[index] aann 6: d←DIST(anode,q) 7: ifd<nn.dist then 8: nn.dist←d 9: nn.loc←node 10: index←1 Figure2:(Left)ThetriangularinequalityisusedinOrchard’s 11: else Algorithm to prune the items in A that cannot possibly 12: index←index+1 to be the nearest neighbor of query q. (Right) Similarly 13: return nn the triangular inequality is used in Augmented Orchard’s Algorithm to prune the items in A that are certain to have abetterbest-so-farmatchthanthecurrentqueryq. P[a ].pointerinascendingorderuntiloneofthreethings nn.loc happen. Theendofthelistisreached,orthenextobjecton thelisthasvaluethatismorethantwicethecurrentnn.dist haveafeaturevectorlengthofnperobject. Giventhis, the (line 4). In either circumstance the algorithm terminates. dataset A itself requires O(|A|n) space, and Orchard index The third possibility is that the item in the list is closer to requires O(|A|2) space. Concretely, for the Butterfly exam- thequerythanthecurrenttentativenearestneighbor(line7). ple in Section 3.3 the space overhead amounts to approxi- In that case the algorithm simply jumps to the head of the mately0.04%,andforthelightcurveexampletheoverhead list associated with this new nearest neighbor to the query isapproximately2.8%. BecauseOrchard’salgorithmisnot andcontinuesfromthere(lines8to10). widely known we will briefly review it in the next section. We can see that a great advantage of this algorithm is We refer the interested reader to [14] for a more detailed itssimplicity;itcanbeimplementedinafewdozenlinesof treatment. code. Another important advantage for our purposes is its generality. The function DIST can be any distance metric 2.1.1 AReviewofOrchard’sAlgorithm. Thebasicidea function. In contrast, most algorithms use to index data in of Orchard’s algorithm is to quickly prune non-nearest ordertospeedupjoinsexplicitlyexploitpropertiesthatmay neighbors based on the triangular inequality. In the prepro- not exist in all datasets. For example they may require the cessing stage the distance between each two items in the data to be real valued [5],[27] or may be unable to handle dataset A is calculated. As shown in Figure 2 Left, given mixeddatatypes. aqueryq,ifthedistancebetweenqandanitema indataset Note that this algorithm finds the one nearest neighbor i Aisalreadyknownasd,thenthoseitemsindatasetAwhose to a query q. However, recall that we have a data structure distanceislargerthan2d toa canbepruned. Thedistance calledmappingwhichrecordsthenearestitemtoeachobject i betweentheseitemsandqisguaranteedtobelargerthand in A that we have seen thus far. So we need to adapt the which directly follows the triangular inequality. Therefore, classic Orchard’s algorithm to update not only the nearest noneofthemcanbecomethenearestneighborofq. neighbor to q in A (if appropriate) but all objects a such i Specifically, for every object ai ∈ A, Orchard’s algo- thatDIST(ai,q)<mapping[i].dist. Weconsideramethodto rithm creates a list P[a].pointer which contains pointers adaptthealgorithminthenextsection. i to all other objects in A sorted by distance to a. I.e., the i liststorestheindex, denotedasP[a].pointer[k]), ofthekth 2.1.2 Augmented Orchard’s Algorithm. The results i nearest neighbor to a within dataset A, and the distance from the previous section together with the mapping struc- i P[a].dist[k]betweena andthisneighbor. ture built for dataset A can be utilized into an extended i i This simple array of lists is all that we require for fast schemeforapproximatejoins,whichexhibitstheproperties nearest neighbor search in A. The algorithm, as outlined in ofananytimejoinalgorithmtoo. WetermthisschemeAug- Algorithm2.2,beginsbychoosingsomerandomelementin mentedOrchard’sAlgorithm(seeAlgorithm2.3). Aasthetentativenearestneighbora ,andcalculatingthe Thealgorithmstartswithaninitializationstep(lines1to nn.loc distancenn.dist betweenthequeryqandthatobject(lines1 4)duringwhichthetableofsortedneighborhood listsP[a] i and2). Thereafter, thealgorithminspectstheobjectsinlist iscomputed. AtthispointallelementsinAarealsomapped to the first element in the larger dataset B. To provide for Algorithm2.3AUGMENTORCHARDS(A,B) theinterruptabilitypropertyofanytimealgorithms,weadopt 1: fori←1to|A|do a B-over-A mapping between the elements of the two sets. 2: BuildP[ai] The approximate join proceeds as follows: Dataset B is 3: mapping[i].dist←DIST(ai,b1) sequentiallyscannedfromthesecondelementon,improving 4: mapping[i].pointer←1 the best-so-far match for some of the elements in dataset 5: for j←2to|B|do A. For this purpose, we first invoke the classic Orchard’s 6: nn←Orchards(A,bj) algorithm which finds the nearest neighbor ann.loc to the 7: ifnn.dist<mapping[nn.loc].dist then current query bj and also computes the distance between 8: mapping[nn.loc].dist←nn.dist them(i.e.nn.dist,line6). Ifthequeryimprovesonthebest- 9: mapping[nn.loc].pointer← j so-farmatchofann.loc,thenweupdatetheelement’snearest 10: forindex←1to|A|−1do neighboraswellasitsdistance(lines8and9). 11: node←P[ann.loc].pointer[index] In addition to improving the best match for ann.loc, 12: bound←|nn.dist−P[ann.loc].dist[index]| the query example bj may also turn out to be the best 13: if mapping[node].dist>bound then neighbor observed up to this point for some of the other 14: d←DIST(anode,bj) elements in A. However, we do not need to check the 15: ifd<mapping[node].dist then distances between all ai and the query bj. Instead, we 16: mapping[node].dist←d can use the pre-computed distance list P[ann.loc] and once 17: mapping[node].pointer← j again apply the triangle inequality to prune many of the 18: return mapping distance computations. Concretely, we need to compute the actual distance between b and any a ∈A only if the j i following inequality holds: mapping[i].dist > |nn.dist − DIST(ann.loc,ai)|. Figure 2 Right gives the intuition behind selectingthecentersarerandomselection[4],[26]orselec- this pruning criterion. The “distance bound” represents the tionbasedonsomeheuristic,e.g.maximalremotenessfrom right term of the above inequality. If it is larger or equal anypossibleclustercenter[18].IntheAugmentedOrchard’s to the bound obtained with the best-so-far query to ai, then Algorithm we follow a different approach. Namely, for ev- the new query bj cannot improve the current best match ery query bj we select a different center point ann.loc. This of ai. Note that all terms in the inequality are already ispossibleasintheinitializationstepwehavecomputedall computedasdemonstratedonline12ofthealgorithm,sono pairwisedistancesamongtheelementsinA. Thealgorithm distance computations are performed for elements that fail alsoheavilyreliesonthefactthattheP[a]listsaresortedin i the triangle inequality. Finally, it is important to point out ascendingorderofpairwisedistance. Theassumptionisthat thatana¨ıveimplementationofbothOrchard’salgorithmand b may impact the best-so-far match only for elements that j ourextensionmayattempttocomputethedistancebetween areclosetoit,i.e.itsnearestneighbora andtheclosest nn.loc thequerybj andthesameelementai morethanonce,sowe elementstothatneighborwhicharethefirstelementsinthe keepatemporarystructurethatstoresallelementscomputed P[a ] list. All remaining elements in this list are even- nn.loc thus far and the corresponding distances for each query tually pruned by the triangular inequality on line 13 of the bj. The memory overhead for bookkeeping the structure is algorithm. negligible and at the same time allows us to avoid multiple redundantcomputations. 3 ExperimentsandCaseStudies. Weconclude thissectionbysketchingtheintuitionbe- In this section we consider examples of applications hind the pruning performance of the Augmented Orchard’s of AAAJ for several diverse domains and applica- Algorithm. Choosing a vantage (center) point is a common tions. Note that in every case the experiments are technique in many algorithms that utilize the triangular in- completely reproducible, the datasets may be found at equality, such as [4], [18], [26], etc. In all of these works http://www.cs.ucr.edu/∼lexiangy/AAAJ/Dataset.html. one or more center points are selected and the distance be- tween them and all other elements in the dataset are com- 3.1 A Sanity Check on Random Walk Data. We begin puted. Thesedistancesaresubsequentlyusedtogetherwith with a simple sanity check on synthetic data to normalize thetriangularinequalitytoefficientlyfindthenearestneigh- ourexpectations. Weconstructedadatasetofonethousand borsforincomingqueries. Astowhatpointsconstitutegood randomwalktimeseriestoactasdatabaseA,andonemillion centers, i.e.centerswithgoodpruningabilities, dependson randomwalktimestoactasdatabaseB. Alltimeseriesare severalfactors[18],e.g.datasetdistribution,positionofthe oflength128. centersamongtheotherpoints,aswellasthepositionofthe Figure 3 shows the rate at which the measure Q de- queryinthegivendataspace. Thetwocommonstrategiesin creases as a function of the number of Euclidean distance 0 -2 Table1:Anabridgedtraceofthecurrentclassificationofthe Log(Q/Max(Q)) -4 BF_AoB 30charactersofthestring“SGT PEPPERS LONELY HEARTS -6 BMFa_gBicoA CLUB BAND”vs.thenumberofdistancecalculationsmade -8 AAAJ AAAJ Terminates -10 TTT TTTTTTT TTTTTT TTTTTT TTTT TTTT 30 -12 TTT TTITTTT TTTTTT TTTTTT TTTT TTTT 436 x 108 -14 0 1 2 3 4 5 6 7 8 9 10 TTT TTIITTT TTTTTT TTTTTT TTTT TTTT 437 Number of calls to Euclidean distance function ... ....... ...... ...... .... .... . ... ....... ...... ...... .... .... . ... ....... ...... ...... .... .... . SBT PSPPERS LONEOY REARTF CLUB BANG 21166 Figure3: ThelogofQasafunctionofthenumberofcalls SBT PSPPERS LONEGY REARTF CLUB BANG *22350 totheEuclideandistance. SBT PSPPERS LONEGY REARTF CLUN BANG 23396 ... ....... ...... ...... .... .... . ... ....... ...... ...... .... .... . ... ....... ...... ...... .... .... . SGT PEPPERS LONELY HEARTZ CLUB HAND 182405 comparisons made. Note that this figure does include the SGT PEPPERS LONELY HEARTS CLUB HAND 305794 setup time for AAAJ to build the index, but this time is so smallrelativetotheoveralltimethatitcannotbedetectedin thefigure(ItcanjustbeseeninFigure5,where|B|ismuch smallerrelativeto|A|). algorithms[9]. WecanseethatnotonlydoesAAAJterminate3times The Letter dataset has 26 class labels corresponding to fasterthantheotheralgorithms,buttherateatwhichitmin- the letters from A to Z. The task here is to recognize letters imizestheerrorismuchgreater,especiallyinthebeginning. given16featuresextractedfromimagedata.Thereareatotal This of course is the desirable diminishing returns property of 20,000 objects in the dataset. We decided on a 30-letter foranytimealgorithms. Noteinparticularthattherateofer- familiarphrase,andrandomlyextracted30objectsfromthe rorreductionforAAAJisveryclosetomagic,whichisthe training set to spell that phrase. We ran AAAJ with the 30 optimalalgorithmpossible,giventheassumptionsstatedfor letterscorrespondingtoA, andthe19,970lettersremaining it. inthetrainingsetcorrespondingtoB. Table1showsatrace oftheclassificationresultsasthealgorithmprogresses. 3.2 Anytime Classification of Batched Data. There has After the first 30 distance comparisons (corresponding beenrecentinterestinframingtheclassificationproblemas to lines 1 to 4 of Algorithm 2.3), every letter in our phrase ananytimealgorithm[21],[25].Theintuitionisthatinmany points to the first letter of dataset B, which happens to be caseswemustclassifydatawithoutknowinginadvancehow “T”.Thereafter,wheneveraniteminAisjoinedwithanew much time we have available. The AAAJ algorithm allows object in B, its class label is updated. When the algorithm ustoconsideraninterestingvariationoftheproblemwhich terminates after 305,794 distance calculations, the target to our knowledge has not been addressed before. Suppose phraseisobviousinspiteofonemisspelling(“HAND”instead that instead of been given one instance to classify under of “BAND”). Note, however, that after the AAAJ algorithm unknowncomputationalresourceswearegivenacollection has performed just 7.3% of its calculations, its string “sbt ofunlabeledinstances. Inthiscircumstancewecantrivially psppers lonegy reartf club bang” is close enough to useAAAJasananytimeclassifier. the correct phrase to be autocorrected by Google, or to be Supposewearegivenkobjectstoclassify,andweintend recognized by 9 out of 10 (western educated) professors at to classify them using the One Nearest Neighbor (1NN) UCR. Figure 4 gives some intuition as to why we can cor- algorithmwithtrainingsetB. However,itispossiblethatwe rectlyidentifythephraseafterperformingonlyasmallfrac- maybeinterruptedatanytime(Paper[21]discussesseveral tionofthecalculations. WecanseethatAAAJbehaveslike realworldapplicationdomainswherethismayhappen). In anidealanytimealgorithms,derivingthegreatestchangesin suchascenarioitwouldmakelittlesensetousetheclassic theearlypartofthealgorithmsrun. brute force algorithm in Algorithm 2.1 to classify the data. In this experiment the improvement of AAAJ over This algorithm might perfectly classify the first 20% of the BF BoA is clear but not very large. AAAJ terminates k objects before interruption, but then its overall accuracy, with 433,201 calculations (including the setup time for the includingthedefaultrateonthe80%ofthedataitcouldnot Orchard’s algorithms), but the other algorithms only take get to in time, would be very poor. Such a situation can 598,801. However,thisisbecausethesizeofAwasamere arise in several circumstances, for example robot location 30, asweshallseeinthenextexperimenttheimprovement algorithms are often based on classifying multiple “clues” becomesmoresignificantforlargerdatasets. aboutlocationandcombiningthemintoasinglebestguess, In Figure 5 we see the results of a similar experiment, androbotnavigationalgorithmsaretypicallycastasanytime thistimethedataisstarlightcurves(discussedinmoredetail 1 Q/Max(Q) 1 0.8 BF_AoB 0.8 BF_BoA Magic 0.6 0.6 AAAJ 0.4 0.4 0.2 Accuracy 0.2 0 200,000 400,000 600,000 0 0 200,000 400,000 600,000 Number of calls to Euclidean distance function Number of calls to Euclidean distance function Figure4: (Left)ThenormalizedvalueofQasafunctionof the number of calls to the Euclidean distance. (Right) The accuracy of classification, averaged over the 30 unknown letters,asafunctionofthenumberofcallstotheEuclidean distance. 1 Q/Max(Q) 0.9 0.8 0.8 BF_AoB BF_BoA AAAJ 0.7 Terminates Magic 0.6 AAAJ 0.6 AAAJ 0.4 Terminates 0.5 Accuracy 0.2 0.4 0 2,000,000 4,000,000 6,000,000 0 2,000,000 4,000,000 6,000,000 Figure 6: Page 967 of “The Theatre of Insects; or, Lesser Number of calls to Euclidean distance function Number of calls to Euclidean distance function living Creatures...” [13], published in 1658, showing the “Day”butterfly. Figure5: (Left)ThenormalizedvalueofQasafunctionof the number of calls to the Euclidean distance. (Right) The accuracyofclassification,averagedoverthe1,000starlight datawillnaturallybetext,therewillalsobetensofmillions curves,asafunctionofthenumberofcallstotheEuclidean of pages of images. Consider as an example the lithograph distance. ofthe“Day”butterflyshowninFigure6. The image was published in 1658, therefore predating thebinomialnomenclatureofLinnaeusbyalmostacentury. in Section 3.4), with |A| = 500 and |B| = 8,736. Each time Soamodernreadercannotsimplyfindthebutterflyspecies serieswasoflength1,024. by looking it up on the web or reference book2. However, At this scale it is difficult to see the advantage of the theshapeiswelldefined,andinspiteofbeinginblackand proposedalgorithm. However,considerthis. Afterperform- whitetheauthor’shelpfulnotestellusthatitis“forthemost ing1,000,000distancecalculationstheAAAJalgorithmhas part yellowish, those places and parts excepted which are reduced the normalized value of Q to 0.1640. On the other hereblackedwithinke”. hand, the BF BoA algorithm must perform 4,342,001 dis- Withacombinationofinformationretrievaltechniques tance calculations to reduce the normalized Q to the same andhumaninsightwecanattempttodiscoverthetrueiden- level.Thefollowingobservationisworthnoting. Whileall4 tifyofthespeciesillustratedinthebook.Notethataqueryto algorithms have identical classification accuracy when they Google image search for “Butterfly” returns approximately terminate (by definition), the accuracy of AAAJ is actually 4.73millionimages(on9/12/2007).Alargefractionofthese slightlyhigherwhileitisonly80%completed.Thisisnotan are not images of butterflies, but of butterfly valves, swim- error, it is simply the case that this particular run happened mersdoingthebutterflystroke, greetingcardsetc. Further- tohaveafewmislabelobjectstowardstheendofdatasetB, more,oftheimagesactuallydepictingbutterflies,manyde- andthoseobjectscauseseveralobjectsinAtobemislabeled. pictthemincomplexscenesthatarenotamiabletostate-of- the-art segmentation algorithms. Nevertheless, a surprising 3.3 Historical Manuscript Annotation. Recent initia- fractionoftheimagescanbeautomaticallysegmentedwith tivesliketheMillionBookProjectandGooglePrintLibrary Project have already archived several million books in dig- ital format, and within a few years a significant fraction of 2Thereisa“Daymoth”(Apinacallisto),howeveritlooksnothinglike world’sbookswillbeonline[10]. Whilethemajorityofthe theinsectinquestion. 1.0 Q/Max(Q) BF_AoB 0.8 BF_BoA Magic AAAJ 0.6 0.4 “Day Butterfly” x 105 0.2 0 2 4 6 8 10 12 14 16 Number of calls to Euclidean distance function Figure 8: The normalized value of Q as a function of the numberofcallstotheEuclideandistance. the University of Padova3 [1]. However, the work assumes unlimitedcomputationresourcesandahighdegreeofhuman intervention. Here we attempt to show that AAAJ could allow real time exploration of historical manuscripts. We can simply point to an image and right-click and choose Eastern tiger swallowtail annotation-search. At this point the user can provide some (Papilioglaucus) text clues to seed the search. In this case we assume the word“butterfly”isprovidetothesystem,whichthenissues aGoogleimagesearch. Figure 7: We can compare two shapes by converting them While there are still some image processing and web toone-dimensionalrepresentationsandusinganappropriate crawling issues to be resolved we consider this problem an distancemeasure,inthiscaseDynamicTimeWarping. ideal one to test the utility of AAAJ, so we conducted the followingexperiment. Dataset A consists of 35 drawings of butterflies. They are taken from manuscripts as old as 1783 and as recent as 1968. In every case we know the correct species label be- simple off the shelf tools (we use Matlab’s standard image causeiteitherappearsinthetext(insomecasesinGerman processing tools with default parameters). As illustrated in or French which we had translated by bilingual entomolo- Figure 7, for those images which can be segmented into a gists) or because the entomologist Dr. Agenor Mafra-Neto single shape, we can convert the shapes into a one dimen- was able to unambiguously identify it for us. Data B con- sionalrepresentationandcomparethemtoourimageofin- sist of 35 real photographs of corresponding insects, plus terest[11]. 691otherimagesofbutterflies(includingsomeduplicatesof While the above considers just one image, there are theabove)plus44,215randomshapes. Therandomshapes many online collections of natural history which have hun- come from combining all publicly available shape datasets, dredsorthousandsofsuchimagestoannotate. Forexample, andincludeimagesoffish,leafs,bones,fruit,chickenparts, AlbertusSeba’s18thcenturymasterwork[17]hasmorethan arrowheads,tools,algae,andtrademarklogos. Bothdatasets 500drawingsofbutterfliesandmoths,andthereareatleast arerandomlyshuffled. twenty 17th and 18th century works of similar magnitude WhileweusedDynamicTimeWarpingasthedistance (see[7],page86forapartiallist). measure in Figure 7, it is increasingly understood that for Webelievethatannotatingsuchcollectionsisaperfect large datasets the amount of warping allowed should be applicationofAAAJ.Wehaveacollectionofafewhundred reduced[16],[19]. Forlargeenoughdatasetstheamountof objects,thatwewanttolinktoasubsetofahugedataset(the warpingallowedshouldbezero,whichissimplythespecial images on the internet). An exhaustive join is not tenable, caseofEuclideandistance. nor is it needed. We don’t need to find the exact nearest Figure8showstherateofreductionofQonthebutterfly matchtoeachbutterfly,justonethatisnear-enoughtolikely dataset. bethesameorrelatedspecies. Whilewecanseethattherateoferrorreductionisvery Once we have a near-enough neighbor we can use fast,itisdifficulttogetcontextfortheutilityofAAAJfrom any meta tags or text surrounding the neighbor to annotate our unknown historical images. The basic framework for 3The work at the University of Padova and related projects consider annotationofhistoricalmanuscriptsinthiswayhasbeenthe mostlyreligioustexts,inparticularilluminatedmanuscripts. Weknowof subjectofextensivestudybyseveralgroups,mostnotablyat nootherresearcheffortthatconsidershistoricalscientifictexts. this figure. Consider therefore Figure 9. Here we show in DatasetA 10% Time 100% Time theleftmostcolumntheoriginalhistoricalmanuscripts(note that at this resolution they are very difficult to differentiate from photographs). After 10% of the eventual time had passed we took a snapshot of the current nearest neighbors ofA. Figure9(center)showsthetopeight,asmeasuredby mapping[i].dist. Intherightmostcolumnweshowthefinal result (which is the same for all algorithms). It is difficult Agrias Agrias Agriasbeata sardanapalus sardanapalus to assign a metric to these results, since precision/recall or accuracycannoteasilybemodifiedtogivepartialcreditfor discovering a related species (or a mimic, or convergently evolvedspecies). However,wecanintuitivelyseethatmuch of the utility of conducting this join is captured in the first 10%ofthetime. Troidesamphrysus Troidesamphrysus Troidesmagellanus 3.4 Anytime Anomaly Detection. In certain domains data observations come gradually over time and are subse- quently accumulated in large databases. In some cases the input data rate may be very high or data may come from multiplesources,whichmakesithardtoguaranteethequal- ity of the stored observations. Still we may need to have Papilioulysses Papiliokarna Papilioulysses some automatic way to efficiently detect whether incoming observationsarenormalorrepresentsevereoutliers,withre- spect to the data already in the database. A simple means to achieve this is to apply the nearest neighbor rule and to findoutwhetherthereissomesimilarobservationstoredso Paridesphiloxenus Pachliopterapolyphontes Pachlioptera far.Dependingontheaccumulateddatasetsizeandtheinput polyeuctes rate, processing all incoming observation in online manner withtheabovesimpleproceduremaystillbeintractable. At thesametimeitisoftenundesirableand,aswedemonstrate Papilioantimachus Heliconiusmelpomene Papilioantimachus below, unnecessary to wait for the whole process to finish andthenrunofflinesomedatacleaningprocedure. Whatwe canuseinsteadisananytimemethodthatcomputestheap- proximate matches to small subsets of elements in a batch mode,beforethenextsubsetofobservationhasarrived. Be- low we demonstrate how our AAAJ algorithm can help us Papiliokrishna Papiliokarna Papiliokarna iruana iruana withthis. As a concrete motivating example consider star light curves,alsoknownasvariablestars. TheAmericanAssoci- ationofVariableStarObservershasadatabaseofover10.5 million variable star brightness measurements going back Papiliodemodocus Papiliobromius Graphiumsarpedon over ninety years. Over 400,000 new variable star bright- nessmeasurements areadded tothedatabase every year by over700observersfromallovertheworld[12],[15]. Many oftheobjectsaddedtothedatabasehaveerrors. Thesources of these errors range from human error, to malfunctioning observationalequipment,tofaultsinpunchcardreaders(for Papiliohesperus Papilioblumei Papilioascalaphus attemptstoarchivedecadeoldobservations)[12]. Anobvi- ouswaytocheckforerrorsistojointhenewtentativecollec- tion(A)totheexistingdatabase(B). IfanyobjectsinAare Figure 9: (Left column) Eight sample images from the unusually far from their nearest neighbor in B then we can dataset, a collection of butterfly images from historical singlethemoutforcloserinspection. Theonlyproblemwith manuscripts. (Centercolumn)Thebestmatchingimagesaf- thisideaisthatafulljoinonthe10.5millionobjectdatabase terAAAJhasseen10%ofthedata. (Rightcolumn)Thebest matchingimagesafterAAAJhasseenallofthedata. 6 6 50 4 4 40 Statistics for the nearest neighbor distance of AAAJ mmeeaann+-ssttdd 30 mean 2 2 maximum 20 0 0 10 -2 -2 0 100,000 200,000 300,000 400,000 200 400 600 800 1000 200 400 600 800 1000 (a) (b) 50 6 6 40 4 4 30 2 2 20 10 0 0 -2 -2 0 100,000 200,000 300,000 400,000 Number of calls to Euclidean distance function 200 400 600 800 1000 200 400 600 800 1000 (c) (d) Figure 11: Average nearest neighbor distance with respect to the performed Euclidean distance comparisons. (Top) A Figure10: Variouslightcurvevectorsinthedataset. (a)and iscomposedonlyofnormalelements. (Bottom)Acontains (b)arenormalones. And(c)and(d)areoutliers. oneoutlier. takes much longer than a day. As we shall see, doing this experiments. In the first one dataset A consists of 100 withAAAJallowsustospotpotentiallyanomalousrecords randomlyselectedexamplesamongtheexamplesannotated muchearlier. as normal. For the second experiment, we replace one For this experiment we use a collection of star light of the normal examples in A (selected at random) with a curvesdonatedbyDr.PavlosProtopapasofTimeSeriesCen- knownoutlier. Thisoutlierisnotbelievedtobearecording ter at Harvard’s Initiative for Innovative Computing. The or transcription error, but an unusual astronomical object. dataset that we have obtained contains around 9,000 such Figure 11 shows the improvement in the average nearest light curves from different star classes. Each example is a neighbor distance in dataset A, again as a function of the timeseriesofsize1,024.Figure10showsfourdifferentlight performed Euclidean distance computations. The maximal curveshapesinthedata. Supposethatadomainexpertcon- nearest neighbor distance is also presented in the graphs. sidersasoutliersexampleswhosenearestneighbordistance The top graph corresponds to the experiment where dataset ismorethanapredefinedthresholdoft standarddeviations Aiscomposedonlyofnormalelements,andthebottomone away from the average nearest neighbor distance [15]. For is for the dataset with one outlier. The experiment shows thelightcurvedatawesett =5standarddeviations, which how efficiently the AAAJ algorithm can detect the outliers. capturesmostoftheanomaliesasannotatedbytheexpertas- The average distance drops quickly after performing only a tronomers.Examples(c)and(d)inFigure10,showtwosuch limited number of computations. After that the mean value lightcurves,while(a)and(b)arelessthan5standarddevi- stabilizes. At this point we can interrupt the algorithm and ationsfromtheirnearestneighborsandthereforearetreated computethedeviationforeachelementinA. Theelements asnormal. that fail the expert provided threshold t are likely to fail it Now assume that the light curve observations are hadweperformedafulljointoo. Inthesecondexamplewe recorded in batch mode 100 at a time. We can apply the areablewithhighconfidencetoisolatetheoutlierinonlya AAAJ algorithm to every such setof incoming light curves fewthousanddistancecomputations. (representingdatasetAinthealgorithm)andfinditsapprox- imatematchwithinthedatabase(setB)interruptingthepro- 4 ConclusionsandFutureWork cedure before the next subset of light curves arrives. If the Inthisworkwehavearguedthatformanydataminingprob- maximalapproximatenearestneighbordistanceinAismore lemsanapproximatejoinmaybeasusefulasanexhaustive thanthethresholdof5standarddeviationsfromtheaverage join. Given the difficulty of figuring out exactly “how ap- nearestneighbordistance,thenbeforecombiningAwiththe proximate” is sufficient, we show that we can cast joins as rest of the database we remove from it the outliers that fail anytime algorithms, in order to use as much computational thethreshold. resourcesasareavailable.Wedemonstratedtheutilityofour To demonstrate the above procedure we conduct two ideaswithexperimentsondiversedomainsandproblems.

Description:
asymmetric join. Secondly data mining applications of joins it is not really necessary to . has archived flight telemetry for one million domestic com-.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.