ebook img

Anytime Measures for Top-k Algorithms - CSE SERVICES PDF

12 Pages·2007·0.47 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Anytime Measures for Top-k Algorithms - CSE SERVICES

Anytime Measures for Top-k Algorithms Benjamin Arai Gautam Das Dimitrios Gunopulos Univ. of California, Riverside Univ. of Texas, Arlington Univ. of California, Riverside [email protected] [email protected] [email protected] Nick Koudas Univ. of Toronto [email protected] ABSTRACT improvesgraduallyascomputationtimeincreases[13]. Although severaltypesofsuchalgorithmshavebeenproposed,interruptible Top-kqueriesonlargemulti-attributedatasetsarefundamentalop- anytimealgorithmsarehighlypopularanduseful.Aninterruptible erations ininformation retrieval andranking applications. Inthis anytimealgorithmisanalgorithmwhoseruntimeisnotdetermined paper,weinitiateresearchontheanytimebehavioroftop-kalgo- inadvancebutatanytimeduringexecutioncanbeinterruptedand rithms. Inparticular,givenspecifictop-kalgorithms(TAandTA- returnaresult. Moreover,interruptiblealgorithmshaveanassoci- Sorted) we are interested in studying their progress toward iden- atedperformanceprofilewhichreturnsresultquality(forsuitably tificationof the correct result at any point during thealgorithms’ definednotionsofquality)asafunctionoftime(relativetoexecu- execution. Weadoptaprobabilisticapproachwhereweseektore- tion) for a problem instance. Such algorithms are valuable since portatanypointofoperationofthealgorithmtheconfidencethat at any point during the execution a user can obtain feedback re- thetop-k resulthasbeenidentified. Suchafunctionalitycanbea gardingtheresultqualityatthatpoint. Ifoneissatisfiedwiththe valuableassetwhenoneisinterestedinreducingtheruntimecost currentfeedbackonemaybringthealgorithmtoahalt.Thus,such oftop-kcomputations. Wepresentathoroughexperimentalevalu- algorithmsprovideagracefultrade-offbetweenresultqualityand ationtovalidateourtechniquesusingbothsyntheticandrealdata responsetime. sets. Inthispaperweinitiateastudyofanytimetop-kalgorithms.We studythebehaviorofcommontop-kalgorithmsatanypointoftheir 1. INTRODUCTION executionandwereasonabouttop-kresultquality.Noticethatthis Top-k queries on large multi-attribute databases are common- notionofanytimetop-kcomputationissignificantlydifferentfrom place. Suchqueriesreportthek highest rankingresultsbasedon thenotionofapproximatetop-k algorithmspreviouslyintroduced similarityscoresofattributevaluesandspecificscoreaggregation in the literature[6, 3]. Such models aim to relax the control pa- functions.Suchqueriesareveryfrequentinamultitudeofapplica- rametersofthecomputation (e.g., distance) whicharedifficultto tionsincluding(a)multimediasimilaritysearch(onimages,audio, translateintoguaranteesperceivedbyauser. Theactualbehavior etc.),(b)preferencequeriesexpressedonattributesofassorteddata ofsuchmodelsremainslargelyempirical. Incontrastwewishto types, (c) Internet searches on scores based on word occurrence monitoratop-kalgorithmatanypointinitsexecutionandreason statisticsanddiversecombiningfunctions,and(d)sensornetwork about result quality. For large data collections such an approach applicationsoverstreamsofsensormeasurements. canbesignificantlybeneficialasonemaydecidetoterminatethe Severalalgorithmshavebeenintroducedinliteraturetoefficiently computationearlyifoneissatisfiedwiththecurrentqualityofthe performtop-kcomputations. AmongthemostsuccessfulistheTA results.Inparticularwemakethefollowingcontributions: algorithmdiscoveredindependentlybyNepalet.al.,[21],Guntzer • We initiate the study of anytime top-k computations. We et.al.,[12]andFaginet.al.,[23].Inthisalgorithmeachvalueofan presentaframework,withinwhichatanypointinqueryexe- attributecanbeaccessedindependentlyviaanindexindescending cutionforsuitabletop-kalgorithms,wecancomputeproba- orderofitsscore. Suchascoreiscomputedwithaspecificquery bilisticestimatesofseveralmeasuresoftop-kresultquality. condition. Numerous algorithms for performing top-k computa- Suchmeasuresincludeconfidenceofhavingthecorrecttop- tionshavebeenproposed[10,8,7,2,19,16,1,15,20]depending kresult,precisionoftheresultsassessedwithrespecttothe onthemodelofdataaccess,stoppingconditions,etc.Themajority correcttop-kresults,rankdistancebetweenthecurrenttop- ofsuchcomputationshowevercanbeexhaustive. Thealgorithms kresultandtheexactresult,aswellasthedifferencebetween cometoastoponlywhenthereisabsolutecertaintythatthecorrect thescoresofthecurrenttop-kresultandtheexactresult. top-kresulthasbeenidentified. An anytime algorithm is an algorithm whose quality of results • We investigate the monotonic properties of these anytime measures for varioustop-k algorithmssuch asTAand TA- Sorted. Weshow thatsuchmeasuresaremonotoneforTA, Permissiontocopywithoutfeeallorpartofthismaterialisgrantedprovided butforasingleinstanceofatop-kcomputationofTA-Sorted, thatthecopiesarenotmadeordistributedfordirectcommercialadvantage, thesemeasurescanbenon-monotonic,thoughinexpectation theVLDBcopyrightnoticeandthetitleofthepublicationanditsdateappear, suchmeasuresaremonotonic. andnoticeisgiventhatcopyingisbypermissionoftheVeryLargeData BaseEndowment. Tocopyotherwise, ortorepublish, topostonservers • WepresentalgorithmicenhancementstoTAandTA-Sorted ortoredistributetolists,requiresafeeand/orspecialpermissionfromthe by which they can provide such anytime guarantees with publisher,ACM. smallruntimeoverheadsduringthecourseoftheirexecution VLDB‘07,September23-28,2007,Vienna,Austria. overlargedatacollections. Copyright2007VLDBEndowment,ACM978-1-59593-649-3/07/09. 2. RELATED WORK However,evenifeachoftheunseentupleshasaverysmallproba- Thethresholdalgorithm(TA)constitutesthestateoftheartfor bilityofbeinginthetop-ktuples,collectivelytheremaybealarge top-k computations[21,12,23]. SeveralvariantsofthebasicTA probability thatwedo not havethecorrect top-k tuples. For e.g. ideas have been considered in various contexts [15, 20, 16]. [1] (assumek =1),ifeachunseentuplehasonly1%chanceofbeing dealswithtop-kproblemsonwebaccessibledatasourceswithlim- thetop-1tuple,andwehave1,000,000unseentuples,theprobabil- itedsortedaccess. Nearestneighbortypeofapproacheshavebeen itythatanyoneofthemisthetruetop-1is1−0.991,000,000 ≈1. consideredinthiscontextaswell[5,4,26,14]. Itisassumedthat Incontrast,ourworkismoregeneralinthatweproposeanytime sorted lists of the data items by each attribute are available, and enhancementstobothTAandTA-Sorted. Moreover,ourmethods TAscanstheselists(performingsortedaccesses)inaninterleaved dependonacarefulconsiderationofthenumberofunseentuples, manner,andcomputestheitemswithtop-kscoresusingmonotone whichisnecessarytogivecorrectprobabilisticguarantees. scorecombiningfunctions.Thealgorithmhastoimmediatelycom- Recentwork[18,24]onprobabilisticrankingofdata,isorthog- putethecompletescoreforeachitemencounteredintheselists.In onaltotheworkpresentedhere.Themodelassumedintheseworks ordertodosohowever,itconductsrandomaccessestoallrelevant isthatofincompletedataandtheprobabilisticframeworkisbased listsandthusitsoverheadmaybehighdependingontheapplica- onpossibleworldssemantics. Incontrastweassumecompletein- tioncontext. Fortherestofthispaperwerefertothisalgorithmas formationwithorwithoutnoiseandweareinterestedinassigning TA. guaranteesonearlystoppingofpopulartop-kalgorithms. Several variants of this basic idea have been proposed. TA- Sorted[23,12]canworkinenvironmentswhererandomaccessis 3. FRAMEWORK notavailable.Itmaintainsworstandbestscoresforitemsbasedon partiallycomputedtotalscores; thealgorithmcomparestheworst 3.1 AnytimeMeasures casescoreofthek-rankeditemwiththebestscoreofallcandidates Ourfocusinthispaperistoupgradetop-kalgorithmssothatthey asastoppingcondition.Inthisalgorithmitemsarealwaysaccessed canexhibitanytimebehavior. Thismeansthatatanypointduring sequentially. Sinceexpensiverandomaccessisavoided,incertain theexecution-i.e.,beforethealgorithmhasterminated-wewishto situationstheperformancemaybemuchbetterthanTA. beableto(a)revealthecurrenttop-kresultscalculatedthusfar,and OptimizationissuesforTAalgorithmshavebeenconsideredas (b)associatea“guarantee”withourcurrentanswers.Forexample, well[19,1,16]. Themainthrusthasbeentoreducethenumberof wemaywishtobeabletogiveprobabilisticguarantees, suchas: randomaccesseswhensourcesvaryinseveralparameters,suchas “Withprobabilityp,thecurrenttop-ktuplesarelikelytobethetrue speed, selectivityetc. Severalstatisticalaidshavebeendeployed, top-k tuples”. Providingsuchprobabilisticguaranteesisthemost suchashistogramsandprobabilisticestimatorsforthenumber of criticalaspectofourapproach, andmuchoftheremainderofthis randomaccesses. paperisdevotedtodevelopingappropriateguaranteemeasuresand AnytimealgorithmshavefoundnumerousapplicationsinAIand efficienttechniquesbywhichsuchmeasurescanbecalculated.Our planning contexts [13, 25]. The quality of results of an anytime goal is to provide a mechanism to continuously recompute these algorithm improves as the computation evolves. At a high level, guaranteesasmoredataisseen. anytimealgorithmscanbecategorizedasbeingeitherinterruptible orcontract. Aninterruptiblealgorithmdoesnothaveasetrunning time and can always be interrupted at any time during execution • Confidence: The algorithms shall be able to determine the returningaresult.Thequalityoftheresultcanbedeterminedviaa probability that thecurrent top-k tuplesareindeed thetrue performanceprofile. Acontractalgorithmhasatimedeadlineasa top-ktuples. parameterandnoassumptionabouttheresultscanbemadebefore • Precision:Thealgorithmsshallbeabletocalculatea(proba- thedeadline. bilistic)lowerboundontheprecisionofthecurrenttop-ktu- Theobald et. al., [17] presented an approach for probabilistic ples-i.e.,thisboundontheprecisionwillholdwithagiven top-k query evaluation. This work is specifically targeted to the probabilityofp(typically,p = 0.95). Theprecisionofthe TA-Sortedalgorithm. Thebasicideais,foranewlyseenitem,to retrievedresultsisdefinedasr/kwhereristhenumberof computetheprobabilitywithwhichitmaybelongtothetop-kre- thecurrenttop-ktuplesthatbelongtothetruetop-ktuples. sult. Ifthatprobabilityisbelowausersuppliedthresholdtheitem isdiscardedfromfurtherconsideration. Thisway,possiblyfewer • Rank Distance: Likewise, the algorithms shall be able to itemsareconsideredduringtop-kqueryevaluation. Moreover,by compute a probabilistic upper bound on the rank distance carefullymaintainingboundsforthescoresofthemostpromising ofthecurrent top-k tuples. Therank-distance isdefinedas (as far as the top-k result is concerned) items that have been en- follows. Let CurRank(t) be the rank of a tuple t in the countered thealgorithmmayprobabilisticallydecidetoterminate currenttop-k,andletTrueRank(t)beitsrankintheentire earlierthantheregularTA-Sorteddeterministiccomputation. Em- databasewhensortedbyscores.Then piricalevaluationpresentedin[17]demonstratedthatthealgorithm performswellinpractice. RankDistance= Theworkof[17]hassomesimilaritytoourwork,howeveritis not an anytime algorithm. It applies only to theTA-Sortedalgo- X |CurRank(t)−TrueRank(t)| rithm,andoffersguaranteesonlyattheendoftheexecution, i.e., t∈CurTopk whenthealgorithmrunsoutofcandidates.Further,sinceitfocuses RankDistanceisrelatedtotheSpearman’sFootrulemeasure onlyoneliminatingcandidatesthatarepartiallyseenbutunlikely forcomparingrankedlists[9]. tobeinthefinaltop-kresult,itisnotdirectlyapplicabletotheTA algorithm. Finally,thealgorithmin[17]onlygivesaprobabilistic • ScoreDistance:Finally,thealgorithmsshallbeabletocom- guarantee thatadiscarded/unseen tupleisnotinthetop-k tuples, puteaprobabilisticupper bound onthedifference between independently of thenumber of unseen tuplesinthedataset. Its thesmallestscoreofthetruetop-ktuplesrelativetothesmall- resultdoesnotchangeifwehave10vs. 1,000,000unseentuples. estscoreofthecurrenttop-ktuples. 3.2 Knowledgeofthe Data Distribution tion3.2,thenPDF(D|D ∈D)istheprobabilitydensityassoci- Tobeabletogiveprobabilisticguaranteeswithouranytimean- atedwitheachspecificdatabaseD. swers, it is critical that we assume some knowledge of the data, Let OneMore(Seend) refer to the space of all possible valid suchasthenumberoftuplesN,aswellasknowledgeofthedistri- prefixes of databases that is defined by extending Seend by one butionalpropertiesofthedata.Suchknowledgecanbeobtainedvia moreiteration. Consideranyspecificextension ofSeend byone popularparametricornon-parametrictechniques(i.e.,histograms). iteration,saySeend+1. Wenotethatapdfoverthisspaceofex- Thesedatadistributionmodelsareassumedtobeeitheravailable tensions,i.e. PDF(Seend+1|Seend+1∈OneMore(Seend)), (e.g., histograms of thedata have been pre-computed, tobe used can be naturally defined. To carry OneMore(Seend) even fur- multiple times for different top-k queries), or can be computed ther,letD(Seend)refertothespaceofallpossiblevalidcomplete ondemand(e.g.,foreachtop-k query, freshhistogramsarecom- databases that can be defined by extending Seend into complete puted).Ourdevelopmentofanytimetop-kalgorithmsdoesnotde- databases,i.e.,afterN −diterations. Thepdfofthesedatabases, pend on theparticular type of distributional knowledge assumed. PDF(D|D∈D(Seend)),canbenaturallydefined. For this reason, we employ a generic probabilistic model of the LetScore(t)bethescoreofatuplet,definedasalinearaddi- datawhichweassumeisknowntous.Wechoosetodosoinorder tivefunctiononthetheindividualattributevaluesintypicaltop-k tokeepthepresentationofourtechniquesgenericandindependent algorithms,suchasScore(t) = w1t[1]+...wMt[M]wherethe ofspecificformsofdatadistributionmodels. weightsarepositiveconstants.LetthekMinScore(Seend)refers Tobemorespecific,letourdatabaseDhaveNtuplesoverMat- tothekthlargestscoreof all tuplesinSeend. Wecan makethe tributesA1,...,AM andletDom1,...,DomM betherespective followingobservation: domainsoftheattributes. Theprobabilitydistributionalmodel of thedatamayeitherbespecified(assumingattributeindependence) OBSERVATION 1. Theminimumscoreofthecurrenttop-k tu- asaproductofknownprobabilitydensityfunctionsgPDFi(x)as- ples increases monotonically as the algorithm progresses on any sociated with each ith attribute (e.g., M single-dimensional his- database. tograms), or as a joint distributional model over the space of all possibletuplesDom1×...×DomM (e.g.,amulti-dimensional kMinScore(Seend)≤kMinScore(Seend+1) histogram).OuractualdatabaseDmaybeassumedtobeaspecific instanceofN tuplesdrawnfromthisdistribution. LetkthScore(D)refertokthlargesttruescoreofalltuplesina specificdatabaseD.ForAnytimeTAletConfidence(Seen )be d 4. ANYTIMETA ALGORITHM definedastheprobabilitythat kMinScore(Seen )=kthScore(D) 4.1 Preliminaries d We begin with a short description of the Threshold Algorithm where D is a random valid extension of Seen into a complete d (TA): The algorithm proceeds in iterations, where in each itera- databasedrawnfromPDF(D|D∈D(Seen )). d tion, the next items in each sorted list are retrieved in parallel. Foreachretrievedtuple-id, theentiretupleisretrievedusingran- THEOREM 1. Foralldatabaseinstancesitholdsthat dom access and its score is computed. The algorithm maintains abounded bufferofsizek inwhichthecurrenttop-k tuples(i.e., Confidence(Seen )≤Confidence(Seen ) d d+1 amongthoseseen)aremaintained.Thealgorithmterminateswhen astopping condition isreached, i.e., whentheminimumscore in Proof: SincethekMinScore(Seen )isincreasingineachitera- the top-k buffer (henceforth referred to as kMinScore is larger d tion,theprobabilityofthekMinScore(Seen )beingequaltothe thanScore(h),whereh=[h1,··· ,hM]isa“hypothetical”tuple kMinScore(D)isalsoalwaysincreasing.2d such that each hi is the last attributevalue read along the sorted orderforAi. 4.2 The Algorithm ConsiderasnapshotofTAafterditerationsforaspecificdatabase TheanytimeversionofTAisshowninAlgorithm1. Thealgo- D. LetSeen bethe“prefix”ofthedatabasethathasbeenseen d rithmproceedslikethestandardTA,selectingattributesinaround- by this algorithm after these d iterations. To be able to estimate robinfashion,andateachstepprocessesthenextvalueinthesorted theanytimemeasures,thealgorithmwillhavetomakesomedistri- listoftheselectedattribute.Inaddition,italsomaintainstheinfor- butional assumptionsabout theremainingportionofthedatabase mationnecessarytocomputeprobabilisticguarantees1. thathasnotyetbeenseen.Intuitively,thealgorithmdeterminesthe Foreachroundofthealgorithmanewvalue< t,t[i] >isread pdfoftheremainderofthedatabasebyconditioningthedatadis- tributionalmodel(discussedinSection3.2)withtheprefixalready along the list Li corresponding to the i-th attribute, i.e., the i-th attributevalueoftuplet.Whenthisitemisread,thealgorithmhas seen,andthencomputesestimatesofeachoftheanytimemeasures to(a)resolveScore(t)(whichisthesumoftheattributesoftandis basedonthisconditionalpdf.Asanexample,assumethatthedata donebyprobingthelistsusingrandomaccess),(b)updatethepdf distributionof D isdefined usingthedistributionsgPDFi along of thei-th attribute(gPDF(i))so that it reflectsthe distribution theithattributeassumingindependenceamongtheattributes,and oftheremainingvaluesofthatattribute, and(c)updatethetop-k leth1,...,hM bethelastvaluesseenalongeachattributerespec- bufferwiththektupleswiththehighestscores.Attheendofeach tively.Thentheithattributeofanyunseentupletintheremainder roundthestatisticsareupdatedandtheconfidenceiscomputed. ofthedatabasewillbearandomvariablet[i]distributedaccording gPDFiconditionedbyt[i]≤hi. 1NotethatunlikethestandardTAalgorithmouralgorithmdoesnot LetPDF(O|O∈O)representstheprobabilitydensityassoci- haveaterminationcondition,sincetheobjectiveistoproduceany- atedwithobjectOthatbelongstoa(possiblyinfinite)setO.Thus timeprobabilisticguarantees.Ouralgorithmcanbeeasilymodified if D refers to the space of all database tables with N tuples that toterminate,forexamplewhentheprobabilisticguaranteescrossa canbegeneratedbytheprobabilisticdatamodeldiscussedinSec- userdefinedthreshold. Algorithm1AnytimeTA Histogram 1 1: topk={dummy1,...,dummyk},Score(dummyi)=0 Histogram 2 Max 2: kMinScore=0//smallestscoreintopkbuffer 3: ford=1toN do 4: foralllistsLi(1≤i≤M)inparalleldo 5: Let<tuple-idt,t[i]>bethed-thiteminLi 6: //ComputeScore(t)usingrandomaccess 7: Score(t)=0 8: forj =1toM do 0 20 40 60 80 100 9: Score(t)+=wjt[j] 10: endfor 11: //UpdatePDFstomodeltheremainingvalues Figure1: An exampleof theresultof themax-convolution of 12: Update-gPDF(gPDFi,t[i]) twodistributions. 13: //Updatetopkbuffer 14: ifScore(t)>kMinScorethen 4.3.1 ComputingConfidence 15: ift6∈topkthen LetSeen(Unseen)refertothesetoftuplesthathavebeenseen 16: Letubethetuplewiththesmallestscoreintopk (unseen) by the algorithm thus far. Clearly |Unseen| = N − 17: topk=topk−{u} |Seen|. To execute the function call ComputeConfidence(), 18: topk=topk∪{t} wehavetoestimateProb(kMinScore > MaxUnseen),where 19: endif kMinScore istheminimum scoreinthetop-k buffer, whilethe 20: kMinScore=min{Score(v)|v ∈ topk} random variable MaxUnseen describes the maximum score of 21: endif all the unseen tuples. The pdf of MaxUnseen can be com- 22: //Computeconfidence puted by firstcomputing thepdf of the scoreof oneUnseen tu- 23: Confidence=ComputeConfidence() ple. Thisinvolvestheconvolutionofthepdfsoftheattributeval- 24: endfor ues: OneUnseenPDF = ∗{gPDFi|1 ≤ i ≤ M}. Then the 25: endfor pdfofMaxUnseen(i.e.,MaxUnseenPDF)canbecomputed by computing the max-convolution over the multi-set containing |Unseen|copiesofOneUnseenPDF: 4.3 Computing AnytimeTA Measures Inthissubsectionwediscussdetailsofhowthevariousanytime MaxUnseenPDF = measuresarecomputed ineachiterationofthealgorithm. Foran ∗ ({OneUnseenPDF,...,OneUnseenPDF}) max unseentuplet,itsscoremaybeviewedasarandomvariable. Let scorePDFt(x)bethepdfofthescoreoft. Inordertocompute As we shall later show in Lemma 4.3 the max-convolution of the anytime measures, we need to compute the pdf of the score identicalpdfscanbeefficientlycomputedinconstanttime. Once of anyunseen tuple, andthepdfof themaximum scoreofall the wehavecomputedMaxUnseenPDF,wecancompute unseentuples.Ifweassumeattributeindependence,thenthescore ofanunseentupleisthesumofM randomvariables. Tocompute Confidence=Prob(kMinScore>MaxUnseen) thepdfofthissumwecomputetheconvolutionofthegPDFi. If joint-distributionsareknown wecanalsoproceed toestimatethe 4.3.2 ComputingOtherAnytimeMeasures pdfofthescore. Weshowbelowhowthisscorecanbeestimated Inthissubsectionweoutlinehow,inadditiontoConfidence,the byconvolutionofpdfsofM independentattributes. anytimemeasuresofPrecision,RankDistance,andScoreDistance canbecomputed. DEFINITION 1. Convolutionoftwodistributions:Assumethat In the case of Precision, we wish to determine (with a given f(x), g(x)are theprobabilitydensityfunctions (pdfs) of thetwo probability p, say 95%) the fraction of the current top-k tuples independent random variables X,Y respectively. The pdf of the that will belong to the true top-k tuples of the database. Let the randomvariableX+Y (thesumofthetworandomvariables)is worst case scores of the current topk tuples be s ,s ,...,s (= theconvolutionofthetwopdfs: 1 2 k kMinScore). LetProbibetheprobabilitythatsiisgreaterthan ∗({f,g})(x)=R0xf(z)g(x−z)dz tMecahxniUqunesseuense.dTfohresceomPpruotbini’gsCcaonnfibdeenccoemapbuotevde,uesxincegptthtehastamwee Thisdefinitioncanbeeasilyextendedtothesumofmorethantwo havetoexecuteitforeachsiratherthatjustforkMinScore.Leti randomvariables. Wealsogiveanotherdefinitionthatallowusto bethelargestintegersuchthatProbi ≥p. Thealgorithmoutputs estimateother aggregates, such asmaxandmin of random vari- i/k asPrecision. NotethatthisisaconservativeboundonPreci- ables. sionbecause weonlyconsider prefixes of the current top-k tobe overlappingwiththetruetop-k,andnotanysubset. DEFINITION 2. Max-convolution of two distributions: As- Inorder tocomputeScoreDistance, our taskistofinda“high sume that f(x),g(x)are thepdfsof tworandom variables X,Y probability”upperboundonthesmallestscoreofthetruetopktu- respectively.Thepdfoftherandomvariablemax(X,Y)(themax- ples.Thus,wewishtofindthesmallestpositivenumberδsuchthat imumofthetwovalues)isthemax-convolutionofthetwopdfs: Prob(kMinScore+δ >MaxUnseen)>pwherepisagiven ∗ ({f,g})(x)=f(x) xg(z)dz+g(x) xf(z)dz probability,suchas95%.OnceweknowthepdfofMaxUnseen, max R0 R0 theanswertothisquestionisstraightforward. Figure1showstheresultofmax-convolutionsovertwogivendis- ComputingRankDistanceismoreinvolved.Themaintaskisto tributions. Themax-convolutiondefinitioncanbeeasilyextended determine,foreachtupletiinthecurrenttopktuples,ahighprob- tomorethantworandomvariables. abilityupperboundforitstruerankinthedatabase(oncewehave these estimates, we can compute a high probability upper bound Proof:TheproofissimilartothatofLemma4.2,exceptthatonce fortheRankDistance). Todetermineanupperboundonthetrue thecumulativepdfofHAhasbeenpre-computed,eachprobability rank of ti, we need to compute how many tuples from Unseen term Prob(Ak ≤ max(A,A,...,A) < Ak+1) reduces to n· havelargerscoresthantiwithhighprobability. Furtherdetailsare HA[k+1]·(Pi≤kHA[i])n−1,whichcanbecomputedinconstant omittedfromthisversionofthepaper. time.2 4.4 Approximating PDFs UsingHistograms 4.5 AnExample Wepresentedourtechniquesthusfarusingagenericprobabilis- ticmodelofdata. Inthissectionwedescribethepracticalrealiza- A1 A2 tion of our methodologies using a widely adopted model for ap- id,val id,val proximatingdatadistributions(i.e.,pdfs),namelyhistograms. For t :0.9 t :0.8 4 5 simplicity of exposition, we adopt equi-width histograms for our t :0.8 t :0.7 2 4 discussion,howeverthedescriptionisapplicabletoanyhistogram t :0.4 t :0.6 3 2 technique.Wenotethathistogramscanapproximatearbitraryfunc- t :0.3 t :0.3 1 1 tionsandthusouruseofhistogramsdoesnotplaceanyrestrictions t :0.2 t :0.2 5 3 orrequireanyassumptionsabouttheunderlyingdistributionsthat arebeingapproximated. Table1:SortedlistsofasampletablewithtwocolumnsA and Thefollowinglemmasdetailtherunningtimeofthebasicoper- 1 A ,tuplest ...t ,andvaluesforeachattributerangingfrom0 ationsofthealgorithm. 2 1 5 to1. LEMMA 4.1. Theconvolutionoftwopdfsthatarerepresented Table1showsthesortedlistsforadatasetwith2attributesand bytwobbuckethistogramscanbecomputedinO(b2)time. 5tuples. Wehaveaqueryforthetop-ktupleswherekisequalto Proof: Considertworandomvariables,A,Binthedomain[0,1] 2,andthescoreforagiventupletiscomputedasalinearadditive functionoftheindividualattributes. with pdfs fA(x),fB(x) respectively. Assume that the two pdfs Throughouttheexample,assumeweuseequi-widthhistograms areapproximatedbytwohistogramswithbbuckets,HAandHB. withatmost2buckets. Atthestart,thebucketsofthehistogram Assume that the bucket boundaries are the same: HA = [0 = A ,...,A = 1]; ifnotwecancreatetwoequivalenthistograms forA1 havecounts (3,2), whilethebuckets ofthehistogram for 1 b with 2b buckets and the same bucket boundaries. Consider the A2havecounts(2,3). Notethateachhistogramcanrepresentthe CartesianproductofthetwohistogramsCA,B =HA×HBwhere correspondinggPDFibynormalizingtorelativecounts. Assumeasnapshotofthealgorithmwherethefirstitemsofeach CA,B[i,j] = HA[i]HB[j](HA[i]istherelativecount associated withbucketi.) WecanapproximatethepdfofA+B withahis- listhasbeenread,andt4andt5havebeenfullyresolvedandloaded togramwith2bbucketsandboundariesg0 =0,g1 =A1,...,gb= into the top-k buffer. Thus t4 and t5 belong to the Seen group. 1,g = 1+A ,...,g = 2. To compute thehistogram we ClearlykMinScore=Score(t5)isthelowestscoreinthetop-k b+1 1 2b buffer. have to compute the probability Prob(g < A + B ≤ g ) k k+1 for the buckets of the new histogram, which may be derived as The remaining tuples t1, t2, and t3 are in the Unseen group. WeneedtoestimateOneUnseenPDF,thepdf ofthescoreofany aPppArlo+xBimma=tegdk+b1yCaA,bB[bl,umck]etThhiisstohgirsatomgrabmy mcaenrgsinugbsneqeiugehnbtloyrinbge Unseen tuple using the gPDFis for attributes A1 and A2. We pairsofbuckets.ThisproceduregivesanO(b2)algorithmforcom- havetofirstupdatethegPDFistomodeltheremainingvaluesfor putingtheconvolutionofthetwopdfs.2 eachattribute. Consequently thebucketsofthehistogramforA1 will now have counts (3,1), while the buckets of the histogram Asacorollary, for nhistograms, wecan performthe convolu- tionsinsequence,withafinalrunningtimeofO(nb2). for A2 will have counts (2,2). Wethen normalize each gPDFi by dividing by the sum for each gPDFi to get (3/4,1/4) and LEMMA 4.2. Themax-convolution of twopdfsthat are repre- (1/2,1/2)respectively.WethencomputeOneUnseenPDFbytak- sentedbytwobbuckethistogramscanbecomputedinO(b)time. ing the convolution of gPDF1 and gPDF2, resulting in counts (3/8,5/8). NextweneedtocomputeMaxUnseenPDF,thepdfof Proof: Thetrickhere istoavoidtheCartesianproduct. Asbe- themaxscoreofallunseentuples.AsshowninLemma4.3,wedo fore,considertworandomvariablesA,Bwithpdfsapproximated thisbytakingthepdfofasingleunseentupleOneUnseenPDFand bytwohistogramsHAandHB eachwithbbucketsandthesame raisingittothepowerofthetotalnumberofunseentuples,which bucket boundaries. We approximate the pdf of max(A,B) with inthiscaseis3. ahistogramwiththesamebucket boundaries. ThenProb(A ≤ Wecanthencomputetheconfidenceofthecurrenttop-kbuffer k max(A,B) < Ak+1) isequal to HA[k +1]·(Pi≤kHB[i])+ bycomparingMaxUnseenPDFwiththecurrenttop-kkMinScore HB[k+1]·(Pi≤kHA[i]). asdescribedinSection4.3.1. IfwefirstcomputethecumulativedistributionsofHAandHB, 4.6 Considering Multidimensional itiseasytoseethetheaboveprobabilitycanbecomputedincon- Distributions stanttime. Sincethecumulativedistributionscanbecomputedin O(b)time,theoveralltimeforthemax-convolutionisO(b).2 The pdf of the score of a tuple for a given query depends on Asacorollary, we cancompute themax-convolution of n his- thejointdistributionof theattributes. Manycommercial systems togramsinO(nb)time. Evenmoreinterestingly,asthefollowing maketheattributevalueindependenceassumption,andkeepstatis- lemmashows, themax-convolutionofnidenticalhistogramscan ticsonlyforindividualattributes.Inoursettingasdescribedabove, becomputedinO(b)time. theindependenceassumptionissimilarlyassumedwhenwecom- putethepdf ofthescore of atupleby takingconvolutions of the LEMMA 4.3. Themax-convolutionofnidenticalPDFs,repre- histogramsofthedifferentattributes. Westressherethatthecom- sentedbyabbuckethistogram,canbecomputedinO(b)time. putation of the score is the only place in our framework this as- sumptionhasbeenmade. Allothercomputations(includingmax- Algorithm2AnytimeTA-Sorted convolutions) that take place in the computation of our any-time 1: topk={dummy1,...,dummyk},MinScore(dummyi)= measures do not make any assumptions on the distributions. Al- 0 though theindependence assumption iscommonly applied andis 2: Partials={}//Partiallyseentuplesnotcurrentlyintopk well validated in practice for in a wide variety of applications, it 3: kMinScore=0//smallestscoreintopkbuffer mayproduce inaccurateresultsinsomecases(whethertheconfi- 4: Assumeforalltuplest,obs(t)={} dence curve using one-dimensional histogram is higher or lower 5: ford=1toN do than the confidence curve using two-dimensional histograms de- 6: forallsortedlistsLi(1≤i≤M)inparalleldo pends onwhether theindependence-based approach isoverlyop- 7: Let<tuple-idt,t[i]>bethed-thiteminLi timisticor pessimistic). For such cases, joint distributionmodels 8: obs(t)=obs(t)∪{i} involvingmultipleattributesmaybenecessary. 9: MinScore(t)=0 Jointdistributionscaneasilybeappliedinourframework. Sup- 10: forj ∈obs(t)do poseforsometupletwehavethreeattributesA,B andC which 11: MinScore(t)+=wjt[j] areunknown. Earlierweshowedthatwecancomputetheconvo- 12: endfor lutionofHA,HB,andHC,butwithmultidimensionalhistograms 13: //UpdatePDFsbyconditioningwithremainingvalues we can now compute the convolution of thescore pdf of A+B 14: Update-gPDF(gPDFi,t[i]) andHC,wherethescorepdfofA+Bmaybedirectlycomputed 15: //Updatetopkbuffer from HA,B, thetwo-dimensional histogram representing the join 16: ifMinScore(t)>kMinScorethen distributionofattributesAandB. 17: ift6∈topkthen Asinthecaseofone-dimensionalhistograms,multidimensional 18: Letubetuplewithsmallestworstcasescoreintopk histograms are computed as a pre-processing step. Methods for 19: Removeufromtopk computing multidimensional histograms have been throughly re- 20: if|obs(u)|<M then searched[11][22]involvingsamplingandotherefficientapproxi- 21: Partials=Partial∪{u} mationtechniques. Intheevaluationsectionofthisworkwecon- 22: endif sider two-dimensional histograms. Since the number of possible 23: topk=topk∪{t} two-dimensionalhistogramsisquadraticinthenumberofattributes, 24: endif wehavetodecidewhichpairstotake. Weusethefollowingsim- 25: kMinScore=min{MinScore(v)|v ∈ topk} ple heuristic: starting withthe set of attributes, we find the most 26: endif correlatedpairofattributes,computeatwo-dimensionalhistogram 27: if|obs(t)|<M andt6∈topkthen ontheseattributes,removethesetwoattributes,andcontinuewith 28: Partials=Partials∪{t} theremainingset. Thisapproachproducesalinearnumberofhis- 29: else tograms,and,sincethereisnooverlapofattributesbetweendiffer- 30: Partials=Partials−{t} ent histograms, greatly simplifies the selection of the histograms 31: endif that have to be used to compute the convolution of a set of at- 32: //Computeconfidence tributes. 33: Confidence=ComputeConfidence() Recall that for one dimensional histograms (histograms cover- 34: endfor ing a single attribute) every time a new item from the sorted list 35: endfor isread, thecorresponding bucket hastobe decreased by one. In the case of multidimensional histograms we similarly decrement thehistogramsasnewitemsareread. Supposewehaveadatabase thenext(sortedbydecreasingmagnitude)valueoftheselectedat- withtwoattributesAandB,twoone-dimensionalequi-widthhis- tribute. ThedifferentiatingfactorbetweenAnytimeTAandAny- togramsHA,HB,aswellasone10×10equi-width2-dimensional timeTA-SortedistheinclusionofPartials. LetPartialsbethe histogramHA,B. Ifthefirsttuplethatiscompletelyresolvedhas setoftuplesthatarepartiallyseen(somebutnotalloftheattributes thevalues(0.3,0.9), wedecrementthebucketsofthehistograms foragiventuplehavebeenresolved),butarenotinthetop-kbuffer. asfollows: HA[3]wouldbedecremented,HB[9]wouldbedecre- Let < t,t[i] > be the next item read by the algorithm along mented, andHA,B[3,9] would be decremented. Thissametech- the sorted list Li corresponding tothe i-th attribute, i.e., the i-th niquefollowsthroughforhigherdimensional histogramsandcan attributevalueoftuplet.Whenthisitemisread,thealgorithmhas beperformedincrementally. to(a)updateMinScore(t)(whichisthesumoftheattributesthat haveseenfort)(b)updatethepdfoftheattributei(gPDFi),and 5. ANYTIMETA-SORTED ALGORITHM (c)updatethetop-kbufferwiththektupleswiththehighestlower- bound scores. After reading t[i], t will either be fully resolved InthissectionwedescribehowtheTA-Sortedalgorithmcanbe (thatis,allattributesofthavebeenseenanditsfinalscorefound) extended to compute online probabilistic guarantees. In addition andputintheSeengroup,orpartiallyresolvedandplacedinthe to Seen and Unseen tuples, TA-Sorted also maintains tuples in Partialsgroup. whichonly someof theattributeshavebeen seen. Thisisacon- sequence of theinabilityof TA-Sortedtoperform random access 5.1 Monotonicity forAnytimeTA-Sorted operations. Consequently, duringtheoperationof TA-Sorted, we Measures inExpectation needtokeepasetoftuplescalledPartialsthatarenotinthetop- LetkthScore(D)refertothekthlargestscoreofalltuplesina k,yetcannotbeeliminatedbecauseweknowonlyalower-bound specificdatabaseD. TheConfidence(Seen )forAnytimeTA- oftheirtruescore. TheTA-Sortedalgorithmmustestimatethepdf d Sortedmaybedefinedastheprobabilitythat ofthemaximumscoresofthePartialsbeforegivinganyproba- bilisticguaranteeontheconfidence. kMinScore(Seen )>(k+1)thScore(D) d LikeTA,theTA-SortedalgorithmasshowninAlgorithm2se- lects attributes in a round-robin fashion, at each step processing where D is a random valid extension of Seen into a complete d database drawn from PDF(D|DinD(Seen )). Because of the PartitioningallvaliddatabaseextensionsDasfollows,weget d useoflower-boundscores,thisdefinitionofconfidenceisactually evenmoreconservativethantheearlierdefinitionofconfidencein Confidence(Seend)= X Section3.1. Seend+1∈OneMore(Seend) Example:Thereexistadatabaseinstancewhere ( X (kMinScore(Seend)= D∈D(Seend+1) Confidence(Seend)>Confidence(Seend+1) kthScore(D)) ·Prob(D|D∈D(Seen ))) d+1 AssumeadatabasewithtwocolumnsA andA ,eachwithdomain 1 2 ·Prob(Seen |Seen ∈OneMore(Seen )) [0.0,1.0]andauniformdistributionmodel. Letthescorefunction d+1 d+1 d beScore(t)= t[1]+t[2]. Letthedatabasehavefourtupleswith tuple-idst1,...,t4,andassumethatthetaskistoreturnthetop-2 FromTheorem1wehave tuples. In the first iteration, assume we encounter t1 = [0.9,?] and kMinScore(Seen )≤kMinScore(Seen ) d d+1 t2=[?,0.9],alongeachofthesortedlists(a?impliesthatthecor- responding attributevalueisunresolved). Afterthisiteration, the foranyextensionSeen .Thustheabovereducesto: d+1 top-2bufferisloadedwitht1andt2,eachwithaworstcasescore of0.9.Sincewehavenotseentheothertwotuples,weassumethat Confidence(Seend)≤ X each isdistributeduniformly in[0.0,0.9]×[0.0,0.9], and hence Seend+1∈OneMore(Seend) theprobabilitythatthecurrentworstcasescoreof0.9islargerthan thescoresofboththeseunseentuplesis(1/2)∗(1/2)=1/4. ( X (kMinScore(Seend+1)=kthScore(D))· Suppose in the next iteration the algorithm encounters t3 = D∈D(Seend+1) [0.8,?]andt4 = [?,0.8]. Afterthisiteration,thetop-2bufferre- Prob(D|D∈D(Seend+1)))· mains unchanged. However, the unresolved attribute of t3 has a Prob(Seen |Seen ∈OneMore(Seen )) d+1 d+1 d probabilityof7/8ofhavingavalueintherange[0.1,0.8],which would enable t3 to have larger score than the current worst case score. Asimilarargumentcanbemadefort4. Thus,theprobabil- Thus, itythatthecurrentworstcasescoreof0.9islargerthanthescoresof boththese(nowpartiallyseen)tuplesdecreasesto(1/8)∗(1/8)= Confidence(Seend)≤ X 1/64. 2 Seend+1∈OneMore(Seend) Similarexamplescanbeconstructedtoshowthattheotherany- Confidence(Seen )· d+1 time measures are non-monotonic for certain database instances. Prob(Seen |Seen ∈OneMore(Seen )) Theseargumentsbringtolightasubtleissue.Theuncertain(prob- d+1 d+1 d abilistic)natureofanytimemeasuresshouldofcoursebeobvious tothereader-i.e.,thatatanypointduringexecution,wecannotbe Thus,Confidence(Seen )≤E[Confidence(Seen )].2 completely certain that wehave discovered the truetop-k tuples, d d+1 andthereforecanonlymakeprobabilisticguaranteesregardingour 5.2 ComputingAnytimeTA-SortedMeasures anytimemeasures.However,whattheexampleshowsisthatasthe Inthissubsectionwediscusshowtheanytimemeasuresarecom- iterationsprogress,wemayhavetorevise,andsometimesevenre- putedineachiterationoftheTA-Sortedalgorithm.Ourfocusison duce,ourprobabilisticguarantees.Wenotethatasimilarargument the Confidence measure; the details of the computation of other willnotsufficeinthecaseofTA,becauseinthatalgorithmatuple anytimemeasuresareomittedduetolackofspace. isneverinapartiallyresolvedstate-itiseithercompletelyseenor completelyunseen. 5.2.1 ComputingConfidence However,althoughtheanytimemeasuresforTA-Sortedarenot Atanyinstanceduringtheexecutionofthealgorithm,consider monotonicforcertaindatabaseinstances,wecanneverthelessshow thesetoftuplesOthers=Partials∪Unseen.LetMaxOthers that the measures are monotonic in expectation over all database betherandomvariablethatdenotesthemaximumscoreofalltu- instances. Wedescribetheresultfortheconfidencemeasure. Sim- plesinOthers.ToexecutethefunctionComputeConfidence(), ilarresultsfortheotheranytimemeasuresarestraightforwardand we have to estimate Prob(kMinScore > MaxOthers). To omittedduetolackofspace. compute thisprobability, we need to firstcompute the pdf of the LetE[Confidence(Seen )]bedefinedastheexpectedvalue d+1 random variable MaxOthers. This can be accomplished if we of Confidence(Seen ), where Seen is randomly drawn d+1 d+1 compute the pdfs of two random variables, MaxPartials and fromPDF(Seen |Seen ∈OneMore(Seen )). d+1 d+1 d MaxUnseenandthencomputethepdfofthemaximumofthese tworandomvariables. THEOREM2 (EXPECTEDMONOTONICITYTHEOREM). ThepdfofMaxUnseen(i.e.,MaxUnseenPDF)canbecom- putedaswasdoneinTA,i.e.,byraisingOneUnseenPDF tothe Confidence(Seen )≤E[Confidence(Seen )] d d+1 powerofthetotalnumberofunseentuples. Tocomputethepdfof MaxPartials,wefirstdefineScorePDFt,thedistributionofthe Proof: Fromthedefinitionofconfidence,weknowthat scoreofapartiallyseentuplet.Thedefinitionissimilartothedefi- nitionofthescorepdfofanunseentuple(i.e.,OneUnseenPDF), Confidence(Seend)= X (kMinScore(Seend)= exceptthattheconvolutionsaretakenonlyoverthepdfsoftheun- D∈D(Seend) resolvedattributesoft,towhichtheaggregateoftheresolvedat- kthScore(D))·Prob(D|D∈D(Seen )) tributevalues(i.e.,MinScore(t))iscombined. d Moreformally,givenarealnumbera,letδa(x)denotethe“delta WenotethattherunningtimeofthisupdateisindependentofN, distribution”wherealltheprobabilitymassisconcentratedataand thetotalnumberoftuplesinthedatabase. is0elsewhere.Then 6. EXPERIMENTAL EVALUATION ScorePDFt =∗({δMinScore(t)}∪{gPDFi|i6∈obs(t)}) Inthissectionwepresentanexperimentalevaluationofourframe- MaxPartialsPDF maynowbedefinedas: work. The implementation of our techniques is in C++ and our evaluationsareperformedonadualAMDOpteron280processor MaxPartialsPDF =∗max({scorePDFt|t∈Partials}) systemwith8GBofmemory. We have conducted series of experiments using synthetic and Thisoperationislinearinthenumberofpartiallyseentuples,and tworeal-worlddatasetsvaryingthedistributionandsize.Thedata soitcanbecomeslowforlargedatasets. InthefollowingSection sets range in size from 4,990 to 1,000,000 rows, and four to ten 5.2.2wepresentanefficientimplementationbyclusteringpartially attributes(wevarythenumberofattributeswhenwereportonper- resolvedtuples. OncewehavecomputedMaxUnseenPDF and formance). Our experiments focus on the comparison of the ac- MaxPartialsPDF,wecancomputeMaxOthersPDFanduse curacy of our estimatedresultswiththeexpected performance of thattocompute theTAandTA-Sortedalgorithms.WealsogeneratedatawithZip- fiandistributionsandconduct similarsetsofexperiments. Dueto Confidence=Prob(kMinScore>MaxOthers) spaceconstraintswedonotincludeillustrationsforcomparingZip- 5.2.2 EfficientlyComputingMaxPartialsPDF fiandistributedscoresbutwebrieflydiscussthehighlightsofthe results. ThestraightforwardwaytocomputeMaxPartialsPDF isto computethemax-convolutionofthescorepdfsofthepartiallyseen 6.1 Real WorldData Sets tuples. Thisoperationislinearinthenumberofpartiallyseentu- Inourexperimentsweusetworeal-worlddatasets.Ourfirstdata ples,andsoitmaybecomeslowforlargedatasets. setisatmosphericdatacollectedfromseveral independent sensor To improve the running time, we cluster the partially seen tu- ples. Consider a subset of the attributes, S, and let PartialsS locationsinWashingtonandOregonbytheDepartmentofAtmo- be the set of tuples that have exactly these S attributes resolved. sphericScienceattheUniversityofWashington.Thesecondisthe That is, PartialsS = {t|obs(t) = S}. Since all the tuples in InternetMovieDatabaseIMDB2. Forthesensordata,25sensorsindependentlyobtainedtempera- PartialsS havethesameattributesunresolved, wecanspeedup turereadingsonanhourlybasisbetweenJune2003andJune2004, thecomputationofthemax-convolutionoftheirscores: for a total of 208 days. For each sensor there is a total of 4,990 ∗max({scorePDFt|t∈PartialsS})= readings. Eachofthereadingstakenfromasensorwerecombined ∗max({∗({δMinScore(t)}∪{gPDFi|i6∈S})|t∈PartialsS}) withreadingsfromothersensorswhichhadtakenareadingduring Then,letusconsidertheworstcasescores(i.e.,MinScore(t))of thesametimeperiod. Thesereadingsweregroupedtomakeindi- the tuples t in PartialsS, and consider an equi-width B-bucket vidual rows based on their time-stamps. Sensor data such as the histogram H with these values (where B may be different from temperature data provided can specifically benefit fromour algo- the b used to denote the number of buckets in the score/attribute rithmduetotheanytimebehavior. Forourexperimentsweusethe histograms). LetU(t)betheupperboundoftherangeofthehis- readingsfromfivetotenrandomlyselectedsensors. togrambucketofH thatMinScore(t)fallsin. Letusreplacethe The IMDB database is composed of more than 860,000 titles worstcasescoreofeachtuplewiththisupperboundofthecorre- anddetailsabouteach. FortheIMDBdataset,weextractedalist spondinghistogrambucket.Wehavethen, totaling 863,049 titles. For each title, we queried the following ∗max({scorePDFt|t∈PartialsS})≤ attributes: budget, grossincome, openingweekend grossincome, ∗max({∗({δU(t)}∪{gPDFi|i6∈S})|t∈PartialsS}) andnumberofkeywordsdescribingthetitle. Thus,anytwotuplesinPartialsthathavethesamesetofresolved Weexperimentedwithseveraldifferenthistogramsizes;wefound attributesandwhoseworstcasescoresmaptothesamebuckethave that theaccuracy didnot improvemuch withhistogramsof more approximately identical score distributions. Since there are 2M than20bucketsforourreal-worldexperiments. possiblesubsetsofattributes,andweuseaB-bucketshistogramfor 6.2 AnytimeMeasures eachsubset,wehaveessentiallypartitionedalltuplesinPartials into at most 2MB clusters. Using Lemma 4.3 for each of these Ourexperimentalevaluationvalidatesourmeasuresonreal-world clusterswecancomputeanupperboundforthepdfoftheirmax- and synthetic data sets. As a baseline we compare our approach imumscore. Wecanthencomputethemax-convolutionofthere- againsttheactualconfidence,TA,andTA-Sortedalgorithms. sulting2MBhistogramstofinallycomputeMaxPartialsPDF. Inthecasewhenthedistributionofscoresisskewed,theconfi- Toefficientlydothiscomputationwehavetomaintainonecounter denceofthealgorithmmaystayrelativelylowforalargeportion foreachofthe2MBhistogrambuckets(whichareinthebeginning of the data set. This is due to a high density of values keeping initializedat0).Everytimeanewvalueisreadin,oneofthetuples thekMinScoreandMaxOtherscloseforalargerportionofthe hasonemoreattributeresolved.Ifthisisanewtuple,weincrement runningtime(i.e.,thereisalow-slopingincreaseintheconfidence, thecorrespondingbucketandaddthistupletothePartialsset.If buteventuallyitreaches100%confidence). Incaseswhenthedis- thetupleisalreadyinPartials,onebucketwillhaveitscounterre- tribution of the data set contains a distinct cluster of K or more ducedbyone. Ifthetupleisstillnotfullyresolved,anotherbucket highscores(row-levelcorrelation)theconfidencequicklyclimbs. willhaveitscounterincreasedbyone. For the IMDB data set, there are few large values with the ma- Using Lemmas 4.1, 4.2 and 4.3, we can state the following jority of the scores being clustered toward the lower end of the lemma: valuerangeforeachattribute. Thisisreasonableconsideringthat thereareonlyafewbigbudgetmoviesandofthesemoviesaneven LEMMA 5.1. AnupperboundforMaxPartialsPDF canbe computedinO(2MBb2)time. 2http://www.imdb.org Confidence: Anytime TA (Data Set: IMDB) % of Correct Results in the Top-k Buffer [Rows = 863,049, Attributes = 4] [Rows = 863,049, Attributes = 4] (IMDB) 1 1 %) 0.95 00 0.8 0.9 1-1 0.6 ect 0.85 ( r e or 0.8 enc 0.4 % C 0.75 nfid 0.2 kk==120000 0.7 kk==210000 o 0.65 C k=300 k=300 0 0.6 100 1000 10000 100000 1e+06 1e+07 100 1000 10000 100000 1e+06 1e+07 Number of items processed Number of items processed Figure2:Inthisexperimentweevaluatetheconfidenceforvaryingk Figure3:Inthisexperimentweshowtheprecision,definedastheper- asthenumberofseentuplesisincreasedfortheIMDBdataset. centageofthecurrenttop-kbufferthatisactuallyinthetop-kresult fortheIMDBdatset. Actual: Anytime TA (Data Set: Synthetic) Tuple Error: Anytime TA (Data Set: Synthetic) [Rows = 100,000, Attributes = 4] [Rows = 100,000, Attributes = 4] 1 0.2 %) Confidence 0.15 00 0.8 Actual %) 0.1 1-1 0.6 or 0.05 e ( Err 0 denc 0.4 ple ( -0.05 nfi 0.2 Tu -0.1 o -0.15 C 0 -0.2 10 100 1000 10000 100000 1e+06 0.80 0.85 0.90 0.95 Number of items processed Confidence Figure4:InthisexperimentwecomparetheactualandAnytimeTAconfidence.Thetwofiguresshowthedifferenceinthenumberofitemsread forvariouslevelsofconfidenceusingthesyntheticdataset. smallersubsetthatgrossalargesumofmoney. Thiscreatesadata runs(i.e.,webuiltanewvectorthatrepresentstheelement-wiseav- setwithasmallnumberofhighscoretuples. Similarlyinthecase erageofthevectorset)creatinganewvectorofrealvalueswhere ofthesensordatathereisrow-levelcorrelationaroundtemperature eachelementofthevectorrepresentstheactualconfidenceforeach spikes withthemajorityof thereadings beinglocatedaround the respectiverun. average temperature for each sensor. As shown in figures 2 and Weevaluatetheaccuracyofreadingsbycomparingthenumber 9, both the IMDB and sensor data sets illustrate how correlation of items read given a user-defined confidence using Anytime-TA ofattributescanquicklycausetheAnytimeTAalgorithmtoclimb withthenumberofitemsretrievedhadtheactualconfidence(de- to100%confidence,thiscanbeaccountedforbythefactthatthe finedabove)beenknown. Wecanestimatetheaccuracyofaread- correlation of data cause the the kMinScore and MaxOthers ing by comparing the number of itemsread for Anytime TA and groupstoquicklydiverge. theactualconfidence. InFigure4weshowntheerrorpercentage forconfidencelevelsof0.80through0.95.Ouralgorithmperforms Accuracy:Ourresultsshowgoodperformanceforbothreal-world wellforvariouslevelsofconfidence.Theresultssuggestthatthere andsyntheticdatasets. Infigures2and3weshowtheconfidence islittlecorrelationbetweentheconfidencelevelandtheaccuracyof andpercentageofcorrectresultsinthetop-kbufferduringtheex- ourresults.FortheexperimentpresentedinFigure4,thenumberof ecutionofthealgorithm.Thesefiguresillustratehowourestimates itemsreadbytheAnytimeTAalgorithmneverdeviatesmorethan coincidewiththenumberofcorrectresultsinthetop-kbuffer. 16% from the number of itemsread for thecorresponding actual Further, in Figure 4 we show that our estimates for the confi- confidence. dence accurately approximates theactual confidence. Inorder to comparetheaccuracyofourestimations,wecomputedtheactual 6.3 Scalability &Performance confidence by running the TA algorithm for 10 independent runs (wegenerated10randomlydistributedsyntheticdatasetsandran Efficiency: Ourresultsshowthatsizablesavingscanbeachieved thealgorithmforeach)buildingavectorforeachrunwhereeach incomparisontotheTAandTA-Sortedalgorithms. Asabaseline elementofthevectorcontainsoneoftwovalues(1=”Top-kfound”, weran TA and TA-Sortedon theIMDB and sensor data sets. In 0=”Top-knotfoundyet”). Wethencomputedtheaverageoverall eachcasewecomputedhowmanytupleswerereadbeforetheTA Accesses: Anytime TA (Data Set: IMDB) Accesses: Anytime TA-Sorted (Data Set: IMDB) [Rows = 863,049, Attributes = 4] [Rows = 863,049, Attributes = 4] 2000 16000 14000 1500 12000 10000 s s e e pl 1000 pl 8000 u u T T 6000 500 4000 2000 0 0 0.80 0.85 0.90 0.95 0.99 (TA) 0.80 0.85 0.90 0.95 0.99 (TA-S) Confidence Confidence Figure 5: Inthisexperiment wecomparethenumberoftuplesre- Figure6: Inthisexperimentwecomparenumberoftuplesretrieved trieved for Anytime TA with various levels of confidence using the for Anytime TA-Sorted with various levels of confidence using the IMDBdatasetwhereK=100. IMDBdatasetwhereK=100.(TA-S)=TA-Sorted Accesses: Anytime TA (Data Set: Sensor) our implementation of the AnytimeTA algorithm. Thisdoes not [Rows = 4,990, Attributes = 10] includethetimethatittakestocomputetheanytimemeasures. In otherwords,thisincludesthetimeittakestorunTAandthetime 6000 ittakestomaintainthegPDFsforeachround. Notethatthistime isdependent upontheusersconfidence bound. Thethirdcolumn 5000 showstheaveragetimeforcomputingtheanytimemeasure(con- s 4000 fidence, rank distance, and so on) every time thiscomputation is e pl 3000 invoked. Thetotalrunningtimeofouralgorithmisthesumofthe u T timeittakestorunAnytimeTA(column2) andthetimeittakes 2000 tocomputetheanytimemeasures(column3)timesthenumberof 1000 timestheanytimecomputationisinvoked. The experimental results in Table 2 suggest that the overhead 0 ofourapproachisrelativelysmallforAnytimeTA.Thereislittle 0.80 0.85 0.90 0.95 0.99 (TA) variation in runtime between the TA and Anytime TA algorithm Confidence (this is attributed to the fact that histograms are not utilized for computationuntilareadingistaken). Varyingthehistogramsize Figure 7: Inthis experiment wecompare thenumber oftuples re- between 5 and 25 buckets make little difference in effecting the runtimeoftheAnytimeTAalgorithm. trievedforAnytimeTAwithvariouslevelsofconfidenceusingthesen- sordatasetwhereK=300. FortheAnytimeTA-SortedalgorithmasshowninTable3there isasizabledifferenceintherunningtimeforTA-SortedandAny- orTA-Sortedstoppingconditionwasreached. Wethencompared timeTA-Sortedalgorithms. Thisisattributedtotheoverhead in- these resultswith our algorithm. As shown in Figure5 Anytime curredfromthemaintenanceofthepartiallyseentuples. Inother TAprovidessizablesavingsoverTA.Weachieveasavingofover words,thisincludesthetimeittakestorunAnytimeTA-Sorted,up- 70%(1,200tuples)foraconfidencelevelof99%usingtheIMDB datethegPDFsandmaintainpartiallyseenclustersforeachround dataset. Similarly,AnytimeTAworkswellforhighdimensional asdefinedinSection5.2.2. Varyingthehistogramsizebetween5 (sensor)datasets.AsshowninFigure7,weachievesavingsofover and25bucketsmakelittledifferenceineffectingtheruntimeofthe 50%(3,000tuples)foraconfidencelevelof99%usingthesensor AnytimeTA-Sortedalgorithm. Overall,theoverheadfor thepar- dataset. SinceTA-Sorteddoesnotallowforrandomaccesses,the tialsremainsafixedcostoverAnytimeTAandincreaseswhenthe numberoftuplesreadisusuallymuchgreaterthanTA(allowingfor sizeofthehistogramsincreases,asexpected. greater savings). Asshown inFigure6wecomparetheAnytime Performance:Weevaluateperformanceintermsofhowmanytu- TA-SortedalgorithmwithTA-Sorted. Inthiscase, forTA-Sorted plesweread,andhowlongittakestorunthealgorithmusingour andaconfidencelevelof99%weachieveanevengreatersavings implementation. We compare Anytime TA with TA. To evaluate ofover95%(14,000tuples). ourapproachweranexperimentsusingasyntheticdatasettotaling Scalability:Toevaluatetheoverheadofourapproachweranscala- 100,000rows,4attributes,andauniformdistributionforeachat- bilityexperimentswithasyntheticdatasettotaling1,000,000rows tribute;weuseahistogramsizeof20todescribethedistribution. and 4 attributes. We used histograms of 5 to 25 buckets to de- In this set of experiments we set K = 1000, but similar results scribe attribute distributions. In this set of experiments we set were obtained for different values. Proper selection of skip size K =1,000,butsimilarresultswereobtainedfordifferentvalues. (i.e. thenumber oftuplessampledbetweenreadings) cangreatly Table2showstheruntimeperformanceoftheAnytimeTAalgo- affecttheruntimeandtotalnumberoftuplessampled.Alargeskip rithm,aswellastheoverheadthatthetechniqueimposesoverthe size ensures that the number of readings is minimal. If the skip TAalgorithm.Inthefirstcolumnwereporttherunningtimeofthe sizeistoolargethenthereisacoarseningoftheconfidencelevels TAalgorithm. Inthesecondcolumnwereporttherunningtimeof between readings, generally causing additional tuples to be read

Description:
Anytime Measures for Top-k Algorithms Benjamin Arai Univ. of California, Riverside [email protected] Gautam Das Univ. of Texas, Arlington [email protected]
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.