Query-driven Frequent Co-occurring Term Extraction over Relational Data using MapReduce Jianxin Li Chengfei Liu Liang Yao SwinburneUniversityof SwinburneUniversityof SwinburneUniversityof Technology Technology Technology Melbourne,Australia Melbourne,Australia Melbourne,Australia [email protected] [email protected] [email protected] Jeffrey Xu Yu Rui Zhou 3 1 ChineseUniversityofHong SwinburneUniversityof 0 Kong Technology 2 HongKong,China Melbourne,Australia [email protected] [email protected] n a J ABSTRACT SQLand MapReduce interfaces over avariety of storage mecha- 1 nisms. [3,4,5]summarizethestate-of-the-artscalabledataman- In this paper we study how to efficiently compute frequent co- 1 agement systems for traditional and cloud computing infrastruc- occurringterms(FCT)intheresultsofakeywordqueryinparallel usingthepopularMapReduceframework. Takingasinputakey- tures.[3]highlightsupdateheavyandanalyticalworkloads.[4]in- ] troducessomeimportantapplicationexamplesforbigdatainreal B wordqueryqandanintegerk, anFCTqueryreportsthekterms life. [5] describessixdatamanagement research challenges rele- D thatarenotinq,butappearmostfrequentlyintheresultsofthekey- vantforbigdataandthecloud. Differently, inthispaperourtar- wordqueryqovermultiplejoinedrelations.Thereturnedtermsof . getistoaddresstheproblemofquery-drivenfrequentco-occurring s FCTsearch can be used todo query expansion and query refine- c mentfortraditionalkeywordsearch. Differentfromthemethodof termextractionoverbigdata,i.e.,computingthefrequentco-occurring [ FCTsearchinasingleplatform, ourproposed approach caneffi- termsintheresultsofagivenkeywordquery.Thereturnedfrequent cientlyansweraFCTqueryusingtheMapReduceParadigmwith- co-occurring terms, as the informative feedbacks, can be used to 1 refinetheoriginalkeywordquerybeforetheexactresultset ofthe v outpre-computingtheresultsoftheoriginalkeywordquery,which originalqueryisretrieved. 8 is run in parallel platform. In this work, we can output the final Sincetraditionalkeywordsearchoftenassumesthatthedatasets 7 FCTsearchresultsbytwoMapReducejobs: thefirstistoextract should be loaded and processed in memory, it is not suitable to 3 thestatisticalinformationofthedata;andthesecondistocalculate dealwithkeywordsearchoverbigdata.Thechallengescomefrom 2 thetotalfrequencyofeachtermbasedontheoutputofthefirstjob. three points: (1) the ambiguity of keyword query limits the ex- . AtthetwoMapReducejobs,wewouldguaranteetheloadbalance 1 pressiveness of search intention and may lead to a large number ofmappersandthecomputationalbalanceofreducersasmuch as 0 ofuninterestingresults,whichmaymaketheusersfrustratedeas- possible. Analyticalandexperimentalevaluationsdemonstratethe 3 ily; (2) evaluating keyword query over big data requires scalable efficiency andscalabilityof our proposed approach using TPC-H 1 computationalparadigmwhereaparallelplatformisdesirable;(3) benchmarkdatasetswithdifferentsizes. v: generally, only simple index can be built for big data in practice i duetothehugespacecost. Duetotheabovechallenges,thereare X 1. INTRODUCTION onlyafewworkstodiscusstheproblemofkeywordsearchoverbig r Recently, analyzing and querying big data are attracting more data. [6]addressesscalablekeywordsearchonlargedatastreams a by pruning the unqualified tuples in a scalable method based on andmoreresearchattentions,whichcanprovidevaluableinforma- selection/semi-join strategy [7]. [8] addresses scalable keyword tion to company and personal customers. Forexample, [1] com- searchoverrelationaldatabyreturningpartofresults,ratherthan bines a time-oriented data processing system with a MapReduce thewholesetofresults,withinthespecifiedshorttime. framework,whichcanallowuserstoperformanalyticsusingtem- By extracting frequent co-occurring terms of a given original poral queries - thesequeries are succinct, scale-out-agnostic, and keyword query, we can have two advantages naturally. On the easytowrite.[2]presentsdataparallelalgorithmsforsophisticated searchengineside,itismoreprofitabletoconstrainuserstoaspe- statisticaltechniques, withafocusondensitymethods,whichen- cific set of results by exploiting the frequent co-occurring terms ables agiledesign and flexiblealgorithmdevelopment using both of the issued keyword queries from big data if the users are also interestedintheextractedterms,whichcansavelotsofcomputa- tionalresources.Ontheuserside,theuserscaneasilydiscoverthe conceptsthatarecloselyassociatedwiththegivenkeywordsetby extractingthefrequentco-occurredtermsfromthebigdata,which Permission tomake digital orhardcopies ofall orpartofthis workfor personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare ishelpfulfortheuserstoeasilyunderstandtheirinterestinginfor- notmadeordistributedforprofitorcommercialadvantageandthatcopies mationinthebigdata. bearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,to However, extracting such Frequent Co-occurring Terms (FCT) republish,topostonserversortoredistributetolists,requirespriorspecific of a given keyword query ischallenging today, as there isan in- permissionand/orafee. creasingtrendofapplicationsbeingexpectedtodealwithvastamounts Copyright20XXACMX-XXXXX-XX-X/XX/XX...$15.00. ofdatathatusuallydonotfitinthemainmemoryofonemachine, sk(cid:13)s(cid:13)(cid:13)aT(cid:13)(cid:13) (cid:13)pMa(cid:13)(cid:13)(cid:13) sk(cid:13)s(cid:13)(cid:13)aT(cid:13)(cid:13) (cid:13)e(cid:13)cu(cid:13)d(cid:13)e(cid:13)R(cid:13)(cid:13) e.g.,theGoogleN-gramdataset[9]andtheGeneBankdataset[10] SFD(cid:13)(cid:13)(cid:13) e(cid:13)l(cid:13)f(cid:13)f(cid:13)uh(cid:13)S(cid:13)(cid:13) e(cid:13)g(cid:13)r(cid:13)eM(cid:13)(cid:13) SFD(cid:13)(cid:13)(cid:13) thatcontains100millionrecordswiththetotalsizeof416GB.Ap- 1(cid:13) (cid:13)k(cid:13)s(cid:13)aT(cid:13) (cid:13) (cid:13)e(cid:13)c(cid:13)ud(cid:13)e(cid:13)R(cid:13)(cid:13) 1(cid:13) (cid:13)k(cid:13)s(cid:13)aT(cid:13) (cid:13) (cid:13)pM(cid:13)a(cid:13)(cid:13) plications withsuch datasetsusually make use of clustersof ma- chines and employ parallel algorithmsin order toefficientlydeal 2(cid:13) (cid:13)k(cid:13)s(cid:13)aT(cid:13) (cid:13) (cid:13)e(cid:13)c(cid:13)ud(cid:13)e(cid:13)R(cid:13)(cid:13) withthisvastamountofdata. Fordata-intensiveapplications, the 2(cid:13) (cid:13)k(cid:13)s(cid:13)aT(cid:13) (cid:13) (cid:13)pM(cid:13)a(cid:13)(cid:13) MapReduce[11]paradigmhasrecentlyreceivedalotofattention 3(cid:13) (cid:13)k(cid:13)s(cid:13)aT(cid:13) (cid:13) (cid:13)e(cid:13)c(cid:13)ud(cid:13)e(cid:13)R(cid:13)(cid:13) for being a scalable parallel shared-nothing data-processing plat- form.Theframeworkisabletoscaletothousandsofnodes.Inthis 3(cid:13) (cid:13)k(cid:13)s(cid:13)aT(cid:13) (cid:13) (cid:13)pM(cid:13)a(cid:13)(cid:13) 4(cid:13) (cid:13)k(cid:13)s(cid:13)aT(cid:13) (cid:13) (cid:13)e(cid:13)c(cid:13)ud(cid:13)e(cid:13)R(cid:13)(cid:13) paper,weuseMapReduceastheparalleldata-processingparadigm for extracting the frequent co-occurring terms with regards to a givenkeywordqueryoverabigdata. Figure1:OverviewofMapReduce Weknowthateachkeywordsearchresultofakeywordqueryin relationaldatabaseincludesasetoftupleswhichareretrievedfrom a single relation or several joined relations, and contains all the for the MapReduce paradigm include processing crawled docu- givenkeywordsofthekeywordquery.Intuitively,givenakeyword ments,Webrequestlogs,etc.Intheopensourcecommunity,Hadoop query and a big data, we can first compute the large number of [13] is a popular implementation of this paradigm. In MapRe- keywordsearchresultsandthencalculatethetotalfrequenciesfor duce,dataisinitiallypartitionedacrossthenodesofaclusterand each term in the result set. At last, all the terms can be sorted stored in a distributed file system (DFS). Data is represented as by their frequencies and the top-k frequent terms can be found. (key,value)pairs. Thecomputationisexpressedusingtwofunc- But this straightforward solution may make the feedbacks of the tions: returned termsmeaningless tothe users because the highly time- map (k ,v ) list(k ,v ); 1 1 2 2 consuming evaluation of keyword query overbig datamay delay reduce (k ,list(v )) → list(k ,v ). 2 2 3 3 thefeedbacksgreatly. Figure1showsthedataflow→inaMapReducecomputation.The Toreducetheprocessingtime,weproposeanewFCTapproach, computation starts with amap phase in which the map functions whichcanavoidtheprocedureofdirectkeywordqueryevaluation areappliedinparallelondifferentpartitionsoftheinputdata. The usingtheideaofstaralgorithmin[12]. Furthermore,ournewap- (key,value)pairsoutputbyeachmapfunctionarehash-partitioned proach canefficientlyexplore thestatisticalinformation frombig onthekey. Foreachpartitionthepairsaresortedbytheirkeyand dataandcomputethefrequenciesoftheco-occurringtermsusing then sent across the clusterin ashuffle phase. At each receiving MapReduce. node, all the received partitions are merged in a sorted order by Themaincontributionsinthispaperaresummarizedasfollows. theirkey. Allthepairvaluesthatshareacertainkeyarepassedto a single reduce call. The output of each reduce function iswrit- weproposeanovelMapReduce-basedapproachtoefficiently • ten to a distributed file in the DFS. Besides the map and reduce compute thequery-driven frequent co-occurring termsover functions, the framework also allows the user to provide a com- bigdata. binefunctionthatisexecutedonthesamenodesasmappersright we propose two shuffling strategies to guarantee the load/ afterthemapfunctions havefinished. Thisfunction actsasa lo- • computationalbalanceofmappersandreducersbyconsider- calreducer, operatingonthelocal (key,value)pairs. Thisfunc- ingboththeuniformeddatadistributionandtheunevendata tion allows the user to decrease the amount of data sent through distribution. the network. The signature of the combine function is: combine (k ,list(v )) list(k ,list(v )). Finally, the framework also 2 2 2 2 → weconductedextensiveperformancestudiestodemonstrate allowstheusertoprovideinitializationandtear-downfunctionfor • thescalabilityofourproposedapproachusingTPC-Hbench- eachMapReducefunctionandcustomizehashingandcomparison markdataset. functionstobeusedwhenpartitioningandsortingthekeys. From Figure1onecannoticethesimilaritybetweentheMapReduceap- Theremainderofthispaperisorganizedasfollows.InSection2, proach and query-processing techniques for parallel DBMS [14, weintroducetheworkingprocedureofMapReduceframeworkand 15]. anoptimizedmultiwayjoininMapReduceinmoredetails.Wede- finetheproblemofquery-drivenfrequenttermextraction(denoted 2.2 Lagrangean Multipliers based Multiway asFCTsearch)inSection3.Section4firstlydiscussesthepartition JoinsinMapReduce strategiesforuniformeddatadistributionandunevendatadistribu- Tolearnhowtooptimizemap-keysforamultiwayjoinin[16], tion. And it then presents the procedures of computing frequent let us begin with a simple example: the cyclic join R(A,B) 1 co-occurring terms of a query using MapReduce step by step. It S(B,C)1T(A,C). Supposethatthetargetnumberofmap-keys lastlyprovesthecorrectnessandcompletenessoftheMapReduce- isk. Thatis,weshallusekReduceprocessestojointuplesfrom based FCTsearch. Weprovidetheimplementationalgorithms of thethreerelations. Eachof thethreeattributes A,B, andC will ourapproachinSection5.Section6presentstheperformancestud- haveashareofthekey,whichwedenotea,b,andc,respectively. ies. Finally,wediscussrelatedworkinSection7andconcludein WeassumetherearehashfunctionsthatmapvaluesofattributeA Section8. toadifferentbuckets,valuesofBtobbuckets,andvaluesofCto cbuckets.Weusehasthehashfunctionname,regardlessofwhich 2. PRELIMINARIES attribute’svalueisbeinghashed.Notethatabc=k. Consider tuples (x,y)in relation R. WhichReduce processes 2.1 MapReduce Framework need to know about this tuple? Recall that each Reduce process MapReduce[11]isapopularparadigmfordata-intensiveparal- is associated with a map-key (u,v,w), where u is a hash value lel computation inshared-nothing clusters. Example applications intherange1toa,representingabucketintowhichA-valuesare hashed. Similarly,v isabucketintherange1tobrepresentinga Definition4. (JoiningNetworkofTupleSets)Ajoiningnetwork B-value, and w isabucket intherange 1to crepresenting a C- oftuplesetsJnisatreeoftuplesetswhereforeachpairofadjacent value. Tuple (x,y) from R can only be useful to this reducer if tuplesetsRiKi,RjKj inJn,thereisanedge(Ri,Rj)inG. h(x)=uandh(y)=v.However,itcouldbeusefultoanyreducer that has these first two key components, regardless of the value Here,RiKi representsthesetoftuplesofrelationRiwhereeach ocfdwiff.erWenetcroendculcuedrsecthoarrte(sxp,oyn)dimngusttobkeeyrevpalilcuaetsed(ha(nxd)s,ehn(tyt)o,wth)e, ttuhpelseectoonftatuinpsleaspoafrtrieallaqtiuoenryRkjeywwhoerrdeseeatcKhtiuwplheilceoRntjKaijnrseaprpeasertniatsl where1 ≤ w ≤ c. Similarreasoningtellsusthatanytuple(y,z) querykeywordsetKj. from S must be sent to the a different reducers corresponding to map-keys(u,h(y),h(z)),for1 u a. Finally,atuple(x,z) Definition5. (Candidate Network) Given a keyword query, a ≤ ≤ from T is sent to the b different reducers corresponding to map- candidate network C isajoining networkof tuplesets, such that keys(h(x),v,h(z)),for1 v b. thereisaninstance I ofthedatabasethathasaMTJNTM C ≤ ≤ ∈ Thisreplicationof tupleshasa communication cost associated andnotuplet M thatmapstoafreetuplesetF C contains ∈ ∈ withit.ThenumberoftuplespassedfromtheMapprocessestothe anykeywords. Reduceprocessesisrc+sa+tbwherer,s,andtarethenumbers As the example shown in [17], for a keyword query “Smith, of tuples in relations R, S, and T, respectively. Therefore, The Miller”, J =ORDERSSmith 1CUSTOMER{} 1ORDERS{} is optimizationproblemistominimizetheoverallcost: notacandidatenetworkeventhoughthereisaMTJNTthatbelongs MinimizeF(x)=rc+sa+tbsubjecttoabc=k toJ becauseJ issubsumedbyORDERSSmith1CUSTOMER{} 1ORDERSMiller. Here,CUSTOMER{}andORDERS{}denote wherea,b,andcarethenumbersofbucketsofrelations,andkis freetuplesets. thenumberofreducetasks. Themethod ofLagrangean multipliersservesuswell. Thatis, Definition6. (Top-k FCT Retrieval of Keyword Query) Given westartwiththeexpressionrc+sa+tb−λ(abc−k),andtake akeywordquery q,anumber Rmax,andanintegerk,afrequent derivativeswithrespecttothethreevariables, a,b,andc, andset co-occurringterm(FCT)queryreturnsthektermswiththehighest theresultingexpressionsequalto0. Theresultisthreeequations: frequencies among all terms that (1) are not in q and (2) in the s = λbc => sa = λk; t = λac => tb = λk; r = λab resultsofthekeywordqueryqw.r.t.themaximalnumberRmaxof => rc = λk. If we multiply the left sides of the three equa- joinedrelations. tions and set that equal to the product of the right sides, we get rstk = λ3k3 (remembering that abc on the left equals k). We To the problem of FCT retrieval, a straightforward solution is can now solve for λ = 3 rst/k2. From this, the first equation to first solve the corresponding keyword query, and then extract sa = λkyieldsa = 3 kprt/s2. Similarly,thenexttwoequations the term frequencies. However, the solution would incur expen- p sivecostbecauseitneedstocompletelyevaluateallthejoins-the yieldb= 3 krs/t2andc= 3 kst/r2.Whenwesubstitutethese p p minimumtotaljoinnetworks(MTJNTs)ofthecorrespondingkey- values in the original expression to be optimized, rc+sa+tb, word query. To reduce the computational cost, [12] proposes a wegettheminimumamountofcommunicationbetweenMapand Reduceprocesses:3√3krst. star method toefficientlycalculate the termfrequencies without completejoinevaluation.Itfirstobtainsallthecandidatenetworks (CNs)ofthekeywordquerybyusingtheCN-generationalgorithm 3. PROBLEM DEFINITION in[17]. Andthenit computes thetermfrequencies foreachCN. WeconsiderthatthedatabasehasntablesR1, R2, ...,Rn,re- Atlast,allthecomputedtermfrequenciesaresummarizedintothe ferredtoastherawtables.Theirreferencingrelationshipsaresum- totaltermfrequencieswithregardstotheFCTqueryoverthe data marizedinaschemagraph: tobesearched. Let us use h to represent the number of CNs. And a CN can Definition1. (Schema Graph) The schema graph is a directed be regarded as an algebraic expression, which retrieves a set of graph G such that (1) G has n vertices, corresponding to tables rvReef1re,ter.ex..n,RcRinjng(,1arep≤srpimeicat6=irvyejklye≤,yainnnd)R,(2iif.)aGndhaosnlayniefdRgejfhraosmavfeorrteeixgnRkietyo MTrMesTTMuJJlNNtTinTTJgNsc.fTarWon(CmbeeNedxoejeup)ctl=pouuytiφtnMbgfyoTCrtJwNNanoTiyC((1C1N≤N≤saii)ti≤tt6=hoehdjs)eanw≤moheteehtri.temhTweehe.saeThttahiovseferteMoMfosTTraeJJy,NN,tnThToes(CNi) Definition2. (JoiningNetworkofTuples)Ajoiningnetworkof keywordsearchresultsetcanbedefinedasfollows. tuplesjn isatreeoftupleswhereforeachpairofadjacenttuples ti,tj jn,whereti Ri,tj Rj,thereisanedge(Ri,Rj)in h Gand∈(ti1tj)∈(Ri∈1Rj). ∈ KSResult(q)= [MTJNT(CNi). (1) i=1 Definition3. (Keyword Query) Given akeyword query, itsre- sult is the set of all possible joining networks of tuples that are Letfreq-CN(CNi,w)bethetotalnumberofoccurrencesforterm both: (1) Total- every keyword iscontained inatleast onetuple winalltheMTJNTsofMTJNT(CNi),orformally: ofrfotmhethjoeinjoininginngetnweotrwko;r(k2)anMdinstiimllahla-vweeactaontanlojtorienminogveneatnwyotrukpolef freq-CN(CNi,w)= X Count(T,w). (2) ∀T∈MTJNT(CNi) tuples. whereCount(T,w)isdefinedasthenumberofoccurrencesofw As such, we can call such joining networks as Minimal Total inasingleMTJNTT. Thus,thetotalfrequencyfreq(q,w)canbe JoiningNetworksofTuples(MTJNT)ofthekeywordsinkeyword calculatedas: query. Each MTJNT is a result instance of keyword query. To h improvetheefficiencyofkeywordquery,wecangroupthetuples ofeachrelationbasedontheircontainedquerykeywords. freq(q,w)=Xfreq-CN(CNi,w). (3) i=1 Accordingtotheaboveequation,theFCTretrieval(freq(q,w)) thedataindimensionaltablesaredistributeduniformly,i.e.,oneof can be efficiently answered by alternatively calculating the term 8processorsneedtoprocess1.25 1011 joiningoperationsmaxi- ∗ frequencies (freq-CN(CNi,w)) of each candidate network CNi. mally.Theprocessorscanbelabelledasfollows: 000 001 010 011 Specifically,freq-CN(CNi,w)canbecalculatedefficientlywhen 100 101 110 111 However,ifoneofthedimensionaltablecontainsskewedtuples, CNi is a star candidate network where a vertex, called the root, thensomeprocessorswillbecomemuchhotterwhiletherestmay connectstoalltheothervertices,calledtheleaves. beidle.Evenifwecanaddmorenodestoincreasethescalabilityof Although the star method is much better than the straightfor- thesystem,itdoesnotsolvetheskewproblembecauseallskewed wardone,itisstillexpensivetocomputeFCTretrievalbecause(1) tupleswillstillbesenttothehotprocessors. Followingtheabove itneedstoscanthedatatwice: scanningdataformakingstatisti- instance,ifonlythetuplesinthefirstpartitionofdimensionaltable calinformationandscanningdataforcomputingthefrequenciesof S appear inthefact table R, then thehalf processors withlabels termsindata;(2)itisasingle-machinebasedapproach,bywhich 100,101,110,111 will be always idle. That is to say, each pro- long-timeprocessingwillbeincurredwhenthedatatobeprocessed cessor of 000,001,010,011 has to deal with 2.5 1011 joining ismassive;(3)IfCNiisnotastarcandidatenetwork,ithastose- ∗ operationsmaximally. Thiscaseoftenhappensinstar-schemejoin lectsomerelationstodojoinoperationsexactly, whichisusedto operationsbecauseitisdifficulttoguaranteethatdatainallthedi- makestar-conversion. Inthispaper,westudytheproblemofFCT mensionaltablesareevendistributed. Toaddresstheunbalanceof retrievalintheparallelenvironment, i.e.,MapReduceframework, computation,intuitivelywecansplitthedataintomorebulks(small whichcanimprovetheperformancegreatly. Inthefollowing sec- partitions)thatcanbedistributedasevenlyaspossible. However, tion,wemainlyfocusonthecomplexcomputationofFCTretrieval itisnoteasytomanagetheschedulingofthebignumberofbulks, forstarcandidatenetworks.Forthenon-starcandidatenetwork,we particularlywhenthebulksizeissmall. canevaluatesomerelationstobeselectedusingtherepartitionjoin strategyinMapReduce,whichiseasytobeimplemented[18]. 4.2 Uneven Distribution-based Shuffling 4. MAPREDUCE-BASED FCT SEARCH Strategy Differentfromthestarmethod,weloadandprocessthedatain Therearelotsofworktoprocessdataskewinparalleljoinsin parallelusingMapReduce.However,makingthechangeisnottriv- database systems. However, most of the existing work primarily ialbecauseMapReduceisashared-nothingparalleldataprocessing focusonthejoinoftworelations(R1S).Althougtheirapproaches paradigm,andtheperformanceoftheFCTsearchwoulddependon canbeappliedtothejoinofmultiplerelationsbyrepeatedopera- whetherthestrategyofdatapartitionisgoodornot.ForeachCN , tions,thelong-processingtimewouldbecomechallengingtosome i wecangetitsrelevanttermfrequenciesbytwoMapReducejobs. extent. Specifically, in this work our main problem is to address Inthissection,wefirstproposeouroptimaldatapartitionstrat- star-scheme join that consists of one fact table and more dimen- egy, by which the data can be distributed over the process nodes sional tables. From the study on the fact table, we can observe withtheminimalduplicationswhileitguaranteesthereisnocom- thatthejoinedattributesinafacttableoftencontainsomeskewed municationcostamongtheprocessnodesintheperiodofevaluat- tuplesoveroneorseveraljoineddimensions. Forexample,anair- ingtheFCTsearch. Andthen, wedescribetheprocedures ofthe linecanprovidethebookingservicetotravelagentsandpersonal twoMapReducejobsforaggregatingthetotaltermfrequencies.At customers at the same time. Generally, a travel agent may book last, we analyze the properties of MapReduce-based FCT search thousands of tickets per year, but a personal customer can book approach. onlyafewticketsperyear. Ifwetreatthetwotypesofcustomers equally,thentheprocessorsdealingwithtravelagentswouldbehot 4.1 Uniformed Distribution-based Shuffling andtherestprocessorsdealingwithpersonalcustomerswouldbe Strategy cooling(mostoftime,theyareidle.) Theoutputofhotprocessors willgreatlydecreasetheoverallperformanceofparalleljoin. Forthestar-schemejoin,wecanstillutilizetheLagrangeanMul- ConsiderafactrelationR(A,B,C)andtwodimensionalrelations tipliersbasedJoinstrategytopartitionthedata.Forexample,given asetofrelationshavingR(A,B,C)1S(A,E)1T(B,F)1P(C,G), S(A,E)andT(B,F).AssumeScanbesplittedintothreepartitions: S ,S ,andS ,andTissplittedintofourpartitions:T ,T ,T and thecostexpressioncouldbe 0 1 2 0 1 2 T .Assuch,wehave12reducetasksthatneedtobecomputed.We 3 r+sbc+tac+pab needtoanswer:howtodistributethe12reducetasksintothekre- ducers(e.g.,k=9)? Weshouldberemindedthatsomereducetasks andtheLagrangeanequationsare: doesnot producejoinedresultsoronlygenerateafewduetothe tac+pab=λk,sbc+pab=λk,sbc+tac=λk dataskew.Todothis,weproposeacostmodeltoevaluatethecom- putationalcostofeachreducetask.Anyformulaforestimatingthe wherek=abc.Aftermakingthepair-comparisons,wecangetthe costofajoincouldbeused. Here,wechosethesimpletechnique transformed equations: s/a = t/b = p/c. Thus, theminimum- ofestimatingthat cost solution hasshares foreachvariable proportional tothesize cij = Rij est+ Siest+ Tj est+ Rij 1Si1Tj est ofthedimensiontableinwhichitappears.Thatistosay,themap- where|Rij|estis|an|estimat|eo|fthenu|mberofRtuple|smapped keyspartitionthefacttableintok parts, andeachpartofthefact totheredu|cet|asklabelledasij, Siestisanestimateofthenumber tablegetsequal-sizedpiecesofeachdimensiontablewithwhichit ofStuplesmappedtothereduc|eta|skslabelledasi?(thequestion isjoined.Asaresult,wecanderivea= 3 ks2/tp,b= 3 kt2/sp markmeansthatallpossiblereducetaskswhoselabelstarts with p p andc= p3 kp2/st. i),|Tj|est isanestimateofthenumberofTtuplesmappedtothe Forinstance,ifwehave9PCasprocessors,i.e. k=9,andeach reducetaskslabelledas?j(similarly,thequestionmarkmeansthat dimensionaltablecontains10,000tuples,theneachdimensionalta- allpossible reduce taskswhoselabel ends withj), Rij 1 Si 1 blecanbesplittedinto2partitions,i.e.,S0,S1,T0,T1,P0,andP1. Tj est isanestimateofthenumberoftuplesinRij|1 Si 1 Tj. Obviously,thefacttablewiththemaximalsizeof10,0003 (1012) We| compute this estimate of the size of Rij 1 Si 1 Tj by as- canbeapproximatelydistributedintothe8processorsonly when sumping thatthejoinattributevalues ineach joindimension Rij are uniformly distributed. Once this estimate for the cost of the samekeys,whichcancompressthedatatobesent,e.g.,thepairsof joiningof thereduce tasksarecomputed, anytaskscheduling al- s ands canbeaggregatedinto(a ,′′...,k ,...,k ,...2′′)where 1 2 1 1 1 | gorithm can be used to try to balance the computational cost of thesymbol isusedtoseperatetheaggregatedtextinformationand | reducers.Inthiswork,weadoptaheuristicmethodtoschedulethe the local frequency of the tuples (e.g., s and s ) with the same 1 2 reducetasks. Forexample,wehave5reducetaskswiththeiresti- join attributekey (e.g., a ); the rest two pairs can be aggregated 1 matedcoststobecomputedovertworeducers. Thentheywillbe into(a ,′′...,k ,w ,w ,w ,...,k ,w 2′′).Inthisprocedure,the 2 1 1 2 3 1 2 | assignedasshowninFigure2. querykeywords (e.g., k )canbepruned directly, whichdoesnot 1 affectthefollowingcalculation. Butwekeepitintheexamplein ordertomaketheaggregationtobeunderstoodeasily. For the tuples in a fact relation, the map function extracts all the join attributes as the composite key k , and the texts of the 08(cid:13)(cid:13) 07(cid:13)(cid:13) 04(cid:13)(cid:13) 02(cid:13)(cid:13) 51(cid:13)(cid:13) restattributesastheaggregatedvalue v . 2Similarly, thecombine 2 functioncanbeappliedtoaggregatethevaluesofthetupleshaving the same set of join attributes in order to minimize the network traffic between the map and reduce functions. For example, the mapfunctioncanoutputakey-valuepair(a b c ,′′w ,w ,...′′) 1 2 2 1 2 | | for the tuple r in Figure 3(e). Similarly, we can transform the 1 )(cid:13)51(cid:13)1(cid:13)(cid:13)((cid:13) (cid:13)1(cid:13) (cid:13)r(cid:13)e(cid:13)cu(cid:13)d(cid:13)e(cid:13)(cid:13)r(cid:13) )(cid:13)01(cid:13)1(cid:13)(cid:13)((cid:13) (cid:13)2(cid:13) (cid:13)r(cid:13)e(cid:13)cu(cid:13)d(cid:13)e(cid:13)(cid:13)r(cid:13) othertuplesintokey-valuepairs. Subsequently,thereducefunctioncomputesthestatisticalinfor- Figure2:ExampleofAssigningReduceTasks mationforeachfacttupleanditscorrespondingdimensiontuples. Firstly, it pulls all the dimension tuples from the DFS files and To efficiently estimate the cost, in this paper we employ Sim- countsthenumberoftuplessharingthesamejoinattributeforeach ple Random Sample to choose the tuples from R, S and T. Al- dimensionrelation,whicharestoredinavector,denotedas num- thoughtherearealsootherclassificsamplingstrategies,e.g.,Strat- array. Andthen, itdeals withthefact tuples oneby one at each ifiedsamplingandSystemanticsampling,theymayintroducemore reducer.Foreachfacttuple,weneedtoprobethenum-arraysofits workload.Thedetailedcomparisonofsamplingstrategiesisoutof corresponding dimension relationsforproducing thecardinalities thescopeofthispaper. ofthejoinattributes, whicharestoredinvectors, denoted asvol- Afterthereducetasksarevirtuallyplacedtothereducersinbal- arrays. Afterallthefacttuplesareprocessedatareducer,itgen- ance, the master node will construct an allocation table to main- eratesonevol-arrayforeachdimensionrelationandonevol-array taintheschedulingrelationshipsbetweenbulksandreducers. For forthefactrelation,whichcanbeusedtocomputethefrequencies example, if the reduce tasks R and R are grouped into the oftermsco-occurredwiththequerykeywordsatMapReduce2nd. 00 01 samereducer(e.g.,#reducer=1)together,thenthereducerwith #reducer =1willpullthetuplesetsR00 andR01 fromthecor- 4.3.2 AllocateDatatoReduceTasksEvenly responding mappers. Atthesametime, italsopull thetuplesets Tounderstandthereducefunction,weneedtoanswertwoques- S ,T andT fromthecorrespondingmappers. 0 0 1 tions. Thefirstoneis: howcanweallocatethedatatothereduce 4.3 Computing the Statistical Information at tasks as even as possible? Let’s consider the discussion in Sec- MapReduce1st tion4.1again.Ifeachdimensiontableissplittedintotwopartitions tobeprocessedbyreducersinparallel,thenitwillproduce 23 re- 4.3.1 Brief Procedure of Computing Statistical In- ducetasksthatarelabelledas 000 001 010 011 . Apossible 100 101 110 111 formation way istolabel thedistinct joinattributeswithdifferent numbers. AtMapReduce1st, themapfunction getsasinputstheoriginal Andthen, ahash functioncan beusedtomake theshuffle ofthe tuples. For each tuple in dimensional relations, the function ex- data. Forexample, therearefourdistinctattributesincolumnA. tractsthejoinattributeasthekey k2, andthetextsof therestat- We can assign the numbers as a1 1, a2 2, a3 3, and tributesasthevaluev2. Tominimizethenetworktrafficbetween a4 4.Ifwestillwanttosplitthed→ataintotw→opartition→s,thenthe themapandreducefunctions,weuseacombinefunctiontoaggre- has→h functioncan bedesigned ash(ai) = getNum(ai)MOD2 gatethevalueswiththesamekeyoutputbythemapfunctioninto where getNum(ai)isused toget thenumber of theattribute ai asinglevalue. Andanumberisappendedtotheaggregatedvalue in A column. Based on the hash function, we have h(a ) = 1, 1 asthelocalfrequencyofthejoinattributeappearinginthecurrent h(a )=0,h(a )=1,andh(a )=0respectively. Similarly,we 2 3 4 split. Ifthejoinattributeistheprimarykeyofthecorresponding candesignhashfunctionsforthejoinattributesincolumn B and relation,thenthecombineprocesscanbeskipped. columnC. ConsidertheexampleinFigure3. Tomaketheloadbalanced, Accordingtothedesigned hash function, wecandistributethe we assume that each table is splitted into two subtables with the tuplesinthefact tableand dimensiontablesintodifferent reduce equal size asmuch aspossible, e.g., for S{k1}, thefirst partition tasks. Forthekey-valuepair(a b c ,′′w ,w ,...′′)tobegener- 1 2 2 1 2 size is l|S{k1}|/2m = 4 while the second partition is |S{k1}|− 1at0e0d.bAyttthheetusapmleert1i,mitew,fiollrbtheealkleoy|c-avt|aelduteophaAir(saw1)ithhBa(b,2t)hheCy(wci2l)lb=e l|S{k1}|/2m = 3. The map function can read the tuples s1, s2, distributedtothereducetaskswith100,101,110and1111,respec- s ,s andgeneratethecorrespondingkey-valuepairsforthefirst tively. Thekey-value pairswithb willbedistributedtothecor- 3 4 2 partition inFigure3(b). For s , theoutput key-value pairis(a , respondingreducetaskswith000,001,100,and101,respectively. 1 1 ′′...,k ,...′′).Fors ,thekey-valuepairis(a ,′′k ,...′′).Similarly, Thekey-valuepairswithc willbedistributedtothecorresponding 1 2 1 1 2 the rest two tuples generate key-value pairs taking a as the key reducetaskswith000,010,100,and110,respectively.Allthedata 2 and the corresponding texts as their values respectively. By the inFigure3areallocatedasshowninFigure4, inwhichweonly combinefunction,wecanaggregatethekey-valuepairshavingthe listthekeysofthekey-valuepairs. Inpractice,boththekeysand }(cid:13)1Sk(cid:13)(cid:13)(cid:13){(cid:13) DI(cid:13)(cid:13) A(cid:13) t(cid:13)xe(cid:13) (cid:13)t(cid:13) }(cid:13)t(cid:13)xe(cid:13)(cid:13)t(cid:13) (cid:13),A(cid:13){(cid:13)(cid:13) s(cid:13) a(cid:13) k(cid:13) (cid:13),(cid:13).(cid:13).(cid:13)..(cid:13)(cid:13).(cid:13).(cid:13) (cid:13),(cid:13) DI(cid:13)(cid:13) B(cid:13) t(cid:13)xe(cid:13) (cid:13)t(cid:13) 1(cid:13) 1(cid:13) 1(cid:13) s(cid:13)2(cid:13) a(cid:13)1(cid:13) .(cid:13).k(cid:13).(cid:13)1(cid:13)(cid:13),(cid:13) t1(cid:13)(cid:13) kb(cid:13)(cid:13)1 (cid:13)(cid:13),(cid:13).(cid:13).(cid:13).(cid:13) 2w(cid:13),(cid:13)(cid:13)w (cid:13)3(cid:13)(cid:13),(cid:13).(cid:13).4(cid:13)(cid:13).(cid:13) f(cid:13) s(cid:13)3(cid:13) ka(cid:13) (cid:13)(cid:13)2,(cid:13)(cid:13).(cid:13).(cid:13).(cid:13) w (cid:13)1(cid:13)(cid:13),(cid:13)w (cid:13)1(cid:13)(cid:13),(cid:13)w (cid:13)2(cid:13)(cid:13),(cid:13) 3(cid:13) t2(cid:13)(cid:13) b(cid:13)1(cid:13) k(cid:13)2(cid:13),(cid:13)ww (cid:13)1(cid:13)(cid:13)(cid:13),(cid:13) .(cid:13)2.(cid:13)(cid:13),(cid:13).(cid:13) }(cid:13)t(cid:13)xe(cid:13)(cid:13)t(cid:13) C(cid:13) ,(cid:13)B(cid:13) (cid:13),A(cid:13){(cid:13)(cid:13) R(cid:13) s(cid:13)4(cid:13) a(cid:13)2(cid:13)k(cid:13) (cid:13),(cid:13).(cid:13).(cid:13).(cid:13) w (cid:13)1(cid:13)(cid:13),(cid:13) 2(cid:13) t3(cid:13)(cid:13) b(cid:13)2(cid:13) k(cid:13) (cid:13),(cid:13).(cid:13).(cid:13).(cid:13) 2(cid:13) s(cid:13)5(cid:13) a(cid:13)2(cid:13)k(cid:13) (cid:13),(cid:13).(cid:13).(cid:13).(cid:13) w (cid:13)1(cid:13)(cid:13),(cid:13) 3(cid:13) t4(cid:13)(cid:13) b(cid:13)3(cid:13)k(cid:13) (cid:13),(cid:13).(cid:13).(cid:13).(cid:13) w (cid:13)2(cid:13)(cid:13),(cid:13) 2(cid:13) }(cid:13)2Tk(cid:13)(cid:13){(cid:13)(cid:13) }P(cid:13)3k(cid:13)(cid:13)(cid:13){(cid:13) s(cid:13)6(cid:13) a(cid:13)3(cid:13) .(cid:13).k(cid:13).(cid:13)1(cid:13)(cid:13),(cid:13) t5(cid:13)(cid:13) bk(cid:13)3(cid:13)(cid:13) (cid:13),(cid:13).(cid:13).(cid:13).(cid:13) w (cid:13)2(cid:13)(cid:13),(cid:13)w (cid:13)1(cid:13)(cid:13),(cid:13) 3(cid:13) }(cid:13)t(cid:13)xe(cid:13)(cid:13)t(cid:13) (cid:13),B(cid:13){(cid:13)(cid:13) }(cid:13)t(cid:13)xe(cid:13)(cid:13)t(cid:13) (cid:13),C(cid:13){(cid:13)(cid:13) s(cid:13)7(cid:13) a(cid:13)4(cid:13)k(cid:13) (cid:13),(cid:13).(cid:13).(cid:13).(cid:13) w (cid:13)1(cid:13)(cid:13),(cid:13) 4(cid:13) t6(cid:13)(cid:13) b(cid:13)4(cid:13)k(cid:13) (cid:13),(cid:13).(cid:13).(cid:13).(cid:13) w (cid:13)2(cid:13)(cid:13),(cid:13) 3(cid:13) (a) Star-CN (b) S{k1} (c) T{k2} DI(cid:13)(cid:13) C(cid:13) t(cid:13)xe(cid:13) (cid:13)t(cid:13) p(cid:13)1(cid:13) kc(cid:13)(cid:13)1 (cid:13)(cid:13),(cid:13).(cid:13).(cid:13).(cid:13) 3w(cid:13),(cid:13)(cid:13)w (cid:13)1(cid:13)(cid:13),(cid:13).(cid:13).1(cid:13)(cid:13).(cid:13) DI(cid:13)(cid:13) A(cid:13) B(cid:13) C(cid:13) t(cid:13)xe(cid:13) (cid:13)t(cid:13) p(cid:13)2(cid:13) c(cid:13)1(cid:13) k(cid:13)3(cid:13),(cid:13)ww (cid:13)1(cid:13)(cid:13)(cid:13),(cid:13) .(cid:13)2.(cid:13)(cid:13),(cid:13).(cid:13) r(cid:13)1(cid:13) a(cid:13)1(cid:13) b(cid:13)2(cid:13) c(cid:13)2(cid:13) ww (cid:13)1(cid:13)(cid:13)(cid:13),(cid:13) .(cid:13)2.(cid:13)(cid:13),(cid:13).(cid:13) p(cid:13)3(cid:13) ck(cid:13)2(cid:13)(cid:13) (cid:13),(cid:13).(cid:13).(cid:13).(cid:13) w (cid:13)3(cid:13)(cid:13),(cid:13)w (cid:13)2(cid:13)(cid:13),(cid:13) 4(cid:13) r(cid:13)2(cid:13) a(cid:13)1(cid:13) b(cid:13)4(cid:13) c(cid:13)1(cid:13) .(cid:13).(cid:13).(cid:13) r(cid:13) a(cid:13) b(cid:13) c(cid:13) w (cid:13)(cid:13),(cid:13).(cid:13).(cid:13).(cid:13) 3(cid:13) 5(cid:13) 2(cid:13) 2(cid:13) 2(cid:13) p(cid:13) ck(cid:13)(cid:13) (cid:13),(cid:13).(cid:13).(cid:13).(cid:13) w (cid:13)(cid:13),(cid:13)w (cid:13)(cid:13),(cid:13) 4(cid:13) 2(cid:13) 3(cid:13) 2(cid:13) 3(cid:13) p(cid:13)5(cid:13) c(cid:13)3(cid:13) k(cid:13) (cid:13),(cid:13).(cid:13).(cid:13).(cid:13) 3(cid:13) r(cid:13)4(cid:13) a(cid:13)2(cid:13) b(cid:13)4(cid:13) ck(cid:13)1(cid:13)(cid:13) (cid:13),(cid:13).(cid:13).(cid:13).(cid:13) w (cid:13)3(cid:13)(cid:13),(cid:13)w (cid:13)2(cid:13)(cid:13),(cid:13) 3(cid:13) p(cid:13) c(cid:13) k(cid:13) (cid:13),(cid:13).(cid:13).(cid:13).(cid:13) w (cid:13)(cid:13),(cid:13) r(cid:13) a(cid:13) b(cid:13) c(cid:13) w (cid:13)(cid:13),(cid:13).(cid:13).(cid:13).(cid:13) 6(cid:13) 3(cid:13) 3(cid:13) 3(cid:13) 5(cid:13) 4(cid:13) 3(cid:13) 1(cid:13) 3(cid:13) (d) P{k3} (e) Rφ Figure3:Arunningexampleforaquery k ,k ,k 1 2 3 { } valuesofthekey-valuepairswillbeallocatedtogether. cannot makeanycontributionforthefinalresultsbecauseno fact tuplesappearinthesetasks.Therefore,theseunusefulreducetasks canbedirectlyprunedwithoutfurthercomputation,whichcanim- 00(cid:13) 0(cid:13) (cid:13) 10(cid:13) 0(cid:13) (cid:13) 01(cid:13) 0(cid:13) (cid:13) 11(cid:13) 0(cid:13) (cid:13) provetheperformancealot. a(cid:13) a(cid:13) ab(cid:13)(cid:13)|(cid:13)c(cid:13)|(cid:13) a(cid:13) a(cid:13) ab(cid:13)(cid:13)|(cid:13)c(cid:13)|(cid:13) 2(cid:13) 4(cid:13) 2(cid:13) 4(cid:13) 1(cid:13) 2(cid:13) 4(cid:13) 4(cid:13) 3(cid:13) 1(cid:13) b(cid:13) b(cid:13) a(cid:13) a(cid:13) b(cid:13) a(cid:13) a(cid:13) After pruning the unuseful reduce tasks, the first reducer only 2(cid:13) 4(cid:13) 2(cid:13) 4(cid:13) 3(cid:13) 2(cid:13) 4(cid:13) c(cid:13) b(cid:13) b(cid:13) c(cid:13) b(cid:13) needtoprocessthereducetask(100)whilethesecondreducerhas 2(cid:13) 2(cid:13) 4(cid:13) 2(cid:13) 3(cid:13) c(cid:13) c(cid:13) todealwiththethreereducetasks(001,011,101).Ifwedon’tcon- 1(cid:13) 1(cid:13) 00(cid:13) 1(cid:13) (cid:13) 10(cid:13) 1(cid:13) (cid:13) 01(cid:13) 1(cid:13) (cid:13) 11(cid:13) 1(cid:13) (cid:13) siderotherinformation,thenwecansaythatthecomputationover ab(cid:13)1(cid:13)(cid:13)|(cid:13)c2(cid:13)(cid:13)|(cid:13) 2(cid:13) ab(cid:13)1(cid:13)(cid:13)|(cid:13)c4(cid:13)(cid:13)|(cid:13) 1(cid:13) a(cid:13)1(cid:13) a(cid:13)1(cid:13) thetworeducershavebeenbalancedasmuchaspossiblebecause ab(cid:13)5(cid:13)(cid:13)|(cid:13)c2(cid:13)(cid:13)|(cid:13) 2(cid:13) a(cid:13)1(cid:13) b(cid:13)3(cid:13) b(cid:13)3(cid:13) thefirstreducerneedstoprocesstwofactkey-valuepairswhilethe a(cid:13)1(cid:13) b(cid:13)2(cid:13) b(cid:13)4(cid:13) c(cid:13)2(cid:13) c(cid:13)2(cid:13) secondreducerdealswiththreefactkey-valuepairs. However, if b(cid:13) b(cid:13) c(cid:13) 2(cid:13) 4(cid:13) 1(cid:13) wedoasimplesampleoverthereducetask(100),wecanfindthat c(cid:13) 2(cid:13) a doesnotexistinthereducetask(100). Therefore,thefactkey- 5 value pair with key of a b c can be pruned. At this moment, 5 2 2 | | Figure4:DemonstrationoftheDataAllocation wecanfindthattheworkloadratioofthetworeducersare 1 : 3. Toaddressthedataskew,wecanusethecostmodelinSection4.2 toevaluatethecostofeachreducetaskandgroupthemasevenas 4.3.3 AllocateReduceTaskstoReducersEffectively possible. Inthiscase,wecanallocatethereducetasks(100,101) to thefirst reducer and the reduce tasks (001, 011) to thesecond The second question is: how can we allocate the reduce tasks reducer. to reducers effectively? The naive method is to activate one re- ducerforeachreducetask. Butthismethodisofteninfeasiblein realapplication.Thisisbecausewecangeneratedifferentnumbers of virtual reduce tasksforthesame computational task whilethe numberofavailablephysicalreducersislimited. Anotherpossible 4.3.4 DetailedProcedureofComputingStatisticalIn- method isto allocate the reduce tasks ina round-robin way. For formation example, thereducetaskswiththelabels000, 010, 100, and110 canbecomputedatthefirstreducerwhiletheresttaskscanbecal- Whenthereducerspullalltherelevantdatafromthecorrespond- culated at thesecond reducer. Although it intends toachieve the ingmappers,itistimetobuildthelocalvol-arraysforthedimen- computationbalancewithoutconsiderationofdatainformation,it sion andfact relations. Forthereduce task (001), itgetsthe fol- oftenleadstounbalance ofcomputation duetothedataskew. In lowingkey-valuepairs: (a b c ,′′...,k ,w ,w′′),(a ,′′...,k , addition, different reduce tasksmay havedifferent computational w ,w ,w ;...,k ,w 2′′2)|,(4a| 1,′′...,k 3,w′′2),(a3 ,′′...2,k ,w′′1), workload. The reducers with the allocation of heavy workloads (b1,′′.2..,k3′′),(b 1,′′..2.,|k ,w′′)2and(c1,′′..3.,k ,4w ,w ,1...;k4 , 2 2 4 2 3 1 3 1 1 3 willtakemoreprocessingtimewhiletheoneswiththelightwork- w ,w ,... 2′′). Firstly,wecanbuildthenum-arraysforthejoin 1 2 | loadswilltakelessprocessingtime. Especiallyfortheexamplein attributesA,BandC,respectively. Figure4,thereducetaskswiththelabels000, 010, 110, and111 Forthelocalnum-arrayofS{k1},wehave attribute num text attribute volume text a 3 ′′...,k ,w ,w ,w ;...,k ,w ; c 5 ′′...,k ,w ,w ,...;k ,w ,w ,...′′ 2 1 1 2 3 1 2 1 3 1 1 3 1 2 ...,k1,w3′′ Forthelocalvol-arrayofRφ,wehave a4 1 ′′...,k1,w4′′ attribute volume text Forthelocalnum-arrayofT{k2},wehave a b c 6 ′′...,k ,w ,w′′ attribute num text a2|b4|c1 4 ′′...,w3′′ 2 3 b 1 ′′...,k′′ Sim4i|la3rl|y,1wecangetthevol-ar3raysatthesecondreducerasfol- 2 2 b 1 ′′...,k ,w′′ lows. 4 2 3 Forthelocalnum-arrayofP{k3},wehave Forthelocalvol-arrayofS{k1},wehave attribute num text attribute volume text c1 2 ′′...,k3,w1,w1,...;k3,w1,w2,...′′ a1 4 ′′...,k1,...;k1,...; Afterwebuildthenum-arraysforthelocalpartitions, weneed Forthelocalvol-arrayofT{k2},wehave to process the fact key-value pairs one by one. Since the reduce attribute volume text task(001)onlyincludes(a2|b4|c1,′′...,k3,w2,w3′′),thelocalvol- b2 4 ′′...,k2 arraysofthefactanddimensiontablescanbebuiltasfollows. b 4 ′′...,k ,w′′ 4 2 3 Forthelocalvol-arrayofS{k1},wehave Forthelocalvol-arrayofP{k3},wehave attribute volume text attribute volume text a2 2 ′.′....,.k,1k,1w,w3′′1,w2,w3;...,k1,w2; cc12 22 ′′′′......,,kk33,,ww12,,ww14,;......,;kk33,,ww12,,ww23′′,...′′ Forthelocalvol-arrayofT{k2},wehave Forthelocalvol-arrayofRφ,wehave attribute volume text attribute volume text b4 6 ′′...,k2,w3′′ a1 b2c2 4 ′′w1,w2,...′′ Forthelocalvol-arrayofP{k3},wehave a |b |c 4 ′′...′′ 1 4 1 | | attribute volume text c 3 ′′...,k ,w ,w ,...;k ,w ,w ,...′′ 1 3 1 1 3 1 2 4.4 Computing theTerm Frequency Forthelocalvol-arrayofRφ,wehave atMapReduce2nd attribute volume text a b c 6 ′′...,k ,w ,w′′ At MapReduce2nd, we will output the final results - frequent 2| 4| 1 3 2 3 Similarly,wecanprocessthereducetasksof011,100,and101, co-occurrenceswiththegivenkeywordquerybyutilizingthesta- respectively. Sincethedimensionkey-valuepairsareoftenneeded tisticalinformationinvol-arraysatMapReduce1st. tobecopiedacrossdifferentreducetasksaccordingtoouradopted ThemapfunctiontakesasinputstheintermediateresultsofMap scheduling strategy, the by-product of the strategy is to avoid to Reduce1st,i.e.,vol-arraysconsistingofthreeparts:{joinattribute, re-pull the key-value pairs if they have been obtained by the re- volume, textinformation}. Foreachrecordinthe vol-arrays, we ducers. By doing this, we can reduce the communication cost breakthetextinformationintotokensetbyusinganytokenization andguaranteethecorrectnessoftheresults. Forexample, there- methodandfilterthestopwordsfromthegeneratedtokenset. For ducetasks001and011wouldbeprocessedatthefirstreducerto- eachdistincttoken,wecangetitslocalfrequencybycountingthe gether. Afterwedealwiththereducetask001, thedatainforma- timesofthetokenappearinginthefilteredtokenset.Andthen,we tionofa ,a ,c arereadyatthisreducer.Forthereducetask011, generatethefrequencyofthetokenatthemapperbymultiplyingits 2 4 1 the reducer only needs to pull the necessary key-value pairs (b , localfrequencyandthevolumeoftherecord,whichcanbetakenas 3 ′′...,k ,w ;...,k ,w ,w 2′′) and (a b c , ′′...,w′′). As such, thevaluev .Thetokenistakenasthekeyk .Atthereducerstage, 2 2 2 1 3| 4| 3| 1 3 2 2 only the local num-array of T{k2} of the reduce task 001 needs thereducefunctionstartstocomputethetotaltermfrequency.Fora tobeupdatedbyaddingthekey-valuepairofb . Fortheupdated certainkey,thereducefunctionpullsallthecorrespondingrecords 3 localnum-arrayofT{k2},wehave withthekeyfromallthemappers. attribute num text Let’stakethetermw1 asanexampletoshowtheprocedureof b 1 ′′...,k′′ MapReducer2nd. Assume there are two available mappers: the 2 2 b 2 ′′...,k ,w ;...,k ,w ,w′′ firstmappertakesasinputstheoutput ofthefirstreduceratMap b3 1 ′′...,k2,w2′′ 2 1 3 Reducer1standthesecondmappertakesasinputstheoutputofthe Afterp4rocessingthefactke2y-v3aluepair(a b c ,′′...,w′′),we second reducer at MapReducer1st. For the first mapper, it scans 4| 3| 1 3 eachrecordinvol-arraysandgeneratesthekey-valuepairs taking canupdatethelocalvol-arraysofthefactanddimensiontablesas follows. termaskeyanditscardinalityasvalue. Fortherecord a2,itfirst Forthelocalvol-arrayofS{k1},wehave computesthelocalfrequencyofw1as1;andthenitcalculatesthe cardinality by multiplying the local frequency and the volume of attribute volume text a2 2 ′.′....,.k,k,1w,w′′1,w2,w3;...,k1,w2; (thwe1,re2c)o.rdA,tet.hge.,s1am∗e2tim=e,2t;hleasottlhyeritkoeuyt-pvuatlsueapkaeiyrs-voaflutehepateirrmass a 4 ′′...,1k ,w3′′ intherecordcanbeoutput. Similarly,therecord b3 outputs(w1, 4 1 4 1 2 = 2) andtherecord c outputs(w , 3 5 = 15). Forthe Forthelocalvol-arrayofT{k2},wehave ∗ 1 1 ∗ secondmapper,itoutputsthekey-valuepairs(w ,3 2=6)byc 1 1 attribute volume text ∗ and(w ,1 4=4)bya b c ,respectively. b3 2 ′′...,k2,w2;...,k2,w1,w3 Atth1ere∗ducerstage,ea1c|h2r|e2ducergetsallthekey-valuepairsof b4 6 ′′...,k2,w3′′ the keys to be allocated to the reducer and adds the cardinalities Forthelocalvol-arrayofP{k3},wehave of each termas thetotal frequency of the term. For w , itstotal 1 frequencyis2+2+15+6+4=29. 4.5 Propertities ofMapReduce-based 5. IMPLEMENTATION OF MAPREDUCE- FCT Search BASED FCTSEARCH In the above sections, we have introduced the concepts of our THEOREM 1. (AggregationEqualTransformation)Theaggre- MapReduce-based FCTsearchapproach. Nowwepresentitsim- gationofthenum-arraysandthevol-arraystobebuiltoverinde- plementation, which includes the functions Map(), Reduce() and pendent reduce tasksisequal tothosetobebuiltovertheaggre- getPartition() of MapReduce1st and the brief description of Map gateddatainformationoftheindependentreducetasks. Reduce2nd,respectively. PROOF. Forthenum-arrays, ifakeyappearsinareducetask, thenallthekey-value pairswiththesamekeymustappearinthe Algorithm1Map(key,arecord)atMapReduce1st reducetask.Itsaysthatthelocalfrequency(num)ofthekeyshould 1: key =getJoinAttribute(key,therecord); betheglobalfrequency(num)ofthekeyintheoriginalrelations. new 2: ifType(key )isidentifiedasadimensionkeythen Inotherwords, thenumof akey inareduce taskcanbeused to new 3: indexPos = getJoinAttrPosition(Type(key ), joinAttr- serveallthereducetasksthatcontainthekey. Therefore, theag- new TypeSet[]); gregationofthenum-arraysoverindependent reducetaskscanbe 4: for(i=1;i<=numDuplicates;i++)do equivalentlytransformedtothatwefirstaggregatethedistinctkey- 5: cPartition=Integer.toX-naryString(i); valuepairsoftheindependent reducetasksandthencompute the 6: cPartition.insertBefore(‘*’,indexPos); num-arraysovertheaggregateddata.Becausetheequivalenttrans- 7: Value =getValue(key,therecord); formationofnum-arraysholds,thevol-arrayscanalsobebuiltby new 8: Emit(pair(indexPos,key ),pair(cPartition,Value )); alternativelyaccessingtheaggregated dataoftheindependent re- new new 9: endfor ducetasks. 10: else 11: keyset=getJoinAttribute(key,therecord); BasedontheequivalenttransformationinTheorem1,thestatis- 12: priority = getJoinAttrPosition(Type(key keyset), ticalresultsofonereducetaskcanbeusedforanotherreducetaskif P ∈ joinAttrTypeSet[]); bothreducetasksincludethesamekeys.Therefore,twocorollaries 13: key=keyset.toString(‘|’); canbederivedasfollows. 14: Value =getValue(keyset,therecord); new COROLLARY 1. (IncrementalComputation)Thenum-arraysand 15: Emit(pair(priority,key),Valuenew); thevol-arrayscanbeincrementallybuiltacrossreducetasks. 16: endif COROLLARY 2. (DataFiltering)Thereducersonlyneedtopull InAlgorithm1,weshowtheproceduresinMap()atMapReduce1st. thenecessarydatainformationthathavenotbeenseen. Whenadimensiontupleisread,itfirstextractsasthenewkeythe joinattributeandgeneratesasthevalueastringbycombiningthe Sincethederivationsareeasytobeunderstood, wedonotpro- contentsoftherestattributes.Andthen,ittagsthekeywithanum- videtheirproofsinthispaper. Accordingtotheabovetwocorol- berwhereweusethepositionofitscorrespondingattributecolumn laries,wecanfurtherimprovetheperformanceofourapproachby infactrelation. Bydoingthis, wecanguaranteeateachreducer, Reducing the communication cost due to the avoidance of thedatabelongingtothesamedimensionrelationwillbecollected • therepeateddatatobepulled; together. Anditalsotagsthevaluewiththepartitiontobecopied, whichisusedtoimplementthemultiwayjoinbaseddatapartition, Accelaratingthecomputationofreducersbecausethereduc- asshowninLine4-Line9. Forafacttuple,themapfunctiontags • erscanstarttoworkearlyfortheexistingdatathathavebeen thekeywiththesumofthepositionnumbersoftheircorresponding ready; attributecolumnsinfactrelation,whichguaranteesthatallthefact tuplesshouldarriveafterallthedimensiontuplesatareducer.Dif- Avoidingthecomputationfromscratchbyincrementallymain- • ferentfromprocessingthedimensiontuples,wedon’tneedtotag tainingthecomputationalresultsofthereducetasksthathave thevaluesbecauseeachfacttuplewillbesenttoonlyonepartition beenprocessed. asshowninLine11-Line15. For adapting to the multiway join based data partition, Algo- COROLLARY 3. (CorrectnessandCompleteness)MapReduce- rithm2redesigns thefunction getPartition()ofHadoop. Accord- basedFCTSearchcancomputethetermfrequenciesforakeyword ingtothespecifiednumberofreducetasks,i.e.,numReduceTasks, queryoverbigdatacorrectlyandcompletely. Line 1 is used to calculate the number (denoted as numDimPar- PROOF. Accordingtotheuniformeddistribution-basedshuffling tition) of partitionsforeach dimension relation using the derived strategyinSection4.1orunevendistribution-basedshufflingstrat- equations,e.g.,a= 3 ks2/tp,b= 3 kt2/spandc= 3 kp2/st egyinSection4.2,wecanseethatforeachfacttuple,itwillbesent inSection4.1. Ifthepkey-valuepaircpomesfromadimenpsionrela- toonereducetask,i.e.,noduplicatesacrossdifferentreducetasks. tion,wecancomputethelocalpartitionnumber,withregardstothe Andforthefactanddimensionpartitiondataateachreducetask, dimensionrelation,tobeallocatedbythekey,asshowninLine3. they can be used tocompute the termfrequencies independently. And then, itwillbe used tocompute theglobal partitionnumber Therefore,itguaranteesthelocalcorrectnessandcompletenessof bycombiningitwiththepartitionnumbersoftheotherdimension MapReduce-based FCTSearchforthepartitiondatawithregards relations,asshowninLine4-Line5.Similarly,wecanprocessthe tothereducetask. key-value pairscoming fromfactrelation. Differently, thekey is BasedontheaggregationequaltransformationinTheorem1,we oftenacompositekeythatconsistsofmultiplesinglekeys. There- canconcludetheglobalcorrectnessandcompletenessofMapReduce- fore,weneedtofirstcalculatethelocalpartitionnumberforeach basedFCTSearchbecausetheaggregatedresultsofallthereduce singlekeyandthentransformthesetoflocalpartitionnumbersinto tasks can be equally transformed to compute the results over the aglobalpartitionnumber,asshowninLine8-Line12. aggregateddatapartitions(i.e.,theoriginaldata). Algorithm3canbedividedintothreestages.Atthefirststagein Algorithm 2 getPartition(key, value, numReduceTasks) at Algorithm3Reduce(key,arecord)atMapReduce1st MapReduce1st 1: {Comments:thedimensiontuplesarealwaysprocessedbefore 1: {Comments: computethenumbernumDimPartitionsofparti- facttuples}; tionsfordimensionrelationfromnumReduceTasks}; 2: ifkey.second()isidentifiedasadimensionkeythen 2: ifkey.second()doesnotcontain‘|’,i.e.,adimensiontuplethen 3: ifhashi .contains(key.first())then value 3: numPartition = (key.second().hashCode() & Inte- 4: hashi (key.first())=hash (key.first())+1; num num ger.MAX_VALUE)%numDimPartitions; 5: else 4: cPartition=value.first().replace(‘*’,numPartition); 6: hashi (key.first())=value.second(); value 5: return cPartition.toDecimal()asthenumberofpartition; 7: hashi (key.first())=1; num 6: else 8: endif 7: newastringstr=‘’; 9: else 8: keys[]=key.second().split(’|’); 10: {Comments: allthedimensiontuplesbelongingtothispar- 9: forinti=0;i<keys.length;i++do titionhavearrived}; 10: str+= (keys[i].hashCode() & Integer.MAX_VALUE) % 11: keys=key.first().split(‘|’); numDimPartitions; 12: num =hashi (keys[i]); i num 11: endfor 13: ifanynumi=0then 12: return str.toDecimal()asthenumberofpartition; 14: hashf 6(key)=value; value 13: endif 15: hashf (key)= num ; 16: hashivol(keys[i])Q+= i num ; vol Qj6=i j 17: endif Line2-Line8,itcalculatesthetotalnumberofdimensionrecords 18: endif withthesamejoinattributeaskey.AtthesecondstageinLine11- 19: {Comments: generate inputs for the reducer at Line17,itcalculatesthevolumeforeachfacttupleusing numi MapReduce2nd}; Q andthetotalvolumesforeachjoinattributeusingQj6=inumj,re- 20: foreachitemxinhashfvolorhashivoldo spectively. AtthethirdstageinLine20-Line24, itgeneratesthe 21: foreachwordwinhashfori (x)do value intermediateresultsthatwillbetakenastheinputsofthereducer 22: Emit(w,hashfori(x)); vol atMapReduce2nd. Sincethetagofthefactkeysisalwayslarger 23: endfor than that of the dimension keys, any dimension relation has the 24: endfor higherprioritythanthefactrelation. Therefore,weguaranteethat thethreestagescanbeprocessedinastablesequence. Therestworkissimilartotheclassificexampleofwordcount seenat[19]. usingMapReduce.Wetakeastheinputstheintermediateresultsof thereduceratMapReduce1st. Andwecalculatethetotalfrequen- t(cid:13)c(cid:13)e(cid:13)l(cid:13)eS(cid:13) (cid:13) ciesofeachterm.Afterthat,themerge-sortoperationisappliedto s(cid:13)dr(cid:13)to(cid:13)(cid:13)wc(cid:13)y(cid:13)e(cid:13)e(cid:13)(cid:13)lk(cid:13)(cid:13)e(cid:13)S(cid:13) (cid:13) TRA(cid:13)P(cid:13)(cid:13)(cid:13) s(cid:13)dr(cid:13)o(cid:13)w(cid:13)y(cid:13)e(cid:13)k(cid:13) (cid:13) REMO(cid:13)T(cid:13)S(cid:13)(cid:13)U(cid:13)(cid:13)C(cid:13)(cid:13) theoutputsofthereducersatMapReduce2nd. Assuch, thetop-k frequentwordsortermswillbefound. RE(cid:13)I(cid:13)L(cid:13)P(cid:13)P(cid:13)U(cid:13)S(cid:13)(cid:13) MET(cid:13)(cid:13)I(cid:13)E(cid:13)N(cid:13)I(cid:13)L(cid:13)(cid:13) SR(cid:13)ED(cid:13)RO(cid:13)(cid:13)(cid:13)(cid:13) RE(cid:13)I(cid:13)L(cid:13)P(cid:13)P(cid:13)U(cid:13)S(cid:13)(cid:13) MET(cid:13)(cid:13)I(cid:13)E(cid:13)N(cid:13)I(cid:13)L(cid:13)(cid:13) SR(cid:13)ED(cid:13)RO(cid:13)(cid:13)(cid:13)(cid:13) 6. EXPERIMENTS (a) Star-Type (b) Chain-Type Inthissection,westudytheperformanceofourproposedMapReduce- t(cid:13)c(cid:13)e(cid:13)l(cid:13)eS(cid:13) (cid:13) basedFCTsearchapproach. Allexperimentswereperformedona TRA(cid:13)P(cid:13)(cid:13)(cid:13) s(cid:13)dr(cid:13)o(cid:13)w(cid:13)y(cid:13)e(cid:13)k(cid:13) (cid:13) 9-machine cluster running Hadoop 1.0.3 [13] at SwinCloud plat- form1. OnemachineservedastheHeadNoderunningCentOS-5 Linuxwith500GBharddiskallocatedasDFSstorage. TheHead RE(cid:13)I(cid:13)L(cid:13)P(cid:13)P(cid:13)U(cid:13)S(cid:13)(cid:13) MET(cid:13)(cid:13)I(cid:13)E(cid:13)N(cid:13)I(cid:13)L(cid:13)(cid:13) SR(cid:13)ED(cid:13)RO(cid:13)(cid:13)(cid:13)(cid:13) REMO(cid:13)T(cid:13)S(cid:13)(cid:13)U(cid:13)(cid:13)C(cid:13)(cid:13) Node also serves as the Name Node and JobTracker at the same (c) Mix-Type time. Whilethe other 8 machines as Worker Nodes are the gen- eralPCwith1GBRAM,whichcanbeusedforMapandReduce Figure5:DesignedQueryTypes tasks. And each Worker Node is configured to run one map and onereducetaskconcurrently.Thedistributedfilesystemblocksize issetto64MB.OnlytheHeadNodetakestheroleofstoragenode To test the stability of our approach, we design three types of fortheDFS.AllthemachinesareconnectedviaaGigabit-Ethernet keywordqueriesasshowninFigure5.Inordertoguaranteethere- network. sultsetsofthegeneratedkeywordqueriesarenotempty,weadopt the following steps to generate keyword queries and record their 6.1 SelectionofDatasetandKeywordQueries experimentalresults. Let’stakethetype inFigure5(a)asanex- 1 ample: AsTPC-H[19]isthemostwidelyusedbigdatabenchmark in MapReducestudy,e.g.,[20,21,22],wegenerateasetofdatasets runthestructuredqueriesandoutputtheresultsasthreebags withdifferentsizes.Inordertodemonstratetheperformanceofthe • oftexts,e.g.,forstar-type,wehave: selectbag(p*),bag(s*), multiway-join inMapReduce, wedirectly linkthePARTrelation bag(o*) from part p, supplier s, lineitemli, orders owhere andtheSUPPLIERrelationtotherelationLINEITEM,bywhich p.partkey=li.partkey&s.suppkey=li.suppkey&o.orderkey the relationLINEITEMistaken asthe fact relationwhile the re- =li.orderkey; lationsPART,SUPPLIERandORDERSareconsideredasthedi- mensionrelations.Inaddition,TheoriginalTPC-HSchemacanbe choosethetermsfromdifferentbagsasthequerykeywords; • 1hadoop.ict.swin.edu.au Foreachkeywordquerytype,wegenerate3batchesofkey- • wordquerieswhereeachbatchcontains10randomkeyword shuffle space usage approximately takes 10% of the dataset size. queries. FromFigure11,wefindthattheshufflespaceusageapproximately takes17%ofthedatasetsize. ThisisbecauseforChain-Typeand In thefollowing study, the average performance of each batch of Mix-Typequeries, theORDERSrelationcanbereducedbyjoin- keyword queriesareusedtomakecomparison, e.g., Q ,Q , and 1 2 ingwiththeCUSTOMERrelation, whichcanreducethenumber Q representthethreebatchesforType ;Q ,Q ,andQ represent 3 1 4 5 6 ofORDERStuplestoinvolveinthemultiway-joinofMapReduce. the three batches for Type ; Q , Q , and Q represent the three 2 7 8 9 Particularly,forChain-Typequeries,itsmultiway-joinusestwoat- batchesforType . 3 tributes as the composite key, e.g., suppkey and orderkey in Fig- To illustrate the advantages of parallel platforms, we first run ure5. Inthiscase,thenumberofcopiesforadimensiontuple is theFCTsearchofthequerybatchQ over1GBdataset insingle 1 much smallerthan thatof Star-Typetakingthreeattributes asthe machine platform. Although the single machine has 4GB RAM, compositekey. 500GB hard disk and it does not need shuffle operations, it still FromexperimentalresultsofQ over10GBdataset,wefindthat consumesabout4.5minstocompletetheFCTsearchofQ ,which 1 1 theshufflespaceusagesofthe8reducersareunbalanced. Forhalf ismuchexpensivethanthat(about1.83mins)ofourparallelplat- ofthereducers,theirindividualshufflespacecostisapproximately formconsistingof8workernodes.Therefore,thefollowingexper- 344MB,inwhichthenumberofreduceinputrecordsis10,843,452. imentalstudiesareonlyfocusedontheevaluationofourapproach While for the other four reducers, their individual shuffle space intheparallelplatform. costisapproximately 144MB, inwhichthenumberof reducein- 6.2 Response Time put records is 4, 070, 586. From the result analysis, we can get thattheunbalancedshufflingoftenhappenswhenwedealwithbig Figure6-Figure8providetheresponsetimeofFCTsearchwhen data,whichmayaffectthetotalperformancegreatly. Thisisalso we process theselected keyword queries over the TPC-Hdataset thereasonthattheshufflingtimecosttakesthehighpercentageof withdifferentsizes:1GB,5GB,10GBand20GB,respectively.For thetimecostofreducestageatMapReduce. processing thequerybatchesQ -Q , wefirstmergethetworela- 4 9 tionsCUSTOMERandORDERS,andthenrunthemultiway-based 6.4 VerifyingUnevenDistribution-basedShuf- MapReducejoinbytakingtheLINEITEMasthefactrelation. fling Strategy Fromtheexperimentalresults,wefindthatmostoftimeisspent ontheMapReduce1st stage. Forexample,forthequerybatchQ , 1 theMapReduce1ststageconsumes1.25minswhiletheMapReduce2nd 16(cid:13) stagetakes0.6minsfor1GBdataset;thefirststageconsumes3.25 ns)(cid:13) 14(cid:13) minswhilethesecondtakes0.8minsfor5GBdataset;thefirststage e (mi 1102(cid:13)(cid:13) consumes 6.1 mins while the second takes 0.81 mins for 10GB m Ti 8(cid:13) dataset;thefirststageconsumes11.33minswhilethesecondtakes se 6(cid:13) 0.93minsfor20GBdataset;Forotherquerybatches,wecangetthe on 4(cid:13) similarobservationsthatthefirststagetakesthehighpercentageof sp 2(cid:13) e thetotalresponsetime. Inaddition,fromtheexperimentalresults, R 0(cid:13) we can find that map() in MapReduce1st stage takes about 0.33 Q1(cid:13) Q4(cid:13) Q5(cid:13) mins to load a block withsize of 64MB and the loading balance Queries(cid:13) canbeguaranteedbysplittingthedatasetintomultipleblocks. No-Adjust(cid:13) Adjust(cid:13) Forinstance,considerQ ,5GBdatasetand8mappers,itissplit- 1 tedinto68maptasksandeachmapperapproximatelyload8num- berofblocks.Assuch,themapstagemaytakeabout0.33*8=2.64 Figure12:ResponseTimeofProcessingUnevenDataDistribu- mins to finish all mappers’ workloads. At the reduce stage, the tion shuffle() takes high timecost than sort() and reduce(), e.g., shuf- fling spends 1.91 minswhile sorting takes 0.1 mins and reduce() takes0.43minsforonereducerinprocessing5GBdatasetusing8 3(cid:13) reducers. Fortunately,theshufflingcanbeprocessedinparallelat B)(cid:13) G 2.5(cid:13) themapstage. Basedonthis, wecanfindthattheresponse time e ( 2(cid:13) cmaonstbewmithinriemgaizreddsttootmheax5{G2B.64d,at1a.s9e1t}an+d08.1w+or0k.e4r3n=od3e.1s.7minsat Shuffl 1.5(cid:13) Fromtheresultanalysis,wecangetthatthetotalresponsetime ce 1(cid:13) u constrainstothemaximalvalueofloadingtimeandtheshuffling d 0.5(cid:13) e time. Toreducetheloadingtime,wecanaddmoreworkernodes R 0(cid:13) intothecluster.Toreducetheshufflingtime,wesendeachfacttu- Q1(cid:13) Q4(cid:13) Q5(cid:13) pleintoonereducetaskandcopytherequireddimensiontuplesinto Queries(cid:13) theircorrespondingreducetasksbasedonourproposedscheduling No-Adjust(cid:13) Adjust(cid:13) strategy, which can reduce the shuffling operation times because generallyfactrelationismuchlargerthandimensionrelations. Figure13:ReduceShuffleCostofProcessingUnevenDataDis- 6.3 Reduce Shuffle Size tribution Figure 9-Figure 11 show the reduce shuffle space usage when we process theselected keyword queries over the TPC-Hdataset Inorder to verify theuneven distribution-based shuffling strat- with different sizes: 1GB, 5GB, 10GB and 20GB, respectively. egy,wemodifythe10GBdatasetintoadatasetwithmuchhigher FromFigure9,wefindthattheshufflespaceusageapproximately data skew by removing some tuples and repeatedly adding some takes 20% of the dataset size. From Figure 10, we find that the othertuples. Tomakeatradeoff betweentheestimatedprecision