hMDAP: A Hybrid Framework for Multi-paradigm Data Analytical Processing on Spark Xiaowang Zhang Jiahui Zhang Zhiyong Feng SchoolofComputerScience SchoolofComputerScience SchoolofComputerSoftware andTechnology andTechnology TianjinUniversity,P.R.China TianjinUniversity,P.R.China TianjinUniversity,P.R.China TianjinKeyLaboratoryof TianjinKeyLaboratoryof TianjinKeyLaboratoryof CognitiveComputingand CognitiveComputingand CognitiveComputingand Application Application Application Tianjin300350,P.R.China Tianjin300350,P.R.China Tianjin300350,P.R.China [email protected] 7 [email protected] [email protected] 1 0 2 ABSTRACT toprocessthisdatafortimelyandaccuratedecisions[1]. Forthis n challenge, big data analysis [16] has become a tool to slove the WeproposehMDAP,ahybridframeworkforlarge-scaledataan- a problem. The primary goal of big data analysis is to help com- alytical processing on Spark, to support multi-paradigm process J panies make more informed business decisions by enabling data (incl. OLAP, machine learning, and graph analysis etc.) in dis- 6 scientists,predictivemodelersandotheranalyticsprofessionalsto tributedenvironments. Theframeworkfeaturesathree-layerdata 1 analyzelargevolumesoftransactiondata, aswellasotherforms processmoduleandabusinessprocessmodulewhichcontrolsthe ofdatathatmaybeuntappedbyconventionalbusinessintelligence former.WewilldemonstratethestrengthofhMDAPbyusingtraf- ] programs[16]. Recently,manytechniqueshavebeensuccessfully B ficscenariosinarealworld. developpedforprovidingbigdataanalysisinvariousapplications. D Forexample,OracleBigdata[14]buildsonHadoop[10]through Keywords . OracleDirectConnectorconnectingHadoopandOracledatabases. s Dataanalyticalprocessing;OLAP;multi-paradigm;Spark SQLServer2012[17]providestheextensionserviceofOLAPand c [ businessintelligenceonHadooptosupportbigdataanalysis.IBM 1. INTRODUCTION SmartCloud provides a Hadoop-based analytical software InfoS- 1 phereBigInsights[20]whichcanconnectwithIBMDB2. How- v Data analysis has become a useful technique to organize, pro- ever,thoseexistingtechniquesofbigdataanalysisaremostlybased 2 cess, and analyze large amounts of data in order to obtain use- onOLAPwhichisnoteffectivetoprocessdatainvariousmodels 8 fulknowledgeeffectivelysuchashiddenpatterns, implicitcorre- (e.g., semi-structure [16]), they do not always bring highly accu- 1 lations, futuretrends, customerpreferences, valuablebusinessin- rateanalysisduetothevarietyandvariabilityofbigdatainacom- 4 formation etc [3]. OLAP (online analytical processing) [6], as a plicatedapplication–forexample,thereal-timedataontheperfor- 0 key technology to provide rapid access to data (mostly relational mance of traffic applications or of mobile applications. Besides, . data) for analysis via multidimensional structures, enables users 1 howtoprocessbigdataanalysisefficientlyisalwaysanimportant (e.g.,analysts,managers,executivesetc.)togainusefulknowledge 0 problemwhenthescaleofbigdatagrowsexponentially[5]. from data in a fast, consistent, interactive accessing way. There 7 In this demonstration, we propose a hybrid framework for big aremanypopularenterprisedatabasemanagementsystemsforsup- 1 dataanalysisonApacheSpark[12](ahigh-performancecomput- v: preonrtticnogmOpLuAtinPg.Feonrgienxeamfoprloen,lOinreacalneaOlyLtiAcaPl[p8r,o1ce8s]siisngO.rIaBclMe’scocumr-- ing architecture) which builds on HDFS of Hadoop. The frame- i pany based on the DB2 database proposes the IBM DB2 OLAP workfeaturesathree-layerdataprocessmoduleandabusinesspro- X cessmodulewhichcontrolstheformer.Withinthisframework,we Server[4,2]whichcananalyzetherelationaldatabasequicklyand r cansupportmulti-paradigmdataprocess(i.e.,atechnicalconnec- directly. Microsoft also provides SQL Server Analytic Services a tivitybetweenvariousdisparateprocess[21])inordertoimprove (SSAS)[19,13]supportingforOLAPtoanalyzeinformation,ta- theaccuracyofanalysis,wherevariousbigdataanalysistechniques bles,andfilesscatteredacrossmultipledatabases. (incl. OLAP,machinelearning,andgraphanalysisetc.) areinter- The characteristics of big data is not confined to only volume operatedtoprocesstheanalysisofvariousapplicationsofbigdata andvelocity;itisalsoreferredbythevariety,variabilityandcom- (incl. data cube [9], intelligent prediction, and complex network plexityofthedata[11,7]. Duetothevolume,varietyandvelocity etc.) respectively. Moreover, our proposed framework built on atwhichthedatagrows,itisextremelydifficultfororganisations Spark can process large-scale data efficiently. Finally, we imple- ment hMDAP and demonstrate the strength of hMDAP by using Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalor trafficscenariosinarealworld. classroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributed forprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcita- tiononthefirstpage. Copyrightsforcomponentsofthisworkownedbyothersthan ACMmustbehonored.Abstractingwithcreditispermitted.Tocopyotherwise,orre- 2. ARCHITECTURE publish,topostonserversortoredistributetolists,requirespriorspecificpermission and/[email protected]. In Figure 1, we depict the architecture of our framework con- XXXXXX sistedoffourparts:thestoragemanagement,theresourceschedul- ing,thequeryanalysisandthebusinessprocess. Inthefollowing (cid:13)c 2017ACM.ISBN978-1-4503-2138-9. DOI:XX.XXX/XXXX sections,wewillintroduceeachpartindetail. Figure3:Resourcescheduling. Themoduleofthequeryandanalysisislocatedonthetopofthe framework. Itisnotonlytheentrancetoprovideservices,butalso provides the standard syntax and semantic specification of multi- paradigmdataanalyticalprocessing.Atpresent,HiveQLissimilar tothestandardSQL,whichisorientedtotheclassicOLAPtask, anddoesnotdealwiththequerylanguagebasedonMLanalysis and graph data analysis. On the basis of not changing the exist- ingquerylanguagesyntaxstandard,wedevelopamultiparadigm forlargedatafusionanalysisquerylanguageexpandedofmachine learning(ML)andgraphanalysis. Our big data analysis and processing of the query language is Figure1:ThehMDAParchitecture. based on the improvement of the fusion of SQL and HiveQL in multi-paradigm.Firstofall,weanalyzethesupportofHiveQLand 2.1 Storagemanagement SQLrespectivelyandcounttheamountofoperationswhichcanbe In Figure 2, there are two parts, the physical storage and the supportedbythetraditionalrelationalalgebramodel. Onthebasis logicalstorage. Therapidgrowthofdatamakesthephysicalstor- oftherelationalalgebramodel,weaddothernecessaryoperatorsto age of data from single source storage to distributed storage. In constructanextensionofthealgebraiclanguagemodel,whichcan ordertosolvethestorageofmulti-sourcedata,weadopttheexist- fullysupporttheoperationofHiveQLandstandardSQL.Forthe ingdistributedfilesystem. Inourframework,itisHDFS(Hadoop operatorwithhighercomplexity,itissplitintosmallersuboperator DistributedFileSystem[10]). or used other methods to optimize it. For the ML analysis, we Besides,itproductsmanytypesofdataduetothedifferentneeds countthecommonlyusedanalyticalprocessingmethods, suchas ofapplications,suchastables,texts,RCFile(thefiletypeofHive) classificationandclustering,anddefinetheabstractinterfacesfor andsequencedata.Inordertousethesedifferenttypesofdata,we commonMLanalysisprocessingmethods. Forthegraphanalysis compose the abstract relational views by designing the metadata processing,wealsocountthecommonlyusedanalyticalprocessing withsemanticstoconvertdatatypestotherelationaldatawecan methods,suchastheshortestpathalgorithm,anddefinetheabstract handle. interfacesforthem. In this module, the framework also relates to the implementa- tion of the OLAP on the relational database and ML and graph dataprocessingtasksonthedistributedframework.Thetraditional relationaldatabasequeryoptimizationmethodisnolongerappli- cable to this situation. According to the different characteristics ofrelationalstoragemanagementqueryengineanddistributedfile systemofcomputingengine,wesummarizethequeryinformation andoptimizetheperformance.Firstly,weinvestigatethestatistical indexsystemusedintraditionaldatabaseandanalyzetheinterac- tionbetweeneachindexandtheindexinthesystem.Then,foreach indexintheindexsystemofstatisticalinformation,wedesigneffi- cientandaccuratesamplingmethodstocalculatethecostmodelin queryoptimization. Accordingtotheabovestatistics,wecanalso Figure2:Storagemanagement. designastorageandmaintenanceprogramswhichiseasytoupdate andmanage. Andwemayusethecostmodelinthetraditionalre- 2.2 Resourcescheduling lationaldatabasetodesignanewcostmodelwhichcanreflectthe In our framework, the development is based on Spark and the querycostofthemixeddata. moduleoftheresourceschedulingisassignedtoSpark. TheFig- Figure 4 displays the query analysis. The main architectural ure 3 depicts the resource scheduling in our framework. We use componentsofthequeryanalysisareQueryandDataAnalysisPro- MySQL[?] toqueryoverrelationaldatabase. ThepartofMLlib cessTools(DAPTools).Inthefirstpart,wecanquerybySQLorthe isSparkmachinelearninglibrary. Wecallthefunctionsintheli- functionuserdefinedasspecifiedformat. TheDAPtoolscontain brarytocompute.GraphXisthegraphquerymoduleofSpark.We classicalOLAP,DAPonmachinelearningandDAPongraph. userittoquerygraphsanditprovidesapossibilitytotransformthe 2.4 Businessprocess differentdataformatstographtoquery. Ourframeworkprovidesananalysismethodforthelargescale 2.3 Queryanalysis data analysis process. But in the face of complex business pro- Figure4:Queryanalysis. cessesindifferentfields, weneedthedomainknowledgeandac- cordingtothedomainknowledge,wecandesignthemulti-paradigm fusionofanalysistask. Wecandrawlessonsfromthemethodof servicecompositioninserviceorientedarchitecturedesign. In this module, we need to do two things: developing a multi paradigmfusionanalysisprocessorchestrationlanguagesyntaxand thecomplexbusinessprocessschedulingmethod. Inthefirstpart, we need to analyze the patterns and characteristics of service or- chestrationlanguageinserviceorientedarchitecturedesignandde- Figure5:Businessprocess. signanabstractmodeloftheexecutableprocess. Onthebasisof theabstractmodel, wesummarizethebasicactivitiesofcomplex business process analysis. Finally, we define the grammar of the businessprocess. Inthesemantic, weneedtoresearchandanal- ysis the meanings of basic business activities and define the start point,end point and the basic command. In the second part, we needtostudyandanalyzecomplexbusinessprocessesinpractical applications.Then,webuildcomplexbusinessprocessmodelsand refinethewaytoexchangemessagesinpublicbusinessprocesses. After that, we need to control the interaction of each part of the resourcesthroughtheinteractionsequenceofmessages,achieving areasonablecallforeachresourceservice. Westillneedtoinves- tigatetheapplicabilityofexistingobject-orienteddesignpatterns. For the analysis of complex business process integration model, wedesigndatabusinessprocesses.Werefinethedesignpatternsin complexbusinessprocessesbasedontheadvantagesandprinciples ofexistingdesignpatterns. Figure6:QueryinterfaceofhMDAP. Intherealworld,thebusinessprocessmodeliscomplexandit takesalotoftimetoanalyze.TheFigure5illustratesthedetailsof train and calculate and the parameters of the training of the ma- thebusinessprocessinourframework. chinelearningalsocomesfromtheconfigurationfiles. Finally,the Theuserneedstowritetheconfigurationfilesbeforeheorshe frameworkmakesajoinoftheresultsoftwoparts. submits the query. The format of configuration files are shown 3. DEMONSTRATION in Section 3. When the user submits a query to the framework, thequeryandanalysismoduleintheframeworkstartstoparsethe In this section, we present the interface of hMDAP based in user’squery. Thismoduleparsesqueriesaccordingtopredefined Javascript,whichcommunicatewiththeserviceinJava. Weshow semantics,suchasXML(ExtensibleMarkupLanguage).Themod- thescreenshotofhMDAPinFigure6andtheconfigurationfilewe ule transforms the user’s query to two parts, the query over rela- mentionedaboveinFigure7. tional databases and the query in machine learning. We default Theinterfaceiscomposedasfollows: thattheuser’squeryincludingthequeryoverrelationaldatabases • ConfigurationofMachineLearning: itisatexttoinputthe andthemoduledetermineswhetherornottocarryoutthequery pathoftheconfigurationfileofthemachinelearningalgo- in machine learning. We think that when the result of the query rithm,suchasparameters. overrelationaldatabasesisnull,theframeworkbeginstoqueryin • ConfigurationofRelationDatabase: itisatexttoinputthe machinelearning. Aftertheanalysismodule,theframeworkuses pathoftheconfigurationfileoftherelationaldatabases,such thequeryoverrelationaldatabasesandtheinformationaboutthe astheusername. databaseswhichisreadfromtheconfigurationfilestoquerythere- • Results:itisatexttodisplaytheresultsofthebackground. lationaldatabases.Then,theframeworkrunsthequeryinmachine • Run:itisabuttontostarttheprogramandwhentheprogram learning.Theinputofmachinelearningistheresultofqueryingby runsover,theresultsareshownintheResults. relationaldatabaseswhichthequerystatementisstoredinthecon- • Save: itisabuttontosavethecontextfromResultsintext figurationfiles. Andtheparametersofthemachinelearningalgo- fileandatthesametime,emptyallthetextboxcontents. rithmisalsostoredintheconfigurationfiles.Whentheframework • Cancel:canceltherunningofthisprogramandemptyallthe getstheinformationofthemachinelearningalgorithm,itstartsto textboxcontents. http://bigdataanalyticsnews.com/ importance-of-big-data-analytics-for-business-growth/ [2] BaragoinC.,BercianosJ.,KomelJ.,RobinsonG.,SawaR., andSchuinderE.(2001).DB2OLAPservertheoryand practices.InternationalTechnicalSupportOrganization. [3] BersonA.andJ.SmithS.(1997).Datawarehousing,data mining,andOLAP.McGraw-Hill. [4] BontempoC.andZagelowG.(1998).TheIBMdata warehousearchitecture.Commun.ACM,1998,41(9):38-48. [5] CampbellP.(editor).(2008).Bigdata:scienceinthe petabyteera.Nature,455:1âA˘S¸136. Figure7:Theconfigurationfileofthemachinelearning. [6] ChaudhuriS.andDayalU.(1997).Anoverviewofdata warehousingandOLAPtechnology.SIGMODRecord, Andthedetailsoftheconfigurationfileisasfollows: • configuration:itisthebeginningoftheconfigurationfile. 26(1):65–74. • input: itisthetrainingdatasetofthemachinelearningalgo- [7] ChenH.,ChiangR.H.L.,andStoreyV.C.(2012).Business rithm. intelligenceandanalytics:FrombigdatatobigImpact.MIS • database: itindicatesthattheinputdatasetcomesfromthe quarterl,36(4):1165–1188. relationaldatabaseasfollowinginformation [8] DodgeG.andGormanT.(1998).Oracledatawarehousing. • url, user, password: they are the parameters to connect to JohnWiley&Sons,Inc. therelationaldatabase,thelocationofthedatabase,theuser [9] GrayJ.,ChaudhuriS.,BosworthA.,LaymanA.,Reichart nameandthepasswordoftheuser. D.,VenkatraoM.,PellowF.,andPiraheshH.(1997).Data • sql:itisthestatementtoquerytherelationaldatabase. cube:Arelationalaggregationoperatorgeneralizing • parameter:thecontentsunderthislabelaretheparametersof group-by,cross-tab,andsub-totals.DataMin.Knowl. themachinelearningalgorithmsexcepttheinputparameter. Discov.,1(1):29-53. • value: aseriesoftheselabelsarethevaluesoftheparame- [10] Hadoop.(2015).http://hadoop.apache.org/ ters. [11] Manyika,J.,Chui,M.,Brown,B.,Bughin,J.,Dobbs,R., • algorithm: it is the name of algorithm. For example, the Roxburgh,C.,andH.Byers,A.(2011).Bigdata:Thenext value of algorithm is KMeans and our framework runs the frontierforinnovation,competition,andproductivity. algorithm named KMeans which is defined in our library. McKinseyGlobalInstitute. User can customize the algorithm and give the location of [12] MengX.,K.BradleyJ.,YavuzB.,R.SparksE., thealgorithminthislabel. VenkataramanS.,LiuD.,FreemanJ.,B.TsaiD.,AmdeM., Beforerunningtheinterface, theusershouldwritetwoconfig- OwenS.,XinD.,XinR.,M.FranklinM.,ZadehZ.,Zaharia uration files, the configuration of machine learning algorithms as M.,andTalwalkarA.(2016).MLlib:Machinelearningin Figure7andtheconfigurationofrelationaldatabasesthatthecon- ApacheSpark.J.Mach.Learn.Res.,17:1–7. textsarethepartsof<database>inFigure7.Whentheuserwrites [13] MicrosoftAnalysisservices.(2016).SQLserver2016and twofiles,heorsheshouldwritethepathsofthefilesinthetextson later. theinterface. Then,clickthebuttonRun. Iftheuserwantstosave https://msdn.microsoft.com/en-us/library/bb522607.aspx the results, he or she clicks the button of Save. If the user don’t [14] PlunkettT.,MacdonaldB.,NelsonB.,HornickM.,SunH., needtheresults,heorsheclicksthebuttonofCancel. MohiuddinK.,HardingD.,MishraG.,StackowiakR.,Laker, 4. CONCLUSION K.,andSegleauD.(2013).Oraclebigdatahandbook. Inthisdemonstration,weproposedhMDAP,ahybridframework McGraw-HillOsborneMedia forlarge-scaledataanalyticalprocessingtosupportmulti-paradigm [15] P.ShethA.,KochutK.,A.MillerJ.,WorahD.,DasS.,Lin process on Spark. The multi-paradigm processing mechanism of C.,PalaniswamiD.,LynchJ.,andShevchenkoI.(1996). hMDAPcanprovidetheinteroperabilityofdataanalyticalprocess Supportingstate-wideimmunisationtrackingusing techniquestoprocessdatawhichmightbenoteffectivelyhandled multi-paradigmworkflowtechnology.In:Proc.ofVLDB’96, ifweonlyapplysingledataanalyticalprocesstechnique. Onthe pp.263–273. other hand, hMDAP takes advantage of the high-performance of [16] RouseW.(2012).Whatisbigdataanalytics? Sparkinprocessinglarge-scaledata.WebelievethathMDAPpro- TechTarget.comhttp://searchbusinessanalytics.techtarget. videsanewapproachtobigdataanalysisinamulti-paradigmway. com/definition/big-data-analytics Acknowledgments [17] SarkarD.(2013).MicrosoftSQLserver2012withHadoop. PacktPublishing ThisworkissupportedbytheprogramsoftheKeyTechnologyRe- [18] SchraderM.andVlamisD.(2009).OracleEssbase&Oracle searchandDevelopmentProgramofTianjin(16YFZCGX00210), OLAP.PeterGbolagadeAkintunde. the National Key Research and Development Program of China (2016YFB1000603), the National Natural Science Foundation of [19] SpoffordG.(2001).MDXsolutions:withMicrosoftSQL China(NSFC)(61672377), andtheOpenProjectofKeyLabora- serveranalysisservices.Wiley. tory of Computer Network and Information Integration, Ministry [20] ZikopoulosP.(2011).Understandingbigdata:Analyticsfor ofEducation(K93-9-2016-05). XiaowangZhangissupportedby enterpriseclassHadoopandstreamingdata.McGraw-Hill TianjinThousandYoungTalentsProgram. OsborneMedia 5. REFERENCES [21] zurMuehlenM.andRosemannM.(2004).Multi-paradigm processmanagement.In:Proc.ofCAiSE’04,pp.169–175. [1] AhmedH.(2015).Importanceofbigdataanalyticsfor businessgrowth.:BIGDataAnalyticsNews