View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by HAL-Rennes 1 OverFlow: Multi-Site Aware Big Data Management for Scientific Workflows on Clouds Radu Tudoran, Alexandru Costan, Gabriel Antoniu To cite this version: Radu Tudoran, Alexandru Costan, Gabriel Antoniu. OverFlow: Multi-Site Aware Big Data Management for Scientific Workflows on Clouds. IEEE Transactions on Cloud Computing, 2016, <10.1109/TCC.2015.2440254>. <hal-01239128> HAL Id: hal-01239128 https://hal.inria.fr/hal-01239128 Submitted on 7 Dec 2015 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destin´ee au d´epˆot et `a la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publi´es ou non, lished or not. The documents may come from ´emanant des ´etablissements d’enseignement et de teaching and research institutions in France or recherche fran¸cais ou ´etrangers, des laboratoires abroad, or from public or private research centers. publics ou priv´es. Public Domain IEEETRANSACTIONSONCLOUDCOMPUTING,VOL.X,NO.X,AUGUST2014 1 OverFlow: Multi-Site Aware Big Data Management for Scientific Workflows on Clouds Radu Tudoran, Member, IEEE, Alexandru Costan, Member, IEEE, and Gabriel Antoniu, Member, IEEE Abstract—Theglobaldeploymentofclouddatacentersisenablinglargescalescientificworkflowstoimproveperformanceand deliverfastresponses.Thisunprecedentedgeographicaldistributionofthecomputationisdoubledbyanincreaseinthescale ofthedatahandledbysuchapplications,bringingnewchallengesrelatedtotheefficientdatamanagementacrosssites.High throughput,lowlatenciesorcost-relatedtrade-offsarejustafewconcernsforbothcloudprovidersanduserswhenitcomesto handlingdataacrossdatacenters.Existingsolutionsarelimitedtocloud-providedstorage,whichofferslowperformancebasedon rigidcostschemes.Inturn,workflowenginesneedtoimprovisesubstitutes,achievingperformanceatthecostofcomplexsystem configurations,maintenanceoverheads,reducedreliabilityandreusability.Inthispaper,weintroduceOverFlow,auniformdata managementsystemforscientificworkflowsrunningacrossgeographicallydistributedsites,aimingtoreapeconomicbenefits from this geo-diversity. Our solution is environment-aware, as it monitors and models the global cloud infrastructure, offering highandpredictabledatahandlingperformancefortransfercostandtime,withinandacrosssites.OverFlowproposesaset of pluggable services, grouped in a data scientist cloud kit. They provide the applications with the possibility to monitor the underlyinginfrastructure,toexploitsmartdatacompression,deduplicationandgeo-replication,toevaluatedatamanagement costs,tosetatradeoffbetweenmoneyandtime,andoptimizethetransferstrategyaccordingly.Thesystemwasvalidatedon theMicrosoftAzurecloudacrossits6EUandUSdatacenters.Theexperimentswereconductedonhundredsofnodesusing syntheticbenchmarksandreal-lifebio-informaticsapplications(A-Brain,BLAST).Theresultsshowthatoursystemisableto modelaccuratelythecloudperformanceandtoleveragethisforefficientdatadissemination,beingabletoreducethemonetary costsandtransfertimebyupto3times. IndexTerms—BigData,scientificworkflows,cloudcomputing,geographicallydistributed,datamanagement (cid:70) 1 INTRODUCTION ing and analyzing the data sets results in frequent WITH their globally distributed datacenters, large-scale data movements across widely distributed cloudinfrastructuresenabletherapiddevelop- sites. The targeted applications are compute inten- mentoflargescaleapplications.Examplesofsuchap- sive, for which moving the processing close to data plicationsrunningascloudservicesacrosssitesrange is rather expensive (e.g., genome mapping, physics from office collaborative tools (Microsoft Office 365, simulations), or simply needing large-scale end-to- Google Drive), search engines (Bing, Google), global end data movements (e.g., organizations operating stock market analysis tools to entertainment services several datacenters and running regular backup and (e.g., sport events broadcasting, massively parallel replication between sites, applications collecting data games, news mining) and scientific workflows [1]. fromremotesensors,etc.).Inallcases,thecostsavings Most of these applications are deployed on multiple (mainly computation-related) should offset the signif- sites to leverage proximity to users through content icant inter-site distance (network costs). Studies show delivery networks. Besides serving the local client that the inter-datacenter traffic is expected to triple requests, these services need to maintain a global in the following years [2], [3]. Yet, the existing cloud coherence for mining queries, maintenance or moni- data management services typically lack mechanisms toring operations, that require large data movements. for dynamically coordinating transfers among differ- To enable this Big Data processing, cloud providers ent datacenters in order to achieve reasonable QoS have set up multiple datacenters at different geograph- levels and optimize the cost-performance. Being able ical locations. In this context, sharing, disseminat- toeffectivelyusetheunderlyingstorageandnetwork resources has thus become critical for wide-area data • R. Tudoran is with Inria / ENS Rennes, France. E-mail: movements as well as for federated cloud settings. [email protected] This geographical distribution of computation be- • A. Costan is with IRISA / INSA Rennes, France. E-mail: alexan- comes increasingly important for scientific discovery. [email protected] • G.AntoniuiswithInriaBretagneAtlantique,Rennes,France.E-mail: In fact, many Big Data scientific workloads enable [email protected] nowadays the partitioning of their input data. This allows to perform most of the processing indepen- IEEETRANSACTIONSONCLOUDCOMPUTING,VOL.X,NO.X,AUGUST2014 2 dently on the data partitions across different sites file system. Under these circumstances, we propose a and then to aggregate the results in a final phase. filemanagementservicethatenableshighthroughput In some of the largest scenarios, the data sets are throughself-adaptiveselectionamongmultipletrans- already partitioned for storage across multiple sites, fer strategies (e.g. FTP-based, BitTorrent-based, etc.). which simplifies the task of preparing and launching Next,wefocusonthemoregeneralcaseoflarge-scale ageographical-distributedprocessing.Amongtheno- data dissemination across geographically distributed torious examples we recall the 40 PB/year data that sites. The key idea is to accurately and robustly pre- is being generated by the CERN LHC. The volume dictI/Oandtransferperformanceinadynamiccloud overpasses single site or single institution capacity to environment in order to judiciously decide how to storeorprocess,requiringaninfrastructurethatspans perform transfer optimizations over federated data- overmultiplesites.ThiswasthecasefortheHiggsbo- centers.Theproposedmonitoringserviceupdatesdy- sondiscovery,forwhichtheprocessingwasextended namicallytheperformancemodelstoreflectchanging to the Google cloud infrastructure [4]. Accelerating workloadsandvaryingnetwork-deviceconditionsre- the process of understanding data by partitioning the sultingfrommulti-tenancy.Thisknowledgeisfurther computation across sites has proven effective also in leveraged to predict the best combination of proto- other areas such as solving bio-informatics problems col and transfer parameters (e.g., multi-routes, flow [5]. Such workloads typically involve a huge number count, multicast enhancement, replication degree) to ofstatisticaltestsforassertingpotentialsignificantre- maximize throughput or minimize costs, according gionofinterests(e.g.,linksbetweenbrainregionsand to users policies. To validate our approach, we have genes). This processing has proven to benefit greatly implemented the above system in OverFlow, as part from a distribution across sites. Besides the need for of the Azure Cloud so that applications could use it additional compute resources, applications have to based on a Software as a Service approach. comply with several cloud providers requirements, Our contribution, which extends previous work which force them to be deployed on geographically introduced in [6], [7], can be summarised as follows: distributed sites. For instance, in the Azure cloud, • We implement OverFlow, a fully-automated thereisalimitof300coresallocatedtoauserwithina single- and multi-site software system for scien- datacenter, for load balancing purposes; any applica- tific workflows data management (Section 3.2); tionrequiringmorecomputepowerwillbeeventually • We introduce an approach that optimizes the distributed across several sites. workflow data transfers on clouds by means Moregenerally,alargeclassofsuchscientificappli- of adaptive switching between several intra-site cations can be expressed as workflows, by describing file transfer protocols using context information the relationship between individual computational (Section 3.3); tasksand theirinputand outputdatain adeclarative • We build a multi-route transfer approach across way. Unlike tightly-coupled applications (e.g., MPI- intermediate nodes of multiple datacenters, based)communicatingdirectlyviathenetwork,work- which aggregates bandwidth for efficient inter- flow tasks exchange data through files. Currently, the sites transfers (Section 3.4); workflow data handling in clouds is achieved using • We show how OverFlow can be used to support either some application specific overlays that map large scale workflows through an extensive set the output of one task to the input of another in of pluggable services that estimate and optimize a pipeline fashion, or, more recently, leveraging the costs, provide insights on the environment per- MapReduce programming model, which clearly does formance and enable smart data compression, not fit every scientific application. When deploying deduplication and geo-replication (Section 4); a large scale workflow across multiple datacenters, • We experimentally evaluate the benefits of Over- the geographically distributed computation faces a FlowonhundredofcoresoftheMicrosoftAzure bottleneck from the data transfers, which incur high cloud in different contexts: synthetic benchmarks costs and significant latencies. Without appropriate and real-life scientific scenarios (Section 5). design and management, these geo-diverse networks can raise the cost of executing scientific applications. 2 CONTEXT AND BACKGROUND In this paper, we tackle these problems by trying to understand to what extent the intra- and inter- One important phase that contributes to the total datacentertransferscanimpactonthetotalmakespan makespan (and the consequent cost) of a scientific of cloud workflows. We first examine the challenges workflow executed on clouds is the transfer of its of single site data management by proposing an tasks and files to the execution nodes [8], [9]. Several approach for efficient sharing and transfer of in- file transfers are required to start the task execution put/output files between nodes. We advocate storing andtoretrievetheintermediateandoutputdata.With data on the compute nodes and transferring files the file sizes increasing to PBs levels, data handling betweenthemdirectly,inordertoexploitdatalocality becomes a bottleneck. The elastic model of the clouds andtoavoidtheoverheadofinteractingwithashared hasthepotentialtoalleviatethisthroughanimproved IEEETRANSACTIONSONCLOUDCOMPUTING,VOL.X,NO.X,AUGUST2014 3 resource utilization. In this section, we show that this 3)Managedatasothatitcanbeefficientlyplacedand feature has to be complemented by a wider insight accessed during the execution. about the environment and the application, on one Data management challenges. Data transfers are hand, and a two-way communication between the affected by the instability and heterogeneity of the workflow engine and the data management system, cloud network. There are numerous options, some on the other hand. providing additional security guarantees (e.g., TLS) others designed for a high colaborative throughput (e.g., BitTorrent). Data locality aims to minimize data movements and to improve end-application perfor- 2.1 Terminology mance and scalability. Addressing data and computa- We define the terms used throughout this paper. tional problems separately typically achieves the con- The cloud site / datacenter: is the largest build- trary, leading to scalability problems for tomorrow’s ing block of the cloud. It contains a broad number exascale datasets and millions of nodes, and yielding of compute nodes, which provide the computation significant underutilization of the resources. Metadata power, available for renting. A cloud, particularly management plays an important role as, typically, the a public one, has several sites, geographically dis- data relevant for a certain task can be stored in tributedontheglobe.Thedatacenterisorganizedina multiple locations. Logical catalogs are one approach hierarchyofswitches,whichinterconnectthecompute to provide information about data items location. nodesandthedatacenteritselfwiththeoutsideworld, Programming challenges. So far, MapReduce has through the Tier 1 ISPs. Multiple output endpoints been the ”de-facto” cloud computing model, com- (i.e., through Tier 2 switches) are used to connect the plemented by a number of variations of languages data center to the Internet [10]. for task specification [11]. They provide some data The cloud deployment: refers to the VMs that flow support, but all require a shift of the appli- host a user application, forming a virtual space. cation logic into the MapReduce model. Workflow The VMs are placed on different compute nodes in semantics go beyond the map-shuffle-reduce-merge separate network partitions in order to ensure fault operations, and deal with data placement, sharing, tolerance and availability for the service. Typically, a inter-site data transfers etc. Independently of the pro- load balancer distributes all external requests among gramming model of choice, they need to address the the VMs. The topology of the VMs in the datacenter following issues: support large-scale parallelism to is unknown to the user. Several other aspects are in- maximizethroughputunderhighconcurrency,enable visible (or transparent) to users due to virtualization, data partitioning to reduce latencies and handle the e.g., the mapping of the IPs to physical nodes, the mapping from task I/O data to cloud logical struc- vicinity of VMs, the load introduced by the neigh- tures to efficiently exploit its resources. bouring VMs, if any, on the physical node, etc. A Targeted workflow semantics. In order to ad- deployment is limited to a single site, and depending dress these challenges we studied 27 real-life appli- on commercial constraints and priorities, it can be cations [12] from several domains (bio-informatics, limited in size. Therefore a multi-site application will business,simulationsetc.)andidentifiedasetofcore- rely on multiple-deployments. characteristics: The multi-site cloud: is made up of several ge- • The common data patterns are: broadcast, ographically distributed datancenters. An application pipeline, gather, reduce, scatter. thathasmultiplerunninginstancesinseveraldeploy- • Workflowsarecomposedofbatchjobswithwell- ments across multiple cloud datacenters is referred to defined data passing schemes. The workflow en- asamulti-sitecloudapplication.Ourfocusinthispaper gines execute the jobs (e.g. take the executable to is on such applications. Although applications could a VM; bring the needed data files and libraries; be deployed across sites belonging to different cloud run the job and retrieve the final result) and vendors (i.e. cloud federations), they are out of the perform the required data passing between jobs. scope of this work. We plan to address such issues in • Inputs and outputs are files, usually written once. the context of hybrid clouds in future work. • Tasks and inputs/outputs are uniquely identified. Our focus is on how to efficiently handle and transfer workflow data between the VMs in a cloud 2.2 Workflowdatamanagement deployment. We argue that keeping data in the local Requirements for cloud-based workflows. In order disks of the VMs is a good option considering that to support data-intensive workflows, a cloud-based for such workflows, most of the data files are usually solution needs to: 1) Adapt the workflows to the temporary-theymustexistonlytobepassedfromthe cloud environment and exploit its specificities (e.g. jobthatproducedittotheonethatwillfurtherprocess running in user’s virtualized space, commodity com- it. With our approach, the files from the virtual disk pute nodes, ephemeral local storage); 2) Optimize of each compute node are made available to all other datatransferstoprovideareasonabletimetosolution; nodeswithinthedeployment.Cachingthedatawhere IEEETRANSACTIONSONCLOUDCOMPUTING,VOL.X,NO.X,AUGUST2014 4 it is produced and transferring it directly where it is across multiple routes, aggregating additional needed reduces the time for data manipulation and bandwidth between sites. minimizes the workflow makespan. • Modeling the cloud performance. The com- plexity of the datacenters architecture, topology and network infrastructure make simplistic ap- 2.3 The Need of site-aware file management for proaches for dealing with transfer performance cloudbasedworkflows (e.g., exploiting system parallelism) less appeal- As we move to the world of Big Data, single-site ing.Inavirtualizedenvironmentsuchtechniques processing becomes insufficient: large scale scientific are at odds with the goal of reducing costs workflows can no longer be accommodated within a through efficient resource utilization. Accurate singledatacenter.Moreover,theseworkflowstypically performance models are then needed, leveraging collect data from several sources and even the data the online observations of the cloud behavior. acquired from a single source is distributed on multi- Our goal is to monitor the virtualized environ- ple sites for availability and reliability reasons. Cloud ment and to predict performance metrics (e.g., let users to control remote resources. In conjunction transfer time, costs). As such, we argue for a with a reliable networking environment, we can now model that provides enough accuracy for au- usegeographicallydistributedresources:dynamically tomatingthedistributeddatamanagementtasks. provisioned distributed domains are built in this way • Exploitingthedatalocality.Thecumulativestor- over multiple sites. Several advantages arise from age capacity of the VMs leased in one’s deploy- computations running on such multi-site configura- menteasilyreachestheTBsorder.Althoughtasks tions: resilience to failures, distribution across par- store their input and output files on the local titions (e.g., moving computation close to data or disks,mostofthestorageremainsunused.Mean- viceversa),elasticscalingtosupportusagebursts,etc. while,workflowstypicallyuseremotecloudstor- However, in order to exploit the advantages of age (e.g., Amazon S3, Azure Blobs) for sharing multi-site executions, users currently have to set up data [8]. This is costly and highly inefficient, due their own tools to move data between deployments, to high latencies, especially for temporary files through direct endpoint to endpoint communication that don’t require persistent storage. Instead, we (e.g. GridFTP, scp, etc.). This baseline option is rela- propose aggregating parts of the virtual disks in tivelysimpletosetinplace,usingthepublicendpoint asharedcommonpool,managedinadistributed provided for each deployment. The major drawback fashion, in order to optimize data sharing. in this scenario is the low bandwidth between sites, • Cost effectiveness. As expected, the cost closely which limits drastically the throughput that can be follows performance. Different transfer plans of achieved. Clearly, this paradigm shift towards multi- thesamedatamayresultinsignificantlydifferent site workflow deployments calls for appropriate data costs.Inthispaperweaskthequestion:giventhe management tools, that build on a consistent, global clouds interconnect offerings, how can an applica- viewoftheentiredistributeddatacenterenvironment. tion use them in a way that strikes the right balance This is precisely the goal targeted by OverFlow. between cost and performance? • No modification of the cloud middleware. Data 3 ADAPTIVE FILE MANAGEMENT ACROSS processing in public clouds is done at user level, which restricts the application permissions to the CLOUD SITES WITH OVERFLOW virtualized space. Our solution is suitable for In this section we show how the main design ideas both public and private clouds, as no additional behind the architecture of OverFlow are leveraged to privileges are required. support fast data movements both within a single site and across multiple datacenters. 3.2 Architecture The conceptual scheme of the layered architecture of 3.1 Coredesignprinciples OverFlow is presented in Figure 1. The system is built to support at any level a seamless integration Our proposal relies on the following ideas: of new, user defined modules, transfer methods and • Exploiting the network parallelism. Building services.Toachievethisextensibility,weoptedforthe on the observations that a workflow typically ManagementExtensibilityFramework1,whichallows runs on multiple VMs and that communication the creation of lightweight extensible applications, by between datacenters follows different physical discovering and loading at runtime new specialized routes, we rely on multi-route transfer strate- services with no prior configuration. gies. Such schemes exploit the intra-site low- We designed the layered architecture of OverFlow latency bandwidth to copy data to intermediate starting from the observation that Big Data appli- nodes within the source deployment (site). Next, this data is forwarded towards the destination 1.http://msdn.microsoft.com/en-us/library/dd460648.aspx IEEETRANSACTIONSONCLOUDCOMPUTING,VOL.X,NO.X,AUGUST2014 5 ServerfLayer TransferfCost TransferfTimef Compression Geographical Performancef Estimation Estimation Transfer ... Replication Modeling ManagementfLayer Monitor MetaDatafRegistry ReplicationfAgent TransferfManager Service Bandwidth CommunicationfLayer Intra-Site Inter-Sites Throughput AdaptivefProtocol Multi-Routesf VMfCompute SwitichingfTranserf Transferfacrossf fPerformance FTP In-Memory Torrent IntermediatefNodes ... Fig. 2. Architecture of the adaptive protocol-switching filemanagementsystem[6].Operationsfortransferring filesbetweenVMs:upload(1),download(2,3,4). Dataf Center DatafCenter file locations and transferring the file towards the destination, without intermediately storing them in Fig.1. Theextendible,server-basedarchitectureofthe a shared repository. We introduce 3 components that OverFlowSystem enable file sharing across compute instances: cation require more functionality than the existing The Metadata Registry holds the locations of files put/get primitives. Therefore, each layer is designed in VMs. It uses an in-memory distributed hash-table to offer a simple API, on top of which the layer to hold key-value pairs: file ids (e.g., name, user, shar- above builds new functionality. The bottom layer inggroupetc.)andlocations(theinformationrequired provides the default “cloudified” API for communi- by the transfer module to retrieve the file). Several cation. The middle (management) layer builds on it implementation alternatives are available: in-memory a pattern aware, high performance transfer service databases,AzureTables,AzureCaching.Wechosethe (see Sections 3.3, 3.4). The top (server) layer exposes latter as our preliminary evaluations showed that the a set of functionalities as services (see Section 4). The Azure Caching delivers better performance than the servicesleverageinformationsuchasdataplacement, AzureTables(10timesfasterforsmallitems)andhas performanceestimationforspecificoperationsorcost alowCPUconsumptionfootprint(unlikeadatabase). of data management, which are made available by The Transfer Manager performs the transfers be- the middle layer. This information is delivered to tween the nodes by uploading and downloading files. users/applications, in order to plan and to optimize The upload operation (arrow 1 in Figure 2 ) consists costs and performance while gaining awareness on simply in advertising the file location, which is done the cloud environment. by creating a record in the Metadata Registry. This TheinteractionofOverFlowsystemwiththework- implies that the execution time does not depend on flowmanagementsystemsisdonebasedonitspublic the data size. The download operation first retrieves API. For example, we have integrated our solution the location information about the data from the with the Microsoft Generic Worker [12] by replacing registry (arrow 2 in Figure 2 ), then contacts the VM it’s default Azure Blobs data management backed holding the targeted file (arrow 3), transfers it and with OverFlow. We did this by simply mapping the finally updates the metadata (arrow 4). With these I/O calls of the workflow to our API, with OverFlow operations, the number of reads and writes needed leveragingthedataaccesspatternawarenessasfuhrer to move a file between tasks (i.e. nodes) is reduced. detailed in Sections 5.6.1, 5.6.2. The next step is to Multiple options are available for performing the leverage OverFlow for multiple (and ideally generic) actual transfer. Our proposal is to integrate several workflow engines (e.g., Chiron [13]), across multiple of them and dynamically switch the protocol based sites. We are currently working jointly with Microsoft on the context. Essentially, the system is composed of in the context of the Z-CloudFlow [14] project, on user-deployedanddefault-providedtransfermodules using OverFlow and its numerous data management and their service counter parts, deployed on each services to process highly distributed workflows or- compute instance. Building on the extensibility prop- chestrated by multiple engines. erties,userscandeploycustomtransfermodules.The onlyrequirementistoprovideanevaluationfunction 3.3 Intra-sitetransfersviaprotocolswitching for scoring the context. The score is computed by weighting a set of parameters, e.g., number or size In a first step, we focus on intra-site communication, of files, replica count, resource load, data format, etc. leveraging the observation that workflow tasks usu- These weights reflect the relevance of the transfer ally produce temporary files that exist only to be solution for each specific parameter. Currently, we passed from the job that produced them to the one provide 3 protocols among which the system adap- furtherprocessingthem.Thesimplifiedschemaofour tively switches: approachtomanagesuchfilesisdepictedinFigure2. File sharing between tasks is achieved by advertising • In-Memory: targets small files transfers or de- IEEETRANSACTIONSONCLOUDCOMPUTING,VOL.X,NO.X,AUGUST2014 6 High Latency Network Low Latency Network Sender Receiver to the different ISP grids that connect the datacenters (the interconnecting network is not the property of the cloud provider). Considering that many Big Data Data Intermediate Center applications are executed on multiple nodes across Nodes Data Data several sites, an interesting option is to use these Center Center nodes and sites as intermediate hops between source Fig. 3. The multi-route transfer schema based on and destination. intermediatecloudnodesforinter-sitetransfers. The proposed multi-route approach that enables such transfers is shown in Figure 3. Data to be ployments which have large spare memory. transferred is first replicated within the same site Transferring data directly in memory boosts (green links in Figure 3) and it is then forwarded to the performance especially for scatter and the destination using nodes from other intermediate gather/reduce access patterns. We used Azure datacenters (red arrows in the Figure). Instances of Caching to implement this module, by building the inter-site transfer module are deployed on the a shared memory across the VMs. nodes across the datacenters and used to route data • FTP: is used for large files, that need to be trans- packages from one node to another, forwarding data ferred directly between machines. This solution towards the destination. In this way, the transfers is mostly suited for pipeline and gather/reduce leveragemultiplepaths,aggregatingextrabandwidth data access patterns. To implement it we started between sites. This approach builds on the fact that from an open-source implementation [15], ad- virtual routes of different nodes are mapped to dif- justed to fit the cloud specificities. For example ferent physical paths, which cross distinct links and the authentication is removed, and data is trans- switches(e.g.,datacentersareinterconnectedwiththe ferred in configurable sized chunks (1 MB by ISPgridsbymultiplelayer2switches[10]).Therefore, default) for higher throughput. by replicating data within the site, which is up to • BitTorrent:isadaptedforbroadcast/multicastac- 10 times faster than the inter-site transfers, we are cesspatterns,toleveragetheextrareplicasincol- able to increase the performance when moving data laborative transfers. Hence, for scenarios involv- across sites. OverFlow also supports other optimiza- ingareplicationdegreeaboveauser-configurable tions: data fragmentation and re-composition using threshold, the system uses BitTorrent. We rely on chunks of variable sizes, hashing, acknowledgement the MonoTorrent [16] library, again customised for avoiding data losses or packets duplication. One for our needs: we increased the 16 KB default might consider the acknowledgement-based mecha- packet size to 1 MB, which translates into a nism redundant at application level, as similar func- throughput increase of up to 5 times. tionality is provided by the underlying TCP protocol. The Replication Agent is an auxiliary component We argue that this can be used to efficiently handle to the sharing functionality, ensuring fault tolerance and recover from possible cloud nodes failures. across the nodes. The service runs as a background Our key idea for selecting the paths is to consider process within each VM, providing in-site replication the cloud as a 2-layer graph. At the top layer, a functionality alongside with the geographical repli- vertex represents a datacenter. Considering the small cation service, described in Section 4.2. In order to number of datacenters in a public cloud (i.e., less decrease its intrusiveness, transfers are performed than 20), any computation on this graph (e.g., de- duringidlebandwidthperiods.Theagentscommuni- termine the shortest path or second shortest path) cateandcoordinateviaamessage-passingmechanism is done very fast. On the second layer, each vertex thatwebuiltontopoftheAzureQueuingsystem.As corresponds to a VM in a datacenter. The number of a future extension, we plan to schedule the replica such nodes depends on application deployments and placement in agreement with the workflow engine can be scaled dynamically, in line with the elasticity (i.e. the workflow semantics). principle of the clouds. These nodes are used for the fast local replication with the purpose of transferring data in parallel streams between the sites. OverFlow 3.4 Inter-sitedatatransfersviamulti-routes selects the best path, direct or across several sites, In a second step, we move to the more complicated and maximizes its throughput by adding nodes to it. caseofinter-sitedatatransfers.Sendinglargeamounts Nodes are added on the bottom layer of the graph ofdatabetween2datacenterscanrapidlysaturatethe to increase the inter-site throughput, giving the edges small interconnecting bandwidth. Moreover, due to basedonwhichthepathsareselectedonthetoplayer the high latency between sites, switching the trans- of the graph. When the nodes allocated on a path fer protocol as for intra-site communication is not failtobringperformancegains,OverFlowswitchesto enough.Tomakethingsworse,ourempiricalobserva- new paths, converging towards the optimal topology. tions showed that the direct connections between the Algorithm 1 implements this approach. The first datacentersarenotalwaysthefastestones.Thisisdue step is to select the shortest path (i.e., the one with IEEETRANSACTIONSONCLOUDCOMPUTING,VOL.X,NO.X,AUGUST2014 7 Algorithm 1 The multi-path selection across sites typical pricing estimations are done simply by multi- 1: procedure MULTIDATACENTERHOPSEND plyingthetimespanorthesizeofthedatawiththead- 2: Nodes2Use=Model.GetNodes(budget) vertised costs. However, such a naive cost estimation 3: while SendData<TotalData do fails to capture any data management optimizations 4: MonitorAgent.GetLinksEstimation(); or access pattern semantics, nor it can be used to 5: Path=ShortestPath(infrastructure) 6: while UsedNodes<Nodes2Use do give price hints about the data flows of applications. 7: deployments.RemovePath(Path) To address this issue, we designed a cost estimation 8: NextPath=ShortestPath(deployments) service,basedonacostmodelforthemanagementof 9: UsedNodes+=Path.NrOfNodes() workflow data across sites. (cid:46) // Get the datacenter with minimal throughput We start by defining the transfer time (Tt) as 10: Node2Add=Path.GetMinThr() 11: while UsedNodes<Nodes2Use & the ratio between the data size and the throughput, 12: Node2Add.Thr>=NextPath.NormalizedThr do i.e., Tt = Size. In order to integrate the multi-route Thr 13: Path.UpdateLink(Node2Add) transferscheme,weexpressthethroughputaccording 14: Node2Add=Path.GetMinThr() to the contribution of each of the N parallel routes. 15: end while Additionally, we need to account for the overhead of 16: TransferSchema.AddPath(Path) 17: Path=NextPath replicating the data within the site compared with 18: end while the inter-site communication, denoted LRO – local 19: end while replication overhead. The idea is that, the multi- 20: end procedure route approach is able to aggregate throughput from additional routes as long as the total time to locally replicatedataislessthanthetimetoremotelytransfer the highest throughput) between the source and the it. Empirically, we observed that local replication is destination datacenters. Then, building on the elas- about 10 times faster than inter-site communication, ticity principle of the cloud, we try to add nodes to whichcanbeusedasabaselinevalueforLRO.Hence, this path, within any of the datacenters that form we define the total throughput from the parallel this shortest path. More nodes add more bandwidth, streams as: N,N<LRO (cid:88) LRO−(i−1) translating into an increased throughput along the Thr = thrlink × LRO (1) path. However, as more nodes are added, the ad- i=1 With these formulas we can now express the cost ditional throughput brought by them will become of transferring data according to the flows of data smaller (e.g., due to network interferences and bottle- withinaworkflow.Thecostofageographicaltransfer necks).Toaddressthisissue,weconsideralsothenext is split into the cost charged by the cloud provider best path (computed at lines 7-8). Having these two foroutbounddata(outbound ),asusuallyinbound paths,wecancompareatalltimesthegainofadding Cost data is free, and the cost derived from leasing the a node to the current shortest path versus adding a VMs during the transfer. This model can be applied new path (line 12). The tradeoff between the cost and also to local, in-site, replication by ignoring the out- performance can be controlled by users through the budget parameter. This specifies how much users are bound cost component in the formula. The service determines automatically whether the source and the willingtopayinordertoachievehigherperformance. destination belong to the same datacenter based on Our solution then increases the number of intermedi- their IPs. According to the workflow data dissemina- ate nodes in order to reduce the transfer time as long tion pattern, we have the following costs: as the budget allows it. • Pipeline – defines the transfers between a source and a destination as: 4 LEVERAGING OVERFLOW AS A KEY TO CostPipeline =N×Tt×VMCost+outboundCost×Size (2) SUPPORT GEOGRAPHICALLY DISTRIBUTED • Multicast/Broadcast–definesthetransferofdata APPLICATIONS towards several destination nodes (D): Cost =D×(N×Tt×VM +outbound ×Size) Multicast Cost Cost We introduce a set of pluggable services that we (3) designed on top of OverFlow, exploiting the system’s • Reduce/Gather – defines the cost of assembling knowledge about the global distributed environment into a node, files or pieces of data, from several to deliver a set of scientific functionalities needed by other nodes (S): many large-scale workflows. The modular architec- S (cid:88) ture allows to easily deploy new, custom plugins. CostGather = (N×Tti×VMCost+outboundCost×Sizei) i=1 (4) Onecanobservethatpipelineisasub-caseofmulti- 4.1 Costestimationengine cast with one destination, and both these patterns are One of the key challenges that users face today is sub-cases of the gather with only one source. There- estimating the cost of operations in the cloud. The fore, the service can identify the data flow pattern IEEETRANSACTIONSONCLOUDCOMPUTING,VOL.X,NO.X,AUGUST2014 8 through the number of sources and destination pa- 4.3 Smartcompression rameters. This observation allows us to gather the es- Datacompressioncanhavemanyflavorsanddeclina- timationofthecostinasingleAPIfunction:Estimate- tions.Onecanuseittodefinetheprocessofencoding Cost(Sizes[],S,D,N).Hence,applicationsandworkflow the data based on a dictionary containing common engines can obtain the cost estimation for any data sequences (e.g., zip) or based on previous sequences access pattern, simply by providing the size of the (e.g., mpeg), to search for already existing blocks of data,thedataflowtypeandthedegreeofparallelism. data (i.e., deduplication) or even to replace the data The other parameters such as the links throughput with the program which generated it. In the context and the LRO factor are obtained directly from the of inter-site transfers, reducing the size of data is other services of OverFlow. interesting as it can potentially reduce the cost (i.e., This model captures the correlation between per- the price paid for outbound traffic) or the time to formance (time or throughput) and cost (money) and perform the transfer. Nevertheless, as typically the canbeusedtoadjustthetradeoffbetweenthem,from compression techniques are time consuming, a smart the point of view of the number of resources or the tradeoff is needed to balance the time invested to data size. While urgent transfers can be sped-up using reduce the data and the actual gains obtained when additionalroutesattheexpenseofsomeextramoney, performing the transfer. less urgent ones can use a minimum set of resources Our goal is to provide applications with a ser- atareducedoutboundcost.Moreover,leveragingthis vice that helps them evaluate these potential gains. cost model in the geographical replication/transfer First, the service leverages deduplication. Applications service, one can set a maximum cost for a trans- call the CheckDeduplication(Data,DestinationSite) func- fer, based on which our system is able to infer the tion to verify in the Metadata Registry of the destina- amount of resources to use. Although the network or tionsiteif(similar)dataalreadyexist.Theverification end-system performance can drop, OverFlow rapidly is done based on the unique ID or the hash of the detects and adapts to the new reality to satisfy the data. If the data exist, the transfer is replaced by the budget constraint. address of the data at destination. This brings the maximum gains, both time and money wise, among all“compression”techniques.However,ifthedataare 4.2 Geographicalreplication not already present at the destination site, their size can still potentially be reduced by applying compres- Turning geo-diversity into geo-redundancy requires sionalgorithms.Whethertospendtimeandresources the data or the state of applications to be distributed to apply such an algorithm and the selection of the across sites. Data movements are time and resource algorithm itself are decisions that we leave to users, consuming and it is inefficient for applications to who know the application semantics. suspend their main computation in order to perform Second, the service supports user informed such operations. Therefore, we introduce a plugin compression-related decisions, that is, compression– that releases applications and users from this task time or compression–cost gain estimation. To this by providing Geo-Replication-as-a-Service on top of purpose we extend the previous estimation model OverFlow. Applications simply indicate the data to to include the compression operation. The updated be moved and the destination via an API function execution-time model is given by Equation 5, while call, i.e., Replicate(Data,Destination). Then, the service the cost is depicted in Equation 6. performs the geographical replication via multi-path Size Size TtC = + (5) transfers, while the application continues uninter- Speed Factor ×Thr Compression Compression rupted. (cid:88)S Cost = (N ×TtC ×VM + C i Cost Replicatingdataopensthepossibilitiesfordifferent i=1 optimization strategies. By leveraging the previously Size introduced service for estimating the cost, the geo- outboundCost× Factor i )×D (6) Compression replicationserviceisabletooptimizetheoperationfor Users can access these values by overloading the cost or execution time. To this purpose, applications functions provided by the Cost Estimation Service areprovidedwithanoptionalparameterwhencalling with the Speed and Factor pa- Compression Compression the function (i.e., Policy). By varying the value of this rameters. Obtaining the speed with which a com- parameter between 0 and 1, applications will indicate pression algorithm encodes data is trivial, as it only a higher weight for cost (i.e., a value of 0) or for time requires to measure the time taken to encode a piece (i.e., a value of 1), which in turn will determine the of data. On the other hand, the compression rate amount of resources to use for replicating the data. obtainedonthedataisspecifictoeachscenario,butit This is done by querying the cost estimation service can be approximated by users based on their knowl- for the minimum and maximum times, the respective edge about the application semantics, by querying cost predictions, and then using the Policy parameter the workflow engine or from previous executions. as a slider to select between them. Hence, applications can query in real-time the service IEEETRANSACTIONSONCLOUDCOMPUTING,VOL.X,NO.X,AUGUST2014 9 to determine whether by compressing the data the • Less frequent samples are weighted higher, as transfertimeorthecostarereducedandtoorchestrate they are more valuable. tasks and data accordingly. TheseobservationsarecapturedinEquation10.The formula combines the Gaussian distribution with a time reference component, which we defined based 4.4 MonitoringasaService on the frequency of the sample - tf within the time Monitoringtheenvironmentisparticularlyimportant reference interval T. The values range from 0 - no in large-scale systems, be it for assessing the resource trust - to 1 - full trust. performance or simply for supervising the applica- e−(µ2−σS2)2 +(1− tf) tions.WeprovideaMonitorasaServicemodule,which w= T (10) 2 on the one hand tracks the performance of various The metrics considered by the service are: avail- metrics and on the other hand offers performance able bandwidth, I/O throughput, CPU load, memory estimates based on the collected samples. To obtain status, inter-site links and VM routes. The available such estimates, we designed a generic performance bandwidth between the nodes and the datacenters is model,whichaggregatesallinformationon:thedeliv- measured using the Iperf software [17]; the through- ered performance (i.e., what efficiency one should ex- put and interconnecting VM routes are computed pect) and the instability of the environment (i.e., how based on time measurements of random data trans- likely is it to change). Our goal is to make accurate fers; the CPU and memory performance is evaluated estimations but at the same time to remain generic using a benchmark, that we implemented based on with our model, regardless of the tracked metrics or the Kabsch algorithm [18]. New metrics can be easily the environment variability. To this end, we propose definedandintegratedaspluggablemonitoringmod- to dynamically weight the trust given to each sample ules. The very simplicity of the model allows it to be based on the environment behaviour deducted that general, but at the expense of becoming less accurate far. The resulting inferred trust level is used then to for some precise, application dependent, settings. It update the view on the environment performance. wasourdesignchoicetotradeaccuracyforgenerality. Thisallowstotransparentlyintegratesamplesforany 5 EVALUATION metric while reasoning with all available knowledge. Themonitoringsamplesabouttheenvironmentare In this section we analyze the impact of the overhead collected at configurable time intervals, in order to of the data services on the transfer performance in an keep the service non-intrusive. They are then inte- effort to understand to what extent they are able to grated with the parameter trace about the environ- customize the management of data. We also provide ment (h - history gives the fixed number of previous an analysis that helps users understand the costs in- samples that the model will consider representative). volvedbycertainperformancelevels.Themetricused Based on this, the average performance (µ - Equation asareferenceisthedatatransfertime,complemented 7) and the corresponding variability (σ - Equation 8) for each experiment with the relevant “cost” metric areestimatedateachmomenti.Theseestimationsare foreachservice(e.g.,size,money,nodes).Wefocuson updated based on the weights (w) given to each new both synthetic benchmarks (Sections 5.2,5.3,5.4) and measured sample. real-life workflows (Sections 5.5,5.6). (h−1)∗µ +(1−w)∗µ +w∗S µi = i−1 h i−1 (7) 5.1 Experimentalsetup (cid:113) σi = γ−µ2i (8) The experiments are performed on the Microsoft Azure cloud, at the PaaS level, using the EU (North, (h−1)∗γ +w∗γ +(1−w)∗S2 γi = i−1 h i−1 (9) West) and US (North-Central, West, South and East) datacenters. The system is run on Small (1 CPU, 1.75 where S is the value of the new sample and γ GB Memory and 100 Mbps) and Medium (2 CPU, 3.5 (Equation 9) is an internal parameter. Equation 9 is GB Memory and 200 Mbps) VM instances, deploying obtainedbyrewritingthestandardvariabilityformula (cid:113) tens of VMs per site, reaching a total number of 120 (i.e. σ = 1 (cid:80)N (x −µ )2), in terms of the previ- i N j=1 j i nodes and 220 cores in the global system. ousvalueatmomenti−1andthevalueofthecurrent Theexperimentalmethodologyconsiders3metrics: sample.Thisrewritingallowstosavethememorythat execution time, I/O throughput and cost. The exe- wouldbeneededotherwisetostorethelasthsamples. cution time is measured using the timer functions The corresponding weights are selected based on the providedbythe.NetDiagnosticlibrary.Thethrough- following principles: put is determined at destination as a ratio between • A high standard deviation will favor accepting the amount of data sent and the operation time. newsamples(eventheonesfartherfromtheaver- Finally, the cost is computed based on Azure prices age), indicating an environment likely to change; and policies [19]. Each value reported in the charts • Samples far from average are potential outliers is the average of tens of measurements performed at and weighted less (it can be a temporal glitch); different daily moments.
Description: