Computers,EnvironmentandUrbanSystemsxxx(2014)xxx–xxx ContentslistsavailableatScienceDirect Computers, Environment and Urban Systems journal homepage: www.elsevier.com/locate/compenvurbsys MERRA Analytic Services: Meeting the Big Data challenges of climate science through cloud-enabled Climate Analytics-as-a-Service John L. Schnasea,⇑, Daniel Q. Duffyb, Glenn S. Tamkina, Denis Nadeaua, John H. Thompsonb, Cristina M. Griegc, Mark A. McInerneya, William P. Webstera aOfficeofComputationalandInformationSciencesandTechnology,NASAGoddardSpaceFlightCenter,Greenbelt,MD20771,UnitedStates bNASACenterforClimateSimulation,NASAGoddardSpaceFlightCenter,Greenbelt,MD20771,UnitedStates cDepartmentofComputationalDataSciences,GeorgeMasonUniversity,Fairfax,VA22030,UnitedStates a r t i c l e i n f o a b s t r a c t Articlehistory: ClimatescienceisaBigDatadomainthatisexperiencingunprecedentedgrowth.Inoureffortstoaddress Availableonlinexxxx theBigDatachallengesofclimatescience,wearemovingtowardanotionofClimateAnalytics-as-a-Ser- vice(CAaaS).Wefocusonanalytics,becauseitistheknowledgegainedfromourinteractionswithBigData Keywords: thatultimatelyproducesocietalbenefits.WefocusonCAaaSbecausewebelieveitprovidesausefulway MapReduce ofthinkingabouttheproblem:aspecializationoftheconceptofbusinessprocess-as-a-service,whichisan Hadoop evolvingextensionofIaaS,PaaS,andSaaSenabledbyCloudComputing.Withinthisframework,Cloud Dataanalytics Computingplaysanimportantrole;however,weseeitasonlyoneelementinaconstellationofcapabil- Dataservices itiesthatareessentialtodeliveringclimateanalyticsasaservice.Theseelementsareessentialbecausein CloudComputing theaggregatetheyleadtogenerativity,acapacityforself-assemblythatwefeelisthekeytosolvingmany Generativity iRODS oftheBigDatachallengesinthisdomain.MERRAAnalyticServices(MERRA/AS)isanexampleofcloud- MERRA enabledCAaaSbuiltonthisprinciple.MERRA/ASenablesMapReduceanalyticsoverNASA’sModern-Era ESGF RetrospectiveAnalysisforResearchandApplications(MERRA)datacollection.TheMERRAreanalysisinte- BAER gratesobservationaldatawithnumericalmodelstoproduceaglobaltemporallyandspatiallyconsistent synthesisof26keyclimatevariables.Itrepresentsatypeofdataproductthatisofgrowingimportanceto scientistsdoingclimatechangeresearchandawiderangeofdecisionsupportapplications.MERRA/AS bringstogetherthefollowinggenerativeelementsinafull,end-to-enddemonstrationofCAaaScapabili- ties:(1)high-performance,dataproximalanalytics,(2)scalabledatamanagement,(3)softwareappliance virtualization,(4)adaptiveanalytics,and(5)adomain-harmonizedAPI.TheeffectivenessofMERRA/AS hasbeendemonstratedinseveralapplications.Inourexperience,CloudComputinglowersthebarriers andrisktoorganizationalchange,fostersinnovationandexperimentation,facilitatestechnologytransfer, andprovidestheagilityrequiredtomeetourcustomers’increasingandchangingneeds.CloudComputing isprovidinganewtierinthedataservicesstackthathelpsconnectearthbound,enterprise-leveldataand computationalresourcestonewcustomersandnewmobility-drivenapplicationsandmodesofwork.For climatescience,CloudComputing’scapacitytoengagecommunitiesintheconstructionofnewcapabili- tiesisperhapsthemostimportantlinkbetweenCloudComputingandBigData. PublishedbyElsevierLtd. 1.Introduction cannotbemoved:instead,analyticaloperationsneedtomigrateto wherethedatareside;complexanalysesoverlargerepositoriesre- The term ‘‘Big Data’’ is used to describe data sets that are too quireshigh-performancecomputing;largeamountsofinformation large and complex to be worked with using commonly-available increases the importance of metadata, provenance management, tools(Snijders,Matzat,&Reips,2012).Climatesciencerepresents and discovery; migrating codes and analytic products within a a Big Data domain that is experiencing unprecedented growth growing network of storage and computational resources creates (Edwards,2010).NASA’sclimatechangerepositoriesalonearepro- aneedforfastnetworks,intermediation,andresourcebalancing; jectedtogrowto350petabytesby2013(Skytland,2012).Someof and, importantly, the ability to respond quickly to customer the major Big Data challenges facing climate science are easy to demandsfornewandoftenunanticipatedusesforclimatedatare- understand:largerepositoriesmeanthatthedatasetsthemselves quiresgreater agility in building and deploying applications. It is usefultosituatetheBigDatachallengesoftheclimatedomainin ⇑ this larger context, because doing so helps us understand where Correspondingauthor.Tel.:+13012864351. innovationcanyieldimprovements. E-mailaddress:[email protected](J.L.Schnase). 0198-9715/$-seefrontmatterPublishedbyElsevierLtd. http://dx.doi.org/10.1016/j.compenvurbsys.2013.12.003 Pleasecitethisarticleinpressas:Schnase,J.L.,etal.MERRAAnalyticServices:MeetingtheBigDatachallengesofclimatesciencethroughcloud-enabled ClimateAnalytics-as-a-Service.Computers,EnvironmentandUrbanSystems(2014),http://dx.doi.org/10.1016/j.compenvurbsys.2013.12.003 2 J.L.Schnaseetal./Computers,EnvironmentandUrbanSystemsxxx(2014)xxx–xxx CloudComputingisoneofseveraltechnologiesofteninvokedas (4) How accessible it is to those ready and able to build on it: asolutiontoBigDatachallenges.However,thetechnicaldefinition Accessibility makes it easier to obtain the technology and of‘‘CloudComputing’’issovariouslyinterpretedthatthetermhas the information necessary to achieve mastery. The more becomejargonized(Mell&Grace,2011).ThatCloudComputingis accessible,themoregenerative. both ubiquitous and ambiguous points to the need to examine (5) How transferable any changes are to others, including non- carefullyhowCloudComputingenables. experts: Transferability reflects how easily changes in the technologycanbeconveyedtoothers. 1.1.ClimateAnalytics-as-a-Service(CAaaS) Amajordeficiencyinanyonefactorgreatlyreducesoverallgen- InoureffortstoaddresstheBigDatachallengesofclimatesci- erativity.Conversely,themorethesefivequalitiesaremaximized, ence, we are moving toward a notion of Climate Analytics-as-a- theeasieritisforasystemtowelcomecontributionsfromoutsid- Service(CAaaS).Wefocusonanalytics,becauseitistheknowledge ersaswellasinsiders.Ingeneral,generativetoolsaremorebasic gainedfromourinteractionswithBigDatathatultimatelyproduce andlessspecializedforaccomplishingaparticularpurpose. societalbenefits.WefocusonCAaaSbecausewebelieveitprovides Intheremainderofthispaper,weillustratehowwearetrans- ausefulwayofthinkingabouttheproblem:aspecializationofthe latingtheseconceptsintoreality:wedescribethecontextinwhich concept of business process-as-a-service, which is an evolving we are working; the technology foundations important to us, extension of IaaS, PaaS, and SaaS enabled by Cloud Computing. includingourdefinitionandrationaleforthegenerativeelements Withinthisframework,CloudComputingplaysanimportantrole; we feel are crucial; a specific project, MERRA Analytic Services, however,weseeitasonlyoneelementinaconstellationofcapa- and applications that demonstrate these capabilities in action; bilities that are essential to delivering climate analytics as a ser- waysthatCloudComputingarecontributingtotheeffort;and,fi- vice. These elements are essential because in the aggregate they nally,ourplansforthefuture. lead to generativity – a capacity for self-assembly that we feel is thekeytosolvingmanyoftheBigDatachallengesinthisdomain. 2.Background–TheNASACenterforClimateSimulationand climatescienceasaBigDatadomain 1.2.Generativetechnologies OurunderstandingoftheEarth’sprocessesisbasedonacombi- Generativityreferstoasystem’scapacitytoproduceunantici- nationofobservationaldatarecordsandmathematicalmodels.The patedchangethroughunfilteredcontributionsfrombroadandvar- sizeofNASA’sspace-basedobservationaldatasetsisgrowingdra- iedaudiences(Zittrain,2008).Theconcepthighlightsaspectsofan maticallyasnewmissionscomeonline.However,apotentiallybig- innovationorprocessthatenableanautocatalyticfeeding-forward gerdatachallengeisposedbytheworkofclimatescientists,whose thatcanhelpmakegrowth,furtherinnovation,andsuccesspossi- modelsareregularlyproducingdatasetsofhundredsofterabytes ble.Generativityconnectsinputsfromdiversepeopleandgroups, ormore(Edwards,2010;Webster,2013). who may or may not be working in concert, with emergent and TheNASACenterforClimateSimulation(NCCS)providesstate- unanticipated outputs. How much the system facilitates partici- of-the-artsupercomputinganddataservicesspecificallydesigned pantcontributionisafunctionofbothtechnologicaldesignandso- for weather and climate research (NCCS, 2013). The NCCS main- cial behavior (Baker & Bowker, 2007). A system’s generativity tainsadvanceddatacapabilitiesandfacilitiesthatallowresearch- describes not only its objective characteristics, but also the ways ers within and beyond NASA to create and access the enormous thesystemrelatestoitsusersandthewaysusersrelatetoonean- volumeofdatageneratedbyweatherandclimatemodels.Tackling other.Inturn,theserelationshipsreflecthowmuchtheusersiden- the problems of data intensive science is an inherent part of the tifyascontributorsorparticipants,ratherthanasmereconsumers. NCCSmission. The Internet itself, modern operating systems, Apple’s iTunes, Therearetwomajorchallengesposedbythedataintensivenat- Twitter, Facebook, and the emerging infrastructure for mobile ureofclimatescience.Thereistheneedtoprovidecompletelife- application development are examples of generative systems. In cycle management of large-scale scientific repositories. This both cases, the design elements contributing to their generative capabilityisthefoundationuponwhichavarietyofdataservices potentialareeasytosee.TheInternet’sframersmadesimplicitya can be provided, from supporting active research to large-scale corevalue,definingintheprocesstheclassicend-to-endargument data federation, data publication and distribution, and archival thatmostfeaturesinanetworkshouldbeimplementedatitscom- storage(Berman,2008).Wethinkofthisaspectofourmissionas puterendpointsratherthanbythenetworkitself,whichappropri- climatedataservices. atelyimplementsonlythosefunctionsthatareuniversallyuseful The other data intensive challenge has to do with how these (Saltzer, Reed, & Clark, 1984). To do otherwise might have tilted large datasets are used: data analytics – the capacity to perform thegenericnetworktowardspecificusesandlimiteditspotential usefulscientificanalysesoverenormousquantitiesofdatainrea- for growth. (Consider, for example, the proprietary, non-genera- sonableamountsoftime.Inmanyrespectsthisisthebiggestchal- tive,andnowdefunctCompuServenetwork.). lenge; without effective means for transforming large scientific Zittrain(2008)identifiesfivepropertiesofgenerativesystems: datacollectionsintomeaningfulscientificknowledge,ourmission fails. It is against this backdrop that the NCCS began looking at (1) Howextensivelyasystemortechnologyleveragesasetofpos- CAaaS as a potential element in our technological and organiza- sibletasks:Leveragemakesadifficultjobeasier,and,ingen- tionalresponsetochangingdemands. eral, the more a system can do, the more capable it is of producingchange. (2) How well it can be adapted to a range of tasks: Adaptability 3.Technologyfoundations–Towardagenerativeecologyfor enablesnew,unintended,andinnovativeusesofatechnol- ClimateAnalytics-as-a-Service ogy.Itbroadensthetechnology’suse. (3) How easily new contributors can master it: Ease of Mastery We believe there are five essential technology elements that reflects how easy it is for broad audiences to understand contributetobuildingagenerativecontextforClimateAnalytics- howtoadoptandadaptit.Themoreusefulatechnologyis as-a-Service: high-performance, data-proximal analytics;integra- bothtotheneophyteandtheexpert,themoregenerativeitis. tivedatamanagement;softwareappliancevirtualization;adaptive Pleasecitethisarticleinpressas:Schnase,J.L.,etal.MERRAAnalyticServices:MeetingtheBigDatachallengesofclimatesciencethroughcloud-enabled ClimateAnalytics-as-a-Service.Computers,EnvironmentandUrbanSystems(2014),http://dx.doi.org/10.1016/j.compenvurbsys.2013.12.003 J.L.Schnaseetal./Computers,EnvironmentandUrbanSystemsxxx(2014)xxx–xxx 3 analytics; and domain-harmonized APIs. In this section, we de- serversthatprovideaccesstostorageresources.Aresourceserver scribe what we mean by these terms and demonstrate how we (iRES)canprovideaccesstomorethanonestorageresource,and areimplementingtheconcept. the system can support any numberof clients at a time. A client canconnecttoanyserveronthegridandrequestaccesstodigital 3.1.High-performance,data-proximalanalytics(MapReduce) objectsfromthesystem.Therequestisparsedusingthecontextual andsysteminformationstoredintheiCATcatalog,andaphysical Clearly,atitscore,CAaaSmustbringtogetherdatastorageand objectisidentifiedandtransferredtotheclient.Therequestcanbe high-performance computing in order to perform analyses over intermsoflogicalobjectnames,oraconditionalquerybasedon data where the data reside. MapReduce is of particular interest descriptive and system metadata attributes. iRODS is a peer-to- tous,becauseitprovidesanapproachtohigh-performanceanalyt- peer server system; hence, requests can be made to any server, icsthatisprovingtobeusefultomanydataintensiveproblemsin whichinturnactsonbehalfoftheclientfortransferringthefile. climate research (Dean & Ghemawat, 2008; Duffy et al., 2011, Thefinalfiletransfertakestheshortestnetworkpathintermsof 2012; Tamkin, 2013). As typically implemented, MapReduce en- numberofhops. ables distributed computing on large data sets using high-end AnimportantaspectofiRODSisitsbuilt-inruleframework.As computers. It is an analysis paradigm that combines distributed part of each resource server, a distributed rule engine is imple- storageand retrievalwithdistributed,parallelcomputation,allo- mentedthatprovidesextensibilityandcustomizabilitybyencod- cating to the data repository analytical operations that yield re- ing server-side operations (including the main access APIs) into duced outputs to applications and interfaces that may reside sequencesofmicroservices.Thesequenceofmicroservicesiscon- elsewhere. Since MapReduce implements repositories as storage trolledbyuser-oradministrator-definedEvent:Condition:Action- clusters, data set size and system scalability are limited only by set:Recovery-set rules similar to those found in active databases. thenumberofnodesintheclusters.WhileMapReducehasproven The rules can be viewed as defining pipelines or workflows. An effectiveforlargerepositoriesoftextualdata,itsuseindatainten- ingestionoraccessprocesscanbeencodedasaruletoprovidecus- sivescienceapplicationshasbeenlimited(Bucketal.,2011),be- tomizedfunctionality.Rulesalsocanbedefinedbyusersandexe- causemanyscientificdatasetsareinherentlycomplex,havehigh cutedinteractively.Hence,changestoaparticularprocessorpolicy dimensionality,andusebinaryformats. can easily be constructed by the user, then tested and deployed MapReduce distributes computations across large data sets without the aid of system administrators or application develop- usingalargenumberofcomputers(nodes).Ina‘‘map’’operation ers.Theuseralsocandefineconditionswhenarulegetstriggered a head node takes the input, partitions it into smaller sub-prob- thus controlling the application of different rules (or processing lems,anddistributesthemtodatanodes.Adatanodemaydothis pipelines)basedoncurrenteventsandoperatingconditions. againinturn,leadingtoamulti-leveltreestructure.Thedatanode The building blocks for the iRODS rules are microservices – processesthesmallerproblem,andpassestheanswerbacktoare- small,well-definedproceduresorfunctionsthatperformacertain ducernodetoperformthereductionoperation.Ina‘‘reduce’’step, task. For example, one can use a rule that stipulates that when thereducernodethencollectstheanswerstoallthesub-problems accessing a data object from a particular collection, additional andcombinestheminsomewaytoformtheoutput–theanswer authorizationchecksneedtobemade.Theseauthorizationchecks totheproblemitwasoriginallytryingtosolve.Borrowingfromthe canbeencodedasasetofmicroserviceswithdifferenttriggersthat LISPfamilyoffunctionalprogramminglanguages,themapandre- canfirebasedoncurrentoperatingconditions.Inthisway,onecan ducefunctionsofMapReducearebothdefinedasdatastructured controlaccesstosensitivedatabasedonrulesandcanescalateor in<key,value>pairs(HDFS,2013). reduceauthorizationlevelsdynamicallyasthesituationwarrants. Apart from iRES servers and an iCAT server, iRODS also has two 3.2.Scalabledatamanagement(iRODS) other servers: iSEC for scheduling and executing queued rules, and iXMS for providing a message-passing framework between The core data management infrastructure for CAaaS must en- microservices. ablecollectionsscalability,richmetadatamanagement,andfeder- ateddiscoveryandaccess(Agrawal,Das,&Abbadi,2011).Forus, iRODSplaysacentralrole.TheIntegratedRule-OrientedDataSys- 3.3.Softwareappliancevirtualization(vCDS) tems,oriRODS,isanopensourcedatagridsoftwaresystembeing developedbytheDataIntensiveCyberEnvironments(DICE)group Virtualizationis centralto CloudComputingand importantto andtheRenaissanceComputingInstitute(RENCI)attheUniversity us in the way it enables agile development and deployment. We ofNorthCarolinaatChapelHill(iRODS,2013).Itisdescribedbyits have attempted to make virtualization even more convenient by creatorsaspeer-to-peerdatagridmiddlewarethatprovidesafacil- taking a software appliance approach to building a core element ityforcollection-building,managing,querying,accessing,andpre- ofourtechnologycluster,theVirtualClimateDataServer(vCDS). servingdatainadistributeddatagridframework.Akeyfeatureof AvCDSisaniRODS-baseddataserverspecializedtotheneedsof iRODSisitscapacitytoapplypolicy-basedcontrolwhenperform- aparticularclimatedata-centricapplication. ingthesefunctions. The basic configuration of an iRODS data server consists of a iRODSappealstousforseveralreasons.Ittargetslargereposi- specificversionofiRODSinstalledonaparticularoperatingsystem tories,largedataobjects,digitalpreservation,andintegratedcom- runningon particularhardware.Moving toward the vCDSvirtual plexprocessing,makingitoneofthemorepromisingtechnologies appliance model has been a two-step process in which we (1) for grid-centric data services for scientific applications. We also encapsulatetheoperatingsystemandiRODSasavirtualmachine likethefactthatitsdevelopmentculturehashistoricrootsindig- image, then (2) specialize that image with functionality required ital libraries, persistent archives, and real-time data systems re- formanagingclimatedata.Ourapproachtospecializationhasbeen search, having received support from the National Science to build general-purpose scientific ‘‘kits,’’ such as those that can Foundation(NSF)andNationalArchivesandRecordsAdministra- externalizeintothevCDSiCATtheinternalmetadatastoredinNet- tion(NARA). CDF,HDF,andGeoTIFfiles.Thesekitssitintheverticalstackabove TheiRODSdatagridsystemconsistsofseveralcomponents.It iRODSandbelowapplication-specifickits,suchasthosethatmight has a metadata catalog server, called the iCAT, which provides beneededtohandlethespecialdatamanagementrequirementsof metadataandabstractionservices.Therecanbemultipleresource aparticularcollection(Fig.1). Pleasecitethisarticleinpressas:Schnase,J.L.,etal.MERRAAnalyticServices:MeetingtheBigDatachallengesofclimatesciencethroughcloud-enabled ClimateAnalytics-as-a-Service.Computers,EnvironmentandUrbanSystems(2014),http://dx.doi.org/10.1016/j.compenvurbsys.2013.12.003 4 J.L.Schnaseetal./Computers,EnvironmentandUrbanSystemsxxx(2014)xxx–xxx Fig.1. SpecializationofaniRODSserverthroughappliancevirtualizationandtheadditionofdomain-andapplication-specific‘‘kits.’’ Our initial focus has been building a vCDS to manage NetCDF array of cloud services that are flexible, adaptable, scalable, and data.Additionaldetailsaboutthesecomponentsareprovidedbe- stageableto‘‘bricksandmortar’’facilitiesasneeded.Wecanprovi- low and in Schnase, Webster, Parnell, and Duffy (2011), Schnase, sioncapabilitiesintoanyresourceclass,migrateimagesfromone Tamkin, et al. (2011), Schnase, Duffy, et al. (2011), Schnase et al. resourceclasstoanother,andusetheiRODSfederationmechanism (2012),butinsummary,thecoreelementsincludethefollowing: toassemblevirtualcollectionsthatcrossresourceclasses.Thisap- proachprovidesanagileentrypointfornewcustomerswithdata- Application-specificmicroservices–Basicarchiveoperations,par- centric requirements and enables virtualization-as-a-service ticularlythemechanismsrequiredtoingestOpenArchiveInfor- (VaaS), software-as-a-service (SaaS), and platform-as-a-service mation System (OAIS)-compliant Submission Information (PaaS),and,asshownbelow,laysthefoundationforhigher-order Package (SIP) metadata for IPCC NetCDF objects (more about offerings,suchasCAaaS(Fig2). IPCCandOAISbelow). Application-specific metadata – OAIS-compliant constitutive 3.4.Adaptiveanalytics(CanonicalOps) (application-independent) Representation Information (RI) and Preservation Description Information (PDI) metadata for Data intensive analysis workflows bridge between a largely NetCDFobjects. unstructuredmassofarchivedscientificdataandthehighlystruc- Application-specificrules–NetCDFtriggersandworkflows. tured,tailored,reduced,andrefinedanalyticproductsthatareused AspecificreleaseofiRODS–Inthecurrentversionweareusing byindividualscientistsandformthebasisofintellectualworkin iRODS 2.5 that has been augmented with what we refer to as thedomain.Ingeneral,theinitialstepsofananalysis,thoseoper- Administrative Extensions (AE) that log object-level actions ationsthatfirstinteractwithadatarepository,tendtobethemost withintheserver. general, while data manipulations closer to the client tend to be Aspecificoperatingsystem–Inourcase,SuSELinuxEnterprise themostspecializedtotheindividual,tothedomain,ortothesci- Server(SLES)11SP1. encequestionunderstudy.Theamountofdatabeingoperatedon also tends to be larger on the repository-side of the workflow, Collectively,werefertothefunctionalityassociatedwithvCDS smallertowardtheclient-sideendproducts. asathevCDSV1.0‘‘productsuite.’’Takentogether,theseelements Thisstratificationcanbeexploitedinordertooptimizeefficien- enableanapproachtoscientificcollectionsmanagementinwhich ciesalongtheworkflowchain.MapReduce,for example,seeks to virtualization is a driving concept. It supports access to a tiered improve efficiencies of the near-archive operations that initiate Pleasecitethisarticleinpressas:Schnase,J.L.,etal.MERRAAnalyticServices:MeetingtheBigDatachallengesofclimatesciencethroughcloud-enabled ClimateAnalytics-as-a-Service.Computers,EnvironmentandUrbanSystems(2014),http://dx.doi.org/10.1016/j.compenvurbsys.2013.12.003 J.L.Schnaseetal./Computers,EnvironmentandUrbanSystemsxxx(2014)xxx–xxx 5 Fig.2. vCDScloudprovisioningandmigrationpaths. workflows.Inourworksofar,wehavefocusedonbuildingasmall is helping search the Sloan Digital Sky Survey for patterns and setofcanonicalnear-archive,early-stageanalyticaloperationsthat observationsofpotentialscientificvalue(Christian,Lintott,Smith, representacommonstartingpointinmanyanalysisworkflowsin Fortson,&Bamford,2012;Szalayetal.,2000;Young,2010).Webe- manydomains.Forexample,average,variance,max,min,sum,and lievethatthistypeofsocialnetworkingcanplayanimportantrole countoperationsofthegeneralform: inthefutureofclimateanalytics.Theapproachwearetakingsets thestageforthecommunityconstructionofnewcapabilitiesthat result avgðvar;ðt ;t Þ;ððx ;y ;z Þ;ðx ;y ;z ÞÞÞ; 0 1 0 0 0 1 1 1 are adapted to the socially expressed requirements of those who thatreturn,inthisexample,theaveragevalueofavariablewhen usethesystem. givenitsname,atemporalextent,andaspatialextent.Becauseof their widespread use, we refer to these simple operations as 3.5.Domain-harmonizedAPIs(CDSAPI) ‘‘canonical ops’’ with which more complex analytic expressions canbebuilt.Theyprovideatemplateforusersastheybegintheir Inordertoknitthesecapabilitiestogetheranddelivertheminto exploration of MapReduce analytics and are useful in their own practical use, we are building the Climate Data Services (CDS) rightasstepsinlargeranalyses.Wetendtothinkofthemasatype application programming interface (API). APIs specify how soft- ofassemblylanguageinstructionforclimatedataanalysis. ware components interact with each other; they can take many Thegoalistodeploythecanonicalopswithinaframeworkthat forms,butthe goal for allAPIs isto make iteasier toimplement isabletocapturetheirpatternsofuseandenablemorecomplex the abstract capabilitiesof a system. In buildingthe CDS API, we analysestobeassembledandincorporatedbackintothesystem. aretryingtoprovideforclimatescienceauniformsemantictreat- The notion of engaging the broader community to deal with Big mentofthecombinedfunctionalitiesoflarge-scaledatamanage- Datachallengeshasbeenusedsuccessfullyinothersettings,per- mentanddata-proximalanalytics.Indoingso,wearecombining hapsmostnotablywithGalazyZoo,wherealargeusercommunity concepts from the Open Archive Information Systems (OAIS) Pleasecitethisarticleinpressas:Schnase,J.L.,etal.MERRAAnalyticServices:MeetingtheBigDatachallengesofclimatesciencethroughcloud-enabled ClimateAnalytics-as-a-Service.Computers,EnvironmentandUrbanSystems(2014),http://dx.doi.org/10.1016/j.compenvurbsys.2013.12.003 6 J.L.Schnaseetal./Computers,EnvironmentandUrbanSystemsxxx(2014)xxx–xxx referencemodel,object-orientedprogrammingAPIs,andWeb2.0 uponwhichwehavebuiltaCDSCommandLineInterpreter(CLI) resource-orientedAPIs. thatsupportsinteractivesessions.Inaddition,Pythonscriptsand TheOpenArchiveInformationSystem(OAIS)referencemodel, fullPythonapplicationsalsocanuse methodsimportedfromthe definedbytheConsultativeCommitteeonSpaceDataSystems,ad- API. The resulting client stack can be distributed as a software dressesafullrangeofarchivalinformationpreservationfunctions package or used to build a cloud service (SaaS) or distributable including ingest, archival storage, data management, access, and cloudimage(PaaS)(Fig.4). dissemination–fullinformationlifecyclemanagement.Italsoad- This approach to API design focuses on the specific analytic dresses the migration of digital information to new media and requirements of climate science and marries the language and forms,thedatamodelsusedtorepresenttheinformation,therole abstractionsofcollectionsmanagementwiththoseofhigh-perfor- ofsoftwareininformationpreservation,andtheexchangeofdigi- manceanalytics.Doingsoreflectsattheapplicationlevelthecon- talinformationamongarchives.OAISprovidesexamplesandsome fluence of storage and computation that is driving Big Data ‘‘best practice’’ recommendations and defines a minimal set of architecturesofthefuture.Itistooearlytotell,butwehopethat responsibilities for an archive to be called an OAIS (OAIS, 2013). this ‘‘harmonization’’ will make CAaaS more accessible to our OAISidentifiesbothinternalandexternalinterfacestothearchive users. functions,anditidentifiesanumberofhigh-levelservicesatthese interfaces (Fig. 3). Thesehigh-levelservices provide avocabulary thatwehaveadoptedfortheClimateDataServicesReferenceMod- 4.MERRAAnalyticServices–Acasestudyincloud-enabled elandassociatedLibraryandAPI. ClimateAnalytics-as-a-Service TheCDSReferenceModelisalogicalspecificationthatpresents asingleabstractdataandanalyticservicesmodeltocallingappli- MERRA Analytic Services (MERRA/AS) pull these elements to- cations.Asshownbelow,theCDSReferenceModelcanbeimple- getherinanend-to-enddemonstrationofCAaaScapabilities.MER- mentedusingvarioustechnologies; inallcases,however,actions RA/AS enables MapReduce analytics over NASA’s Modern-Era arebasedonthefollowingsixprimitives: Retrospective Analysis for Research and Applications (MERRA) Ingest–Submit/registeraSubmissionInformationPackage. Query – Retrieve data from a pre-determined service request (synchronous). Order – Request data from a pre-determined service request (asynchronous). Download–RetrieveaDisseminationInformationPackage. Status–Trackprogressofserviceactivity. Execute–Initiateaservice-definableextension. WithinthisOAIS-inspiredframework,wearecreatingaPython- based CDS Library that contains methods that support the basic primitives(ingest,query,order,etc.)aswellas extendedutilities thatcombinetheseprimitivesintoautomatedmulti-stepcanonical ops(avg,max,min,etc.).TheLibrarysitsatopaRESTfulWebSer- Fig.4. ClimateDataServicesclientstackbuiltonthecapabilitiesenabledbythe vicesClientthatencapsulatesinboundandoutboundinteractions CDSReferenceModel,WebServicesClient,Library,andAPI. with various climate data services. These provide the foundation Fig.3. BasicOAISinteractionsthatformthebasisoftheClimateDataServicesReferenceModel,Library,andAPI. Pleasecitethisarticleinpressas:Schnase,J.L.,etal.MERRAAnalyticServices:MeetingtheBigDatachallengesofclimatesciencethroughcloud-enabled ClimateAnalytics-as-a-Service.Computers,EnvironmentandUrbanSystems(2014),http://dx.doi.org/10.1016/j.compenvurbsys.2013.12.003 J.L.Schnaseetal./Computers,EnvironmentandUrbanSystemsxxx(2014)xxx–xxx 7 data. As we describe below, the MERRA collection is a reanalysis responsible for executing mapper and reducer codes in parallel datasetthatisofparticularinteresttoabroadcommunityofusers. overthenodesthatcomposetheHDFS. Insimpleterms,ourvisionforMERRA/ASisthatitallowsMER- MERRAdatafilesarecreatedfromtheGoddardEarthObserving RAdatatobestoredinaHadoopDistributedFileSystem(HDFS)on Systemversion5(GEOS-5)modelandarestoredinHDF-EOSand aMERRA/AScluster.FunctionalityisexposedthroughtheCDSAPI. NetCDF formats (MERRA, 2013). Each file contains a single grid TheAPIexposuresenableabasicsetofoperationsthatcanbeused withmultiple2Dand3Dvariables.Alldataarestoredonalongi- to build arbitrarily complex workflows and assembled into more tude–latitudegridwithaverticaldimensionapplicableforall3D complex operations (which can be folded back into the API and variables.TheGEOS-5MERRAproductsaredividedinto25collec- MERRA/ASserviceasfurtherextensions).Thecomplexitiesofthe tions:18standardproducts,7chemistryproducts.Thecollections underlying (Java) mapper and reducer codes for the basic opera- comprisemonthlymeansfilesanddailyfilesat6-hintervalsrun- tionsareencapsulatedandabstractedawayfromtheuser,making ning from 1979 to 2012. MERRA data are typically packaged as thesecommonopseasiertouse. multi-dimensionalbinarydatawithinaself-describingNetCDFfile AnimportantadjuncttotheMERRA/ASserviceisapersistence format. Hierarchical metadata in the NetCDF header contain the service, also exposed through the CDS API, that allows users to representation information that allows NetCDF software to work store, download, annotate, and otherwise manage Java codes or with the data. It also contains arbitrary preservation description CDSscriptsthatimplementthemapandreducefunctionsoftheir and policy information that can be used to bring the data into analyses.Thepersistenceservicehasthecapacitytoexecutethese use-specificcompliance. codes on the MERRA cluster, capture the resulting output, and Totalsizeofthenative,compressedNetCDFMERRAcollection managetheoutputastheuserwishesundercontrolofthepersis- in a standard filesystem is approximately 80TB. Native MERRA tenceservice.Thecodesetsessentiallybecomerealizableobjects– filesaresequencedandingestedintotheHadoopclusterintripli- theirlogicalrepresentationsareusedinserver-sideprocessesthat cated640MBblocks.TotalsizeoftheMERRA/ASHDFSrepository causetheiranalyticalresultstoberealizeduponrequest. isapproximately480TB.TheMERRA/ASHDFSisrunningona36- Inthissectionwedescribethesecomponentsingreaterdetail, node Dell cluster that has 576 Intel 2.6GHz SandyBridge cores, demonstrate the use of MERRA/AS in example applications, and 1300TBofrawstorage,1250GBofRAM,anda11.7TFtheoretical show the role that Cloud Computing is playing in these efforts. peak compute capacity. Nodes communicate through a Fourteen We begin with background information about the MERRA DataRate(FDR)InfinibandnetworkhavingpeakTCP/IPspeedsin collection. excessof20Gbps. 4.3.TheMERRA/ASserver 4.1.ReanalysesandtheModernEraRetrospective-Analysisfor ResearchandApplications(MERRA) The functional requirements of the MERRA/AS server derive from the basic organization of a Hadoop MapReduce system. As The MERRA reanalysis integrates observational data with introduced above, a MapReduce Program has two components: numerical models to produce a global temporally and spatially onethatimplementsamapper,andanotherthatimplementsare- consistent synthesis of 26 key climate variables (Rieneckeret al., ducer.Themappertransformseachelementofaninputlisttoan 2011).Spatialresolutionis1/2(cid:2)latitude(cid:2)2/3(cid:2)longitude(cid:2)72ver- outputelement;thereduceraggregatestheseoutputelementsinto tical levels extending through the stratosphere. Temporal resolu- a single result, which has the effect of turning a large volume of tion is 6-h for three-dimensional, full spatial resolution, data into a smaller summary of itself. A third component, called extending from 1979 to present, nearly the entire satellite era. the Driver, initializes the Program on the Job Tracker node, in- MERRA data are typically made available to the general public structs the Hadoop engine to execute the mapping and reducing throughtheNASAEarthObservingSystemDistributedInformation codes on a set of input files, and controls where the output files System(EOSDIS, 2013). A subset of the data is made availableto areplaced. the climate research community through the Earth System Grid In MapReduce, every value has a key associated with it. Keys Federation (ESGF), the research community’s data publication identify related values. The mapping and reducing functions re- infrastructure(ESGF,2013). ceive not just values, but <key, value> pairs. The output of each We are focusing on the MERRA collection because there is an of these functions is the same: both a key and a value must be increasing demand for reanalysis data products by an expanding emittedtothenextlistinthedataflow.Thesefilteringandcom- community of consumers, including local governments, federal biningfunctions,executedinparalleloverdistributeddata,collec- agencies, and private-sector customers. Reanalysis data are used tively accomplish analytical operations of varying complexity inmodelsanddecisionsupportsystemsrelatingtodisasters,eco- withintheMapReduceparadigm. logicalforecasting,healthandairquality,waterresources,agricul- MostoftheworkofbuildingtheMERRA/ASserverinvolvescre- ture,climateenergy,oceans,andweather.Currently,MERRAdata atingutilitiestomoveMERRAdataintoandoutoftheHDFSand are generally moved to client applications for analysis and use. writing the MapReduce programs that implement MERRA/AS’s Convenientaccesstostorage-sideanalyticscouldsignificantlyim- canonicalops.EachfileofnativeNetCDFbinarydataisconverted provetheusefulnessofthisimportantcollection. into separate sequence files – flat files consisting of binary <key, value>pairsthatcanbeoperatedonbymapperandreducerfunc- 4.2.TheMERRA/ASHDFSrepository tions.Thesesequencefilesareblock-compressedinHDFSandpro- vide direct serialization of several arbitrary binary data types. The Apache Hadoop software library is the classic framework Duringsequencing,thedataispartitionedbytime,sothateachre- for MapReduce distributed analytics (HDFS, 2013). We are using cordinthesequencefilecontainsthetimestampandnameofthe Cloudera, the 100% open source, enterprise-ready distribution of parameter(e.g.temperature)asthecompositekeyandthevalueof Apache Hadoop. Cloudera is integrated with configuration and theparameter(whichcouldhave1–3spatialdimensions). administration tools and related open source packages, such as In operation, map processes filter each sequence file to cap- Hue,Oozi,Zookeeper,andImpala(Cloudera,2013).Therearemany ture<key, value>pairs that match the variable and time span of waystoconfigureaHadoopcluster,butitsbasicarchitecturecon- interest; reduce processes perform calculations based on input sists of an HDFS file system and a MapReduce engine, which is parameters (time, extents, etc.) and create new subset sequence Pleasecitethisarticleinpressas:Schnase,J.L.,etal.MERRAAnalyticServices:MeetingtheBigDatachallengesofclimatesciencethroughcloud-enabled ClimateAnalytics-as-a-Service.Computers,EnvironmentandUrbanSystems(2014),http://dx.doi.org/10.1016/j.compenvurbsys.2013.12.003 8 J.L.Schnaseetal./Computers,EnvironmentandUrbanSystemsxxx(2014)xxx–xxx files.TheresultingsequencefilesarethentransformedtoNetCDF prioritization of target areas for reseeding and long-term ecosys- formatinthede-sequencingprocess(Fig.5). temrecoverymonitoring. ThecanonicaloperationsthatimplementMERRA/AS’saverage, RECOVER is a new decision support system (DSS) that will be variance,max, min,sum, and countcalculationsare JavaMapRe- incorporated into a long-standing post-fire decision process, the duce programs that are ultimately exposed as simple references NationalBurnedAreaEmergencyResponse(BAER)program.After to CDS Library methods or as web services endpoints. There is a a major wildfire, law requires that the federal land management substantialcodeecosystembehindtheseapparentlysimpleopera- agencies certify a comprehensive plan for public safety, burned tions,nearly6000linesofJavacodebeingoffloadedfromtheuser area stabilization, resource protection, and site recovery. These totheMERRA/ASservice. BAERplansareacrucialpartofournationalresponsetowildfire disasters.Theplansareduewithin14daysofcontainmentofama- jor wildfire and become the guiding document for managing the 4.4.MERRA/ASinuse–RESTfulwebservices activitiesandbudgetsforallsubsequentremediationefforts.There arefewinstancesin thefederalgovernmentwhereplansofsuch Our initial exposure for client applications that wish to con- wide-rangingscopeareassembledonsuchshortnoticeandtrans- sume MERRA/AS results is the MERRA/AS Web Service. We are latedintoactionmorequickly(BAER,2013). using a Representational State Transfer (REST)-style architecture, BAER plans are largely developed on-site by multi-agency which is the predominant web API design model. REST provides teams of specialists that include natural resource managers and scalabilityofcomponentinteractions,accommodatesintermediar- scientistswithexpertiseinthesalientdisciplines.Thenatureand ieslikefirewallsandproxieswithouttheneedtochangeinterfaces, settingoftheirworkcreateaneedfordecisionsupporttoolsthat andallowsindependentdeploymentofcomponentswhereimple- allow sound decision-making and land management planning to mentations can change without the need to change interfaces. A takeplacequicklyforaspecificregionofinterest.Remotesensing concrete implement of a REST Web Service uses HTTP methods imageryisoftenusedtocomplementfield-basedassessmentsand explicitly, is stateless, exposes directly structure-like URIs, and toprovidelandscapeorregionalscalemonitoring.Severalindica- transfersXML,JavScriptObjectNotation(JSON),orboth. tors derived from satellite imagery can be used to characterize Our RESTservicehasbeen builtusing theCDS API and makes bothfireseverityandintensityaswellasvegetationrecoveryfol- callstotheCDSLibrary;itsendpointsemanticsadheretothecon- lowingfire.Forseveralanalyses,historicalecosystemconditionsof ventionswehaveestablishedintheCDSAPI.Aspecificcalltothe thetypecapturedinMERRA’svariablesareofimportance. MERRA/AS service to find the average temperature over a given Criticalsite-specificinformationthatcouldotherwiseimprove periodoftime,geospatialextent,andspanofaltitudeslookssome- outcomes does not become part of the decision-making process thinglikethis: unlessitisimmediatelyavailabletotheBAERteams.RECOVERis acontext-aware,site-specificDSSthatbringstogetherinasingle http://skyportal.gsfc.nasa.gov/cds/mas/order.php?GetAverage application the information necessary for BAER team post-fire ByVariable_TimeRange_SpatialExtent_VerticalExtent&variable_ rehabilitationdecision-making.Inatypicalscenario-of-use,aRE- list=T&operation=avg&start_date=201101&end_date=201112& COVERinstanceiscreatedautomaticallyinresponsetoafiredetec- avg_period=12&min_lon=125&min_lat=24&max_lon=-66&max tionevent.UsingtherapidresourceallocationcapabilitiesofCloud _lat=50&start_level=1&end_level=42. Computing, Earth observational data, MERRA data, and derived decision products are automatically collected and refreshed Asimpleform-basedinterfacehasbeenprovidedtoenablebeta throughouttheburnsothatwhenthefireisdeclaredundercon- testingoftheMERRA/ASWebServicesendpoints(Fig.6). trol, BAER teams have at hand a complete and ready-to-use RE- COVERdatasetthatiscustomizedforthetargetwildfire. 4.5.MERRA/ASinuse–TheRECOVERwildlandfiredecisionsupport The system itself comprises a RECOVER Server and RECOVER system Clients. The RECOVER Server is a tailored vCDS deployed in the Amazon Elastic Compute Cloud (EC2). When provided a wildfire Amoreinterestingmachine-to-machineuseofMERRA/ASWeb nameandgeospatialextent,theRECOVERServeraggregatesdata Services is demonstrated in the RECOVER project. In RECOVER, from a suite of web services, does the necessary transformations whichstandsforRehabilitationCapabilityConvergenceforEcosys- andreprojectionsrequiredforthedatatobeusedbyRECOVERCli- temRecovery,NASAisworkingwiththeDepartmentofInterior’s ents, and, in turn, exposes the tailored collection through a Web BureauofLandManagement(BLM)andtheNationalInteragency MapServicerunningintheServer.RECOVERcallsonMERRA/AS’s FireCenter(NIFC)toaddressestwocriticalrequirementsinpost- WebService,therebyprovidinganeasyintegrationofthishereto- fire decision-making for savanna ecosystems: identification and foreseldomusedresourceintotheBAERprocess(Fig.7). Fig.5. Sequencing/de-sequencingoperationsperformedbyMERRA/ASutilitiesfortheMERRAHDFSrepository. Pleasecitethisarticleinpressas:Schnase,J.L.,etal.MERRAAnalyticServices:MeetingtheBigDatachallengesofclimatesciencethroughcloud-enabled ClimateAnalytics-as-a-Service.Computers,EnvironmentandUrbanSystems(2014),http://dx.doi.org/10.1016/j.compenvurbsys.2013.12.003 J.L.Schnaseetal./Computers,EnvironmentandUrbanSystemsxxx(2014)xxx–xxx 9 Fig.6. MERRA/AS’scanonicalMapReduceoperationsareexposedthroughRESTfulwebservicesandasimpleformsinterface. RECOVER’s feasibility has been demonstrated in Idaho during This ontological alignment – moving from the semantic frame of the2013wildfireseason.Overthenext3years,RECOVERwillbe referencedefinedbytheproducersoftheMERRAdatatothatused deployed into operational use in the Great Basin states of Idaho, byESGF–isoftenamixedprocessofautomaticandmanualcon- Utah,andNevadawhereEC2’sauto-scalingandelasticloadbalanc- versionandcontributessignificantlytothedatapreparationover- ingcapabilitieswillbeparticularlybeneficial. headofsupportingtheIPCCproject. Tofacilitatethisprocess,wehavedeployedvCDSintheAmazon cloudandhaveusedthesystemtodeliverasubsetofNASA’sMER- 4.6.MERRA/ASinuse–EarthSystemGridFederation(ESGF) RAproductstotheESGFserver,alsorunningintheAmazoncloud. publication vCDS-managedobjectsareexposedtoESGFthroughFUSE(Filesys- teminUserSpace),whichpresentsaPOSIX-compliantfilesystem NASAscientistscontributeclimatedata,includingMERRAprod- abstraction to applications such as the ESGF server that require ucts, to the Intergovernmental Panel on Climate Change, which such an interface (Fig. 8). vCDS will ultimately provide a persis- representsateamofnearly1000expertsworkingthroughoutthe tence service that hosts the MapReduce codes that extract ESGF worldonissuesofclimatechange(IPCC,2013).Thisresearchcom- products from MERRA/AS. vCDS will also provide a place in the munity uses the Earth System Grid Federation (ESGF) as the pri- stackforotherutilities,suchasthoserequiredforontologyalign- marymechanismforpublishingIPCCdataaswellastheancillary mentandmetadatamanagement. observational and reanalysis products used in model/model and observation/modeldatainter-comparison,animportantaspectof climate change research (Edwards, 2010). ESGF functions like a peer-to-peercontentdistributionnetworkinwhichgeographically 4.7.MERRA/ASinuse–TheWeimethodforprogrammeddata distributed collections can be accessed by the climate research assemblyandcapabilityenhancement community through a certificate authority mechanism (CA). PublishedESGFdata,regardlessofsource,conformstothecommu- Wei, Dirmeyer, Wisser, Bosilovich, and Mocko (2013), in a nity-defined Climate Model Inter-comparison Program (CMIP5) hydrological study that focuses on the contribution of irrigation Data Reference Syntax and Controlled Vocabularies standard toprecipitation,providesanexcellentexampleofthewayMERRA (Tayloretal.,2012).ThetrustrelationshipsetupbytheCAmech- dataareusedininvestigationsofthistype.WeareusingtheWei anism essentially creates a virtual organization of producers and experiment,andotherslikeit,todeveloptheMERRA/ASAPI,Com- consumersofESGFproducts. mand Interpreter, scripting, and programming capabilities. The Institutions, such as the NCCS, that host ESGF servers have goalatthisearlystageistodemonstrateandevaluatetheability responsibility for correctly formatting and registering their data of CAaaS and the CDS Client Stack to simplify the work of data contributions.PreparingMERRAdataforESGFpublicationrequires assembly and the community construction of enhanced CAaaS reformattinginordertomakeitcomplywiththeCMIP5standard. functionality. Pleasecitethisarticleinpressas:Schnase,J.L.,etal.MERRAAnalyticServices:MeetingtheBigDatachallengesofclimatesciencethroughcloud-enabled ClimateAnalytics-as-a-Service.Computers,EnvironmentandUrbanSystems(2014),http://dx.doi.org/10.1016/j.compenvurbsys.2013.12.003 10 J.L.Schnaseetal./Computers,EnvironmentandUrbanSystemsxxx(2014)xxx–xxx Fig.7. RECOVERServer(a)andClient(b)interfaces. Simplyput,theWeiteamusedMERRAdatatostudyfourinten- used a quasi-isentropic back-trajectory (QIBT) method to track sivelyirrigatedregions–northernIndia/Pakistan,theNorthChina water vapor for precipitation events backward in time assuming Plain,theCaliforniaCentralValley,andtheNileValley.Thestudy precipitated water is drawn from the atmospheric column along Pleasecitethisarticleinpressas:Schnase,J.L.,etal.MERRAAnalyticServices:MeetingtheBigDatachallengesofclimatesciencethroughcloud-enabled ClimateAnalytics-as-a-Service.Computers,EnvironmentandUrbanSystems(2014),http://dx.doi.org/10.1016/j.compenvurbsys.2013.12.003