Big Data Software Spring 2017 Bloomington, Indiana Editor: Gregor von Laszewski Department of Intelligent Systems Engeneering Indiana University [email protected] Contents 1 S17-ER-1001 Berkeley DB Saber Sheybani 5 2 S17-IO-3000 Apache Ranger Avadhoot Agasti 8 3 S17-IO-3005 Amazon Kinesis Abhishek Gupta 11 4 S17-IO-3008 Google Cloud DNS Vishwanath Kodre 14 5 S17-IO-3010 Robot Operating System (ROS) Matthew Lawson 17 6 S17-IO-3011 Apache Crunch Scott McClary 22 7 S17-IO-3012 Apache MRQL - MapReduce Query Language Mark McCombe 25 8 S17-IO-3013 Lighting Memory-Mapped Database (LMDB) Leonard Mwangi 29 9 S17-IO-3014 SciDB: An Array Database Piyush Rai 32 10 S17-IO-3015 Cassandra Sabyasachi Roy Choudhury 34 11 S17-IO-3016 Apache Derby Ribka Rufael 37 1 12 S17-IO-3017 Facebook Tao Nandita Sathe 40 13 S17-IO-3019 InCommon Michael Smith, 43 14 S17-IO-3020 Hadoop YARN Milind Suryawanshi, Gregor von Laszewski 46 15 S17-IO-3021 Apache Tez- Application Data processing Framework Abhijt Thakre 49 16 S17-IO-3022 Deployment Model of Juju Sunanda Unni 53 17 S17-IO-3023 AWS Lambda Karthick Venkatesan 56 18 S17-IO-3024 Not Submitted Ashok Vuppada 60 19 S17-IR-2001 HUBzero: A Platform For Scientific Collaboration Niteesh Kumar Akurati 62 20 S17-IR-2002 Apache Flink: Stream and Batch Processing Jimmy Ardiansyah 65 21 S17-IR-2004 Jelastic Ajit Balaga, S17-IR-2004 68 22 S17-IR-2006 An Overview of Apache Spark Snehal Chemburkar, Rahul Raghatate 71 23 S17-IR-2008 An overview of Apache THRIFT and its architecture Karthik Anbazhagan 76 2 24 S17-IR-2011 Hyper-V Anurag Kumar Jain 79 25 S17-IR-2012 Retainable Evaluator Execution Framework Pratik Jain 82 26 S17-IR-2013 A brief introduction to OpenCV Sahiti Korrapati 85 27 S17-IR-2014 An Overview of Pivotal Web Services Harshit Krishnakumar 88 28 S17-IR-2016 An Overview of Apache Avro Author Missing 91 29 S17-IR-2017 An Overview of Pivotal HD/HAWQ and its Applications Author Missing 93 30 S17-IR-2018 An overview of Cisco Intelligent Automation for Cloud Bhavesh Reddy Merugureddy 96 31 S17-IR-2019 KeystoneML Vasanth Methkupalli 99 32 S17-IR-2021 Amazon Elastic Beanstalk Shree Govind Mishra 102 33 S17-IR-2022 ASKALON Abhishek Naik 105 34 S17-IR-2024 Memcached Ronak Parekh, Gregor von Laszewski 108 35 S17-IR-2026 Naiad Rahul Raghatate, Snehal Chemburkar 111 3 36 S17-IR-2027 Dryad : Distributed Execution Engine Shahidhya Ramachandran 117 37 S17-IR-2028 A Report on Apache Apex Srikanth Ramanam 121 38 S17-IR-2029 Apache Mahout Naveenkumar Ramaraju 124 39 S17-IR-2030 Neo4J Sowmya Ravi 127 40 S17-IR-2031 OpenStack Nova: Compute Service of OpenStack Cloud Kumar Satyam 130 41 S17-IR-2034 Heroku Yatin Sharma 133 42 S17-IR-2035 D3 Piyush Shinde 136 43 S17-IR-2036 An overview of the open source log management tool - Graylog Rahul Singh 140 44 S17-IR-2037 Jupyter Notebook vs Apache Zeppelin - A comparative study Sriram Sitharaman 143 45 S17-IR-2038 Introduction to Terraform Sushmita Sivaprasad 147 46 S17-IR-2041 Google BigQuery - A data warehouse for large-scale data analytics Sagar Vora 150 47 S17-IR-2044 Hive Diksha Yadav 155 4 ReviewArticle Spring2017-I524 1 Berkeley DB SABER SHEYBANI1,* 1SchoolofInformaticsandComputing,Bloomington,IN47408,U.S.A. *Correspondingauthors:[email protected] Paper2,April30,2017 Berkeley DB is a family of open source, NoSQL key-value database libraries. It provides a simple function-callAPIfordataaccessandmanagementoveranumberofprogramminglanguages,including C,C++,Java,Perl,Tcl,Python,andPHP.BerkeleyDBisembeddedbecauseitlinksdirectlyintotheappli- cationandrunsinthesameaddressspaceastheapplication. Asaresult,nointer-processcommunication, eitheroverthenetworkorbetweenprocessesonthesamemachine,isrequiredfordatabaseoperations.It isalsoextremelyportableandscalable,itcanmanagedatabasesupto256terabytesinsize. Fordataman- agement,BerkeleyDBoffersadvancedservices,suchasconcurrencyformanyusers,ACIDtransactions, andrecovery. BerkeleyDBisusedinawidevarietyofproductsandalargenumberofprojects, includ- inggatewaysfromCisco,WebapplicationsatAmazon.comandopen-sourceprojectssuchasApacheand Linux. © 2017https://creativecommons.org/licenses/.Theauthorsverifythatthetextisnotplagiarized. Keywords: NoSQL,embeddeddatabase,Oracle,opensourcedatabasemanagementsystem https://github.com/cloudmesh/sp17-i524/tree/master/paper2/S17-ER-1001/report.pdf 1. INTRODUCTION SemistructureddatabaseswhichincludeNoSQLdatabasesare thetypeofdatabasemodelthatenablestorageofhetereogenous Datamanagementhasalwaysbeenafundamentalissueinpro- data,byallowingrecordswithdifferentattributes. Thishow- gramming. Since1960s,countlessdatabasemanagementsys- ever,isachievedbysacrificingtheknowledgeofdatatypeby temshavebeendevelopedtofulfildifferentsortsofdemands. thedatabasesystem.Thedatainthiscasemustbeself-describing, Thequestionforeveryuserischoosingthesystemthatbestfits meaningthatthedescription(schema)ofthedatamustbein therequirementsofitsappliction. itself.XML(ExtensibleMarkupLanguage)schemalanguageis Database management systems can be categorized based awidelyusedlanguageforprovidingschemaforthesedatabase on data models, into a number of groups: Hierarchical systems[1]. Databases, NetworkDatabases, RelationalDatabases, Object- BerkeleyDBfitsintothelastcategory,asaNoSQLdatabase basedDatabases,andSemistructuredDatabases. system. The records are stored as key-value pairs and a few Hierarchical databases use the oldest type of data models, logicaloperationscanbeexecutedonthem,namely:insertion, whichisatree-likestructure.Therecordsareconnectedtoeach deletion,findingarecordbyitskey,andupdatinganalready otherwithahard-codedlink. foundrecord."BerkeleyDBneveroperatesonthevaluepartof InaNetworkmodel,recordsarealsoconnectedwithlinks,but arecord.Valuesaresimplypayload,tobestoredwithkeysand thereisnohierarchy.Instead,thestructureisgraph-likeandall reliablydeliveredbacktotheapplicationondemand."There ofthenodescanconnecttoeachother. isnonotionofschemaandnosupportforSQLqueries. "The In Relational databases, there are no physical links, but the applicationmustunderstandthekeysandvaluesthatituses. data is structured in tables (relations). Each row represents Ontheotherhand,thereisliterallynolimittothedatatypes arecordandeachcolumnrepresentsanattribute. Thetables thatcanbestoredinaBerkeleyDBdatabase. Theapplication are connected with common attributes, which makes query- neverneedstoconvertitsownprogramdataintothedatatypes ingmucheasierthanthetwoformermodels. Forthisreason, thatBerkeleyDBsupports. BerkeleyDBisabletooperateon databasemanagementsystemsusingrelationalmodelsarethe anydatatypetheapplicationuses,nomatterhowcomplex"[2]. mostwidelyusedones. Object-baseddatamodelsextendconceptsofobject-oriented 2. ARCHITECTURE programmingintodatabasesystems,inordertoprovideper- sistentstorageofobjectsandothercapabilitiesofdatabasesfor BerkeleyDB’sarchitecturecanbeexplainedbyfivemajorsub- object-orientedprogramming. systems:AccessMethods:Providinggeneral-purposesupport 5 ReviewArticle Spring2017-I524 2 forcreatingandaccessingdatabasefiles. MemoryPool: The crashofthesystem).BerkeleyDB"librariesprovidestrictACID general-purposesharedmemorybufferpool.MultipleTransac- transactionsemantics, bydefault. However, applicationsare tion:Implementingthetransactionmodel,realizingACIDprop- allowedtorelaxtheisolationguaranteesthedatabasesystem erties. processesandthreadswithinprocessesshareaccessto makes"[4]. databasesusingthissubsystem.Locking:Thegeneral-purpose BerkeleyDBrunsinthesameaddressspaceastheapplica- lockmanagerforprocesses.Logging:Thewrite-aheadlogging tion. Asaresult,thereisnoneedforcommunicationbetween thatsupportstheBerkeleyDBtransactionmodel. processes and threads. On the other hand, as an embedded databasemanagementsystem,itdoesnotprovideastandalone server. However,serverapplicationscanbebuiltoverBerke- ley DB and many examples of Lightweight Directory Access Protocol(LDAP)servershavebeenbuiltusingit[2]. ThedatabaselibraryforBerkeleyDBconsumeslessthan300 kilobytesoftextspaceoncommonarchitectures.Thatmakesit afeasiblesolutionforembeddedsystemswithsmallcapacities. Nonetheless,itcanmanageupto256terabytesdatabases. 3.1. SupportedOperatingSystemsandLanguages Berkeley DB supports nearly all modern operating systems. TheyincludeWindows,Linux,MacOSX,Android,iPhone,So- laris,BSD,HP-UX,AIX,andRTOSsuchasVxWorks,andQNX. Thesupportedprogramminglanguagesinclude"C,C++,Java, C#,Perl,Python,PHP,Tcl,Rubyandmanyothers"[6]. 3.2. RequiredInfrastructure Asinfrastructure,BerkeleyDBrequires"underlyingIEEE/ANSI Std 1003.1 (POSIX) system calls and can be ported easily to newarchitecturesbyaddingstubroutinestoconnectthenative systeminterfacestotheBerkeleyDBPOSIX-stylesystemcalls" [7]. Fig.1.BerkeleyDBSubsystems[3] 4. PRODUCTSANDLICENSING TheproductsincludethreeimplementationsonC,C++,andJava Figure1displaysadiagramoftheBerkeleyDBlibraryarchi- (OracleBerkeleyDB,OracleBerkeleyXML,andOracleBerkeley tecture.Thearrowsarecallsthatinvokethedestination.Each JE,respectively)[8]. subsystemcanalsobeusedindependentfromtheotherones, BerkeleyDBisanopensourcelibraryandisfreeforuseand butthisusageisnotcommon. redistributioninotheropensourceproducts. thedistribution includes complete source code for all three implementations, 3. SERVICESANDOTHERFEATURES theirsupportingutilities,aswellascompletedocumentationin HTMLformat[7]. Thetwofundamentalservicesthateverydatabasemanagement For redistribution in commercial products, Sleepycat Soft- systemprovidesaredataaccess,anddatamanagementservices. warelicensesfourproducts,withpricesrangingfromUS$900 Data access services include the low-level operations on the to13,800perprocessor[9]asofMarch2017. Theproducts,in records,whichwerealreadymentionedintheintroductionfor theorderofascendingpriceandcapabilitiesare:BerkeleyDB thecaseofBerkeleyDB.Intermsofstoragestructure,Berkeley DataStore, BerkeleyDBConcurrentDataStore, BerkeleyDB DBsupportshashtables,Btrees,simplerecord-number-based TransactionalDataStore,BerkeleyDBHighAvailability. The storage,andpersistentqueues[4]. Sleepycatsoftwarealsoincludesprebuiltlibrariesandbinaries Datamanagementservicesarethehigher-levelservices(and aspartofsupportservices, whichisnotprovidedinthefree features)suchasconcurrencythatensurespecificqualitiesfor distribution.Thereisnoadditionallicensepaymentforembed- operationofthesystem.Theseservicesincludeallowingsimul- dedusagewithintheOracleRetailPredictiveApplicationServer taneousaccesstotherecordsbymultipleusers(concurrency), (RPAS). changingmultiplerecordsatthesametime(transaction),and completerecoveryofthedatafromcrashes(recovery)[4]. 5. USECASES Forconcurrency,BerkeleyDBisabletohandlelow-levelser- vicessuchaslockingandsharedbuffermanagementtranspar- Anotablenumberofopensourceandcommercialproductsin ently,whilemultipleprocessesandthreadsusethedata. differentareasoftechnology,useBerkeleyDB.Opensourceuse Forrecovery,everyapplicationcanaskBerkeleyDBforrecovery, casesincludeLinux,UNIX,BSD,Apache,Solaris,MySQL,Send- atstartuptime. mail,OpenLDAP,andMemcacheDB. AnACIDtransactionensuresthefollowingspecificationsatthe Proprietaryapplications"includedirectoryserversfromSunand end of its operation [5]: Atomicity (Either all or none of the Hitachi;messagingserversfromOpenwaveandLogicaCMG; recordschange),Consistency(Thesystemgoesfromonevalid switches,routersandgatewaysfromCisco,Motorola,Lucent, state to another), Isolation (concurrent execution of multiple andAlcatel;storageproductsfromEMCandHP;securityprod- transactionyieldsthesameresultasthesequentialexecution uctsfromRSASecurityandSymantec;andWebapplicationsat ofthem),Durability(Theresultremainssteady,evenincaseof Amazon.com,LinkedInandAOL"[6]. 6 ReviewArticle Spring2017-I524 3 6. ADVANTAGESANDLIMITATIONS [10] “Do you need berkeley db?” Web Page, accessed: 2017-4-6. [Online].Available:https://web.stanford.edu/class/cs276a/projects/docs/ Berkeley DB has two advantages over relational and object- berkeleydb/ref/intro/need.html oriented database systems, when it comes to embedded ap- [11] M.A.Olson,K.Bostic,andM.I.Seltzer,“Berkeleydb.”inUSENIX plications. One is running in the same address space as the AnnualTechnicalConference,FREENIXTrack,1999,pp.183–191. applicationandthus,notrequiringanyinter-processcommu- [12] H.Yadava, TheBerkeleyDBBook, ser.BooksforProfessionalsby nicationwhichcanhaveahighcostinembeddedapplications. Professionals. Apress,2007.[Online].Available:https://books.google. And the other is simplicity of interface for operations which com/books?id=2wEkW7pQ0KwC doesnotrequirequerylanguageparsing. Thesetwofeatures alongwithitssmallsize,giveBerkeleyDBsystemaprivilege ofbeinglightweightenoughformanyapplicationswherethere AUTHORBIOGRAPHY aretightconstraintsonresources. SaberSheybanireceivedhisB.S.(ElectricalEngineering-Mi- However,withsimplicitycomesthelackofSQLfeatures.Ifthe norinControlEngineering)fromUniversityofTehran. Heis useroftheapplicationneedstoperformcomplicatedsearches currently a PhD student of Intelligent Systems Engineering - (potentiallyusingSQLqueries)theprogrammerwouldneedto NeuroengineeringatIndianaUniversityBloomington. writethecodeforthosecases.Ingeneral,BerkeleyDBisaimed atprovidingfast,reliable,transaction-protectedrecordstorage, ataminimalistway[10]. 7. EDUCATIONALMATERIAL AswasmentionedintheProductssection,thefreedistribution comes with complete documentation in HTML format. The documentation has two parts: a reference manual in UNIX- styleforprogrammers,andareferenceguidewhichcanserve asatutorial[11]. Inadditiontothat,BerkeleyDBTutorialand ReferenceGuide,Version4.1.24[7]andTheBerkeleyDBBook[12] areusefulresourcesforlearningmoreaboutBerkeleyDBand gettingstartedwithit. 8. CONCLUSION BerkeleyDBisaminimal,lightweightdatabasemanagement system,focusedonprovidingperformance,especiallyinembed- dedsystems.Itoffersasmall,simplesetofdataaccessservices, andarichpowerfulsetofdatamanagementservices.Itisfreely availableforusebynon-commercialdistributionsandhasbeen successfullyusedinmanyprojects. REFERENCES [1] I.Limited,IntroductiontoDatabaseSystems:. Pearson,2010.[Online]. Available:https://books.google.com/books?id=-YY-BAAAQBAJ [2] “What berkeley db is not,” Web Page, accessed: 2017-4-6. [Online].Available:https://web.stanford.edu/class/cs276a/projects/docs/ berkeleydb/ref/intro/dbisnot.html [3] “The big picture,” Web Page, accessed: 2017-4-6. [Online]. Avail- able:https://web.stanford.edu/class/cs276a/projects/docs/berkeleydb/ ref/arch/bigpic.html [4] “What is berkeley db?” Web Page, accessed: 2017-4-6. [Online].Available:https://web.stanford.edu/class/cs276a/projects/docs/ berkeleydb/ref/intro/dbis.html [5] T.HaerderandA.Reuter,“Principlesoftransaction-orienteddatabase recovery,”ACMComputingSurveys(CSUR),vol.15,no.4,pp.287–317, 1983. [6] “Oracle berkeley database products,” Web Page, accessed: 2017- 4-6.[Online].Available: http://www.oracle.com/technetwork/products/ berkeleydb/learnmore/berkeley-db-family-datasheet-132751.pdf [7] “Berkeleydbtutorialandreferenceguide,version4.1.24,”WebPage, accessed:2017-4-6.[Online].Available:https://web.stanford.edu/class/ cs276a/projects/docs/berkeleydb/reftoc.html [8] “Oracle berkeley db 12c,” Web Page, accessed: 2017-4-6. [Online]. Available: http://www.oracle.com/technetwork/database/ database-technologies/berkeleydb/%20overview/index.html [9] “Oracle technology global price list,” Web Page, accessed: 2017- 4-6. [Online]. Available: http://www.oracle.com/us/corporate/pricing/ technology-price-list-070617.pdf 7 ReviewArticle Spring2017-I524 1 Apache Ranger AVADHOOT AGASTI1,*,+ 1SchoolofInformaticsandComputing,Bloomington,IN47408,U.S.A. *Correspondingauthors:[email protected] +HID-SL-IO-3000 paper2,April30,2017 ApacheHadoopprovidesvariousdatastorage,dataaccessanddataprocessingservices. ApacheRanger ispartoftheHadoopecosystem. ApacheRangerprovidescapabilitytoperformsecurityadministration tasksforstorage,accessandprocessingofdatainHadoop. UsingRanger,Hadoopadministratorcanper- formsecurityadministrationtasksusingacentraluserinterfaceorrestfulwebservices. Hadoopadmin- istrator can define policies which enable users or user-groups to perform specific actions using Hadoop componentsandtools. RangerprovidesrolebasedaccesscontrolfordatasetsonHadoopatcolumnand row level. Ranger also provides centralized auditing of user access and security related administrative actions. © 2017https://creativecommons.org/licenses/.Theauthorsverifythatthetextisnotplagiarized. Keywords: ApacheRanger,LDAP,ActiveDirectory,ApacheKnox,ApacheAtlas,ApacheHive,ApacheHadoop,Yarn,ApacheHBase,Apache Storm,ApacheKafka,DataLake,ApacheSentry,HiveServer2,Java https://github.com/cloudmesh/sp17-i524/raw/master/paper2/S17-IO-3000/report.pdf 1. INTRODUCTION 2.2. RangerPlugins PluginsareJavaprograms,whichareinvokedaspartoftheclus- Apache Ranger is open source software project designed to tercomponent.Forexample,theranger-hivepluginisembed- provide centralized security services to various components dedaspartofHiveServer2.Thepluginscachethepolicies,and ofApacheHadoop. ApacheHadoopprovidesvariousmech- intercepttheuserrequestandevaluatesitagainstthepolicies. anismtostore,processandaccessthedata. EachApachetool Pluginsalsocollecttheauditdataforthatspecificcomponent hasitsownsecuritymechanism.Thisincreasesadministrative andsendtoadminportal. overheadandisalsoerrorprone.ApacheRangerfillsthisgap toprovideacentralsecurityandauditingmechanismforvari- 2.3. Usergroupsync ousHadoopcomponents.UsingRanger,Hadoopadministrator WhileRangerprovidesauthorizationoraccesscontrolmecha- canperformsecurityadministrationtasksusingacentraluser nism,itneedstoknowtheusersandthegroups. Rangerinte- interfaceorrestfulwebservices.Theadministratorcandefine grateswithunixusermanagementsystemorLDAPoractive policieswhich,enableusersoruser-groupstoperformspecific directorytofetchtheusersandthegroupsinformation.Theuser actionsusingHadoopcomponentsandtools.Rangerprovides groupsynccomponentisresponsibleforthisintegration. rolebasedaccesscontrolfordatasetsonHadoopatcolumnand row level. Ranger also provides centralized auditing of user 3. HADOOPCOMPONENTSSUPPORTEDBYRANGER accessandsecurityrelatedadministrativeactions. Ranger supports auditing and authorization for following 2. ARCHITECTUREOVERVIEW Hadoopcomponents[2]. [1]describestheimportantcomponentsofRangerasexplained 3.1. ApacheHadoopandHDFS below: Apache Ranger provides plugin for Hadoop, which helps in enforcingdataaccesspolicies. TheHDFSpluginworkswith 2.1. RangerAdminPortal namenodetocheckiftheuser’saccessrequesttoafileonHDFS Rangeradminportalisthemaininteractionpointfortheuser. isvalidornot. AusercandefinepoliciesusingtheRangeradminportal.These policiesarestoredinapolicydatabase.ThePoliciesarepolled 3.2. ApacheHive byvariousplugins. Adminportalalsocollectstheauditdata ApacheHiveprovidesSQLinterfaceontopofthedatastored frompluginsandstoresitinHDFSorinarelationaldatabase. in HDFS. Apache Hive supports two types of authorization: 8 ReviewArticle Spring2017-I524 2 storagebasedauthorizationandSQLstandardauthorization. securityadministratormaywanttoallowdoctorstoseeonlyhis RangerprovidescentralizedauthorizationinterfaceforHive, orherpatients.UsingRanger,suchrowlevelaccesscontrolcan whichprovidesgranularaccesscontrolattableandcolumnlevel. bespecifiedandimplemented. Ranger’shivepluginispartofHiveServer2. 5. HADOOPDISTRIBUTIONSUPPORT 3.3. ApacheHBase Apache HBase is NoSQL database implemented on top of Ranger can be deployed on top of Apache Hadoop. [4] pro- Hadoop and HDFS. Ranger provides coprocessor plugin for videsdetailedstepsofbuildinganddeployingRangerontopof HBase,whichperformsauthorizationchecksandauditlogcol- ApacheHadoop. lections. HortonworkDistributionofHadoop(HDP)supportsRanger deploymentusingAmbari.[5]providesinstallation,deployment 3.4. ApacheStorm andconfigurationstepsforRangeraspartofHDPdeployment. Cloudera Hadoop Distribution (CDH) does not support RangerprovidesplugintoNimbusserverwhichhelpsinper- Ranger.Accordingto[6],RangerisnotrecommendedonCDH formingthesecurityauthorizationonApacheStorm. andinsteadApacheSentryshouldbeusedascentralsecurity andaudittoolontopofCDH. 3.5. ApacheKnox ApacheKnoxprovidesservicelevelauthorizationforusersand groups.RangerprovidespluginforKnoxusingwhich,admin- 6. USECASES istration of policies can be supported. The audit over Knox ApacheRangerprovidescentralizedsecurityframeworkwhich dataenablesusertoperformdetailedanalysisofwhoandwhen canbeusefulinmanyusecasesasexplainedbelow. accessedKnox. 6.1. DataLake 3.6. ApacheSolr [7]explainsthatstoringmanytypesofdatainthesamerepos- Solr provides free text search capabilities on top of Hadoop. itoryisoneofthemostimportantfeatureofdatalake. With RangerisusefultoprotectSolrcollectionsfromunauthorized multipledatasets,theownership,securityandaccesscontrolof usage. thedatabecomesprimaryconcern.UsingApacheRanger,the securityadministratorcandefinefinegraincontrolonthedata 3.7. ApacheKafka access. RangercanmanageraccesscontrolonKafkatopics. Policies canbeimplementedtocontrolwhichuserscanwritetoaKafka 6.2. Multi-tenantDeploymentofHadoop topicandwhichuserscanreadfromaKafkatopic. Hadoopprovidesabilitytostoreandprocessdatafrommul- tiple tenants. The security framework provided by Apache 3.8. Yarn Rangercanbeutilizedtoprotectthedataandresourcesfrom YarnisresourcemanagementlayerforHadoop.Administrators un-authorizedaccess. cansetupqueuesinYarnandthenallocateusersandresources perqueuebasis.PoliciescanbedefinedinRangertodefinewho canwritetovariousYarnqueues. 7. APACHERANGERANDAPACHESENTRY Accordingto[8],ApacheSentryandApacheRangerhavemany 4. IMPORTANTFEATURESOFRANGER featuresincommon. ApacheSentry([9])providesrolebased authorizationtodataandmetadatastoredinHadoop. Theblogarticle[3]explainsthe2importantfeaturesofApache Ranger. 8. EDUCATIONALMATERIAL 4.1. DynamicColumnMasking [10] provides tutorial on topics like A)Security resources Dynamicdatamaskingatcolumnlevelisanimportantfeatureof B)AuditingC)SecuringHDFS,HiveandHBasewithKnoxand ApacheRanger.Usingthisfeature,theadministratorcansetup RangerD)UsingApacheAtlas’TagbasedpolicieswithRanger. datamaskingpolicy. Thedatamaskingmakessurethatonly [11]providesstepbystepguidanceongettinglatestcodebase authorizeduserscanseetheactualdatawhileotheruserswill ofApacheRanger,buildinganddeployingit. seethemaskeddata.Sincethemaskeddataisformatpreserving, theycancontinuetheirworkwithoutgettingaccesstotheactual sensitive data. For example, the application developers can 9. LICENSING usemaskeddatatodeveloptheapplicationwhereaswhenthe ApacheRangerisavailableunderApache2.0License. applicationisactuallydeployed,itwillshowactualdatatothe authorizeduser.Similarly,asecurityadministratormaychoseto 10. CONCLUSION maskcreditcardnumberwhenitisdisplayedtoaserviceagent. Apache Ranger is useful to Hadoop Security Administrators 4.2. RowLevelFiltering sinceitenablesthegranularauthorizationandaccesscontrol.It Thedataauthorizationistypicallyrequiredatcolumnlevelas alsoprovidescentralsecurityframeworktodifferentdatastor- well as at row level. For example, in an organization which ageandaccessmechanismlikeHive,HBaseandStorm.Apache is geographically distributed in many locations, the security Rangeralsoprovidesauditmechanism. WithApacheRanger, administratormaywanttogiveaccessofadatafromaspecific thesecuritycanbeenhancedforcomplexHadoopusecaseslike locationtothespecificuser. Inotherexample,ahospitaldata DataLake. 9
Description: