ebook img

Big Data Software PDF

158 Pages·2017·9.6 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Big Data Software

Big Data Software Spring 2017 Bloomington, Indiana Editor: Gregor von Laszewski Department of Intelligent Systems Engeneering Indiana University [email protected] Contents 1 S17-ER-1001 Berkeley DB Saber Sheybani 5 2 S17-IO-3000 Apache Ranger Avadhoot Agasti 8 3 S17-IO-3005 Amazon Kinesis Abhishek Gupta 11 4 S17-IO-3008 Google Cloud DNS Vishwanath Kodre 14 5 S17-IO-3010 Robot Operating System (ROS) Matthew Lawson 17 6 S17-IO-3011 Apache Crunch Scott McClary 22 7 S17-IO-3012 Apache MRQL - MapReduce Query Language Mark McCombe 25 8 S17-IO-3013 Lighting Memory-Mapped Database (LMDB) Leonard Mwangi 29 9 S17-IO-3014 SciDB: An Array Database Piyush Rai 32 10 S17-IO-3015 Cassandra Sabyasachi Roy Choudhury 34 11 S17-IO-3016 Apache Derby Ribka Rufael 37 1 12 S17-IO-3017 Facebook Tao Nandita Sathe 40 13 S17-IO-3019 InCommon Michael Smith, 43 14 S17-IO-3020 Hadoop YARN Milind Suryawanshi, Gregor von Laszewski 46 15 S17-IO-3021 Apache Tez- Application Data processing Framework Abhijt Thakre 49 16 S17-IO-3022 Deployment Model of Juju Sunanda Unni 53 17 S17-IO-3023 AWS Lambda Karthick Venkatesan 56 18 S17-IO-3024 Not Submitted Ashok Vuppada 60 19 S17-IR-2001 HUBzero: A Platform For Scientific Collaboration Niteesh Kumar Akurati 62 20 S17-IR-2002 Apache Flink: Stream and Batch Processing Jimmy Ardiansyah 65 21 S17-IR-2004 Jelastic Ajit Balaga, S17-IR-2004 68 22 S17-IR-2006 An Overview of Apache Spark Snehal Chemburkar, Rahul Raghatate 71 23 S17-IR-2008 An overview of Apache THRIFT and its architecture Karthik Anbazhagan 76 2 24 S17-IR-2011 Hyper-V Anurag Kumar Jain 79 25 S17-IR-2012 Retainable Evaluator Execution Framework Pratik Jain 82 26 S17-IR-2013 A brief introduction to OpenCV Sahiti Korrapati 85 27 S17-IR-2014 An Overview of Pivotal Web Services Harshit Krishnakumar 88 28 S17-IR-2016 An Overview of Apache Avro Author Missing 91 29 S17-IR-2017 An Overview of Pivotal HD/HAWQ and its Applications Author Missing 93 30 S17-IR-2018 An overview of Cisco Intelligent Automation for Cloud Bhavesh Reddy Merugureddy 96 31 S17-IR-2019 KeystoneML Vasanth Methkupalli 99 32 S17-IR-2021 Amazon Elastic Beanstalk Shree Govind Mishra 102 33 S17-IR-2022 ASKALON Abhishek Naik 105 34 S17-IR-2024 Memcached Ronak Parekh, Gregor von Laszewski 108 35 S17-IR-2026 Naiad Rahul Raghatate, Snehal Chemburkar 111 3 36 S17-IR-2027 Dryad : Distributed Execution Engine Shahidhya Ramachandran 117 37 S17-IR-2028 A Report on Apache Apex Srikanth Ramanam 121 38 S17-IR-2029 Apache Mahout Naveenkumar Ramaraju 124 39 S17-IR-2030 Neo4J Sowmya Ravi 127 40 S17-IR-2031 OpenStack Nova: Compute Service of OpenStack Cloud Kumar Satyam 130 41 S17-IR-2034 Heroku Yatin Sharma 133 42 S17-IR-2035 D3 Piyush Shinde 136 43 S17-IR-2036 An overview of the open source log management tool - Graylog Rahul Singh 140 44 S17-IR-2037 Jupyter Notebook vs Apache Zeppelin - A comparative study Sriram Sitharaman 143 45 S17-IR-2038 Introduction to Terraform Sushmita Sivaprasad 147 46 S17-IR-2041 Google BigQuery - A data warehouse for large-scale data analytics Sagar Vora 150 47 S17-IR-2044 Hive Diksha Yadav 155 4 ReviewArticle Spring2017-I524 1 Berkeley DB SABER SHEYBANI1,* 1SchoolofInformaticsandComputing,Bloomington,IN47408,U.S.A. *Correspondingauthors:[email protected] Paper2,April30,2017 Berkeley DB is a family of open source, NoSQL key-value database libraries. It provides a simple function-callAPIfordataaccessandmanagementoveranumberofprogramminglanguages,including C,C++,Java,Perl,Tcl,Python,andPHP.BerkeleyDBisembeddedbecauseitlinksdirectlyintotheappli- cationandrunsinthesameaddressspaceastheapplication. Asaresult,nointer-processcommunication, eitheroverthenetworkorbetweenprocessesonthesamemachine,isrequiredfordatabaseoperations.It isalsoextremelyportableandscalable,itcanmanagedatabasesupto256terabytesinsize. Fordataman- agement,BerkeleyDBoffersadvancedservices,suchasconcurrencyformanyusers,ACIDtransactions, andrecovery. BerkeleyDBisusedinawidevarietyofproductsandalargenumberofprojects, includ- inggatewaysfromCisco,WebapplicationsatAmazon.comandopen-sourceprojectssuchasApacheand Linux. © 2017https://creativecommons.org/licenses/.Theauthorsverifythatthetextisnotplagiarized. Keywords: NoSQL,embeddeddatabase,Oracle,opensourcedatabasemanagementsystem https://github.com/cloudmesh/sp17-i524/tree/master/paper2/S17-ER-1001/report.pdf 1. INTRODUCTION SemistructureddatabaseswhichincludeNoSQLdatabasesare thetypeofdatabasemodelthatenablestorageofhetereogenous Datamanagementhasalwaysbeenafundamentalissueinpro- data,byallowingrecordswithdifferentattributes. Thishow- gramming. Since1960s,countlessdatabasemanagementsys- ever,isachievedbysacrificingtheknowledgeofdatatypeby temshavebeendevelopedtofulfildifferentsortsofdemands. thedatabasesystem.Thedatainthiscasemustbeself-describing, Thequestionforeveryuserischoosingthesystemthatbestfits meaningthatthedescription(schema)ofthedatamustbein therequirementsofitsappliction. itself.XML(ExtensibleMarkupLanguage)schemalanguageis Database management systems can be categorized based awidelyusedlanguageforprovidingschemaforthesedatabase on data models, into a number of groups: Hierarchical systems[1]. Databases, NetworkDatabases, RelationalDatabases, Object- BerkeleyDBfitsintothelastcategory,asaNoSQLdatabase basedDatabases,andSemistructuredDatabases. system. The records are stored as key-value pairs and a few Hierarchical databases use the oldest type of data models, logicaloperationscanbeexecutedonthem,namely:insertion, whichisatree-likestructure.Therecordsareconnectedtoeach deletion,findingarecordbyitskey,andupdatinganalready otherwithahard-codedlink. foundrecord."BerkeleyDBneveroperatesonthevaluepartof InaNetworkmodel,recordsarealsoconnectedwithlinks,but arecord.Valuesaresimplypayload,tobestoredwithkeysand thereisnohierarchy.Instead,thestructureisgraph-likeandall reliablydeliveredbacktotheapplicationondemand."There ofthenodescanconnecttoeachother. isnonotionofschemaandnosupportforSQLqueries. "The In Relational databases, there are no physical links, but the applicationmustunderstandthekeysandvaluesthatituses. data is structured in tables (relations). Each row represents Ontheotherhand,thereisliterallynolimittothedatatypes arecordandeachcolumnrepresentsanattribute. Thetables thatcanbestoredinaBerkeleyDBdatabase. Theapplication are connected with common attributes, which makes query- neverneedstoconvertitsownprogramdataintothedatatypes ingmucheasierthanthetwoformermodels. Forthisreason, thatBerkeleyDBsupports. BerkeleyDBisabletooperateon databasemanagementsystemsusingrelationalmodelsarethe anydatatypetheapplicationuses,nomatterhowcomplex"[2]. mostwidelyusedones. Object-baseddatamodelsextendconceptsofobject-oriented 2. ARCHITECTURE programmingintodatabasesystems,inordertoprovideper- sistentstorageofobjectsandothercapabilitiesofdatabasesfor BerkeleyDB’sarchitecturecanbeexplainedbyfivemajorsub- object-orientedprogramming. systems:AccessMethods:Providinggeneral-purposesupport 5 ReviewArticle Spring2017-I524 2 forcreatingandaccessingdatabasefiles. MemoryPool: The crashofthesystem).BerkeleyDB"librariesprovidestrictACID general-purposesharedmemorybufferpool.MultipleTransac- transactionsemantics, bydefault. However, applicationsare tion:Implementingthetransactionmodel,realizingACIDprop- allowedtorelaxtheisolationguaranteesthedatabasesystem erties. processesandthreadswithinprocessesshareaccessto makes"[4]. databasesusingthissubsystem.Locking:Thegeneral-purpose BerkeleyDBrunsinthesameaddressspaceastheapplica- lockmanagerforprocesses.Logging:Thewrite-aheadlogging tion. Asaresult,thereisnoneedforcommunicationbetween thatsupportstheBerkeleyDBtransactionmodel. processes and threads. On the other hand, as an embedded databasemanagementsystem,itdoesnotprovideastandalone server. However,serverapplicationscanbebuiltoverBerke- ley DB and many examples of Lightweight Directory Access Protocol(LDAP)servershavebeenbuiltusingit[2]. ThedatabaselibraryforBerkeleyDBconsumeslessthan300 kilobytesoftextspaceoncommonarchitectures.Thatmakesit afeasiblesolutionforembeddedsystemswithsmallcapacities. Nonetheless,itcanmanageupto256terabytesdatabases. 3.1. SupportedOperatingSystemsandLanguages Berkeley DB supports nearly all modern operating systems. TheyincludeWindows,Linux,MacOSX,Android,iPhone,So- laris,BSD,HP-UX,AIX,andRTOSsuchasVxWorks,andQNX. Thesupportedprogramminglanguagesinclude"C,C++,Java, C#,Perl,Python,PHP,Tcl,Rubyandmanyothers"[6]. 3.2. RequiredInfrastructure Asinfrastructure,BerkeleyDBrequires"underlyingIEEE/ANSI Std 1003.1 (POSIX) system calls and can be ported easily to newarchitecturesbyaddingstubroutinestoconnectthenative systeminterfacestotheBerkeleyDBPOSIX-stylesystemcalls" [7]. Fig.1.BerkeleyDBSubsystems[3] 4. PRODUCTSANDLICENSING TheproductsincludethreeimplementationsonC,C++,andJava Figure1displaysadiagramoftheBerkeleyDBlibraryarchi- (OracleBerkeleyDB,OracleBerkeleyXML,andOracleBerkeley tecture.Thearrowsarecallsthatinvokethedestination.Each JE,respectively)[8]. subsystemcanalsobeusedindependentfromtheotherones, BerkeleyDBisanopensourcelibraryandisfreeforuseand butthisusageisnotcommon. redistributioninotheropensourceproducts. thedistribution includes complete source code for all three implementations, 3. SERVICESANDOTHERFEATURES theirsupportingutilities,aswellascompletedocumentationin HTMLformat[7]. Thetwofundamentalservicesthateverydatabasemanagement For redistribution in commercial products, Sleepycat Soft- systemprovidesaredataaccess,anddatamanagementservices. warelicensesfourproducts,withpricesrangingfromUS$900 Data access services include the low-level operations on the to13,800perprocessor[9]asofMarch2017. Theproducts,in records,whichwerealreadymentionedintheintroductionfor theorderofascendingpriceandcapabilitiesare:BerkeleyDB thecaseofBerkeleyDB.Intermsofstoragestructure,Berkeley DataStore, BerkeleyDBConcurrentDataStore, BerkeleyDB DBsupportshashtables,Btrees,simplerecord-number-based TransactionalDataStore,BerkeleyDBHighAvailability. The storage,andpersistentqueues[4]. Sleepycatsoftwarealsoincludesprebuiltlibrariesandbinaries Datamanagementservicesarethehigher-levelservices(and aspartofsupportservices, whichisnotprovidedinthefree features)suchasconcurrencythatensurespecificqualitiesfor distribution.Thereisnoadditionallicensepaymentforembed- operationofthesystem.Theseservicesincludeallowingsimul- dedusagewithintheOracleRetailPredictiveApplicationServer taneousaccesstotherecordsbymultipleusers(concurrency), (RPAS). changingmultiplerecordsatthesametime(transaction),and completerecoveryofthedatafromcrashes(recovery)[4]. 5. USECASES Forconcurrency,BerkeleyDBisabletohandlelow-levelser- vicessuchaslockingandsharedbuffermanagementtranspar- Anotablenumberofopensourceandcommercialproductsin ently,whilemultipleprocessesandthreadsusethedata. differentareasoftechnology,useBerkeleyDB.Opensourceuse Forrecovery,everyapplicationcanaskBerkeleyDBforrecovery, casesincludeLinux,UNIX,BSD,Apache,Solaris,MySQL,Send- atstartuptime. mail,OpenLDAP,andMemcacheDB. AnACIDtransactionensuresthefollowingspecificationsatthe Proprietaryapplications"includedirectoryserversfromSunand end of its operation [5]: Atomicity (Either all or none of the Hitachi;messagingserversfromOpenwaveandLogicaCMG; recordschange),Consistency(Thesystemgoesfromonevalid switches,routersandgatewaysfromCisco,Motorola,Lucent, state to another), Isolation (concurrent execution of multiple andAlcatel;storageproductsfromEMCandHP;securityprod- transactionyieldsthesameresultasthesequentialexecution uctsfromRSASecurityandSymantec;andWebapplicationsat ofthem),Durability(Theresultremainssteady,evenincaseof Amazon.com,LinkedInandAOL"[6]. 6 ReviewArticle Spring2017-I524 3 6. ADVANTAGESANDLIMITATIONS [10] “Do you need berkeley db?” Web Page, accessed: 2017-4-6. [Online].Available:https://web.stanford.edu/class/cs276a/projects/docs/ Berkeley DB has two advantages over relational and object- berkeleydb/ref/intro/need.html oriented database systems, when it comes to embedded ap- [11] M.A.Olson,K.Bostic,andM.I.Seltzer,“Berkeleydb.”inUSENIX plications. One is running in the same address space as the AnnualTechnicalConference,FREENIXTrack,1999,pp.183–191. applicationandthus,notrequiringanyinter-processcommu- [12] H.Yadava, TheBerkeleyDBBook, ser.BooksforProfessionalsby nicationwhichcanhaveahighcostinembeddedapplications. Professionals. Apress,2007.[Online].Available:https://books.google. And the other is simplicity of interface for operations which com/books?id=2wEkW7pQ0KwC doesnotrequirequerylanguageparsing. Thesetwofeatures alongwithitssmallsize,giveBerkeleyDBsystemaprivilege ofbeinglightweightenoughformanyapplicationswherethere AUTHORBIOGRAPHY aretightconstraintsonresources. SaberSheybanireceivedhisB.S.(ElectricalEngineering-Mi- However,withsimplicitycomesthelackofSQLfeatures.Ifthe norinControlEngineering)fromUniversityofTehran. Heis useroftheapplicationneedstoperformcomplicatedsearches currently a PhD student of Intelligent Systems Engineering - (potentiallyusingSQLqueries)theprogrammerwouldneedto NeuroengineeringatIndianaUniversityBloomington. writethecodeforthosecases.Ingeneral,BerkeleyDBisaimed atprovidingfast,reliable,transaction-protectedrecordstorage, ataminimalistway[10]. 7. EDUCATIONALMATERIAL AswasmentionedintheProductssection,thefreedistribution comes with complete documentation in HTML format. The documentation has two parts: a reference manual in UNIX- styleforprogrammers,andareferenceguidewhichcanserve asatutorial[11]. Inadditiontothat,BerkeleyDBTutorialand ReferenceGuide,Version4.1.24[7]andTheBerkeleyDBBook[12] areusefulresourcesforlearningmoreaboutBerkeleyDBand gettingstartedwithit. 8. CONCLUSION BerkeleyDBisaminimal,lightweightdatabasemanagement system,focusedonprovidingperformance,especiallyinembed- dedsystems.Itoffersasmall,simplesetofdataaccessservices, andarichpowerfulsetofdatamanagementservices.Itisfreely availableforusebynon-commercialdistributionsandhasbeen successfullyusedinmanyprojects. REFERENCES [1] I.Limited,IntroductiontoDatabaseSystems:. Pearson,2010.[Online]. Available:https://books.google.com/books?id=-YY-BAAAQBAJ [2] “What berkeley db is not,” Web Page, accessed: 2017-4-6. [Online].Available:https://web.stanford.edu/class/cs276a/projects/docs/ berkeleydb/ref/intro/dbisnot.html [3] “The big picture,” Web Page, accessed: 2017-4-6. [Online]. Avail- able:https://web.stanford.edu/class/cs276a/projects/docs/berkeleydb/ ref/arch/bigpic.html [4] “What is berkeley db?” Web Page, accessed: 2017-4-6. [Online].Available:https://web.stanford.edu/class/cs276a/projects/docs/ berkeleydb/ref/intro/dbis.html [5] T.HaerderandA.Reuter,“Principlesoftransaction-orienteddatabase recovery,”ACMComputingSurveys(CSUR),vol.15,no.4,pp.287–317, 1983. [6] “Oracle berkeley database products,” Web Page, accessed: 2017- 4-6.[Online].Available: http://www.oracle.com/technetwork/products/ berkeleydb/learnmore/berkeley-db-family-datasheet-132751.pdf [7] “Berkeleydbtutorialandreferenceguide,version4.1.24,”WebPage, accessed:2017-4-6.[Online].Available:https://web.stanford.edu/class/ cs276a/projects/docs/berkeleydb/reftoc.html [8] “Oracle berkeley db 12c,” Web Page, accessed: 2017-4-6. [Online]. Available: http://www.oracle.com/technetwork/database/ database-technologies/berkeleydb/%20overview/index.html [9] “Oracle technology global price list,” Web Page, accessed: 2017- 4-6. [Online]. Available: http://www.oracle.com/us/corporate/pricing/ technology-price-list-070617.pdf 7 ReviewArticle Spring2017-I524 1 Apache Ranger AVADHOOT AGASTI1,*,+ 1SchoolofInformaticsandComputing,Bloomington,IN47408,U.S.A. *Correspondingauthors:[email protected] +HID-SL-IO-3000 paper2,April30,2017 ApacheHadoopprovidesvariousdatastorage,dataaccessanddataprocessingservices. ApacheRanger ispartoftheHadoopecosystem. ApacheRangerprovidescapabilitytoperformsecurityadministration tasksforstorage,accessandprocessingofdatainHadoop. UsingRanger,Hadoopadministratorcanper- formsecurityadministrationtasksusingacentraluserinterfaceorrestfulwebservices. Hadoopadmin- istrator can define policies which enable users or user-groups to perform specific actions using Hadoop componentsandtools. RangerprovidesrolebasedaccesscontrolfordatasetsonHadoopatcolumnand row level. Ranger also provides centralized auditing of user access and security related administrative actions. © 2017https://creativecommons.org/licenses/.Theauthorsverifythatthetextisnotplagiarized. Keywords: ApacheRanger,LDAP,ActiveDirectory,ApacheKnox,ApacheAtlas,ApacheHive,ApacheHadoop,Yarn,ApacheHBase,Apache Storm,ApacheKafka,DataLake,ApacheSentry,HiveServer2,Java https://github.com/cloudmesh/sp17-i524/raw/master/paper2/S17-IO-3000/report.pdf 1. INTRODUCTION 2.2. RangerPlugins PluginsareJavaprograms,whichareinvokedaspartoftheclus- Apache Ranger is open source software project designed to tercomponent.Forexample,theranger-hivepluginisembed- provide centralized security services to various components dedaspartofHiveServer2.Thepluginscachethepolicies,and ofApacheHadoop. ApacheHadoopprovidesvariousmech- intercepttheuserrequestandevaluatesitagainstthepolicies. anismtostore,processandaccessthedata. EachApachetool Pluginsalsocollecttheauditdataforthatspecificcomponent hasitsownsecuritymechanism.Thisincreasesadministrative andsendtoadminportal. overheadandisalsoerrorprone.ApacheRangerfillsthisgap toprovideacentralsecurityandauditingmechanismforvari- 2.3. Usergroupsync ousHadoopcomponents.UsingRanger,Hadoopadministrator WhileRangerprovidesauthorizationoraccesscontrolmecha- canperformsecurityadministrationtasksusingacentraluser nism,itneedstoknowtheusersandthegroups. Rangerinte- interfaceorrestfulwebservices.Theadministratorcandefine grateswithunixusermanagementsystemorLDAPoractive policieswhich,enableusersoruser-groupstoperformspecific directorytofetchtheusersandthegroupsinformation.Theuser actionsusingHadoopcomponentsandtools.Rangerprovides groupsynccomponentisresponsibleforthisintegration. rolebasedaccesscontrolfordatasetsonHadoopatcolumnand row level. Ranger also provides centralized auditing of user 3. HADOOPCOMPONENTSSUPPORTEDBYRANGER accessandsecurityrelatedadministrativeactions. Ranger supports auditing and authorization for following 2. ARCHITECTUREOVERVIEW Hadoopcomponents[2]. [1]describestheimportantcomponentsofRangerasexplained 3.1. ApacheHadoopandHDFS below: Apache Ranger provides plugin for Hadoop, which helps in enforcingdataaccesspolicies. TheHDFSpluginworkswith 2.1. RangerAdminPortal namenodetocheckiftheuser’saccessrequesttoafileonHDFS Rangeradminportalisthemaininteractionpointfortheuser. isvalidornot. AusercandefinepoliciesusingtheRangeradminportal.These policiesarestoredinapolicydatabase.ThePoliciesarepolled 3.2. ApacheHive byvariousplugins. Adminportalalsocollectstheauditdata ApacheHiveprovidesSQLinterfaceontopofthedatastored frompluginsandstoresitinHDFSorinarelationaldatabase. in HDFS. Apache Hive supports two types of authorization: 8 ReviewArticle Spring2017-I524 2 storagebasedauthorizationandSQLstandardauthorization. securityadministratormaywanttoallowdoctorstoseeonlyhis RangerprovidescentralizedauthorizationinterfaceforHive, orherpatients.UsingRanger,suchrowlevelaccesscontrolcan whichprovidesgranularaccesscontrolattableandcolumnlevel. bespecifiedandimplemented. Ranger’shivepluginispartofHiveServer2. 5. HADOOPDISTRIBUTIONSUPPORT 3.3. ApacheHBase Apache HBase is NoSQL database implemented on top of Ranger can be deployed on top of Apache Hadoop. [4] pro- Hadoop and HDFS. Ranger provides coprocessor plugin for videsdetailedstepsofbuildinganddeployingRangerontopof HBase,whichperformsauthorizationchecksandauditlogcol- ApacheHadoop. lections. HortonworkDistributionofHadoop(HDP)supportsRanger deploymentusingAmbari.[5]providesinstallation,deployment 3.4. ApacheStorm andconfigurationstepsforRangeraspartofHDPdeployment. Cloudera Hadoop Distribution (CDH) does not support RangerprovidesplugintoNimbusserverwhichhelpsinper- Ranger.Accordingto[6],RangerisnotrecommendedonCDH formingthesecurityauthorizationonApacheStorm. andinsteadApacheSentryshouldbeusedascentralsecurity andaudittoolontopofCDH. 3.5. ApacheKnox ApacheKnoxprovidesservicelevelauthorizationforusersand groups.RangerprovidespluginforKnoxusingwhich,admin- 6. USECASES istration of policies can be supported. The audit over Knox ApacheRangerprovidescentralizedsecurityframeworkwhich dataenablesusertoperformdetailedanalysisofwhoandwhen canbeusefulinmanyusecasesasexplainedbelow. accessedKnox. 6.1. DataLake 3.6. ApacheSolr [7]explainsthatstoringmanytypesofdatainthesamerepos- Solr provides free text search capabilities on top of Hadoop. itoryisoneofthemostimportantfeatureofdatalake. With RangerisusefultoprotectSolrcollectionsfromunauthorized multipledatasets,theownership,securityandaccesscontrolof usage. thedatabecomesprimaryconcern.UsingApacheRanger,the securityadministratorcandefinefinegraincontrolonthedata 3.7. ApacheKafka access. RangercanmanageraccesscontrolonKafkatopics. Policies canbeimplementedtocontrolwhichuserscanwritetoaKafka 6.2. Multi-tenantDeploymentofHadoop topicandwhichuserscanreadfromaKafkatopic. Hadoopprovidesabilitytostoreandprocessdatafrommul- tiple tenants. The security framework provided by Apache 3.8. Yarn Rangercanbeutilizedtoprotectthedataandresourcesfrom YarnisresourcemanagementlayerforHadoop.Administrators un-authorizedaccess. cansetupqueuesinYarnandthenallocateusersandresources perqueuebasis.PoliciescanbedefinedinRangertodefinewho canwritetovariousYarnqueues. 7. APACHERANGERANDAPACHESENTRY Accordingto[8],ApacheSentryandApacheRangerhavemany 4. IMPORTANTFEATURESOFRANGER featuresincommon. ApacheSentry([9])providesrolebased authorizationtodataandmetadatastoredinHadoop. Theblogarticle[3]explainsthe2importantfeaturesofApache Ranger. 8. EDUCATIONALMATERIAL 4.1. DynamicColumnMasking [10] provides tutorial on topics like A)Security resources Dynamicdatamaskingatcolumnlevelisanimportantfeatureof B)AuditingC)SecuringHDFS,HiveandHBasewithKnoxand ApacheRanger.Usingthisfeature,theadministratorcansetup RangerD)UsingApacheAtlas’TagbasedpolicieswithRanger. datamaskingpolicy. Thedatamaskingmakessurethatonly [11]providesstepbystepguidanceongettinglatestcodebase authorizeduserscanseetheactualdatawhileotheruserswill ofApacheRanger,buildinganddeployingit. seethemaskeddata.Sincethemaskeddataisformatpreserving, theycancontinuetheirworkwithoutgettingaccesstotheactual sensitive data. For example, the application developers can 9. LICENSING usemaskeddatatodeveloptheapplicationwhereaswhenthe ApacheRangerisavailableunderApache2.0License. applicationisactuallydeployed,itwillshowactualdatatothe authorizeduser.Similarly,asecurityadministratormaychoseto 10. CONCLUSION maskcreditcardnumberwhenitisdisplayedtoaserviceagent. Apache Ranger is useful to Hadoop Security Administrators 4.2. RowLevelFiltering sinceitenablesthegranularauthorizationandaccesscontrol.It Thedataauthorizationistypicallyrequiredatcolumnlevelas alsoprovidescentralsecurityframeworktodifferentdatastor- well as at row level. For example, in an organization which ageandaccessmechanismlikeHive,HBaseandStorm.Apache is geographically distributed in many locations, the security Rangeralsoprovidesauditmechanism. WithApacheRanger, administratormaywanttogiveaccessofadatafromaspecific thesecuritycanbeenhancedforcomplexHadoopusecaseslike locationtothespecificuser. Inotherexample,ahospitaldata DataLake. 9

Description:
Professionals. Apress and starts using the Google's compute engine and Appl engine, they are [Online]. Available: http://www.lmdb.tech/bench/microbench/ Palm WeboS, Symbian OS, RIM blackberry, Java J2ME. The ap- [19] Analytics Vidhya, “Newbie to D3.js Expert: Complete path to.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.