ebook img

Apache Hive Guide PDF

96 Pages·2017·2.43 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Apache Hive Guide

Apache Hive Guide ImportantNotice ©2010-2021Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor servicenamesorsloganscontainedinthisdocumentaretrademarksofClouderaand itssuppliersorlicensors,andmaynotbecopied,imitatedorused,inwholeorinpart, withoutthepriorwrittenpermissionofClouderaortheapplicabletrademarkholder.If thisdocumentationincludescode,includingbutnotlimitedto,codeexamples,Cloudera makesthisavailabletoyouunderthetermsoftheApacheLicense,Version2.0,including anyrequirednotices.AcopyoftheApacheLicenseVersion2.0,includinganynotices, isincludedherein.AcopyoftheApacheLicenseVersion2.0canalsobefoundhere: https://opensource.org/licenses/Apache-2.0 HadoopandtheHadoopelephantlogoaretrademarksoftheApacheSoftware Foundation.Allothertrademarks,registeredtrademarks,productnamesandcompany namesorlogosmentionedinthisdocumentarethepropertyoftheirrespectiveowners. Referencetoanyproducts,services,processesorotherinformation,bytradename, trademark,manufacturer,supplierorotherwisedoesnotconstituteorimply endorsement,sponsorshiporrecommendationthereofbyus. Complyingwithallapplicablecopyrightlawsistheresponsibilityoftheuser.Without limitingtherightsundercopyright,nopartofthisdocumentmaybereproduced,stored inorintroducedintoaretrievalsystem,ortransmittedinanyformorbyanymeans (electronic,mechanical,photocopying,recording,orotherwise),orforanypurpose, withouttheexpresswrittenpermissionofCloudera. Clouderamayhavepatents,patentapplications,trademarks,copyrights,orother intellectualpropertyrightscoveringsubjectmatterinthisdocument.Exceptasexpressly providedinanywrittenlicenseagreementfromCloudera,thefurnishingofthisdocument doesnotgiveyouanylicensetothesepatents,trademarkscopyrights,orother intellectualproperty.ForinformationaboutpatentscoveringClouderaproducts,see http://tiny.cloudera.com/patents. Theinformationinthisdocumentissubjecttochangewithoutnotice.Clouderashall notbeliableforanydamagesresultingfromtechnicalerrorsoromissionswhichmay bepresentinthisdocument,orfromuseofthisdocument. Cloudera,Inc. 395PageMillRoad PaloAlto,CA94306 [email protected] US:1-888-789-1488 Intl:1-650-362-0488 www.cloudera.com ReleaseInformation Version:ClouderaEnterprise5.8.x Date:February3,2021 Table of Contents Using Apache Hive with CDH....................................................................................7 UseCasesforHive................................................................................................................................................7 Installation............................................................................................................................................................8 Upgrading.............................................................................................................................................................8 Configuration.......................................................................................................................................................8 The Metastore Database......................................................................................................................................8 HiveServer2..........................................................................................................................................................9 Hive on Spark.......................................................................................................................................................9 Hive and HBase....................................................................................................................................................9 Transaction(ACID)SupportinHive......................................................................................................................9 Managing Hive...................................................................................................................................................10 Ingesting Data with Hive....................................................................................................................................10 Tuning.................................................................................................................................................................10 HighAvailability..................................................................................................................................................11 Replication..........................................................................................................................................................11 Security..............................................................................................................................................................11 Troubleshooting.................................................................................................................................................12 Upstream Information for Hive..........................................................................................................................12 Hive Installation.....................................................................................................13 AboutHive..........................................................................................................................................................13 HiveServer2..........................................................................................................................................................................13 Upgrading Hive...................................................................................................................................................13 ChecklisttoHelpEnsureSmoothUpgrades.........................................................................................................................13 UpgradingHivefromaLowerVersionofCDH5..................................................................................................................14 Installing Hive.....................................................................................................................................................17 HeapSizeandGarbageCollectionforHiveComponents.....................................................................................................17 Configuration for WebHCat.................................................................................................................................................19 ConfiguringtheHiveMetastore.........................................................................................................................19 Metastore Deployment Modes............................................................................................................................................19 Supported Metastore Databases.........................................................................................................................................21 MetastoreMemoryRequirements.......................................................................................................................................22 Configuringthe Metastore Database..................................................................................................................................22 Configuring HiveServer2....................................................................................................................................32 HiveServer2 Memory Requirements....................................................................................................................................32 TableLockManager(Required)...........................................................................................................................................32 hive.zookeeper.client.port.....................................................................................................................................33 JDBC driver...........................................................................................................................................................................33 Authentication.....................................................................................................................................................................33 RunningHiveServer2andHiveServerConcurrently.............................................................................................................33 Starting the Metastore.......................................................................................................................................34 File System Permissions.....................................................................................................................................34 Starting,Stopping,andUsingHiveServer2.........................................................................................................35 UsingtheBeelineCLI............................................................................................................................................................35 StartingHiveServer1andtheHiveConsole........................................................................................................36 Using Hive with HBase.......................................................................................................................................36 UsingtheHiveSchemaTool...............................................................................................................................37 Schema Version Verification................................................................................................................................................37 Using schematool................................................................................................................................................................37 InstallingtheHiveJDBCDriveronClients..........................................................................................................39 Setting HADOOP_MAPRED_HOME....................................................................................................................40 ConfiguringtheMetastoretoUseHDFSHighAvailability..................................................................................40 Viewing the Hive Documentation......................................................................................................................40 Managing Hive.......................................................................................................41 Hive Roles...........................................................................................................................................................41 Hive Execution Engines......................................................................................................................................41 ManagingHiveUsingClouderaManager...........................................................................................................42 Running Hive on Spark.......................................................................................................................................43 Configuring Hive on Spark...................................................................................................................................................43 TroubleshootingHiveonSpark............................................................................................................................................43 HiveServer2 Web UI...........................................................................................................................................45 AccessingtheHiveServer2WebUI.......................................................................................................................................45 HiveServer2WebUIConfiguration......................................................................................................................................45 HiveTableStatistics............................................................................................................................................46 ManagingUser-DefinedFunctions(UDFs)withHiveServer2.............................................................................46 User-DefinedFunctions(UDFs)withHiveServer2UsingClouderaManager.......................................................................46 User-DefinedFunctions(UDFs)withHiveServer2UsingtheCommandLine.......................................................................48 Tuning Hive............................................................................................................50 HeapSizeandGarbageCollectionforHiveComponents...................................................................................50 MemoryRecommendations.................................................................................................................................................50 ConfiguringHeapSizeandGarbageCollection...................................................................................................................51 HiveServer2PerformanceTuningandTroubleshooting.....................................................................................52 SymptomsDisplayedWhenHiveServer2HeapMemoryisFull............................................................................................52 HiveServer2PerformanceBestPractices.............................................................................................................................55 Tuning Hive on Spark..............................................................................................59 YARN Configuration............................................................................................................................................59 SparkConfiguration............................................................................................................................................59 Hive Configuration.............................................................................................................................................61 Pre-warmingYARNContainers.............................................................................................................................................62 Hive Metastore High Availability............................................................................63 EnablingHiveMetastoreHighAvailabilityUsingClouderaManager.................................................................63 EnablingHiveMetastoreHighAvailabilityUsingtheCommandLine................................................................63 Configuring HiveServer2 High Availability in CDH...................................................66 EnablingHiveServer2HighAvailabilityUsingClouderaManager......................................................................66 ConfiguringHiveServer2toLoadBalanceBehindaProxyonUnmanagedClusters..........................................66 UnmanagedClusterswithKerberosEnabled.......................................................................................................................67 UnmanagedClustersWITHOUTKerberos............................................................................................................................68 Hive Replication.....................................................................................................71 NetworkLatencyandReplication.......................................................................................................................71 HostSelectionforHiveReplication....................................................................................................................71 HiveTablesandDDLCommands........................................................................................................................71 PerformanceandScalabilityLimitations............................................................................................................72 HiveReplicationinDynamicEnvironments........................................................................................................72 Configuring Replication of Hive Data.................................................................................................................72 ViewingReplicationSchedules...........................................................................................................................75 Enabling,Disabling,orDeletingAReplicationSchedule......................................................................................................77 Viewing Replication History...............................................................................................................................77 Hive Authentication...............................................................................................80 HiveServer2 Security Configuration...................................................................................................................80 EnablingKerberosAuthenticationforHiveServer2..............................................................................................................80 UsingLDAPUsername/PasswordAuthenticationwithHiveServer2....................................................................................81 ConfiguringLDAPSAuthenticationwithHiveServer2..........................................................................................................83 Pluggable Authentication....................................................................................................................................................83 TrustedDelegationwithHiveServer2...................................................................................................................................84 HiveServer2Impersonation..................................................................................................................................................84 SecuringtheHiveMetastore................................................................................................................................................85 DisablingtheHiveSecurityConfiguration...........................................................................................................................85 HiveMetastoreServerSecurityConfiguration...................................................................................................86 UsingHivetoRunQueriesonaSecureHBaseServer........................................................................................87 Configuring Encrypted Communication Between HiveServer2 and Client Drivers....88 ConfiguringEncryptedClient/ServerCommunicationUsingTLS/SSL................................................................88 UsingClouderaManager.....................................................................................................................................................88 UsingtheCommandLine.....................................................................................................................................................89 ConfiguringEncryptedClient/ServerCommunicationUsingSASLQOP.............................................................89 Hive SQL Syntax for Use with Sentry.......................................................................91 Column-level Authorization...............................................................................................................................91 CREATEROLEStatement.....................................................................................................................................92 DROP ROLE Statement.......................................................................................................................................92 GRANT ROLE Statement.....................................................................................................................................93 REVOKEROLEStatement....................................................................................................................................93 GRANT<Privilege>Statement............................................................................................................................93 GRANT<Privilege>ONURIs(HDFSandS3A).....................................................................................................93 REVOKE <Privilege> Statement..........................................................................................................................94 GRANT<Privilege>...WITHGRANTOPTION......................................................................................................94 SETROLEStatement...........................................................................................................................................95 SHOW Statement...............................................................................................................................................95 Example:UsingGrant/RevokeStatementstoMatchanExistingPolicyFile......................................................96 Troubleshooting Hive.............................................................................................98 HiveServer2PerformanceTuningandTroubleshooting.....................................................................................98 SymptomsDisplayedWhenHiveServer2HeapMemoryisFull............................................................................................98 HiveServer2PerformanceBestPractices...........................................................................................................................101 Appendix: Apache License, Version 2.0.................................................................105 UsingApacheHivewithCDH Using Apache Hive with CDH Hivedatawarehousesoftwareenablesreading,writing,andmanaginglargedatasetsindistributedstorage.Usingthe Hivequerylanguage(HiveQL),whichisverysimilartoSQL,queriesareconvertedintoaseriesofjobsthatexecuteon aHadoopclusterthroughMapReduceorApacheSpark. UserscanrunbatchprocessingworkloadswithHivewhilealsoanalyzingthesamedataforinteractiveSQLor machine-learningworkloadsusingtoolslikeApacheImpalaorApacheSpark—allwithinasingleplatform. AspartofCDH,Hivealsobenefitsfrom: • UnifiedresourcemanagementprovidedbyYARN • SimplifieddeploymentandadministrationprovidedbyClouderaManager • SharedsecurityandgovernancetomeetcompliancerequirementsprovidedbyApacheSentryandCloudera Navigator Continuereading: • UseCasesforHive • Installation • Upgrading • Configuration • TheMetastoreDatabase • HiveServer2 • HiveonSpark • HiveandHBase • Transaction(ACID)SupportinHive • ManagingHive • IngestingDatawithHive • Tuning • HighAvailability • Replication • Security • Troubleshooting • UpstreamInformationforHive Use Cases for Hive BecauseHiveisapetabyte-scaledatawarehousesystembuiltontheHadoopplatform,itisagoodchoicefor environmentsexperiencingphenomenalgrowthindatavolume.TheunderlyingMapReduceinterfacewithHDFSis hardtoprogramdirectly,butHiveprovidesanSQLinterface,makingitpossibletouseexistingprogrammingskillsto performdatapreparation. HiveonMapReduceorSparkisbest-suitedforbatchdatapreparationorETL: • YoumustrunscheduledbatchjobswithverylargeETLsortswithjoinstopreparedataforHadoop.Mostdata servedtoBIusersinImpalaispreparedbyETLdevelopersusingHive. • Yourundatatransferorconversionjobsthattakemanyhours.WithHive,ifaproblemoccurspartwaythrough suchajob,itrecoversandcontinues. • Youreceiveorprovidedataindiverseformats,wheretheHiveSerDesandvarietyofUDFsmakeitconvenientto ingestandconvertthedata.Typically,thefinalstageoftheETLprocesswithHivemightbetoahigh-performance, widelysupportedformatsuchasParquet. ApacheHiveGuide|7 UsingApacheHivewithCDH Installation OnaclustermanagedbyClouderaManager,HivecomesalongwiththebaseCDHinstallationanddoesnotneedto beinstalledseparately.WithClouderaManager,youcanenableordisabletheHiveservice,buttheHivecomponent alwaysremainspresentonthecluster. Onanunmanagedcluster,youcaninstallHivemanually,usingpackagesortarballswiththeappropriatecommandfor youroperatingsystem. InstalltheappropriateHivepackagesusingtheappropriatecommandforyourdistribution. OS Command RHEL-compatible $ sudo yum install <pkg1> <pkg2> ... SLES $ sudo zypper install <pkg1> <pkg2> ... UbuntuorDebian $ sudo apt-get install <pkg1> <pkg2> ... Thepackagesare: • hive–basepackagethatprovidesthecompletelanguageandruntime • hive-metastore–providesscriptsforrunningthemetastoreasastandaloneservice(optional) • hive-server2–providesscriptsforrunningHiveServer2 • hive-hbase-optional;installthispackageifyouwanttouseHivewithHBase. SeeInstallingHiveonpage17fordetailsaboutinstallingandconfiguringHivecomponents. ToaccesstheHiveserverwithJDBCclients,suchasBeeline,installtheJDBCdriverforHiveServer2thatisdefinedin org.apache.hive.jdbc.HiveDriver. SeeInstallingtheHiveJDBCDriveronClientsonpage39fordetailsaboutinstallingtheJDBCdriversandtheconnection URLstousetoconnecttoHiveServer2fromHiveclients. Upgrading UpgradeHiveonallthehostsonwhichitisrunningincludingbothserversandclients. SeeUpgradingHiveonpage13fordetailsaboutdeprecatedversions,upgradingbestpractices,andinformationabout upgradingtheHivemetastoreschema. Configuration Hiveoffersanumberofconfigurationsettingsrelatedtoperformance,filelayoutandhandling,andoptionstocontrol SQLsemantics.Dependingonyourclustersizeandworkloads,configureHiveServer2memory,tablelockingbehavior, andauthenticationforconnections.SeeConfiguringHiveServer2onpage32fordetailsaboutrequiredconfiguration changesthatyoumustperform. TheHivemetastoreservice,whichstoresthemetadataforHivetablesandpartitions,mustalsobeconfigured.See ConfiguringtheHiveMetastoreonpage19fordetailsaboutdeploymentmodes,informationaboutsupported metastoredatabases,andspecificconfigurationsforMySQL,PostgreSQL,andOracle. The Metastore Database ThemetastoredatabaseisanimportantaspectoftheHiveinfrastructure.Itisaseparatedatabase,relyingona traditionalRDBMSsuchasMySQLorPostgreSQL,thatholdsmetadataaboutHivedatabases,tables,columns,partitions, andHadoop-specificinformationsuchastheunderlyingdatafilesandHDFSblocklocations. 8|ApacheHiveGuide UsingApacheHivewithCDH Themetastoredatabaseissharedbyothercomponents.Forexample,thesametablescanbeinsertedinto,queried, altered,andsoonbybothHiveandImpala.Althoughyoumightseereferencestothe“Hivemetastore”,beaware thatthemetastoredatabaseisusedbroadlyacrosstheHadoopecosystem,evenincaseswhereyouarenotusingHive itself. Themetastoredatabaseisrelativelycompact,withfast-changingdata.Backup,replication,andotherkindsof managementoperationsaffectthisdatabase.SeeConfiguringtheHiveMetastoreonpage19fordetailsabout configuringtheHivemetastore. ClouderarecommendsthatyoudeploytheHivemetastore,whichstoresthemetadataforHivetablesandpartitions, in“remotemode.”InthismodethemetastoreservicerunsinitsownJVMprocessandotherservices,suchas HiveServer2,HCatalog,andApacheImpalacommunicatewiththemetastoreusingtheThriftnetworkAPI. SeeStartingtheMetastoreonpage34fordetailsaboutstartingtheHivemetastoreservice. HiveServer2 HiveServer2isaserverinterfacethatenablesremoteclientstosubmitqueriestoHiveandretrievetheresults.It replacesHiveServer1,whichhasbeendeprecatedandwillberemovedinafuturereleaseofCDH.HiveServer2supports multi-clientconcurrency,capacityplanningcontrols,Sentryauthorization,Kerberosauthentication,LDAP,SSL,and providesbettersupportforJDBCandODBCclients. HiveServer2isacontainerfortheHiveexecutionengine.Foreachclientconnection,itcreatesanewexecutioncontext thatservesHiveSQLrequestsfromtheclient.ItsupportsJDBCclients,suchastheBeelineCLI,andODBCclients.Clients connecttoHiveServer2throughtheThriftAPI-basedHiveservice. SeeConfiguringHiveServer2onpage32fordetailsonconfiguringHiveServer2andseeStarting,Stopping,andUsing HiveServer2onpage35fordetailsonstarting/stoppingtheHiveServer2serviceandinformationaboutusingthe BeelineCLItoconnecttoHiveServer2.FordetailsaboutmanagingHiveServer2withitsnativewebuserinterface(UI), seeHiveServer2WebUIonpage45. Hive on Spark HivetraditionallyusesMapReducebehindthescenestoparallelizethework,andperformthelow-levelstepsof processingaSQLstatementsuchassortingandfiltering.HivecanalsouseSparkastheunderlyingcomputationand parallelizationengine.SeeRunningHiveonSparkonpage43fordetailsaboutconfiguringHivetouseSparkasits executionengineandseeTuningHiveonSparkonpage59fordetailsabouttuningHiveonSpark. Hive and HBase ApacheHBaseisaNoSQLdatabasethatsupportsreal-timeread/writeaccesstolargedatasetsinHDFS.SeeUsingHive withHBaseonpage36fordetailsaboutconfiguringHivetouseHBase.ForinformationaboutrunningHivequeries onasecureHBaseserver,seeUsingHivetoRunQueriesonaSecureHBaseServeronpage87. Transaction (ACID) Support in Hive HiveinCDHdoesnotsupporttransactions(HIVE-5317).Currently,transactionsupportinHiveisanexperimental featurethatonlyworkswiththeORCfileformat.ClouderarecommendsusingtheParquetfileformat,whichworks acrossmanytools.MergeupdatesinHivetablesusingexistingfunctionality,includingstatementssuchasINSERT, INSERT OVERWRITE,andCREATE TABLE AS SELECT. ApacheHiveGuide|9 UsingApacheHivewithCDH Managing Hive ClouderarecommendsusingClouderaManagertomanageHiveservices,whicharecalledmanageddeployments.If yoursisnotamanageddeployment,configureHiveServer2WebUItomanageHiveservices. UsingClouderaManagertoManageHive ClouderaManagerusestheHivemetastore,HiveServer2,andtheWebHCatrolestomanagetheHiveserviceacross yourcluster.UsingClouderaManager,youcanconfiguretheHivemetastore,theexecutionengine(eitherMapReduce orSpark),andmanageHiveServer2. SeeManagingHiveUsingClouderaManageronpage42 UsingHiveServer2WebUItoManageHive TheHiveServer2webUIprovidesaccesstoHiveconfigurationsettings,locallogs,metrics,andinformationaboutactive sessionsandqueries.TheHiveServer2webUIisenabledinnewlycreatedclustersrunningCDH5.7andhigher,and thoseusingKerberosareconfiguredforSPNEGO.ClustersupgradedfromapreviousCDHversionmustbeconfigured toenablethewebUI;seeHiveServer2WebUIConfigurationonpage45. Ingesting Data with Hive Hivecaningestdataintoseveraldifferentfileformats,suchasParquet,Avro,TEXTFILE,orRCFile.Ifyouaresettingup adatapipelinewhereApacheImpalaisinvolvedonthequeryside,useParquet.Ifacustomfileformatisrequired, youcanextendtheHiveSerDes.SeetheApacheHivewikiforinformationabouttheHiveSerDesandhowtowrite yourownforHive. Important: Theconfigurationpropertyserialization.null.formatissetinHiveandImpalaenginesas SerDesortablepropertiestospecifyhowtoserialize/deserializeNULLvaluesintoastorageformat. Thisconfigurationoptionissuitablefortextfileformatsonly.Ifusedwithbinarystorageformatssuch asRCFileorParquet,theoptioncausescompatibility,complexityandefficiencyissues. ColumnandTableStatisticsforQueryOptimization StatisticsforHivecanbenumbersofrowsoftablesorpartitionsandthehistogramsofinterestingcolumns.Statistics areusedbythecostfunctionsofthequeryoptimizertogeneratequeryplansforthepurposeofqueryoptimization. SeeHiveTableStatisticsonpage46fordetailsaboutcollectingstatisticsforHive. Tuning TuningHiveconsistsofconfiguringnumerousHiveparametersforbetterperformanceandscalability.Themost importantamongthesesettingsisconfiguringsufficientmemoryforHiveServer2andtheHivemetastore.Thisincludes allocatingmemoryforheapsizebaseduponthenumberofconcurrentconnectionsthataretypicalforyourdeployment. Configuringgarbagecollectionlimitsandkeepingthenumberoftablepartitionsbelowrecommendedlimitsarealso importantwhentuningHiveperformance.SeeTuningHiveonpage50fordetailsaboutrecommendedlimitsandbest practices.IfyouareusingSparkasyourexecutionengine,seeTuningHiveonSparkonpage59. 10|ApacheHiveGuide

Description:
Apache Hive Guide .. Upgrading Hive from CDH 4 to CDH 5. Upgrading Hive from a Lower Version of CDH 5.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.