Apache Hive Guide ImportantNotice ©2010-2021Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor servicenamesorsloganscontainedinthisdocumentaretrademarksofClouderaand itssuppliersorlicensors,andmaynotbecopied,imitatedorused,inwholeorinpart, withoutthepriorwrittenpermissionofClouderaortheapplicabletrademarkholder.If thisdocumentationincludescode,includingbutnotlimitedto,codeexamples,Cloudera makesthisavailabletoyouunderthetermsoftheApacheLicense,Version2.0,including anyrequirednotices.AcopyoftheApacheLicenseVersion2.0,includinganynotices, isincludedherein.AcopyoftheApacheLicenseVersion2.0canalsobefoundhere: https://opensource.org/licenses/Apache-2.0 HadoopandtheHadoopelephantlogoaretrademarksoftheApacheSoftware Foundation.Allothertrademarks,registeredtrademarks,productnamesandcompany namesorlogosmentionedinthisdocumentarethepropertyoftheirrespectiveowners. Referencetoanyproducts,services,processesorotherinformation,bytradename, trademark,manufacturer,supplierorotherwisedoesnotconstituteorimply endorsement,sponsorshiporrecommendationthereofbyus. Complyingwithallapplicablecopyrightlawsistheresponsibilityoftheuser.Without limitingtherightsundercopyright,nopartofthisdocumentmaybereproduced,stored inorintroducedintoaretrievalsystem,ortransmittedinanyformorbyanymeans (electronic,mechanical,photocopying,recording,orotherwise),orforanypurpose, withouttheexpresswrittenpermissionofCloudera. Clouderamayhavepatents,patentapplications,trademarks,copyrights,orother intellectualpropertyrightscoveringsubjectmatterinthisdocument.Exceptasexpressly providedinanywrittenlicenseagreementfromCloudera,thefurnishingofthisdocument doesnotgiveyouanylicensetothesepatents,trademarkscopyrights,orother intellectualproperty.ForinformationaboutpatentscoveringClouderaproducts,see http://tiny.cloudera.com/patents. Theinformationinthisdocumentissubjecttochangewithoutnotice.Clouderashall notbeliableforanydamagesresultingfromtechnicalerrorsoromissionswhichmay bepresentinthisdocument,orfromuseofthisdocument. Cloudera,Inc. 395PageMillRoad PaloAlto,CA94306 [email protected] US:1-888-789-1488 Intl:1-650-362-0488 www.cloudera.com ReleaseInformation Version:ClouderaEnterprise5.8.x Date:February3,2021 Table of Contents Using Apache Hive with CDH....................................................................................7 UseCasesforHive................................................................................................................................................7 Installation............................................................................................................................................................8 Upgrading.............................................................................................................................................................8 Configuration.......................................................................................................................................................8 The Metastore Database......................................................................................................................................8 HiveServer2..........................................................................................................................................................9 Hive on Spark.......................................................................................................................................................9 Hive and HBase....................................................................................................................................................9 Transaction(ACID)SupportinHive......................................................................................................................9 Managing Hive...................................................................................................................................................10 Ingesting Data with Hive....................................................................................................................................10 Tuning.................................................................................................................................................................10 HighAvailability..................................................................................................................................................11 Replication..........................................................................................................................................................11 Security..............................................................................................................................................................11 Troubleshooting.................................................................................................................................................12 Upstream Information for Hive..........................................................................................................................12 Hive Installation.....................................................................................................13 AboutHive..........................................................................................................................................................13 HiveServer2..........................................................................................................................................................................13 Upgrading Hive...................................................................................................................................................13 ChecklisttoHelpEnsureSmoothUpgrades.........................................................................................................................13 UpgradingHivefromaLowerVersionofCDH5..................................................................................................................14 Installing Hive.....................................................................................................................................................17 HeapSizeandGarbageCollectionforHiveComponents.....................................................................................................17 Configuration for WebHCat.................................................................................................................................................19 ConfiguringtheHiveMetastore.........................................................................................................................19 Metastore Deployment Modes............................................................................................................................................19 Supported Metastore Databases.........................................................................................................................................21 MetastoreMemoryRequirements.......................................................................................................................................22 Configuringthe Metastore Database..................................................................................................................................22 Configuring HiveServer2....................................................................................................................................32 HiveServer2 Memory Requirements....................................................................................................................................32 TableLockManager(Required)...........................................................................................................................................32 hive.zookeeper.client.port.....................................................................................................................................33 JDBC driver...........................................................................................................................................................................33 Authentication.....................................................................................................................................................................33 RunningHiveServer2andHiveServerConcurrently.............................................................................................................33 Starting the Metastore.......................................................................................................................................34 File System Permissions.....................................................................................................................................34 Starting,Stopping,andUsingHiveServer2.........................................................................................................35 UsingtheBeelineCLI............................................................................................................................................................35 StartingHiveServer1andtheHiveConsole........................................................................................................36 Using Hive with HBase.......................................................................................................................................36 UsingtheHiveSchemaTool...............................................................................................................................37 Schema Version Verification................................................................................................................................................37 Using schematool................................................................................................................................................................37 InstallingtheHiveJDBCDriveronClients..........................................................................................................39 Setting HADOOP_MAPRED_HOME....................................................................................................................40 ConfiguringtheMetastoretoUseHDFSHighAvailability..................................................................................40 Viewing the Hive Documentation......................................................................................................................40 Managing Hive.......................................................................................................41 Hive Roles...........................................................................................................................................................41 Hive Execution Engines......................................................................................................................................41 ManagingHiveUsingClouderaManager...........................................................................................................42 Running Hive on Spark.......................................................................................................................................43 Configuring Hive on Spark...................................................................................................................................................43 TroubleshootingHiveonSpark............................................................................................................................................43 HiveServer2 Web UI...........................................................................................................................................45 AccessingtheHiveServer2WebUI.......................................................................................................................................45 HiveServer2WebUIConfiguration......................................................................................................................................45 HiveTableStatistics............................................................................................................................................46 ManagingUser-DefinedFunctions(UDFs)withHiveServer2.............................................................................46 User-DefinedFunctions(UDFs)withHiveServer2UsingClouderaManager.......................................................................46 User-DefinedFunctions(UDFs)withHiveServer2UsingtheCommandLine.......................................................................48 Tuning Hive............................................................................................................50 HeapSizeandGarbageCollectionforHiveComponents...................................................................................50 MemoryRecommendations.................................................................................................................................................50 ConfiguringHeapSizeandGarbageCollection...................................................................................................................51 HiveServer2PerformanceTuningandTroubleshooting.....................................................................................52 SymptomsDisplayedWhenHiveServer2HeapMemoryisFull............................................................................................52 HiveServer2PerformanceBestPractices.............................................................................................................................55 Tuning Hive on Spark..............................................................................................59 YARN Configuration............................................................................................................................................59 SparkConfiguration............................................................................................................................................59 Hive Configuration.............................................................................................................................................61 Pre-warmingYARNContainers.............................................................................................................................................62 Hive Metastore High Availability............................................................................63 EnablingHiveMetastoreHighAvailabilityUsingClouderaManager.................................................................63 EnablingHiveMetastoreHighAvailabilityUsingtheCommandLine................................................................63 Configuring HiveServer2 High Availability in CDH...................................................66 EnablingHiveServer2HighAvailabilityUsingClouderaManager......................................................................66 ConfiguringHiveServer2toLoadBalanceBehindaProxyonUnmanagedClusters..........................................66 UnmanagedClusterswithKerberosEnabled.......................................................................................................................67 UnmanagedClustersWITHOUTKerberos............................................................................................................................68 Hive Replication.....................................................................................................71 NetworkLatencyandReplication.......................................................................................................................71 HostSelectionforHiveReplication....................................................................................................................71 HiveTablesandDDLCommands........................................................................................................................71 PerformanceandScalabilityLimitations............................................................................................................72 HiveReplicationinDynamicEnvironments........................................................................................................72 Configuring Replication of Hive Data.................................................................................................................72 ViewingReplicationSchedules...........................................................................................................................75 Enabling,Disabling,orDeletingAReplicationSchedule......................................................................................................77 Viewing Replication History...............................................................................................................................77 Hive Authentication...............................................................................................80 HiveServer2 Security Configuration...................................................................................................................80 EnablingKerberosAuthenticationforHiveServer2..............................................................................................................80 UsingLDAPUsername/PasswordAuthenticationwithHiveServer2....................................................................................81 ConfiguringLDAPSAuthenticationwithHiveServer2..........................................................................................................83 Pluggable Authentication....................................................................................................................................................83 TrustedDelegationwithHiveServer2...................................................................................................................................84 HiveServer2Impersonation..................................................................................................................................................84 SecuringtheHiveMetastore................................................................................................................................................85 DisablingtheHiveSecurityConfiguration...........................................................................................................................85 HiveMetastoreServerSecurityConfiguration...................................................................................................86 UsingHivetoRunQueriesonaSecureHBaseServer........................................................................................87 Configuring Encrypted Communication Between HiveServer2 and Client Drivers....88 ConfiguringEncryptedClient/ServerCommunicationUsingTLS/SSL................................................................88 UsingClouderaManager.....................................................................................................................................................88 UsingtheCommandLine.....................................................................................................................................................89 ConfiguringEncryptedClient/ServerCommunicationUsingSASLQOP.............................................................89 Hive SQL Syntax for Use with Sentry.......................................................................91 Column-level Authorization...............................................................................................................................91 CREATEROLEStatement.....................................................................................................................................92 DROP ROLE Statement.......................................................................................................................................92 GRANT ROLE Statement.....................................................................................................................................93 REVOKEROLEStatement....................................................................................................................................93 GRANT<Privilege>Statement............................................................................................................................93 GRANT<Privilege>ONURIs(HDFSandS3A).....................................................................................................93 REVOKE <Privilege> Statement..........................................................................................................................94 GRANT<Privilege>...WITHGRANTOPTION......................................................................................................94 SETROLEStatement...........................................................................................................................................95 SHOW Statement...............................................................................................................................................95 Example:UsingGrant/RevokeStatementstoMatchanExistingPolicyFile......................................................96 Troubleshooting Hive.............................................................................................98 HiveServer2PerformanceTuningandTroubleshooting.....................................................................................98 SymptomsDisplayedWhenHiveServer2HeapMemoryisFull............................................................................................98 HiveServer2PerformanceBestPractices...........................................................................................................................101 Appendix: Apache License, Version 2.0.................................................................105 UsingApacheHivewithCDH Using Apache Hive with CDH Hivedatawarehousesoftwareenablesreading,writing,andmanaginglargedatasetsindistributedstorage.Usingthe Hivequerylanguage(HiveQL),whichisverysimilartoSQL,queriesareconvertedintoaseriesofjobsthatexecuteon aHadoopclusterthroughMapReduceorApacheSpark. UserscanrunbatchprocessingworkloadswithHivewhilealsoanalyzingthesamedataforinteractiveSQLor machine-learningworkloadsusingtoolslikeApacheImpalaorApacheSpark—allwithinasingleplatform. AspartofCDH,Hivealsobenefitsfrom: • UnifiedresourcemanagementprovidedbyYARN • SimplifieddeploymentandadministrationprovidedbyClouderaManager • SharedsecurityandgovernancetomeetcompliancerequirementsprovidedbyApacheSentryandCloudera Navigator Continuereading: • UseCasesforHive • Installation • Upgrading • Configuration • TheMetastoreDatabase • HiveServer2 • HiveonSpark • HiveandHBase • Transaction(ACID)SupportinHive • ManagingHive • IngestingDatawithHive • Tuning • HighAvailability • Replication • Security • Troubleshooting • UpstreamInformationforHive Use Cases for Hive BecauseHiveisapetabyte-scaledatawarehousesystembuiltontheHadoopplatform,itisagoodchoicefor environmentsexperiencingphenomenalgrowthindatavolume.TheunderlyingMapReduceinterfacewithHDFSis hardtoprogramdirectly,butHiveprovidesanSQLinterface,makingitpossibletouseexistingprogrammingskillsto performdatapreparation. HiveonMapReduceorSparkisbest-suitedforbatchdatapreparationorETL: • YoumustrunscheduledbatchjobswithverylargeETLsortswithjoinstopreparedataforHadoop.Mostdata servedtoBIusersinImpalaispreparedbyETLdevelopersusingHive. • Yourundatatransferorconversionjobsthattakemanyhours.WithHive,ifaproblemoccurspartwaythrough suchajob,itrecoversandcontinues. • Youreceiveorprovidedataindiverseformats,wheretheHiveSerDesandvarietyofUDFsmakeitconvenientto ingestandconvertthedata.Typically,thefinalstageoftheETLprocesswithHivemightbetoahigh-performance, widelysupportedformatsuchasParquet. ApacheHiveGuide|7 UsingApacheHivewithCDH Installation OnaclustermanagedbyClouderaManager,HivecomesalongwiththebaseCDHinstallationanddoesnotneedto beinstalledseparately.WithClouderaManager,youcanenableordisabletheHiveservice,buttheHivecomponent alwaysremainspresentonthecluster. Onanunmanagedcluster,youcaninstallHivemanually,usingpackagesortarballswiththeappropriatecommandfor youroperatingsystem. InstalltheappropriateHivepackagesusingtheappropriatecommandforyourdistribution. OS Command RHEL-compatible $ sudo yum install <pkg1> <pkg2> ... SLES $ sudo zypper install <pkg1> <pkg2> ... UbuntuorDebian $ sudo apt-get install <pkg1> <pkg2> ... Thepackagesare: • hive–basepackagethatprovidesthecompletelanguageandruntime • hive-metastore–providesscriptsforrunningthemetastoreasastandaloneservice(optional) • hive-server2–providesscriptsforrunningHiveServer2 • hive-hbase-optional;installthispackageifyouwanttouseHivewithHBase. SeeInstallingHiveonpage17fordetailsaboutinstallingandconfiguringHivecomponents. ToaccesstheHiveserverwithJDBCclients,suchasBeeline,installtheJDBCdriverforHiveServer2thatisdefinedin org.apache.hive.jdbc.HiveDriver. SeeInstallingtheHiveJDBCDriveronClientsonpage39fordetailsaboutinstallingtheJDBCdriversandtheconnection URLstousetoconnecttoHiveServer2fromHiveclients. Upgrading UpgradeHiveonallthehostsonwhichitisrunningincludingbothserversandclients. SeeUpgradingHiveonpage13fordetailsaboutdeprecatedversions,upgradingbestpractices,andinformationabout upgradingtheHivemetastoreschema. Configuration Hiveoffersanumberofconfigurationsettingsrelatedtoperformance,filelayoutandhandling,andoptionstocontrol SQLsemantics.Dependingonyourclustersizeandworkloads,configureHiveServer2memory,tablelockingbehavior, andauthenticationforconnections.SeeConfiguringHiveServer2onpage32fordetailsaboutrequiredconfiguration changesthatyoumustperform. TheHivemetastoreservice,whichstoresthemetadataforHivetablesandpartitions,mustalsobeconfigured.See ConfiguringtheHiveMetastoreonpage19fordetailsaboutdeploymentmodes,informationaboutsupported metastoredatabases,andspecificconfigurationsforMySQL,PostgreSQL,andOracle. The Metastore Database ThemetastoredatabaseisanimportantaspectoftheHiveinfrastructure.Itisaseparatedatabase,relyingona traditionalRDBMSsuchasMySQLorPostgreSQL,thatholdsmetadataaboutHivedatabases,tables,columns,partitions, andHadoop-specificinformationsuchastheunderlyingdatafilesandHDFSblocklocations. 8|ApacheHiveGuide UsingApacheHivewithCDH Themetastoredatabaseissharedbyothercomponents.Forexample,thesametablescanbeinsertedinto,queried, altered,andsoonbybothHiveandImpala.Althoughyoumightseereferencestothe“Hivemetastore”,beaware thatthemetastoredatabaseisusedbroadlyacrosstheHadoopecosystem,evenincaseswhereyouarenotusingHive itself. Themetastoredatabaseisrelativelycompact,withfast-changingdata.Backup,replication,andotherkindsof managementoperationsaffectthisdatabase.SeeConfiguringtheHiveMetastoreonpage19fordetailsabout configuringtheHivemetastore. ClouderarecommendsthatyoudeploytheHivemetastore,whichstoresthemetadataforHivetablesandpartitions, in“remotemode.”InthismodethemetastoreservicerunsinitsownJVMprocessandotherservices,suchas HiveServer2,HCatalog,andApacheImpalacommunicatewiththemetastoreusingtheThriftnetworkAPI. SeeStartingtheMetastoreonpage34fordetailsaboutstartingtheHivemetastoreservice. HiveServer2 HiveServer2isaserverinterfacethatenablesremoteclientstosubmitqueriestoHiveandretrievetheresults.It replacesHiveServer1,whichhasbeendeprecatedandwillberemovedinafuturereleaseofCDH.HiveServer2supports multi-clientconcurrency,capacityplanningcontrols,Sentryauthorization,Kerberosauthentication,LDAP,SSL,and providesbettersupportforJDBCandODBCclients. HiveServer2isacontainerfortheHiveexecutionengine.Foreachclientconnection,itcreatesanewexecutioncontext thatservesHiveSQLrequestsfromtheclient.ItsupportsJDBCclients,suchastheBeelineCLI,andODBCclients.Clients connecttoHiveServer2throughtheThriftAPI-basedHiveservice. SeeConfiguringHiveServer2onpage32fordetailsonconfiguringHiveServer2andseeStarting,Stopping,andUsing HiveServer2onpage35fordetailsonstarting/stoppingtheHiveServer2serviceandinformationaboutusingthe BeelineCLItoconnecttoHiveServer2.FordetailsaboutmanagingHiveServer2withitsnativewebuserinterface(UI), seeHiveServer2WebUIonpage45. Hive on Spark HivetraditionallyusesMapReducebehindthescenestoparallelizethework,andperformthelow-levelstepsof processingaSQLstatementsuchassortingandfiltering.HivecanalsouseSparkastheunderlyingcomputationand parallelizationengine.SeeRunningHiveonSparkonpage43fordetailsaboutconfiguringHivetouseSparkasits executionengineandseeTuningHiveonSparkonpage59fordetailsabouttuningHiveonSpark. Hive and HBase ApacheHBaseisaNoSQLdatabasethatsupportsreal-timeread/writeaccesstolargedatasetsinHDFS.SeeUsingHive withHBaseonpage36fordetailsaboutconfiguringHivetouseHBase.ForinformationaboutrunningHivequeries onasecureHBaseserver,seeUsingHivetoRunQueriesonaSecureHBaseServeronpage87. Transaction (ACID) Support in Hive HiveinCDHdoesnotsupporttransactions(HIVE-5317).Currently,transactionsupportinHiveisanexperimental featurethatonlyworkswiththeORCfileformat.ClouderarecommendsusingtheParquetfileformat,whichworks acrossmanytools.MergeupdatesinHivetablesusingexistingfunctionality,includingstatementssuchasINSERT, INSERT OVERWRITE,andCREATE TABLE AS SELECT. ApacheHiveGuide|9 UsingApacheHivewithCDH Managing Hive ClouderarecommendsusingClouderaManagertomanageHiveservices,whicharecalledmanageddeployments.If yoursisnotamanageddeployment,configureHiveServer2WebUItomanageHiveservices. UsingClouderaManagertoManageHive ClouderaManagerusestheHivemetastore,HiveServer2,andtheWebHCatrolestomanagetheHiveserviceacross yourcluster.UsingClouderaManager,youcanconfiguretheHivemetastore,theexecutionengine(eitherMapReduce orSpark),andmanageHiveServer2. SeeManagingHiveUsingClouderaManageronpage42 UsingHiveServer2WebUItoManageHive TheHiveServer2webUIprovidesaccesstoHiveconfigurationsettings,locallogs,metrics,andinformationaboutactive sessionsandqueries.TheHiveServer2webUIisenabledinnewlycreatedclustersrunningCDH5.7andhigher,and thoseusingKerberosareconfiguredforSPNEGO.ClustersupgradedfromapreviousCDHversionmustbeconfigured toenablethewebUI;seeHiveServer2WebUIConfigurationonpage45. Ingesting Data with Hive Hivecaningestdataintoseveraldifferentfileformats,suchasParquet,Avro,TEXTFILE,orRCFile.Ifyouaresettingup adatapipelinewhereApacheImpalaisinvolvedonthequeryside,useParquet.Ifacustomfileformatisrequired, youcanextendtheHiveSerDes.SeetheApacheHivewikiforinformationabouttheHiveSerDesandhowtowrite yourownforHive. Important: Theconfigurationpropertyserialization.null.formatissetinHiveandImpalaenginesas SerDesortablepropertiestospecifyhowtoserialize/deserializeNULLvaluesintoastorageformat. Thisconfigurationoptionissuitablefortextfileformatsonly.Ifusedwithbinarystorageformatssuch asRCFileorParquet,theoptioncausescompatibility,complexityandefficiencyissues. ColumnandTableStatisticsforQueryOptimization StatisticsforHivecanbenumbersofrowsoftablesorpartitionsandthehistogramsofinterestingcolumns.Statistics areusedbythecostfunctionsofthequeryoptimizertogeneratequeryplansforthepurposeofqueryoptimization. SeeHiveTableStatisticsonpage46fordetailsaboutcollectingstatisticsforHive. Tuning TuningHiveconsistsofconfiguringnumerousHiveparametersforbetterperformanceandscalability.Themost importantamongthesesettingsisconfiguringsufficientmemoryforHiveServer2andtheHivemetastore.Thisincludes allocatingmemoryforheapsizebaseduponthenumberofconcurrentconnectionsthataretypicalforyourdeployment. Configuringgarbagecollectionlimitsandkeepingthenumberoftablepartitionsbelowrecommendedlimitsarealso importantwhentuningHiveperformance.SeeTuningHiveonpage50fordetailsaboutrecommendedlimitsandbest practices.IfyouareusingSparkasyourexecutionengine,seeTuningHiveonSparkonpage59. 10|ApacheHiveGuide
Description: