IBM LoadLeveler Version 5 Release 1 Using and Administering (cid:1)(cid:2)(cid:3) SC23-6792-04 IBM LoadLeveler Version 5 Release 1 Using and Administering (cid:1)(cid:2)(cid:3) SC23-6792-04 Note Beforeusingthisinformationandtheproductitsupports,readtheinformationin“Notices”onpage423. Thiseditionappliestoversion5,release1,modification0ofIBMLoadLeveler(productnumbers5725-G01, 5641-LL1,5641-LL3,5765-L50,and5765-LLP)andtoallsubsequentreleasesandmodificationsuntilotherwise indicatedinneweditions. ThiseditionreplacesSC23-6792-03. ©Copyright1986,1987,1988,1989,1990,1991bytheCondorDesignTeam. ©CopyrightIBMCorporation1986,2012. USGovernmentUsersRestrictedRights–Use,duplicationordisclosurerestrictedbyGSAADPScheduleContract withIBMCorp. Contents Figures . . . . . . . . . . . . . . vii LoadLevelerforAIXandLoadLevelerforLinux compatibility . . . . . . . . . . . . . . 35 Tables . . . . . . . . . . . . . . . ix RestrictionsforLoadLevelerforLinux . . . . 36 FeaturesnotsupportedinLoadLevelerforLinux 36 RestrictionsforLoadLevelerforAIXand About this information . . . . . . . . xi LoadLevelerforLinuxmixedclusters . . . . 36 Whoshouldusethisinformation . . . . . . . xi Conventionsandterminologyusedinthis Part 2. Configuring and managing information . . . . . . . . . . . . . . xi the LoadLeveler environment . . . 37 Prerequisiteandrelatedinformation . . . . . . xii Howtosendyourcomments . . . . . . . . xiii Chapter 4. Configuring the LoadLeveler Summary of changes . . . . . . . . xv environment . . . . . . . . . . . . 39 Themasterconfigurationfile . . . . . . . . 40 Part 1. Overview of LoadLeveler SettingtheLoadLeveleruser . . . . . . . 40 concepts and operation. . . . . . . 1 Settingtheconfigurationsource . . . . . . 41 Overridingthesharedmemorykey . . . . . 41 File-basedconfiguration . . . . . . . . . . 42 Chapter 1. What is LoadLeveler? . . . . 3 Databaseconfigurationoption . . . . . . . . 43 LoadLevelerbasics . . . . . . . . . . . . 4 Understandingremotelyconfigurednodes . . . 43 LoadLeveler:Anetworkjobmanagementand Usingtheconfigurationeditor . . . . . . . . 44 schedulingsystem . . . . . . . . . . . . 4 Modifyingconfigurationdata . . . . . . . . 45 Jobdefinition . . . . . . . . . . . . . 5 DefiningLoadLeveleradministrators. . . . . 45 Machinedefinition . . . . . . . . . . . 5 DefiningaLoadLevelercluster . . . . . . . 45 HowLoadLevelerschedulesjobs . . . . . . . 7 DefiningLoadLevelermachinecharacteristics . . 59 HowLoadLevelerdaemonsprocessjobs . . . . . 8 Definingsecuritymechanisms . . . . . . . 60 Themasterdaemon . . . . . . . . . . . 9 Definingusagepoliciesforconsumableresources 65 TheSchedddaemon . . . . . . . . . . 10 Gatheringjobaccountingdata . . . . . . . 65 Thestartddaemon . . . . . . . . . . . 12 Managingjobstatusthroughcontrolexpressions 72 Theregionmanagerdaemon . . . . . . . 14 Trackingjobprocesses. . . . . . . . . . 73 Theresourcemanagerdaemon . . . . . . . 15 QueryingmultipleLoadLevelerclusters. . . . 74 Thekbdddaemon . . . . . . . . . . . 15 Handlingswitch-tableerrors. . . . . . . . 75 Thenegotiatordaemon . . . . . . . . . 15 Providingadditionaljob-processingcontrols TheLoadLevelerjobcycle . . . . . . . . . 16 throughinstallationexits . . . . . . . . . 75 LoadLevelerjobstates. . . . . . . . . . 19 Consumableresources. . . . . . . . . . . 22 Chapter 5. Defining LoadLeveler ConsumableresourcesandWorkloadManager 23 resources to administer . . . . . . . 89 Overviewofreservations. . . . . . . . . . 24 Definingmachines . . . . . . . . . . . . 89 Fairshareschedulingoverview. . . . . . . . 27 Planningconsiderationsfordefiningmachines . 90 Machine_groupstanzaformatandkeyword Chapter 2. Getting a quick start using summary . . . . . . . . . . . . . . 90 the default configuration . . . . . . . 29 Machinesubstanzaformatandkeyword Whatyouneedtoknowbeforeyoubegin . . . . 29 summary . . . . . . . . . . . . . . 91 Usingthedefaultconfigurationfiles . . . . . . 29 Machinestanzaformatandkeywordsummary 91 LoadLevelerforLinuxquickstart . . . . . . . 30 Defaultvaluesformachine_groupandmachine Quickinstallation . . . . . . . . . . . 30 stanzas. . . . . . . . . . . . . . . 92 Quickconfiguration . . . . . . . . . . 31 Examplesofmachine_groupandmachinestanzas 92 Quickverification . . . . . . . . . . . 31 Dynamicadapterdiscovery . . . . . . . . . 93 Post-installationconsiderations. . . . . . . . 32 LoadLeveleradapterandnodestatusmonitoring. . 94 StartingLoadLeveler . . . . . . . . . . 32 Definingclasses . . . . . . . . . . . . . 94 Directoryconsiderations . . . . . . . . . 33 Usinglimitkeywords . . . . . . . . . . 94 Allowinguserstouseaclass . . . . . . . 97 Chapter 3. What operating systems are Classstanzaformatandkeywordsummary . . 97 supported by LoadLeveler?. . . . . . 35 Examples:Classstanzas . . . . . . . . . 98 Definingusersubstanzasinclassstanzas . . . . 99 ©CopyrightIBMCorp.1986,2012 iii Examples:Substanzas . . . . . . . . . . 99 Usingtheckpt_dirandckpt_subdirkeywords 143 Definingusers . . . . . . . . . . . . . 102 Removingoldcheckpointfiles. . . . . . . 144 Userstanzaformatandkeywordsummary . . 102 Usingtheckpt_execute_dirkeyword . . . . 144 Examples:Userstanzas . . . . . . . . . 102 Initiatingacheckpointusingthell_ckpt()API 146 Defininggroups . . . . . . . . . . . . 103 LoadLevelerschedulingaffinitysupport . . . . 147 Groupstanzaformatandkeywordsummary 104 ConfiguringLoadLevelertousescheduling Examples:Groupstanzas . . . . . . . . 104 affinity . . . . . . . . . . . . . . 148 Definingclusters . . . . . . . . . . . . 104 LoadLevelermulticlustersupport. . . . . . . 149 Clusterstanzaformatandkeywordsummary 104 ConfiguringaLoadLevelermulticluster . . . 150 Examples:Clusterstanzas . . . . . . . . 105 LoadLevelerBlueGenesupport . . . . . . . 153 Definingregions . . . . . . . . . . . . 106 ConfiguringLoadLevelerBlueGenesupport 155 Regionstanzaformatandkeywordsummary 106 BlueGenereservationsupport. . . . . . . 157 Examples:Regionstanzas . . . . . . . . 106 BlueGenefairshareschedulingsupport . . . 157 BlueGeneheterogeneousmemorysupport . . 157 Chapter 6. Performing additional BlueGenepreemptionsupport . . . . . . 157 administrator tasks. . . . . . . . . 109 Usingfairsharescheduling. . . . . . . . . 158 Fairshareschedulingkeywords . . . . . . 158 Settinguptheenvironmentforparalleljobs . . . 110 Reconfiguringfairshareschedulingkeywords 161 Schedulingconsiderationsforparalleljobs. . . 110 Example:threegroupsshareaLoadLeveler Stepsforreducingjoblaunchoverheadfor cluster. . . . . . . . . . . . . . . 161 paralleljobs . . . . . . . . . . . . . 111 Example:twothousandstudentssharea Stepsforallowinguserstosubmitinteractive LoadLevelercluster . . . . . . . . . . 162 POEjobs . . . . . . . . . . . . . . 112 Queryinginformationaboutfairshare Settingupaclassforparalleljobs . . . . . 112 scheduling . . . . . . . . . . . . . 163 Stripingwhensomenetworksfail . . . . . 113 Resettingfairsharescheduling . . . . . . 163 Settingupaparallelmasternode. . . . . . 113 Savinghistoricdata . . . . . . . . . . 163 UsingtheBACKFILLscheduler . . . . . . . 114 Restoringsavedhistoricdata . . . . . . . 164 TipsforusingtheBACKFILLscheduler . . . 116 Procedureforrecoveringajobspool. . . . . . 164 Example:BACKFILLscheduling . . . . . . 117 Configuringandusingislandscheduling . . . . 165 Datastaging. . . . . . . . . . . . . . 117 Energyawarejobsupport . . . . . . . . . 166 ConfiguringLoadLevelertosupportdata S3statesupport . . . . . . . . . . . . 166 staging . . . . . . . . . . . . . . 118 Usinganexternalscheduler . . . . . . . . 119 Part 3. Submitting and managing ReplacingthedefaultLoadLevelerscheduling algorithmwithanexternalscheduler . . . . 120 LoadLeveler jobs. . . . . . . . . 169 Customizingtheconfigurationfiletodefinean externalscheduler. . . . . . . . . . . 121 Chapter 7. Building and submitting Example:Retrievingspecificinformation . . . 122 jobs . . . . . . . . . . . . . . . 171 Example:Changingschedulertypes. . . . . . 122 Preemptingandresumingjobs . . . . . . . 122 Buildingajobcommandfile . . . . . . . . 171 Overviewofpreemption . . . . . . . . 123 Usingmultiplestepsinajobcommandfile . . 172 Planningtopreemptjobs . . . . . . . . 124 Examples:Jobcommandfiles . . . . . . . 173 Stepsforconfiguringaschedulertopreempt Editingjobcommandfiles . . . . . . . . . 176 jobs . . . . . . . . . . . . . . . 126 Definingresourcesforajobstep . . . . . . . 177 ConfiguringLoadLevelertosupportreservations 127 Submittingjobsrequestingdatastaging . . . . 177 Stepsforconfiguringreservationsina Workingwithcoscheduledjobsteps. . . . . . 178 LoadLevelercluster . . . . . . . . . . 128 Submittingcoscheduledjobsteps. . . . . . 178 StepsforintegratingLoadLevelerwiththe Determiningpriorityforcoscheduledjobsteps 178 WorkloadManager . . . . . . . . . . . 133 Supportingpreemptionofcoscheduledjobsteps 179 LoadLevelersupportforcheckpointingjobs . . . 135 CoscheduledjobstepsandcommandsandAPIs 179 Checkpointkeywordsummary . . . . . . 136 Terminationofcoscheduledsteps. . . . . . 179 Planningconsiderationsforcheckpointingjobs 136 Usingbulkdatatransfer. . . . . . . . . . 180 Additionalplanningconsiderationsfor Preparingajobforcheckpoint/restart . . . . . 180 checkpointingMetaClusterHPCjobsonAIX. . 138 Preparingajobforpreemption . . . . . . . 183 Checkpointandrestartlimitations . . . . . 138 Submittingajobcommandfile . . . . . . . 183 SubmittingaMetaClusterHPCcheckpointjobto Jobstatemonitoring . . . . . . . . . . 184 LoadLeveler. . . . . . . . . . . . . . 138 Submittingajobusingasubmit-onlymachine 184 job_1.cmd-Acheckpointablejobcommandfile 138 Workingwithparalleljobs . . . . . . . . . 184 Usingthellckptcommandtocheckpointajob StepforcontrollingwhetherLoadLevelercopies step . . . . . . . . . . . . . . . 139 environmentvariablestoallexecutingnodes. . 185 Restartingajobstepfromacheckpoint. . . . 140 Makingperiodiccheckpoints . . . . . . . 142 iv LoadLeveler: UsingandAdministering Ensuringthatparalleljobsinaclusterrunon 64-bitsupportforconfigurationfilekeywords thecorrectlevelsofPEandLoadLeveler andexpressions . . . . . . . . . . . 232 software . . . . . . . . . . . . . . 185 Configurationkeyworddescriptions. . . . . . 233 Task-assignmentconsiderations . . . . . . 186 User-definedkeywords . . . . . . . . . . 284 Submittingjobsthatusestriping . . . . . . 188 LoadLevelervariables . . . . . . . . . . 286 RunninginteractivePOEjobs . . . . . . . 193 Variablestouseforsettingdates . . . . . . 291 DebugginginterfacesbetweenPOEand Variablestouseforsettingtimes. . . . . . 291 LoadLeveler. . . . . . . . . . . . . 194 RunningMPICH2. . . . . . . . . . . 194 Chapter 11. Administration keyword RunningOpenMPI . . . . . . . . . . 195 reference . . . . . . . . . . . . . 293 RunningIntelMPIjobs . . . . . . . . . 196 Administrationfilestructureandsyntax . . . . 293 Runningembarassinglyparalleljobs. . . . . 196 Stanzacharacteristics. . . . . . . . . . 295 Examples:Buildingparalleljobcommandfiles 197 Syntaxforlimitkeywords . . . . . . . . 295 Obtainingstatusofparalleljobs . . . . . . 201 64-bitsupportforadministrationfilekeywords 297 Obtainingallocatedhostnames . . . . . . 201 Administrationkeyworddescriptions . . . . . 298 BuildingandsubmittingMPICH2andserial interactivejobs. . . . . . . . . . . . . 202 Chapter 12. Job command file Workingwithreservations . . . . . . . . . 203 reference . . . . . . . . . . . . . 333 Typesofreservations. . . . . . . . . . 203 Understandingtheflexiblejobstep . . . . . 203 Jobcommandfilesyntax . . . . . . . . . 333 Understandingthereservationlifecycle . . . 205 Serialjobcommandfile . . . . . . . . . 333 Creatingnewreservations . . . . . . . . 207 Paralleljobcommandfile . . . . . . . . 334 Submittingjobstorununderareservation . . 210 Syntaxforlimitkeywords . . . . . . . . 334 Removingboundjobsfromthereservation . . 212 64-bitsupportforjobcommandfilekeywords 334 Queryingexistingreservations . . . . . . 213 Jobcommandfilekeyworddescriptions . . . . 335 Modifyingexistingreservations . . . . . . 213 Jobcommandfilevariables. . . . . . . . 383 Cancelingexistingreservations . . . . . . 215 Run-timeenvironmentvariables . . . . . . 384 Reservationswithfloatingresources. . . . . 215 Jobcommandfileexamples . . . . . . . 386 Submittingjobsrequestingschedulingaffinity . . 217 SubmittingandmonitoringjobsinaLoadLeveler Part 5. Appendixes . . . . . . . . 389 multicluster . . . . . . . . . . . . . . 218 StepsforsubmittingjobsinaLoadLeveler Appendix A. Troubleshooting multiclusterenvironment . . . . . . . . 219 LoadLeveler . . . . . . . . . . . . 391 Workingwithenergyawarejobs . . . . . . . 220 SubmittingandmonitoringBlueGenejobs . . . 221 Frequentlyaskedquestions. . . . . . . . . 391 Whywon'tLoadLevelerstart?. . . . . . . 392 Chapter 8. Managing submitted jobs 223 Whywon'tmyjobrun?. . . . . . . . . 392 Whywon'tmyparalleljobrun? . . . . . . 395 Queryingthestatusofajob . . . . . . . . 223 Whywon'tmycheckpointedjobrestart? . . . 396 Workingwithmachines . . . . . . . . . . 223 Whywon'tmysubmit-onlyjobrun? . . . . 397 Displayingcurrentlyavailableresources . . . . 224 WhydoesajobstayinthePending(orStarting) Settingandchangingthepriorityofajob . . . . 224 state? . . . . . . . . . . . . . . . 397 Example:Howdoesajob'spriorityaffect Whathappenstorunningjobswhenamachine dispatchingorder?. . . . . . . . . . . 225 goesdown? . . . . . . . . . . . . . 397 Placingandreleasingaholdonajob . . . . . 225 Whydoesllstatusindicatethatamachineis Cancelingajob. . . . . . . . . . . . . 226 downwhenllqindicatesajobisrunningonthe Checkpointingajob . . . . . . . . . . . 226 machine?. . . . . . . . . . . . . . 398 Whywon'tmyjobrunonaclusterwithboth Chapter 9. Example: Using commands AIXandLinuxmachines? . . . . . . . . 399 to build, submit, and manage jobs . . 227 Whywon'tmyjobsrunthatweredirectedtoan idlepool? . . . . . . . . . . . . . 399 Part 4. LoadLeveler interfaces Whathappensifthecentralmanagerisn't reference . . . . . . . . . . . . 229 operating? . . . . . . . . . . . . . 399 HowdoIrecoverresourcesallocatedbya Scheddmachine? . . . . . . . . . . . 401 Chapter 10. Configuration keyword Whycan'tIfindacorefileonLinux? . . . . 401 reference . . . . . . . . . . . . . 231 WhyamIseeinginconsistenciesinmyllfs Configurationkeywordsyntax . . . . . . . 231 output? . . . . . . . . . . . . . . 402 Numericalandalphabeticalconstants . . . . 232 Whydon'tIseemyjobwhenIissuethellq Mathematicaloperators . . . . . . . . . 232 command? . . . . . . . . . . . . . 402 Contents v Whathappensiferrorsarefoundinmy WhydidmyBlueGenejobfailwhenthejob configurationoradministrationfile?. . . . . 402 wassubmittedtoaremotecluster? . . . . . 410 Whyismyflexiblereservationnotactivated? 403 Whydoesllmkresorllchresreturn"Insufficient Whywasmyenergyawarejobrejected? . . . 403 resourcestomeettherequest"foraBlueGene Otherquestions . . . . . . . . . . . 403 reservationwhenresourcesappeartobe Troubleshootinginamulticlusterenvironment . . 405 available?. . . . . . . . . . . . . . 410 HowdoIdetermineifIaminamulticluster Helpfulhints . . . . . . . . . . . . . 411 environment? . . . . . . . . . . . . 405 Scalingconsiderations . . . . . . . . . 411 HowdoIdeterminehowmymulticluster Hintsforrunningjobs . . . . . . . . . 412 environmentisdefinedandwhatarethe Hintsforusingmachines . . . . . . . . 414 inboundandoutboundhostsdefinedforeach HistoryfilesandSchedd . . . . . . . . 415 cluster? . . . . . . . . . . . . . . 405 GettinghelpfromIBM . . . . . . . . . . 416 Whyismymulticlusterenvironmentnot enabled? . . . . . . . . . . . . . . 406 Appendix B. LoadLeveler port usage 417 HowdoIfindlogmessagesfrommy multicluster-definedinstallationexits? . . . . 406 Accessibility features for LoadLeveler 421 Whywon'tmyremotejobbesubmittedor Accessibilityfeatures. . . . . . . . . . . 421 moved? . . . . . . . . . . . . . . 407 Keyboardnavigation. . . . . . . . . . . 421 WhydidtheCLUSTER_REMOTE_JOB_FILTER IBMandaccessibility. . . . . . . . . . . 421 notupdatethejobwithallofthestatementsI defined? . . . . . . . . . . . . . . 408 Notices . . . . . . . . . . . . . . 423 HowdoIfindmyremotejob? . . . . . . 408 Whywon'tmyremotejobrun? . . . . . . 408 Trademarks . . . . . . . . . . . . . . 425 Whydoesllq-Xallshownojobsrunningwhen therearejobsrunning? . . . . . . . . . 409 Glossary . . . . . . . . . . . . . 427 Troubleshootingadapteravailability. . . . . 409 TroubleshootinginaBlueGeneenvironment. . . 409 Index . . . . . . . . . . . . . . . 431 WhydoallofmyBlueGenejobsfaileven thoughllstatusshowsthatBlueGeneispresent? 409 WhydoesllstatusshowthatBlueGeneis absent? . . . . . . . . . . . . . . 409 vi LoadLeveler: UsingandAdministering Figures 1. ExampleofaLoadLevelercluster . . . . . 3 13. Jobcommandfilewithmultiplestepsand 2. LoadLevelerjobsteps . . . . . . . . . 5 oneexecutable . . . . . . . . . . . 173 3. Multiplerolesofmachines. . . . . . . . 7 14. Jobcommandfilewithvaryinginput 4. High-leveljobflow . . . . . . . . . . 16 statements . . . . . . . . . . . . 173 5. JobissubmittedtoLoadLeveler. . . . . . 17 15. UsingLoadLevelervariablesinajob 6. LoadLevelerauthorizesthejob . . . . . . 17 commandfile . . . . . . . . . . . 175 7. LoadLevelerpreparestorunthejob . . . . 18 16. Jobcommandfileusedastheexecutable 176 8. LoadLevelerstartsthejob. . . . . . . . 18 17. Stripingovermultiplenetworks . . . . . 190 9. LoadLevelercompletesthejob . . . . . . 19 18. Stripingoverasinglenetwork. . . . . . 192 10. Howcontrolexpressionsaffectjobs . . . . 73 19. Whentheprimarycentralmanageris 11. MulticlusterExample . . . . . . . . . 105 unavailable . . . . . . . . . . . . 400 12. Jobcommandfilewithmultiplesteps 172 20. Multiplecentralmanagers . . . . . . . 400 ©CopyrightIBMCorp.1986,2012 vii viii LoadLeveler: UsingandAdministering
Description: