Efficient implementation of core-excitation Bethe-Salpeter equation calculations K. Gilmorea,b,∗, John Vinsonc, E.L. Shirleyc, D. Prendergastd, C.D. Pemmarajud, J.J. Kase, F.D. Vilae, J.J. Rehre aEuropean Synchrotron Radiation Facility (ESRF), Grenoble 38000, France bJiangsu KeyLaboratory forCarbon-Based Functional Materials & Devices,Institute of Functional Nano & Soft Materials (FUNSOM), Soochow University,Suzhou, Jiangsu 215123, PR China cNational Institute of Standards and Technology (NIST), Gaithersburg, Maryland 20899, USA dLawrence BerkeleyNational Lab (LBL), Berkeley,California 94720, USA 6 eDepartment of Physics, Universityof Washington, Seattle, Washington 98195, USA 1 0 2 n a Abstract J 8We present an efficient implementation of the Bethe-Salpeter equation (BSE) method for obtaining core-level spectra 2 including x-ray absorption (XAS), x-ray emission (XES), and both resonantand non-resonantinelastic x-ray scattering spectra (N/RIXS). Calculations are based on density functional theory (DFT) electronic structures generated either ] hby abinit or Quantumespresso, both plane-wave basis, pseudopotential codes. This electronic structure is improved pthrough the inclusion of a GW self energy. The projector augmented wave technique is used to evaluate transition - pmatrix elements between core-level and band states. Final two-particle scattering states are obtained with the NIST mcore-levelBSEsolver(NBSE).Wehavepreviouslyreportedthisimplementation,whichwerefertoasocean(Obtaining Core Excitations from Ab initio electronic structure and NBSE) [Phys. Rev. B 83, 115106 (2011)]. Here, we present o cadditionalefficienciesthatenableustoevaluatespectraforsystemstentimeslargerthanpreviouslypossible;containing .up to a few thousand electrons. These improvements include the implementation of optimal basis functions that reduce s cthe cost of the initial DFT calculations, more complete parallelization of the screening calculation and of the action of i sthe BSE Hamiltonian, and various memory reductions. Scaling is demonstrated on supercells of SrTiO3 and example yspectra for the organic light emitting molecule Tris-(8-hydroxyquinoline)aluminum (Alq ) are presented. The ability 3 h to perform large-scale spectral calculations is particularly advantageous for investigating dilute or non-periodic systems p such as doped materials, amorphous systems, or complex nano-structures. [ 1 v 1. Introduction been approached from many directions, but most can be 8 divided by scale into two major categories. Atomic and 2 Core-level spectroscopies provide a quantitative 7 cluster models have sought to include a more exact treat- element- and orbital-specific probe of the local chemical 7 ment of many-body effects by considering a small subsys- 0environment and the atomic and electronic structure temcoupledlooselyorparametricallytothelargersystem. .of materials. For example, X-ray absorption (XAS) 1 At the other end, various single-particle theories are able 0and emission (XES) spectra probe the unoccupied and to treat many hundreds of atoms by approximating the 6occupied densities of states, respectively. The near-edge electron-electron and electron-hole interactions. The util- 1absorptionregion(XANES)issensitivetooxidationstate, ityofcalculatedspectrahingesuponcompromisesbetween : vspin configuration, crystal field, and chemical bonding, theabilitytoaccuratelymodelagivensystemandtheabil- iwhereas the extended region can be used to reconstruct X ity to address systems large enough to be representative the local coordination shells. However, extracting this of experiment: defects, dopants, interfaces, etc. r ainformation requires a reliable interpretation of the measured spectra. Often this is done by comparison with Within a first-principles approach, it is easiest to use reference spectra, but such comparisons are at best quali- theindependent-particleapproximation. Forx-rayabsorp- tative; it is preferable to calculate spectra quantitatively tion, the extended region of materials is very well repro- and predictively. ducedbythereal-spaceGreen’sfunctioncodefeff,which An accurate description of core-level excitations must is widely used over extensive energy ranges [1]. How- take into account both the highly localized nature of the ever, the current implementation loses accuracy at the core hole and the extended condensed system. The prob- edge when non-spherical corrections to the potential are lem of predictive computational x-ray spectroscopy has important. To reproduce the near-edge structure, accu- rate independent quasi-particlemodels for deep-core XAS can be constructed for s-levels [2]. In this case, the x-ray ∗Correspondingauthor at: EuropeanSynchrotron RadiationFa- absorptionintensityisproportionaltotheunoccupiedpro- cility,38043Grenoble,France. Tel.: +33476881727. Email address: [email protected] (K.Gilmore) jected density of states in the presence of a screened core- Preprint submittedto Computer PhysicsCommunications January 29, 2016 hole, weighted by final-state transition matrix elements. and Ni [23–29]. The BSE largely resolves this discrep- This approach has been used with different treatments of ancy, yielding branching ratios in reasonable agreement thecore-holeincludingthe“Slatertransitionstate”model with experiment [30]. However, simpler approaches such of a half-occupied core state [3] or the final state picture as TDDFT can also account for these corrections [31–33]. of a full core hole with [4] or without [5] the correspond- BSE solvers have been implemented in a few core-level ing excited electron. Due to the simplicity of their imple- codes to date [34–36]– as well assome valence levelcodes mentation and modest computational cost, such core-hole [37–40] – but their utility has been limited to a specialist schemes have been implemented in several standard DFT community. In part, this is due to significantly increased distributions [6–12] for near-edge spectra. However, these computational cost. feff and DFT-based core-hole ap- approaches often fail when the hole has non-zero orbital proachesrequirelittle moreeffortthanstandardDFT cal- angular momentum. Extensions to DFT such as time- culations [41], and calculations on systems of hundreds of dependent DFT (TDDFT) have been problematic due to atoms have become routine. BSE calculations are consid- the lackofsufficiently accurateexchange-correlationfunc- erably more intensive and have heretofore typically been tionals for core-excited states. Although recent devel- limited to a few tens of atoms. This added cost is largely opment of more accurate short-range exchange-corrected associated with including GW self-energy corrections to functionals [13, 14] has improved the viability of TDDFT the electronic structure, obtaining the screening response for calculating core-level spectra more advances will be tothecorehole,andactingwiththeBethe-SalpeterHamil- needed to make TDDFT a generally applicable approach. tonianonthe electron-holewavefunctionto obtainthe ex- Theaforementionedapproachesareallinherentlysingle- citation spectrum. However, given the significantly im- particle descriptions. For calculating x-ray spectra this provedaccuracyoftheBSEmethoditisdesirabletomake can be problematic, not only when the core hole has non- this a more widely used technique. This necessitates im- zero angular momentum, but also when many-body or proving its ease of use and reducing the computational multi-electron excitations are important. Many-electron cost. Toward this second objective we report herein sev- wavefunction-based methods constitute one obvious ap- eralefficiencyimprovementsto ourexisting BSEcode[42] proach to this problem. Contrary to single-particle de- thatnowallowBSEcalculationsonsystemsofhundredsof scriptions that typically begin from the perspective of the atoms and significantly reduce the time required for pre- extendedsystem,atomicandclustermodelsstartwiththe viously viable smaller systems. goal of completely describing the local problem. In par- The most time-consuming steps of a BSE calculation ticular, atomic multiplet theory [15] or configuration in- are(1)obtainingtheground-stateelectronicstructure,(2) teraction [16, 17] and exact diagonalization methods [18] correcting the quasiparticle energies by adding a (GW) canaccuratelyreproducecomplicatedLedgesoftransition self-energy, (3) evaluating the screening response to the metalsystems. However,thesetechniquesthattakealocal core-hole, and (4) determining the excitation spectrum of description typically ignoresolid-state effects and are lim- theBSEHamiltonian. OurBSEcalculationsbuildonself- ited to small cluster models, although recent progress at consistent field (SCF) DFT calculations of the ground- incorporating band structure within the multiplet model state charge density and the accompanying Kohn-Sham should be noted [19–21]. potential. We then use non-self-consistent field (NSCF) Further improvementsin DFT-basedapproachesto cal- calculations, i.e., direct calculations that solve the one- culating spectra require a two-particle picture including electron Schr¨odinger equation in the already computed particle-hole interactions, particle and hole self-energies, Kohn-Sham potential, to obtain all desired occupied and and full-potential electronic structure, within the context unoccupied Bloch states. To alleviate the burden of k- of many-body perturbation theory. Specifically, this in- space sampling during the NSCF calculation, and to re- volves solving the Bethe-Salpeter equation (BSE), i.e., duce the plane-wave basis, we implement a k-space inter- a particle-hole Green’s function. The Bethe-Salpeter polation scheme that solves a k-dependent Hamiltonian equation description of absorption includes single-particle overareducedsetofoptimalbasisfunctions[43,44]. This terms that describe the quasi-particle energies of the core isdescribedinsection3.1. Inmostcases,ratherthaneval- hole and the excited photoelectron, together with the in- uating the GW self-energy in the typical random-phase teraction between them. To leading order the interaction approximation (G0WRPA), we instead use a much more consists of two terms: the Coulomb interaction, which in- computationallyefficientapproximationbasedonamulti- cludes adiabatic screening of the core hole, and an un- pole modelfor the loss function. This has been previously screened exchange term. Even this two-particle descrip- described in detail [45], and we will not discuss it further tion of the many-body final state is already a signifi- here. To reduce the time required to calculate the screen- cant improvement for L edges [22]. For example, when ingresponsetothecore-holewetakeadvantageofthefact considering the L edges of the transition metals, the that this screening is highly localized around the excited 2,3 independent-particleapproximationpredictsa2:1branch- atom and partition space accordingly. The electronic re- ing ratio between the intensities of the L and L edges, sponseisevaluatedlocallyandamodeldielectricresponse 3 2 which is in contrast to experiments which exhibit branch- proves adequate for the rest of space [46]. Here we reduce ing ratios ranging from 0.7:1 for Ti to beyond 2:1 for Co the time needed to evaluate the screening response and 2 action of the BSE Hamiltonian by parallelizing these por- Coulomb interaction within the mean-field of the remain- tions of the code. This is discussed in section 3.2 for the ing electrons. This is separated into the attractive direct BSEHamiltonianandsection3.3forthescreening. Insec- term tion 4 we demonstrate the effectiveness of these improve- ments through XAS calculations on a series of supercells V =aˆ+(r,σ)aˆ (r′,σ′)W(r,r′,ω)aˆ (r,σ)aˆ+(r′,σ′), (6) D v c v c of SrTiO . Section 4.1 characterizes the time-scaling of 3 the code with respect to system size. Section 4.2 evalu- whichisscreenedbytheotherelectronsinthesystem,and ates the savings realized by employing the optimal basis the repulsive exchange term functions. The efficacy of parallelization is reported in section 4.3 for the action of the BSE Hamiltonian and in 1 section 4.4 for the screening response. Example XAS and VX =aˆ+v(r,σ)aˆc(r,σ)|r−r′|aˆv(r′,σ′)aˆ+c (r′,σ′), (7) XES spectra of the commonly studied organic molecule Tris-(8-hydroxyquinoline)aluminum (Alq ) are presented 3 which is treated as a bare interaction. The aˆ+ (aˆ ) oper- in section 4.5. We end with a summary of the capabilities v v ofocean andsome generalcomments onits applicability. ator creates (annihilates) an electron in the valence level while aˆ+ (aˆ ) creates (annihilates) an electron in a core c c level. 2. Formalism Ourimplementationofthe GW-BSEmethodisreferred to as ocean (Obtaining Core Excitations from Ab initio The theoretical description of the absorption of a pho- electronic structure and NBSE) and we have previously ton by a material may be expressed in terms of the loss function σ(q,ω)=−Imǫ−1(q,ω). The dielectric response presented it in detail [42]; NBSE refers to the NIST BSE solver. Because solving the Bethe-Salpeter equation at function ǫ depends on the photon energy ω and the mo- mentum transfer q. A formal many-body expression for the level of approximation described above is computa- tionally intensive compared to other methods of calculat- the loss function may be given as ing x-ray spectra our approach makes several reasonable σ(q,ω)∝−1Imh0|Oˆ+G(ω)Oˆ|0i, (1) approximations to improve the efficiency of the calcula- π tion. Specifically,withinaplane-waveapproachtosolving the Kohn-Sham equations, we use pseudopotentials to re- where|0iisthemany-bodyground-statewavefunction,the operator Oˆ describes the interaction between the photon duce the number of electrons and size of the plane-wave basis. The GW self-energy is obtained through the highly field and the system, and G(ω) is the Green’s function efficientmany-poleself-energyapproximationwhenappro- for the many-body excited-state. The form used for the operator Oˆ depends on the physical process being stud- priate [45]. The effortrequiredto obtainthe screening for ied, e.g., eiq·r for non-resonant inelastic x-ray scattering the direct interaction is also reduced by utilizing a hy- brid real-space approach in which the screening response (NRIXS) or the expansion (eˆ·r)+(i/2)(eˆ·r)(q·r)+... is evaluated at the RPA level locally about the absorbing for x-ray absorption (XAS); eˆ being the photon polariza- site, but the long range screening is approximated with a tion vector. Using the Bethe-Salpeter Hamiltonian, the model dielectric function [46]. Green’s function for the excitation can be approximated in a two-particle form as Despite these efficacious strategies, the largest system treated with our previous implementation of ocean was G(ω)=[ω−HBSE]−1 (2) a water cell consisting of 17 molecules [47]. To extend the capabilitiesofoceantotreatlargersystemswehavemade where the Bethe-Salpeter Hamiltonian is typically given several improvements. A calculation with ocean consists by of four stages H =H −H −V +V . (3) BSE e h D X The term for the core-hole 1. DFT 2. Translator 3. Screening 4. BSE H =ǫ +χ−iΓ (4) where stage 2 is a translation layer that allows different h c DFT packages to be used as the foundation for ocean. contains the average core-level energy ǫ , the spin-orbit c The limiting points of the calculation previously were interaction χ, and the core-hole life-time Γ. In most prac- stages1and3,solvingtheKohn-Shamequationsandeval- tical work, the excited electron Hamiltonian uatingthescreeningresponsetothecore-hole. Stage4,the GW actual evaluation of the Bethe-Salpeter Hamiltonian, was H =H +Σ −V (5) e KS xc also a limiting point for systems that required sampling isapproximatedbytheKohn-ShamHamiltonianH with numerous atomic sites. Therefore, our efforts focused on KS a GW self-energy correction ΣGW where the exchange- (i)improvingtheefficiencyoftheDFTcalculationand(ii) correlation energy, V , is subtracted off due to double- parallelizing the evaluation of the screening response and xc counting. The excited electron and hole interact via the the Bethe-Salpeter Hamiltonian. 3 3. Implementation constructed or stored. Rather, at each step in the Hay- dock scheme the BSE Hamiltonian acts on the vector de- 3.1. Optimal Basis Functions termined by the previous iteration, Ourpreviousversionofoceanusedabinit[6–9]asthe ψi+1 =H ψi , (9) DFT solver. ocean may now be alternatively based on α,n,k BSE α,n,k wavefunctions obtainedwith Quantumespresso [10]. For where i gives the ith vector, and convergence of the spec- the purpose of this paper, we report results based on the trum can be achieved in only a few hundred iterations use of Quantumespresso rather than abinit, though ei- (true diagonalization requires a number of steps equal to ther DFT solver may be chosen depending on the prefer- thedimensionofthematrix). Amoreexplicitexplanation ence of the user. is given by Benedict and Shirley in Ref. 49. We divide up The loss function in eqn. 1 is calculated by transform- equation9byseparatingtheBSEHamiltonianintopieces, ing the implied integral over all space into sums over eachofwhichisevaluatedinitsidealormostcompactba- reciprocal-space k-points within the Brillouin zone and sis. This is outlined in the following two sections. The real-space x-points within the unit cell, as is standard in fullvectorψi+1 isthenthesumofcontributionsfromeach calculations of periodic systems. The sum over k-points piece. requires a denser mesh for converging spectra than prop- ertiessuchasthedensity. Additionally,theBSEapproach 3.2.1. Long-range necessitates summing over a large number of unoccupied Only the direct interactionhas a long-rangecomponent states which are also not needed when looking at ground- as the exchange term requires both r and r′ be at the state properties. Thus, a considerable number of Kohn- core-hole site. The screened-coulomb interaction can be Sham states must be constructed. expandedinsphericalharmonicsandseparatedbyangular We reduce the computational cost of generating the momentum l, Blochfunctionsthroughtheuseofoptimalbasisfunctions (OBF) [43]. We have implemented the OBF routines of ǫ−1(r,r′′) ∞ ′ ′′ ′ Prendergast and Louie as a middle-layer in the ocean W(r,r)=Zdr |r′′−r′| = Wl(r,r). (10) X code [44]. The OBFs are a method of k-space interpo- l=0 lation and basis reduction. A fully self-consistent DFT The long-range part includes only the l=0 term, calculation is carried out to converge the density, using only enough bands to cover the occupied states. With ′ ′′ǫ−1(r,r′′) W (r,r)= dr , (11) thisdensityanon-SCFcalculationisperformed,including 0 Z [r ] > r′′,r′ unoccupied bands for the screening and BSE calculations (the density is held constant and the Kohn-Sham eigen- thatgoesas1/rawayfromthecorehole. AllhigherWl are system is solved for all the bands needed for the BSE). treated as short ranged. By integrating out the core-hole This second calculation is used as a basis to create the dependence via the core-hole density ρα OBFs. By using the OBFs we achieve a significant reduc- ǫ−1(r,r′′) tioninthetimespentcalculatingtheBlochfunctionsfora W (r)= dr′ρ (r′) dr′′ , (12) given system. Further details and quantitative results are 0 Z α Z [r>]r′′,r′ presented in section 4.2. the long-range component of the BSE Hamiltonian is a function of the electron spatial coordinate only. 3.2. BSE Hamiltonian We evaluate the actionof W ona vector by firsttrans- 0 In ocean the BSE Hamiltonian acts on a space con- forming to a super-cell space taining a core-level hole with index α and a conduction band electron with indicies n,k. A vector in this space is φα(x,k)= ψα,n,keik·x|un,k(x)i (13) X describedbythecoefficientsψ ,andthe photoelectron n α,n,k wavefunction for a given core index α is easily expanded φα(x,R)≡Fk→R[φα(x,k)], in real space from the conduction-band Bloch functions where R are the lattice vectors and F indicates a Fourier |Φ (x,R)i= ψ eik·(R+x)|u (x)i. (8) transform. The core-hole index α can refer to different α α,n,k n,k X angular momentum and spin lmσ states of the core hole, n,k but the long-rangecomponent of the direct term does not Within OCEAN we are interested primarily in the re- mix spin or angular momentum, and so these are treated sulting x-ray spectrum, rather than obtaining the actual sequentially. The k-space grid used in an ocean calcu- eigenvalues and eigenvectorsof the Bethe-Salpeter Hamil- lation defines a maximum range of the screened core-hole tonian, and so the iterative Haydock technique is used potential. [48, 49]. The advantage of an iterative technique is that The operation ψi+1 = W ψi is laid out in algorithm 1. 0 the complete Hamiltonian does not need to be explicitly Since the direct interaction is diagonal in real-space this 4 Algorithm 1 Computing ψi+1 =W ψi 3.3. Screening 0 1: for each x and α do 2: φ(k)= nψni,k,αeik·xun,k(x) scrIenentehdebyditrheectdiienletecrtraicctiroenspothnesecoofrteh-ehoslyestpemot.enWtiealcails- 3: φ(R)=PFk→R[φ(k)] culate this response within the random phase approxima- 4: φ(R)=W(x,R)×φ(R) tion (RPA) 5: φ(k)=FR→k[φ(R)] 6: ψi+1 +=φ(k)e−ik·xu∗ (x) ′ 7: end fno,kr,α n,k χ0(r,r′,ω)= dω G(r,r′,ω′)G(r′,r,ω′−ω). (15) Z 2πi We evaluate this expressionin real-spacearoundthe core- procedureiseasilyparallelizedbydistributingthex-points hole,and,takingadvantageofthelocalizednatureofnear- among all the processors. The parallel algorithm has the edge core excitations, we limit our full calculation to a additional final step of summing the vector ψi+1 over all sphere with a radius of approximately r =8 a.u., splicing theprocessors. Thislimitsthescaling,butthevectoritself on a model dielectric function for the long-range behavior is notlarge. Boththe numberofx-pointsandthe number [46]. Thecross-overradiusfromRPAtomodelisaconver- of empty bands required for a given energy range scale as gence parameterso that for eachmaterialone may ensure the volume of the system. that this approximation has no discernible effect on the calculated spectra. We use static screening ω = 0 which assumesthattheexcitonbindingenergyissmallcompared 3.2.2. Short-range to the energy scale for changes in the dielectric response, Theshort-rangecomponentsoftheHamiltonianarecal- i.e., smaller than the band gap. culatedbyprojectingtheconduction-bandstatesintoalo- Our real-space grid is 900 points determined from a 36 cal basis aroundthe core hole. These basis functions have point angular grid and 25 uniformly spaced radial points. a well-defined angular momentum around the absorbing We carryout the integraloverenergy in eqn.15 explicitly atomandarereasonablycompletesuchthatneartheatom alongtheimaginaryaxis,andforlargesystemsthebulkof time is spent projecting the wavefunctions onto this grid ϕ (r)| = Aν,l,mR (r)Y (rˆ). (14) n,k r<rc n,k ν,l l,m and constructing the Green’s function X ν,l,m ψ (r)ψ∗ (r′) Yl,m are the usual spherical harmonics and, following the G(r,r′,µ+it)= n,k n,k , (16) µ+it−ǫ ideas of the projector augmented wave method, R are X n,k ν,l n,k taken to be solutions to the isolated atom [50, 51]. This both allows us to capture the correct, oscillatory behav- where µ is the chemical potential. Using the OBFs, we iorofthevalenceandconductionstatesnearthe core-hole determine the Bloch functions on the spherical grid and andgivesusacompactbasisforcalculatingthematrixele- distribute the calculationacrossprocessorsbydividing up mentsofthe short-rangedirectandexchangeinteractions. the spatial coordinates. On modern computer architec- In practice, our core-hole will typically have 1 (s) or 3 (p) turesthisoperationwillbelimitedbymemorybandwidth. angular momentum states, and our conduction electrons To alleviate this we have each processor work on one or willhave4projectorseachforl =s,··· ,f or64lmstates. more small blocks of the Green’s function such that we For an L edge this gives a maximum matrix dimension of can be confident that the block will remain in the cache 4×3×64=768,includingspindegreesoffreedomforboth during the summation over bands. the electron and the hole. This is completely independent Togainanadditionallevelofparallelizationwecanalso of the overallsystem size. distribute the calculation by k-point. We divide the total Thetime-consumingstepsincomputingtheshort-range processors into pools of equal size, and every pool calcu- interactions are projecting into and out of the localized lates Gk. The sum G = kGk is carried out across the basis which requires summing over all of the bands and pools, placing the complePte G on the master pool. Then k-points. The mapping of band states to localized states each processor in the master pool calculates χ0 via equa- is precomputed and stored. We expand the exchange and tion15,performingtheintegralovertheimaginaryenergy local direct by angular momentum and exploit selection axisforitsblocks. Finally,thecompletematrixχ0 iswrit- rules to limit the number of multipole terms. The various ten to disk. In practice, a 2×2×2 shifted k-point grid pieces of the short-range Hamiltonian are distributed to is used for screening calculations, giving 8 unique points differentprocessors. Thesmallnumberofmultipole terms in the Brillouin zone and allowing up to 8 pools. There is and size disparity between them limits the degree of par- some cost to performing the parallel sum of G over pools allelization, but alleviates the need for a communication and inefficiency in evaluating χ0 on only the master pool, step. Each processor adds its results for the long- and howeverneitherofthesecontributesignificantlytotherun- short-range interactions, and then a single synchronizing ning time. summation of ψi+1 is carried out. 5 theDFTportionofthecalculationwasexecutedinparallel Table 1: Timing budget variation with system size. The central overagivennumberofcoreswhilethe remainingstagesof portionofthetableliststhetotalrun-timeinminutesspentoneach stage of the calculation. The DFT portion of the calculation was the calculation used only a single core. We use the Quan- performedinparallelandthetimeshownistherun-timemultiplied tumespresso density-functional theory package to gener- bythenumberofcoresused. Theremainingstageswerecarriedout ate the ground-state electronic structure upon which the onasinglecore. spectralcalculationsarebased. Thesecalculationsareper- Supercell (N) 1 2 3 4 formed with norm-conserving pseudopotentials obtained Rel. System Size 1 8 27 64 from the abinit distribution [6–9] with the exception of DFT 22 832 17161 265226 Ti for which we made a pseudopotential with semi-core Translator 0.08 2.5 43.8 754 states included in the valence configuration. We employ Screening 0.2 2.0 15.4 76.5 the local-density approximation to the exchange correla- BSE 0.5 26.1 250 1915 tion functional and truncate the planewave basis at 50 Ry. Γ-point sampling was used to obtain the ground- state electron density, which was subsequently expanded 4. Results into separate sets of Kohn-Sham orbitals for the evalua- tionofthe screeningandthe Bethe-SalpeterHamiltonian. To demonstrate the scaling performance of ocean we In each case, the number of empty bands utilized equaled consider the Ti L-edge XAS of supercells of SrTiO . Pure 3 thenumberofoccupiedbands. Forthescreeningresponse, SrTiO is a wide band-gap semiconductor and incipient 3 states were constructed at a single k-point while to evalu- ferroelectric that assumes an undistorted cubic structure ate the Bethe-Salpeter Hamiltonian states were expanded at ambient conditions. SrTiO is a foundational ma- 3 on a shifted 2×2×2 grid. The above values were suffi- terial on which a remarkable variety of electronic and cient for convergence of the final spectrum for supercells opto-electronic devices are based. Epitaxially interfacing with N ≥ 2. Since our purpose at present is to study the SrTiO3 withotheroxidecompounds,notablyLaAlO3,can scaling performance of ocean we use these same values yieldinterfacial2-dimensionalelectrongasesthataregood for the N = 1 case (conventional cell) even though this k- conductors [52], superconducting [53], show magnetic or- point sampling is insufficient for convergence at this size. dering[54],orthatareevensimultaneouslysuperconduct- (Convergence at N = 1 can be reached by increasing the ingandmagneticallyordered[55,56]. DopingSrTiO also 3 k-pointsamplingto2×2×2forthedensityandevaluation produces a wide range of useful physical properties. Ad- of the screening, and to 3×3×3 for the BSE.) dressingtheseinterestingsystemsnumericallyrequiresthe Table 1reportsthe timing budgetseparatedbystageof use of large supercells. For instance, a doping level of 3.7 the calculation for each of the four supercells. The DFT % (1/27) necessitates a 3 × 3 × 3 supercell while for a segments of the calculations were run in parallel and the 1.56 % (1/64) doping level a 4×4×4 supercell must be timereportedistheproductoftherun-timeandthenum- constructed. ber of cores used. The remaining stages of the calculation Inthissectionweconsiderrun-timesforcubicsupercells were all run on a single core; only a single Ti site was in- ofpureSrTiO withN={1,2,3,4}repetitionsofthecon- 3 terrogated for each supercell. Extrapolating these times ventional cell in each direction. This corresponds to sys- to all Ti sites in each supercell, we find that the run-time tems with {5,40,135,320}atoms and {32,256,864,2048} required for the three stages after the DFT portion con- valence electrons, respectively. We begin in section 4.1 stitutes approximately 30 % of the total calculation time with benchmark calculations by investigating the single- for the larger supercells. Parallelizationof these segments core time-scaling of the original screening and BSE por- could then reduce their run-time to a small fraction of tions of the calculation. The four supercells are run with the DFT run-time. Table 1 shows that the time spent in equivalent parameters and the usual basis of Kohn-Sham the BSE stage dominates that used in the screening por- orbitals,thatis,withoutemployingtheoptimalbasisfunc- tion of the calculation. Thus, it is essential to implement tions. We next consider in section 4.2 the savings gained an effective parallelization scheme for the BSE stage; this throughthe k-space interpolationschemeof optimalbasis will be demonstrated below. The wavefunction transla- functions. In sections 4.3 and 4.4, we determine the scal- tion step also becomes time consuming for large systems. ing behavior of the parallel implementation of the BSE Atpresent,wehavenotsoughttoimprovetheefficiencyof andscreeningroutines,respectively. Weendinsection4.5 thisprocess,butfutureeffortsmaybedirectedatreducing withabriefdemonstrationofocean bycalculatingthe x- the time required here. rayabsorptionandemissionspectraofanorganicmolecule commonly used in light emitting diode devices. The individual run-time scaling of the screening and BSE stages of the calculation are presented in Figure 1. 4.1. Single Processor Calculations These results demonstrate that both the screening calcu- Before discussing the improvements we have made it is lationandtheevaluationoftheBSEscaleasthenumberof informative to establish the prior baseline performance of statesto the secondpower. Since weuse the same k-point ocean. We performed a series of timing tests for which gridsforallsupercellsthevariationinthenumberofstates 6 25 system size. Therefore the BSE section of the code is ex- pected to scale with the secondpower ofsystemsize, asis Screening BSE (x1/3) confirmed in figure 1. 20 2 1/ e) 4.2. Reduced Basis m Ti 15 n- FortheSrTiO3supercellspresentedinthiswork,Γpoint u R samplingwasusedasaninputtotheOBFschemeandthe e v 10 Bloch functions were interpolated onto 2×2×2 k-point elati grids(Forthesinglecella23 k-pointgridwasinterpolated R ( to a 33 grid). As NSCF DFT calculations scale linearly 5 with the number of k-points this represents a potential 8x speedup (3.4x for the conventional cell). This gain, 0 however, is partially offset by the additional steps needed 1 8 18 27 36 64 to carry out the OBF interpolation. Relative Number of States Figure 1: Run-Time Scaling. The relative run-times for the screening and BSE portions of the calculation are presented versus Table2: TherelativetimerequiredforeachstepingeneratingBloch systemsize. Thescreening(leftaxis,redcircles)scaleswiththesys- functionsandthespeedupwithrespecttothetotalruntimeachieved temsizetothesecondpowerandtheBSE(rightaxis,bluesquares) byusingtheOBFs(Thetotal run-timeincludesstepsnotexplicitly goes asthesystemsizetothefourthpower. listedinthetable). Supercell (N) 1 2 3 4 Rel. System Size 1 8 27 64 comes from the increasing number of bands which grows # Processors 8 32 64 128 proportionally with the supercell size. In this example we SCF 0.48 2.20 6.45 42.5 haveconsideredasinglesiteinthesupercell. Thenumber NSCF 0.06 0.38 7.11 38.6 ofsitesalsogrowwithvolume,leadingtoanoverallscaling OBF 0.18 1.55 12.1 49.7 of volume cubed for both the screening and BSE. Given Total 1.00 4.90 30.3 152 that as the system size increases the BSE stage comes to Speedup 1.27x 1.22x 2.24x 2.46x dominate the overall run-time, it is clearly advantageous to find an effective parallelization scheme for this stage. This is discussed below in section 4.3. InTable2weshowtherelativetimefortheSCF,NSCF, and OBF stages that are needed to generate Bloch func- From eqn. 16 it appears that the screening calculation tionsfortheBSEcalculation. ByassumingthattheNSCF should scale directly with the number of bands as the ra- would necessarily take 8x (3.4x) longer without the OBF dial gridis independent of systemsize. However,for large interpolation we estimate the savings as a percentage of systems much of the time is spent projecting the DFT the total run-time. While the small cells show only a wave functions onto the radial grid from the planewave modest improvement, the 33 and 43 cells complete in less basis. Both the number of planewaves and bands grow than half the time when using the OBF scheme. Gener- with system size leading to this second power behavior. ically, the expected savings will depend most strongly on This indicates that an improved approach to projecting the needed k-point sampling for the system under inves- the Kohn-Shamstates onto the localbasis is anavenue to tigation. Metallic systems require much denser k-point further reducing the time spent in the screening routine. gridsforconvergenceandwillyieldcorrespondinglylarger Per section 3.2, the BSE Hamiltonian is broken into savings. three sections: non-interacting, short-range, and long- range. The non-interacting term is diagonal, and, for a 4.3. BSE Scaling singlesitescalesdirectlywithsystemsize. Theshort-range part,likethescreeningcalculation,reliesonasmall,local- To investigate the scalability of our parallel BSE solver izedbasisthatdoesnotchangewithsystemsize. Alsolike with processor number we focus on the 4×4×4 super- the screening calculation, for large system the projection cell of SrTiO . This system approaches the limits of sin- 3 of the DFT orbitals into this localizated basis to deter- gle node execution due to memory considerations. A sig- mine the coefficients A (eqn. 14) will grow with both the nificant amount of time for each run is spent reading in number of planewaves and bands. While this projection thewavefunctions. Forthisexamplethewavefunctionsre- happens only once at the beginning of the BSE stage, it quire 12 GB of space (3072 conduction bands, 8 k-points, does lead to scaling with the second power of the system and32,768x-points). The time needed to readthese data size. The long-range part of the BSE is the most compu- typically ranged from 40-60 seconds [57]. Currently, the tationally expensive. As shown in algorithm 1, the action wavefunctions are read in by a single MPI task and then of W grows with both the number of x-points and the distributed so this time is relatively constant with pro- 0 number of empty states, both of which scale linearly with cessor count. To give a better picture of the scaling, we 7 subtract out the time required for this step before com- tion,theBSEcodehasverylittle explicitserialexecution. paring runs. If multiple runs are carried out on the same Further work is needed to investigate and alleviate the cell, varying atomic site, edge, or x-ray photon (polariza- bottlenecks preventing linear scaling. tionormomentumtransfer),thewavefunctionsarekeptin memory, and therefore this upfront costwill be amortized 4.4. Core-hole Screening Scaling overallthe calculations. Inthe presentexamplewecalcu- In this section we investigate the timing of calculat- latethespectraforthreepolarizations,using200Haydock ing the valence electron screening of the core-hole. As iterations. stated before, for core-level spectroscopy we are mainly ToassessthescalingoftheBSEsectionweusethemet- interested in the local electronic response. Limiting our ric cost. The costreflects what the user wouldbe charged calculation of the RPA susceptibility to a region of space on a computing facility: the run-time multiplied by the around the absorbing atom makes the calculation much number of processors. A perfectly parallelizedcode would cheaper than traditional planewave approaches without maintain a cost of 1.0 when run across any number of sacrificing accuracy [46]. Within this approach, and for processors. Costslessthan1.0wouldindicatesomesuper- mostothermethodsofcalculatingcore-excitationspectra, scaling behavior, most likely from accidental cache reuse. aseparatescreeningcalculationisrequiredforeachatomic In figure 2we show the relativecostofthe BSEsectionas site. In systems with inequivalent sites due to differences a function of the number of processors. The testbed for in bonding, defects, or vibrational disorder contributions this section consists of identical, dual socket nodes with 6 from each atom must be summed to generate a complete processors (cores) per chip and connected via high-speed spectrum. interconnects. Eachmulti-nodetestwasrunfivetimesand To test the behavior of the screening section we use the the best time for each was used. same test case as for the BSE, the 4 × 4 × 4 supercell The BSE section is a hybrid MPI/OpenMP code so we of SrTiO . The RPA susceptibility is calculated using a 3 compareexecutionwithvariouslevelsofthreading. Ascan singlek-pointand4096bands. Weshowresultsforasingle beseeninfigure2,theuseofthreadsimprovesthespeedof site, 16 sites, and 32 sites (out of the 64 titanium atoms the calculation over pure MPI. This is true even when 12 in our supercell). The number of sites will vary based threads are used, requiring communication across the two on the system being investigated. Impurities may require sockets via OpenMP. We also see very acceptable scaling only a single site, but liquids or other disordered systems with processor count. The benchmark calculation takes a necessitateaveragingovermanysites[47,58]. Asisevident little over 8.5 hours on a single node and single thread, in figure 3, this section of the calculation does not scale while running on 192 cores can be done in approximately beyond a few dozen processors. This is especially true 3.5 minutes for a real-worldspeedup of 114x. when only investigating a single site. While better scaling is desirable, as observed in Table 1 the total time for this 2.2 sectionis quite small. In this particular test the screening 1 2 calculationtookjustunder8minforasinglesitewhilethe 3 2 6 initialDFTcalculationsrequiredapproximately3hourson 12 128 processors. 1.8 4.5. Example Spectra Cost 1.6 Figure 4 presents the Ti L edge of pure SrTiO3 calcu- lated for both the primitive cell (one formula unit) using 1.4 our previous version of ocean [42] and for the 5×5×4 supercell (100 formula units) obtained with the code im- 1.2 provements described herein. This comparison serves to verify that the fidelity of the computational scheme has 1 been preserved through the modifications we have made 12 24 48 96 192 and demonstrates the feasibility of calculating spectra of # of Processors much larger systems than previously possible. Doping Figure 2: BSE Cost. The cost, ratio of actual run-time to the- SrTiO yields a wide range of interesting physical prop- 3 oretical linear scaling, as a function of processor count. Each line erties and we imagine that in future investigations it will represents adifferentnumber ofthreads pernode. OneortwoMPI be fruitful to apply ocean to studies of such systems. tasks per socket (6 or 3 threads) are seen to give the best perfor- mance. Compounds of SrTiO3 doped with transition metals are investigated for use in photocatalysis and as permeable Sublinear scaling, a cost in excess of 1.0, is, in general, membranes, among other uses. Further, there is some ev- theresultofnon-parallelizedsectionsofcode,synchroniza- idence that doping with Mn, Fe and Co may yield dilute tion costs, and memory bandwidth constraints. With the magnetic semiconductors [59–61]. The task of improving exception of the aforementioned wavefunction initializa- the performance of such materials is assisted by a better 8 32 Sites 2 16 Sites Primitive Cell 1 Site 5x5x4 Supercell 1.5 s) e nit m u Ti b. e ar Relativ 1 ensity ( nt I 0.5 0 16 24 32 40 48 56 64 452 454 456 458 460 462 464 466 468 # of Processors Energy (eV) Figure 3: Screening scaling. The relative time for the screening Figure 4: Ti L-edge XAS of SrTiO3. Comparison of the Ti L- sectionasafunctionofthenumberofprocessors. Weshowboththe edgeXASofSrTiO3 calculatedfromtheprimitivecellusingabinit total timerequiredforasinglesite,16sites,and32sitesrelativeto andtheserialversionofoceanandfromthe5×5×4supercellusing thetimeneededtorunthesinglesiteon16processors. Thepersite theOBFsandparallelizedoceancode. time is lower with a larger number of sites due to amortization of i/o and other initialization costs. While the scaling with processor number is poor, we see only very limited detrimental effects from usingmoreprocessors,andthereforethescreeningcalculationisrun usingthesameprocessorcountastheother stages. simulation. Thespectrapresentedinfigure5showtheav- erageproducedfrom10configurationsofaMDsimulation understanding of how the dopant element interacts with performed at 300 K within Quantumespresso. In addi- the host system. It is now possible to model such spectra tion to averaging over the MD configurations, each spec- with a realistic, first-principles approach. trum is an average over each atomic site in the molecule As a second example system we consider the organic for the given element. Thus, these results represent a se- molecule Tris-(8-hydroxyquinoline)aluminum (Alq ). ries of 30 calculations for the N edge, 30 calculations for 3 Alq has remained a leading molecule for the electron theOedge,and270calculationsfortheCedge(keepingin 3 transport and emitting layer in organic light emitting mind that the same ground state DFT calculation can be devices since it was first proposed for this purpose [63]. used for all edges of a given MD configuration). Despite Despite numerous academic and industrial investigations the large sampling required to produce these spectra the of such systems, significant problems remain, particularly calculations are not particularly burdensome. After av- regardingdevicelifetime. TheAlq moleculeissusceptible eraging over all samples and sites, self-energy corrections 3 to decomposition through reaction with water molecules to the electronic structure were incorporated within the [64] or metal atoms from the cathode layer [65, 66]. multi-pole self-energy scheme of Kas et al. [45]. Finally, X-ray spectroscopy is commonly used to further reveal an ad hoc rigid energy shift was applied to each spectrum the interaction of water or metal ions with the Alq toaligntheabsoluteenergieswiththeexperimentalvalues. 3 molecule and the pathway by which the complex decom- poses. However, results are difficult to properly interpret Our calculations are generally in good agreement with because such experimental work is rarely coupled with the measured spectra of Ref. [62]. The primary structure calculated spectra and because the molecule is sensitive oftheXASisreproducedforeachelementwithonlyafew to decomposition due to x-ray beam exposure. In this minor differences. For carbon, the feature near 289 eV section we demonstrate the ability of ocean to produce in our calculation appears in the experiment around 288 reliable XAS and XES spectra for this system. eV while, for nitrogen, the second feature is about 0.5 eV Alq exhibits two isomers, commonly referred to as fa- too low in energy in the calculation. The three peaks of 3 cial and meridional. The meridional isomer is favored en- the oxygen XAS match the experimental spectrum quite ergetically and we consider only this structure. In figure closely. The minor differences could originate in differ- 5 we present the C, N and O K-edge XAS and XES from ences in electronic structure between the gas-phase,as we the meridional Alq molecule, which consists of 52 atoms. consider, and the condensed-phase material probed in ex- 3 We treat the molecule in the gas-phase, though devices periment. Additionally,wepresentlyneglectvibroniccou- typically contain amorphous films of the molecule. Since plingintheexcited-state[67–69]whichcanbeparticularly thermalatomicmotioncannoticeablyimpactspectralfea- important in molecules with light elements [70, 71]. Nev- tures for lighter elements we sum spectra from a series of ertheless,agreementwithexperimentisgenerallyquitefa- configurations generated by a molecular dynamics (MD) vorable. 9 Carbon Nitrogen Oxygen Intensity (arb. units) XES XAS Intensity (arb. units) XES XAS Intensity (arb. units) XES XAS 265 270 275 280 285 290 295 300 375 380 385 390 395 400 405 410 415 515 520 525 530 535 540 545 Energy (eV) Energy (eV) Energy (eV) Figure5: XAS and XES of Alq3. K-edgeXAS(color)andXES(black) ofcarbon(left,red),nitrogen(middle,green)andoxygen(right, blue)forgas-phasemeridionalTris-(8-hydroxyquinoline)aluminum(Alq3). Greycurvesshowtheexperimentaldatareproducedfromreference [62]. 5. Conclusion One should keep in mind that the Bethe-Salpeter equa- tionisoneofseveralapproachestocalculatingx-rayspec- We have implemented a series of improvements that al- tra. While the BSE method holds advantages in being low core-level spectral calculations based on solving the a predictive,first-principles approach,its two-particlefor- Bethe-Salpeter equation to be performed for much larger mulation,incertaincases,isstillacrudeapproximationto systems than previously possible. Whereas ocean was the actual many-body problem. Contrary to this, many- previously limited to systems of a few tens of atoms, here particleapproachessuchasmultipletcalculationsandclus- we have reported calculations on systems as large as a ter models capture many-body physics more completely, 5x5x4 supercell of SrTiO , which consists of 500 atoms thoughtypicallyatthecostofband-structureeffects. The 3 and 3200 valence electrons and would allow for the direct challenge for these local methods is to incorporate the simulation of doping at the 1 % level. This particular electronic structure of the extended system in a scalable calculation required only 12.5 hours on 128 cores. This fashion. Itappearspossible tomakeconsiderableprogress enhanced capability makes spectral calculations on amor- toward this goal by working within a localized basis con- phous and dilute systems feasible. structed from an extended electronic structure [19]. For Wereducedthecostofthenon-selfconsistentfieldDFT the BSE technique, it will be necessaryto incorporatead- calculation through a k-space interpolation scheme based ditional many-particle effects. We expect that a better on a reduced set of optimal basis functions. This yielded description of self-energy effects being made accessible by a speed-up by a factor of 2-2.5 for large systems. Paral- recentcumulant expansiondevelopment will proveadvan- lelizationofthescreeningcalculationandevaluationofthe tageous in this respect [72, 73]. BSE Hamiltonian provided further savings. We find that Theoceansourcecodeisnowavailableforgeneraluse; the parallelizationof the screening response scales only to for details and documentation see [74]. a few dozen processors. However, since the evaluation of the screening at this level of parallelzation has a limited cost compared to the initial DFT calculation this is not 6. Acknowledgments a significant concern at present. The BSE section of the code scales well to a few hundred processors with only a ThisworkissupportedinpartbyDOEGrantDE-FG03- small growth in overhead that appears linear in proces- 97ER45623 (KG, JJK, FDV, JJR). KG was additionally sor count. When both MPI and OpenMP parallelization supported by the National Natural Science Foundation of is used we achieved a speedup of 114x on 192 processors. China (Grant 11375127), the Natural Science Foundation However,whenonlyMPIisusedthereisevidencethatthe ofJiangsuProvence(GrantBK20130280)andtheChinese overheadisgrowingsuperlinearly. Thereremainsroomfor 1000TalentsPlan. Calculationswereconductedwithcom- future improvement to reduce the MPI overhead. puting resources at Lawrence Berkeley National Labora- In addition to x-ray absorptionspectra and x-ray emis- toryandtheNationalEnergyResearchScientificComput- sionspectra,inelasticx-rayscatteringspectramayalsobe ingCenter(NERSC),aDOEOfficeofScienceUserFacility calculated with ocean. We view ocean as uniquely ca- supportedbytheOfficeofScienceofthe U.S.Department pable of evaluating such spectra, particularly at L edges, of Energy under Contract No. DE-AC02-05CH11231. with ab initio accuracy and predictive ability for complex Certainsoftwarepackagesareidentifiedinthispaperto and dynamic systems. We envisionuse ofocean to inter- foster understanding. Such identification does not imply pret data collected on a wide range of systems including recommendationorendorsementbytheNationalInstitute in operando studies of fuel cell materials, photocatalysts, ofStandardsandTechnology,nordoes itimply thatthese gas sensor and energy storage materials, as well as from are necessarily the best available for the purpose. liquid environments. 10