Table Of ContentDistributed-Memory Computing With
the Langley Aerothermodynamic
Upwind Relaxation Algorithm (LAURA)
(cid:3) y
Christopher J. Riley and F. McNeil Cheatwood
NASA Langley Research Center, Hampton, VA 23681
Paper Presented at the
4th NASA National Symposium on Large-Scale Analysis and Design on
High-Performance Computers and Workstations
Oct. 15{17, 1997/Williamsburg, VA
TheLangleyAerothermodynamicUpwindRelaxationAlgorithm(LAURA),aNavier-
Stokes solver, has been modi(cid:12)ed for use in a parallel, distributed-memory environment
usingthe Message-Passing Interface(MPI) standard. A standarddomaindecomposition
strategy is used in which the computational domain is divided into subdomains with
each subdomain assigned to a processor. Performance is examined on dedicated parallel
machines and a network of desktop workstations. The e(cid:11)ect of domain decomposition
andfrequencyofboundaryupdatesonperformanceandconvergenceisalsoexaminedfor
several realistic con(cid:12)gurations and conditions typical of large-scale computational (cid:13)uid
dynamicanalysis.
Introduction though the LAURA computer code is continually be-
The design of an aerospace vehicle for space trans- ingupdated withnewcapabilities, itisamaturepiece
portation and exploration requires knowledge of the of software with numerous options and utilities that
aerodynamic forces and heating along its trajectory. allow the user to tailor the code to a particular appli-
13
Experiments(bothground-testand(cid:13)ight)andcompu- cation.
tational (cid:13)uid dynamic (CFD) solutions are currently LAURAwasoriginallydevelopedandtunedformul-
used to provide this information. At high-altitude, tiprocessor, vector computers with shared memory
high-velocity conditions that are characteristic of at- such as the CRAY C-90. Parallelismusing LAURA is
mospheric reentry, CFD contributes signi(cid:12)cantly to achievedthrough the use of macrotaskingwhere large
the design because of the ability to duplicate (cid:13)ight sections of code are executed in parallel on multiple
conditionsandtomodelhightemperaturee(cid:11)ects. Un- processors. Because LAURA employs a point-implicit
fortunately, CFD solutions of the hypersonic, viscous, relaxation strategy that is free to use the latest avail-
reacting-gas(cid:13)owoveracompletevehiclearebothCPU able data from neighboring cells, the solution may
time and memory intensive even on the most power- evolve without the need to synchronize tasks. This
ful supercomputers; hence, the design role of CFD is resultsin averye(cid:14)cientuseofthe multitaskingcapa-
14
generally limited to a few solutions along a vehicle’s bilities of the supercomputer. But future supercom-
trajectory. puting may be performed on clusters of less powerful
One CFD code that has been used extensively for machines that o(cid:11)er a better price per performance
the computation of hypersonic, viscous, reacting-gas than current large-scalevector systems. Parallelcom-
(cid:13)owsoverreentryvehiclesistheLangleyAerothermo- puterssuchastheIBMSP2consistoflargenumbersof
1,2
dynamic Upwind Relaxation Algorithm (LAURA). workstation-classprocessorswith memory distributed
LAURAhasbeenusedinthepasttoprovideaerother- among the processors instead of being shared. In
modynamic characteristics for a number of aerospace addition, improvements in workstation processor and
3 4 5
vehicles (e.g. AFE, HL-20, Shuttle Orbiter, Mars network speed and the availability of message-passing
6 7
Path(cid:12)nder, SSTO Accessto Space ) and is currently librariesallow networks of desktop workstations(that
being used in the design and evaluation of blunt aero- may sit idle during non-work hours) to be used for
15
braking con(cid:12)gurations used in planetary exploration practical parallel computations. As a result, many
8,9
missions and Reusable Launch Vehicle (RLV) con- CFD codes are making the transition from serial to
10,11 12 16{20
cepts (e.g. the X-33 and X-34 programs). Al- parallel computing. The current shared-memory,
macrotasking version of LAURA requires modi(cid:12)ca-
(cid:3)
Research Engineer, Aerothermodynamics Branch, Aero- tion before exploiting these distributed-memory par-
andGas-DynamicsDivision.
y allel computers and workstation clusters.
ResearchEngineer,VehicleAnalysisBranch,SpaceSystems
andConceptsDivision. Several issues need to be addressed in creating a
1of9
distributed-memory version of LAURA: 1) There is
thechoiceofprogrammingparadigmtouse. Adomain
17
decompositionstrategy (which involvesdividing the
computationaldomain into subdomainsand assigning
eachtoaprocessor)isapopularapproachtomassively
parallel processing and is chosen due to its similarity
to the current macrotasking version. 2) To mini- PARTITIONS
mize memory requirements, the current data struc-
ture of the macrotasking, shared-memory version is
changed since each processorrequires storageonly for
its own subdomain. 3) The choice of message-passing
library(whichprocessorsusetoexplicitlyexchangein-
formation) may impact portability and performance.
4) The frequency of boundary data exchanges be-
tween computational subdomains can in(cid:13)uence (and
BLOCKS
may impede) convergence of a solution although the
point-implicit nature of LAURA already allows asyn-
14
chronous relaxation. 5) There are also portabil-
ity and performance concerns involved in designing a Fig. 1 Domain decomposition of macrotasking
version of LAURA to run on di(cid:11)erent (cache-based version.
and vector) architectures. 6) Finally, a distributed-
relaxation equation is
memory, message-passing version of LAURA should
retain all of the functionality, capabilities, utilities, ML(cid:14)qL =rL (1)
and easeof useof the currentshared-memoryversion.
where ML is the n x n point-implicit Jacobian, qL
This paper describes the modi(cid:12)cations to LAURA
is the vector of conserved variables, rL is the resid-
that permit its use in a parallel, distributed-memory
ualvector,andn isthe number of unknown variables.
environment using the Message-Passing Interface
21 For a perfect gas and equilibrium air, n is equal to 5.
(MPI) standard. An earlier, elementary version of
For nonequilibrium chemistry, n is equal to 4 plus the
LAURA for perfect gas (cid:13)ows using the Parallel Vir-
22 number of constituent species. The residual vector rL
tualMachine(PVM)library providesaguideforthe
23 and the Jacobian ML are evaluated using the latest
current modi(cid:12)cations. Performance of the modi(cid:12)ed
availabledata. Thechangeinconservedvariables,qL,
version of LAURA is examined on dedicated paral-
maybecalculatedusingGaussianelimination. AnLU
lel machines (e.g. IBM SP2, SGI Origin 2000, SGI
factorizationoftheJacobiancanbesaved(frozen)over
multiprocessor) as well as on a network of worksta-
large blocks of iterations ((cid:25) 10 to 50) to reduce com-
tions (e.g. SGI R10000). Also, the e(cid:11)ect of domain
putational costs as the solution converges. However,
decomposition and frequency of boundary updates on
the Jacobian will need to be updated every iteration
performance and convergence is examined for several
earlyinthecomputationwhenthesolutionischanging
realisticcon(cid:12)gurationsandconditionstypicaloflarge-
rapidly.
scale CFD analysis.
Macrotasking
LAURAutilizesmacrotaskingbyassigningpiecesof
LAURA
the computational domain to individual tasks. First,
LAURA is a (cid:12)nite-volume, shock-capturing algo- the computational domain is divided into blocks,
rithm for the steady-state solution of inviscid or vis- where a block is de(cid:12)ned as a rectangularly ordered
cous,hypersonic(cid:13)owsonrectangularlyordered,struc- array of cells containing all or part of the solution
tured grids. The upwind-biased inviscid (cid:13)ux is con- domain. Then each block may be subdivided in the
24
structed using Roe’s (cid:13)ux-di(cid:11)erence splitting and computational sweep direction into one or more par-
25
Harten’s entropy (cid:12)x with second-order corrections titions. Partitions are then separately assigned to a
based on Yee’s symmetric total-variation-diminishing task (processor). Figure 1 shows a two-dimensional
26
scheme (TVD). Gas chemistry options include per- (2D) domaindivided into2blocks with eachblockdi-
fectgas,equilibriumair,and airinchemical andther- vided into 2 partitions. Thus a task may work on one
malnonequilibrium. Moredetailsofthealgorithmcan ormore partitionswhich may be containedin asingle
be found in Refs. 1, 2 and 13. block or may overlap several blocks. Each task then
The point-implicit relaxation strategy is obtained gathers and distributes its data to a master copy of
by treating the variables at the local cell center L at the solution which resides in shared memory. With
theadvancediterationlevelandusingthelatestavail- the point-implicit relaxation, there is no need to syn-
able data from neighboring cells. Thus, the governing chronizetaskswhich results in a very e(cid:14)cient parallel
2of9
in terms of computational speed and convergence.
BLOCK Measuring the elapsed wall clock time of the code on
di(cid:11)erent machines estimates the communication over-
COMMUNICATION head and message-passing e(cid:14)ciency of the code. The
BETWEENBLOCKS
communication overhead associated with exchanging
boundary data information between nodes depends
on the parallel machine, the size of the problem, and
the frequency of exchanges. The frequency of data
BOUNDARYDATA
STORAGE exchangesmaybedecreasedifnecessarytoreducethe
communication penalty, but this may adversely a(cid:11)ect
convergence. Therefore, the impact of boundary data
exchange frequency on convergence is determined for
several realistic vehicles and (cid:13)ow conditions.
Computational Speed
Timingestimatesusingthe message-passingversion
of LAURA are presented for an IBM SP2, an SGI
Origin 2000, an SGI multiprocessor machine, and a
networkofSGIR10000workstations. Thesingle-node
Fig. 2 Domain decomposition of message-passing
version. performanceofLAURA ona cache-based(asopposed
to vector) architectureis not addressed. Viscous, per-
implementation. fect gas computations are performed on the forebody
10,11
Message-passing ofanX-33 con(cid:12)gurationwithagridsizeof64x56
x64. Thecomputationaldomainissplitalongeachof
In the new message-passingversion of LAURA, the
thecoordinatedirections(dependingonthenumberof
computationaldomainisagainsubdivided into blocks
nodes) into blocks of equal size. The individual block
along any of the three (i;j;k) coordinate directions
sizesareshownbelow. Becausetheblocksareequalin
with each block assigned to a processor. As com-
pared to the macrotasking version, this is analogous Table 1 Block sizes for timing study.
to de(cid:12)ning each block to contain only one partition
Nodes Block
and assigning each partition to a separate task. The
2 32 x 56 x 64
number of blocks is therefore equal to the total num-
4 32 x 28 x 64
ber of processors. Due to the distributed memory of
8 32 x 28 x 32
theprocessors,eachtaskonlyrequiresstorageonlyfor
16 16 x 28 x 32
its own block plus storage for boundary data from as
32 16 x 14 x 32
manyassixneighboringblocks(i.e. oneforeachofthe
64 16 x 14 x 16
six block faces). Figure 2 shows a 2D domain divided
128 8 x 14 x 16
equally into 4 separate blocks. Each processor works
only on its own block and pauses at user-speci(cid:12)ed in-
tervals to exchangeboundarydata with its neighbors. size, boundary data exchanges are synchronized at a
Theboundarydataexchangeisexplicitlyhandledwith
speci(cid:12)ed iteration interval for convenience. Each run
send and receive calls from the MPI message-passing
begins with a partially converged solution and is run
21
library. The MPI library was chosen because it is
for 200 iterations with second order accuracy. Two
a standard and because there are multiple implemen- values (1 and 20) are used for nexch, the number of
tations that run on workstations as well as dedicated iterations between boundary data exchanges, to esti-
27,28
parallel machines. Synchronization of tasks oc-
mate the communication overhead on each machine.
curs when messages are exchanged, but this exchange
The number of iterations that the Jacobian is held
is not required for any particular iteration due to the (cid:12)xed, njcobian, is equal to 20 and represents a typical
point-implicit relaxation scheme. As in the macro-
value for solutions that are partially converged.
tasking version, tasks (or blocks) of various sizes may
Four di(cid:11)erent architectures are used to obtain tim-
accumulate di(cid:11)ering numbers of iterations during a
ingestimates. The(cid:12)rstisa160-nodeIBMSP2located
run. For blocks of equal size, it may be convenient
attheNumericalAerospaceSimulation(NAS)Facility
to synchronize the messageexchange at speci(cid:12)ed iter-
at NASA Ames using IBM’s implementation of MPI.
ation intervals.
Thesecondisa64processor(R10000)SGIOrigin2000
also located at NAS using SGI’s version of MPI. The
Results third is a 12 processor (R10000) SGI machine oper-
The performance of the distributed-memory, ating in a multiuser environment, and the fourth is
message-passing version of LAURA is examined a network of SGI R10000 workstations connected by
3of9
nexch=1 nexch=1
104 nexch=20 104 nexch=20
linearspeedup linearspeedup
) IBMSP2 ) SGIR10000
sec 103 X-33forebody sec multiprocessor
( (
e ViscousPG e
m m
ti 64x56x64 ti 103
ck 200iterations ck
o o
allcl 102 allcl XV-is3c3ofuosrePbGody
w CRAYC-90 w 64x56x64
(9CPU’s)
200iterations
101 102
100 101 102 1 2 3 4 5 6 7 8 910
nodes nodes
Fig. 3 Elapsed wall clock time on IBM SP2. Fig. 5 Elapsed wall clock time on SGI multipro-
cessor.
nexch=1
104
nexch=20
linearspeedup 104 nexch=1
nexch=20
linearspeedup
SGIOrigin2000
)
c
se 103 ) SGIR10000
me( X-33forebody sec
ti ViscousPG e(
k m
oc 64x56x64 ti 103
wallcl 102 200iterations clock X-33forebody
all ViscousPG
w
64x56x64
200iterations
101
100 101 102
nodes 1021 2 3 4 5 6 7 8 910
nodes
Fig.4 ElapsedwallclocktimeonSGIOrigin2000.
Fig. 6 Elapsed wall clock time on network of SGI
Ethernet. BothoftheseSGImachinesusetheMPICH R10000 workstations.
27
implementation of MPI. On all architectures, the
MPI de(cid:12)ned timer, MPI WTIME, is used to measure rate measure of the communication overhead. Never-
elapsed wall clock time for the main algorithm only. theless, the speedup of the code on all machines is
Thetimetoreadandwriterestart(cid:12)lesandtoperform good. As anticipated, the relative message-passing
pre- and post-processing is not measured although it performance on the dedicated machines (IBM SP2,
mayaccountforasigni(cid:12)cantfractionofthetotaltime. SGI Origin 2000, SGI multiprocessor) is better than
Compiler options include ‘-O3 -qarch=pwr2’ on the on the network of SGI workstations. Also, the per-
IBM SP2and ‘-O2 -n32-mips4’ onthe SGI machines. formance with data exchanged every 20 iterations is
No e(cid:11)ort is made to optimize the single-node perfor- noticeablybetter on the networkof workstationsthan
mance of LAURA on these cache-basedarchitectures. with data exchanged every iteration. However, there
Figures 3 - 6 display the elapsed wall clock times is little in(cid:13)uence of nexch on elapsed time on the
on the various machines. A time based on the dedicated machines which indicates that the commu-
single-node time and assuming a linear speedup equal nication overhead is very low. The degradation in
to the number of nodes is shown for comparison. The performance of the 8 processor runs on the SGI mul-
measured times are less than the comparison time for tiprocessor is due to the load on the machine from
most of the cases as a result of the smaller blocks other users and is not a result of the communication
on each node making better use of the cache. This overhead. Of course, the times (and message-passing
increase in cache performance o(cid:11)sets the communica- e(cid:14)ciency) measured will vary depending on machine
tion penalty. Improving the single-node performance and problem size. Also shown in Fig. 3 is the elapsed
of LAURA on these cache-based architectures would timefromamultitaskingrunwiththeoriginalversion
reduce the single-node times and give a more accu- of LAURA on a CRAY C-90 using 9 CPU’s. This
4of9
Table 2 LAURA parameters - viscous.
Iterations Order njcobian
0-100 1 1
101-300 1 2
301-500 2 10
>500 2 20
a)X-33 b)X-34 Table 3 LAURA parameters - inviscid.
Iterations Order njcobian
0-100 1 1
101-300 1 2
301-900 1 10
901-1100 2 10
>1100 2 20
uration). A baseline solution is generated with nexch
equal to 1. Updating the boundary data every itera-
c)X-33 forebody d)Stardust capsule
tion should mimic the communication between blocks
Fig. 7 Vehicle geometries. in the shared-memory version of LAURA. A second
computation is made with nexch equal to njcobian
shows that performance comparable to current vector since acceptable values for both parameters depend
supercomputers may be obtained on dedicated paral- on transients in the (cid:13)ow. Solutions that are chang-
lel machines (albeit with more processors) using this ing rapidly should update the Jacobian and exchange
distributed-memory version of LAURA. boundary data frequently, while partially converged
solutions may be able to freeze the Jacobian and lag
Convergence
theboundarydataforanumberofiterations. Asimple
The e(cid:11)ect of problem size, gas chemistry, and strategyistolinkthetwoparameters. Convergenceis
boundary data exchange frequency on convergence is measured by the L2 norm de(cid:12)ned by
10,11
examined for four realistic geometries: the X-33
12 N
and X-34 RLV concepts, the X-33 forebody, and 1 (rL(cid:1)rL)
the Stardust sample return capsule forebody.8 All L2 = 2 X 2 (2)
CN L=1 (cid:26)L
four geometries are shown in Fig. 7. A viscous (thin-
layer Navier-Stokes),perfect gas solution is computed
whereCN istheCourantnumber,N isthetotalnum-
over the X-33 and X-33 forebody con(cid:12)gurations. The
ber of cells, rL is the residual vector, and (cid:26)L is the
convergence of an inviscid, perfect gas solution is ex-
local density. All solutions are generated on the IBM
amined using the X-34 vehicle. Nonequilibrium air
SP2.
chemistrye(cid:11)ects on the convergenceand performance
of the distributed-memory version of LAURA are de- X-33
terminedfromaviscous,11-speciesaircalculationover The viscous, perfect gas (cid:13)ow (cid:12)eld is computed over
theStardustcapsule. Forallgeometries,the vehicleis the X-33 RLV con(cid:12)guration (without the wake) to
de(cid:12)ned by the k =1 surface, and the outer boundary demonstrate the ability of the new message-passing
of the volume grid is de(cid:12)ned by k = kmax. version of LAURA to handle large-scale problems in
Each viscous solution is computed with the same a reasonable amount of time. The freestream Mach
sequence of parameters for consistency and is started number is 9.2, the angle of attack is 18.1deg, and the
with all (cid:13)ow-(cid:12)eld variables initially set to their altitude is 48.3 km. The grid size is 192 x 168 x 64
freestream values. Slightly di(cid:11)erent values are used and is divided into 64 blocks of 48 x 42 x 16.
for the inviscid solutions due to low densities on the Figure 8 shows the L2 convergence as a function of
leeside of the vehicle causing some instability when number of iterations. The elapsed wall clock time on
switchingfrom(cid:12)rsttosecondorderaccuracy. Methods the SP2 is 12.7hr. Only the baseline case(nexch = 1)
to speed convergence such as computing on a coarse was computed due to resource limitations. The e(cid:11)ect
grid before proceeding to the (cid:12)ne grid and converging of nexch on convergence will be examined in greater
blocks sequentially beginning at the nose (i.e. block detail on the nose region of this vehicle. The stall
5
marching) arenotused. TherelevantLAURAparam- in convergence after 10000 iterations is due to a limit
eters areshownbelow. Two valuesof nexch areused cycle in the convergence at the trailing edge of the
(exceptfortheruninvolvingthecompleteX-33con(cid:12)g- tip of the canted (cid:12)n. Iterations are continued past
5of9
X-33
106 M¥ =9.2 104 nexch=1
a =18.1deg splitinI-J-K } nexch=njcobian
splitinI-J
ViscousPG
104 192x168x64 102
64nodes-IBMSP2
X-33forebody
102 nexch=1 100 M¥ =9.2
a =18.1deg
L2 12.7hr L2 ViscousPG
100 10-2 64x56x64
16nodes-IBMSP2
10-2 10-4
10-4 10-6
0 4000 8000 12000 16000 0 4000 8000 12000
iterations iterations
a)Convergence as function of number of iterations
Fig. 8 Convergencehistory of viscous, perfect gas
(cid:13)ow (cid:12)eld over X-33 vehicle.
this point to convergethe boundary layerand surface 104 nspelxitcihn=I-1J-K }
heating. splitinI-J nexch=njcobian
asynchronous
X-33 forebody 102
The e(cid:11)ects of boundary data exchange frequency
and block splitting on convergence are evaluated for 100
the nose section of the X-33. This is the same con-
L2
(cid:12)guration used to obtain the timing estimates, and
10-2
freestream conditions correspond to the complete X-
33 vehicle case. The 64 x 56 x 64 grid is (cid:12)rst divided
inthei-,j-,andk-directionsinto16blockscomprised 10-4
of 32 x 28 x 16 cells each. Two cases, nexch = 1 and
nexch = njcobian, are run using this blocking. An-
10-6
other case is computed with the grid divided in the i- 0 1 2 3 4
andj-directionsonlyresultinginblocksof16x14x64 wallclocktime(hr)
cells. Next,theasynchronousrelaxationcapabilitiesof
LAURAaretestedbyreblockingapartiallyconverged b)Convergence as function of time
restart (cid:12)le in the k-direction to cluster work (and it-
Fig. 9 Convergence histories of viscous, perfect
erations) in the boundary layer. Each block has i x j
gas (cid:13)ow (cid:12)eld over X-33 forebody.
dimensionsof32x28,butthek dimensionissplitinto
8, 8, 16, and 32 cells. Blocks near the wall contain 32
the k-direction.
x 28 x 8 cells, while blocks near the outer boundary
have 32 x 28 x 32 cells. Thus, the smaller blocks ac- Figure 9(b) shows the convergence as a function of
cumulate more iterations than the larger outer blocks wall clock time. Because of the low communication
in a given amount of time and should convergefaster. overhead on the IBM SP2, the time saved by mak-
Figure9showsthe convergencehistoryforthis(cid:13)ow ing fewer boundary data exchanges is small. As seen
(cid:12)eld. For viscous solutions, convergence is typically from the timing data, this would not necessarily be
dividedintotwostages. First,the inviscidshocklayer true on a network of workstations where the decrease
develops and then the majority of the iterations are in communication overhead might o(cid:11)set any increase
spentconvergingtheboundarylayer(andsurfaceheat- in number of iterations. Also shown are LAURA’s
ing). Laggingtheboundarydataappearstohavemore asynchronous relaxation capabilities. After 1 hr (and
of an impact on the early convergence of the inviscid 3500 iterations), the outer inviscid layer is partially
featuresofthe(cid:13)owandlessofanimpactonthebound- converged. Restructuring the block structure at this
arylayerconvergence. Thise(cid:11)ectismuchlargerwhen point by splitting the k dimension into 8, 8, 16, and
the blocksaresplit in the k-directionacrossthe shock 32cellsallowstheboundarylayertoaccumulatemore
layer. The communication delay a(cid:11)ects the develop- iterationsandacceleratesconvergence. Theresultisa
ing shock wave as it crosses the block boundaries in 15percentdecreaseinwallclocktimecomparedtothe
6of9
baseline (nexch = 1) case. A similar strategy would
also have accelerated the convergence of the baseline 104 nspelxitcihn=I-1J-K }
nexch=njcobian
case. splitinI-J
103
X-34
102
Thee(cid:11)ectofboundarydataexchangefrequencyand X-34
block splitting on convergence of inviscid, perfect gas 101 M =6.32
¥
a =23deg
(cid:13)owsisinvestigatedfortheX-34con(cid:12)guration(minus
the body (cid:13)ap and vertical tail). Inviscid solutions are L2100 InviscidPG
usefulinpredictingaerodynamiccharacteristicsforve- 10-1 120x152x32
hicledesignandmaybecoupledwithaboundary-layer 32nodes-IBMSP2
techniquetopredictsurfaceheattransferaswell. The 10-2
freestreamMachnumberis6.32,theangleofattackis
10-3
23 deg, and the altitude is 36 km. The grid is 120 x
152x32andis(cid:12)rstdividedinto32blocksof30x38x 10-4
16cells. Thegridisalsosplitinthei-andj-directions 0 1000 2000 3000 4000
iterations
into blocks of 30 x 19 x 32 cells to check the e(cid:11)ect of
block structure on convergence. The convergence his-
tories areshownin Figure10. The aerodynamics(not a)Convergence as function of number of iterations
shown)ofthevehicleareconvergedat4000iterations.
Thespikeinconvergenceat900iterationsiscausedby
104 nexch=1
the switch from (cid:12)rst to second order accuracy. With splitinI-J-K }
nexch=njcobian
splitinI-J
the grid split in all directions, the bas(cid:0)e3line solution 103
(nexch = 1) reaches an L2 norm of 10 at 3300 it-
erationswhilethesolutionwithboundarydatalagged 102
takes 3640iterations. The solution with the grid split
101
in the i- and j-directions requires 3530 iterations. As
showninFig.10(b),thereisacorrespondingdi(cid:11)erence L2100
in run times to reach that convergence level because
the savings from fewer boundary data exchanges are 10-1
small on the SP2. Nevertheless, the e(cid:11)ect of lagging
10-2
the boundary data on convergenceis minimal.
Stardust 10-3
Theconvergenceofanonequilibriumair(11species, 10-4
0 1 2
two temperature), viscous computation is examined
wallclocktime(hr)
for the forebody of the Stardust capsule. The
freestream Mach number is 17 and the angle of at-
b)Convergence as function of time
tackis 10 deg. The gridis 56 x 32 x 60and is divided
into 32 blocks of 7 x 8 x 60 cells each. There are Fig. 10 Convergence histories of inviscid, perfect
no splits in the k-direction. Figure 11 shows the con- gas (cid:13)ow (cid:12)eld over X-34 vehicle.
vergence as a function of iterations and elapsed wall
clock time. Because of the larger number of (cid:13)ow-(cid:12)eld
take advantage of distributed-memory parallel ma-
variables, considerably more data must be exchanged
chines. A standard domain decomposition strategy
between blocks for nonequilibrium (cid:13)ows. Even on a
yields good speedup on dedicated parallel systems,
dedicated parallel machine such as the IBM SP2, the
but the single-nodeperformanceof LAURA on cache-
communication penalty for this particular case has a
based architecturesrequires further study. The point-
signi(cid:12)cant impact on the elapsed time. The baseline
(cid:0)4 implicit relaxation strategy in LAURA is well-suited
case reaches an L2 norm of 10 at 6900 iterations
for parallel computing and allows the communication
compared to 7500 iterations for the nexch = njcobian
overhead to be minimized (if necessary) by reducing
solution. However,thesavingsincommunicationtime
the frequency of boundary data exchanges. The com-
allowsthe nexch = njcobian solution to converge1 hr
munication overhead is greatest on the network of
faster than the baseline case.
workstationsandfornonequilibrium(cid:13)owsduetomore
data passing between nodes. Lagging the boundary
Conclusions databetweenblocksappearstoa(cid:11)ectthedevelopment
The shared-memory, multitasking version of the of the inviscid shock layer more than the convergence
CFD code LAURA has been successfully modi(cid:12)ed to of the boundary layer. Its largest e(cid:11)ect occurs when
7of9
Acknowledgements
104 nexch=1 } splitinI-J
nexch=njcobian The authors wish to acknowledge Peter Gno(cid:11)o of
the Aerothermodynamics Branch at NASA LaRC for
102 Stardust his assistancewith the inner workingsof LAURA and
M =17 Jerry Mall of Computer Sciences Corporation for his
¥
a =10deg help in pro(cid:12)ling LAURA on the IBM SP2.
100 Viscous11-speciesair
L2 56x32x60 References
32nodes-IBMSP2 1Gno(cid:11)o, P. A., \An Upwind-Biased, Point-Implicit Relax-
10-2
ationAlgorithmforViscous,CompressiblePerfect-GasFlows,"
NASATP{2953, Feb.1990.
2
Gno(cid:11)o, P. A., \Upwind-Biased, Point-Implicit Relaxation
10-4 Strategies for Viscous, Hypersonic Flows," AIAA Paper 89{
1972,Jun.1989.
3
Gno(cid:11)o, P. A., \Code Calibration Program in Support of
10-6 the Aeroassist Flight Experiment," Journal of Spacecraft and
0 2000 4000 6000 8000 Rockets,Vol.27,No.2,1990,pp.131{142.
iterations 4
Weilmuenster, K. J. and Greene, F. A., \HL-20 Compu-
tational Fluid Dynamics Analysis," Journal of Spacecraft and
a)Convergence as function of number of iterations Rock5ets,Vol.30,No.5,1993,pp.558{566.
Gno(cid:11)o,P.A.,Weilmuenster,K.J.,andAlter,S.J.,\Multi-
blockAnalysisforShuttleOrbiterRe-EntryHeatingFromMach
24 to Mach 12," Journal of Spacecraft and Rockets, Vol. 31,
104 nneexxcchh==1njcobian} splitinI-J No.3,1994, pp.367{377.
6
Mitcheltree, R. A. and Gno(cid:11)o, P. A., \Wake Flow About
theMarsPath(cid:12)nder EntryVehicle,"Journal of Spacecraft and
102 Rockets,Vol.32,No.5,1994,pp.771{776.
7
Weilmuenster, K. J., Gno(cid:11)o, P. A., Greene, F. A., Riley,
C. J., Hamilton, H. H., and Alter, S. J., \Hypersonic Aero-
100 dynamic Characteristics of a Proposed Single-Stage-to-Orbit
Vehicle," Journal of Spacecraft and Rockets, Vol. 33, No. 4,
L2 1995,pp.463{469.
8
Mitcheltree, R. A., Wilmoth, R. G., Cheatwood, F. M.,
10-2
Brauckmann,G.J.,andGreene,F.A.,\AerodynamicsofStar-
dustSampleReturnCapsule,"AIAAPaper97{2304,Jun.1997.
9
Mitcheltree,R.A.,Moss,J.N.,Cheatwood,F.M.,Greene,
10-4 F.A.,andBraun,R.D.,\AerodynamicsoftheMarsMicroprobe
EntryVehicles,"AIAAPaper97{3658,Aug.1997.
10
Cook, S. A., \X-33 Reusable Launch Vehicle Structural
10-6 Technologies,"AIAAPaper96{4573,Nov.1996.
0 2 4 6 8 10 11
Gno(cid:11)o,P.A.,Weilmuenster,K.J.,Hamilton,H.H.,Olyn-
wallclocktime(hr)
ick,D.R.,andVenkatapathy,E.,\ComputationalAerothermo-
dynamic Design Issues for Hypersonic Vehicles," AIAA Paper
b)Convergence as function of time 97{2473, Jun.1997.
12
Levine,J.,\NASAX-34Program,"MeetingPapersonDisc
Fig.11 Convergencehistoriesofviscous,nonequi- A9710806,AIAA,Nov.1996.
13
librium (cid:13)ow (cid:12)eld over Stardust capsule. Cheatwood, F. M. and Gno(cid:11)o, P. A., \User’s Manual for
theLangleyAerothermodynamicUpwindRelaxationAlgorithm
(LAURA),"NASATM{4674,Apr.1996.
14
Gno(cid:11)o, P. A., \Asynchronous, Macrotasked Relaxation
StrategiesfortheSolutionofViscous,HypersonicFlows,"AIAA
Paper91{1579,Jun.1991.
the grid is split in the direction normal to the vehicle 15
Jayasimha, D. N., Hayder, M. E., and Pillay, S. K., \An
surface. However, restructuring the blocks to cluster
EvaluationofArchitecturalPlatformsforParallelNavier-Stokes
work and iterations in the boundary layer improves Computations,"NASACR{198308,Mar.1996.
16
overall convergence once the inviscid features of the Venkatakrishnan,V.,\ParallelImplicitUnstructuredGrid
(cid:13)ow have developed. These results demonstrate the EulerSolvers,"AIAAPaper94{0759,Jan.1994.
17
ability of the new message-passingversion of LAURA Wong, C.C.,Blottner, F.G.,Payne,J.L.,andSoetrisno,
M., \A Domain Decomposition Study of Massively Parallel
to e(cid:11)ectively use distributed-memory parallelsystems
Computing in CompressibleGas Dynamics," AIAA Paper 95{
for realistic con(cid:12)gurations. As a result, the e(cid:11)ective- 0572,Jan.1995.
18
nessofLAURAasanaerospacedesigntoolisenhanced Borrelli, S., Schettino, A., and Schiano, P., \Hyper-
byitsnewparallelcomputingcapabilities. Infact,this sonicNonequilibriumParallelMultiblockNavier-StokesSolver,"
JournalofSpacecraftandRockets,Vol.33,No.5,1996,pp.748{
new version of LAURA is currently being applied to
750.
theevaluationofvehiclesusedinplanetaryexploration 19
Domel,N.D.,\Research inParallel Algorithmsand Soft-
missions and the X-33 program. wareforComputationalAerosciences,"NAS96-004,Apr.1996.
8of9
20
VanderWijngaart,R.F.andYarrow,M.,\RANS-MP:A
PortableParallelNavier-StokesSolver,"NAS97-004,Feb.1997.
21
Forum,M.P. I., \MPI: Amessage-passing interface stan-
dard," Computer Science Dept. Technical Report CS{94{230,
UniversityofTennessee,Knoxville,TN,1994.
22
Geist,A.,Beguelin,A.,Dongarra,J.,Jiang,W.,Manchek,
R., and Sunderam, V., \PVM 3.0 User’s Guide and Reference
Manual,"Tech.rep.,Feb.1993.
23
Balasubramanian, R., \Modi(cid:12)cation of Program LAURA
to Execute in PVM Environment," Spectrex Report 95.10.01,
Oct.1995.
24
Roe, P. L., \Approximate Riemann Solvers, Parameter
Vectors, and Di(cid:11)erence Schemes," Journal of Computational
Physics,Vol.43,No.2,1981,pp.357{372.
25
Harten,A.,\HighResolutionSchemesforHyperbolicCon-
servation Laws," Journal of Computational Physics, Vol. 49,
No.3,1983,pp.357{393.
26
Yee, H. C., \On Symmetric and Upwind TVD Schemes,"
NASATM{86842,Sep.1985.
27
Gropp, W. and Lusk, E., \User’s Guide for mpich, a
PortableImplementationofMPI,"Tech.Rep.ANL/MCS{TM{
ANL{96/6,ArgonneNationalLaboratory,1996.
28
Burns,G.,Daoud,R.,andVaigl,J.,\LAM:AnOpenClus-
ter Environment for MPI," Tech. rep., Ohio Supercomputing
Center,May1994.
9of9