Performance and Power Co-Design of Exascale Systems and Applications Adolfy Hoisie Work with Kevin Barker, Darren Kerbyson, Abhinav Vishnu Performance and Architecture Lab (PAL) Pacific Northwest National Laboratory 5th Parallel Tools Workshop Dresden, September 27 , 2011 Outline Static performance modeling Dynamic modeling Modeling for Exascale Tentative conclusions The fallacy of simple metrics: efficiency Example 1: Efficiency of applications Solver Flops Flops Mflop/s % Peak Time (s) Original 64 % 29.8 x 109 448.8 5.6 % 66.351 Optimized 25 % 8.2 x 109 257.7 3.2 % 31.905 Example 2: Efficiency of systems – Code A on Machine X » (500 MFLOPS Peak per CPU, 2 FLOPS per CP): » Time = 522 sec.; MFLOPS = 26.1 (5.2% of peak) – Code A on Machine Y » (3600 MFLOPS Peak per CPU, 4 FLOPS per CP): » Time = 91.1 sec; MFLOPS = 113.0 (3.1% of peak) Rough taxonomy of modeling Simulation » Greatest architectural flexibility but impractical for real applications Trace-driven experiments » Results often lack generality Quasi-analytical modeling » Can tackle full apps on full machines » Uses a set of input knobs » Tool-neutral Benchmarking » Limited to current implementation of the code » Limited to currently-available architectures » Difficult to distinguish between real performance and machine idiosyncrasies Attributes of a Performance Model Encapsulates application behavior – Abstracts application into communication and computation components – Focuses on first-order effects, ignoring distracting details Separates performance concerns – Inherent properties of application structure (e.g., data dependencies) – System performance characteristics (e.g., MPI latency) Code Code problem Model Performance Execution + + Prediction System configuration System Model A Performance Modeling Process Flow Identification of application characteristics Test new configurations (HW and/or SW) Data structures Decomposition Construct Code Parallel activities (or refine) Verify current application performance Frequency of use model Memory usage Validate Compare … (compare Use Combine systems model to mode l Propose measured) future Model can Acquire systems be trusted performance Determine SW characteristics parameters Run bʼmarks System(s) Specifications … on system Future (promised) Run code Micro- performance on system benchmarks Partial list of modeled systems & codes Machines Codes – ASCI Q – SWEEP3D – ASCI BlueMountain – SAGE – ASCI White – TYCHO – ASCI Red – Partisn – CRAY T3E – Earth Simulator – LBMHD – Itanium-2 cluster – HYCOM – BlueGene/L – MCNP – BlueGene/P – POP – CRAY X-1 – KRAK – ASC Red Storm – ASC Purple – RF-CTH – IBM PERCS – CICE – IBM Blue Waters – S3D – Clearspeed accelerators – VPIC – SiCortex SC5832 – GTC – Roadrunner – Jaguar – …….. – …….. Modeling in action as a co-design process– IBM PERCS IBM Simulated PERCS Application(s) run-time simulator (1PE, 1chip) PNNL cores per chip Large-scale Network topology Performance Performance System Design Latency Model Predictions Bandwidth Contention … Modeling used to explore and guide design of PERCS using application suite (HPCS phase 1 & 2) Design feedback loop got used with increasing speed Explored numerous configurations and options Topology comparison through co-design Example: 2,048 PE job 2.0 (256-node system, 64-way) HYCOM 1.9 k LBMHD – FC Fully-connected 1-hop r o w 1.8 RF-CTH2 – OCS 1-hop or 2-hop et N 1.7 KRAK t – 2D, 3D meshes s SAGE e 1.6 B – FT Fat-tree . Sweep3D s 1.5 v POP – OCS-D OCS-Dynamic o i 1.4 t a R 1.3 e m ti 1.2 n u R 1.1 Best hardware latency 1.0 of 50ns, 4GB/s links 1 1 2 D D D T C C C S- 2 3 F F F F C - - S S O C C Graph shows relative performance of eaOch neOtwork relative to the best performing network Modeling as a co-design tool Where is the time being spent ? – ~63% Compute on Cell 100% – ~20% Latency (Cell <-> AMD) 90% – ~5% Bandwidth (Cell <-> AMD) 80% – ~8% Latency (Infiniband) 70% – ~3% Bandwidth (Infiniband) 60% 50% Inter-node (Bandwidth) 40% Pipeline unavoidable Inter-node (Latency) 30% AMD <-> Cell (Bandwidth) Latency dominates 20% AMD <-> Cell (Latency) Compute_Pipe (Cell) communication (Cell <-> AMD 10% Compute_Block (Cell) is major component) 0% 1 2 4 8 6 2 4 8 U U U U U U U 1 3 6 2 C C C C C C C 1 1 2 4 8 2 6 8 1 1 1 Node Count Uses ʻprobableʼ HW parameters
Description: