Cluster performance, how to get the most out of Abel Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18th 2013 Introduction • Architecture x86-64 and NVIDIA • Compilers • MPI • Interconnect • Storage • Batch queue system Installed compute hardware 630 Supermicro nodes Two socket Intel E5-2670 2.6 GHz octa core 64 GiB memory FDR InfiniBand Universitetets senter for informasjonsteknologi Compute nodes - performance • CPU performance – 332 Gflops/s Theoretical – 318 Gflops/s Practical – HPL (top500) • Memory bandwidth – 63 GiB/s Practical (streams) • Memory latency – 115 nano seconds (random access) Universitetets senter for informasjonsteknologi Node performance High Performance Linpack performance (top500 test) : T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR11R2R4 87500 180 4 4 1404.46 3.180e+02 -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0033152 ...... PASSED ================================================================================ [olews@login-0-0 hpl]$ ./xhpl-max.sh HPL.single.node.log No clock freq given, setting it to 2.6 GHz High perf. linpack results: Params size block nxm time Tflops %peak WR11R2R4 87500 180 4x4 1404.46 0.318 95.6 WR11R2R4 87500 200 4x4 1408.52 0.317 95.3 WR11R2R4 85000 200 4x4 1301.79 0.315 94.5 Universitetets senter for informasjonsteknologi Installed compute hardware, GPU 16 Supermicro nodes with GPUs Two socket, two GPUs Intel E5-2670 2.2 GHz quad core 64 GiB memory FDR InfiniBand Tesla K20Xm 6 GiB memory 2688 SP cores 896 DP cores Universitetets senter for informasjonsteknologi Node performance, all hardware High Performance Linpack performance (top500 test) : T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR10L2L2 85000 1280 1 2 223.20 1.844e+03 -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0033152 ...... PASSED ================================================================================ [olews@login-0-0 hpl]$ ./xhpl-max.sh HPL.single.node.log High perf. linpack results: Params size block nxm time Tflops %peak WR10L2L2 85000 1280 1x2 223.20 1.844 66.5 WR10L2L2 85000 1024 1x2 224.62 1.823 65.7 WR10L2L2 85000 1408 1x2 232.41 1.762 63.5 Universitetets senter for informasjonsteknologi Node performance, two K20Xm DGEMM performance GPU vs. CPU Tesla K20X vs Intel SB Single precision 32 bit, 600 CUDA BLAS 2.6 Tflops/s 500 s] MKL BLAS s/ p 400 o fl G e [ 300 c n SGEMM performance GPU vs CPU a m 200 or Tesla K20X vs Intel SB rf e 100 P 1400 0 CUDA BLAS 1200 91 366 1464 3295 5149 MKL BLAS Total matrix foorptrint [MiB] s] 1000 s/ p o Double precision 64 bit, Gfl 800 e [ c n 600 a 1 Tflops/s m or 400 rf e P 200 0 45 183 732 1647 2575 3708 4577 5538 Total matrix footprint [MiB] Universitetets senter for informasjonsteknologi InfiniBand Basics • Ping-pong Latency key for performance • Intra rack : 0.95 second • Inter rack : 1.40 second • Ping-pong Bandwith • 6.14 GiB/s • All numbers measured under full production using OpenMPI Universitetets senter for informasjonsteknologi InfiniBand Basics • TCP/IP over InfiniBand – IPoIB on all node • Both GbE and IB interfaces, eth0 and ib0 named • Example – Node compute-x-y has two interfaces – cx-y is the eth0 interface – ib-x-y is the ib0 interface scp gamessplus.tar.bz2 compute-9-1.local:/tmp/gamessplus.tar.bz2 100% 722MB 55.5MB/s 00:13 scp gamessplus.tar.bz2 ib-9-1:/tmp/gamessplus.tar.bz2 100% 722MB 90.2MB/s 00:08 Universitetets senter for informasjonsteknologi
Description: