Collectivecommunication 2.5Dalgorithms Modellingexascale Improving communication performance in dense linear algebra via topology-aware collectives Edgar Solomonik∗, Abhinav Bhatele†, and James Demmel∗ ∗ UniversityofCalifornia,Berkeley † LawrenceLivermoreNationalLaboratory Supercomputing, November 2011 EdgarSolomonik Mappingdenselinearalgebra 1/29 Collectivecommunication 2.5Dalgorithms Modellingexascale Outline Collective communication Rectangular collectives 2.5D algorithms 2.5D Matrix Multiplication 2.5D LU factorization Modelling exascale Multicast performance MM and LU performance EdgarSolomonik Mappingdenselinearalgebra 2/29 Collectivecommunication 2.5Dalgorithms Rectangularcollectives Modellingexascale Performance of multicast (BG/P vs Cray) 1 MB multicast on BG/P, Cray XT5, and Cray XE6 8192 BG/P XE6 4096 XT5 ) c e s 2048 B/ M ( h 1024 dt wi d 512 n a B 256 128 8 64 512 4096 #nodes EdgarSolomonik Mappingdenselinearalgebra 3/29 Collectivecommunication 2.5Dalgorithms Rectangularcollectives Modellingexascale Why the performance discrepancy in multicasts? (cid:73) Cray machines use binomial multicasts (cid:73) Form spanning tree from a list of nodes (cid:73) Route copies of message down each branch (cid:73) Network contention degrades utilization on a 3D torus (cid:73) BG/P uses rectangular multicasts (cid:73) Require network topology to be a k-ary n-cube (cid:73) Form 2n edge-disjoint spanning trees (cid:73) Route in different dimensional order (cid:73) Use both directions of bidirectional network EdgarSolomonik Mappingdenselinearalgebra 4/29 Collectivecommunication 2.5Dalgorithms Rectangularcollectives Modellingexascale 2D rectangular multicasts trees 2D 4X4 Torus Spanning tree 1 Spanning tree 2 root Spanning tree 3 Spanning tree 4 All 4 trees combined [Watts and Van De Geijn 95] EdgarSolomonik Mappingdenselinearalgebra 5/29 Collectivecommunication 2.5Dalgorithms Rectangularcollectives Modellingexascale Another look at that first plot How much better are rectangular algorithms on P = 4096 nodes? 1 MB multicast on BG/P, Cray XT5, and Cray XE6 (cid:73) Binomial collectives on XE6 8192 BG/P (cid:73) 1/30th of link 4096 XXET65 bandwidth ec) B/s 2048 (cid:73) Rectangular collectives on h (M 1024 BG/P widt nd 512 (cid:73) 4X the link bandwidth Ba 256 (cid:73) 120X improvement in 128 efficiency! 8 64 512 4096 #nodes How can we apply this? EdgarSolomonik Mappingdenselinearalgebra 6/29 Collectivecommunication 2.5DMatrixMultiplication 2.5Dalgorithms 2.5DLUfactorization Modellingexascale Matrix multiplication B A B A B A B A EdgarSolomonik Mappingdenselinearalgebra 7/29 Collectivecommunication 2.5DMatrixMultiplication 2.5Dalgorithms 2.5DLUfactorization Modellingexascale 2D matrix multiplication B A 16 CPUs (4x4) B CPU CPU CPU CPU CPU CPU CPU CPU A CPU CPU CPU CPU CPU CPU CPU CPU B A B A [Cannon 69], [Van De Geijn and Watts 97] EdgarSolomonik Mappingdenselinearalgebra 8/29 Collectivecommunication 2.5DMatrixMultiplication 2.5Dalgorithms 2.5DLUfactorization Modellingexascale 3D matrix multiplication 64 CPUs (4x4x4) CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU B CPU CPU CPU CPU A CPU CPU CPU CPU CPU CPU CPU CPU B CPU CPU CPU CPU CPU CPU CPU CPU A CPU CPU CPU CPU B CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU A B CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU A 4 copies of matrices [Agarwal et al 95], [Aggarwal, Chandra, and Snir 90], [Bernsten 89] EdgarSolomonik Mappingdenselinearalgebra 9/29 Collectivecommunication 2.5DMatrixMultiplication 2.5Dalgorithms 2.5DLUfactorization Modellingexascale 2.5D matrix multiplication B 32 CPUs (4x4x2) A CPU CPU CPU CPU CPU CPU CPU CPU B CPU CPU CPU CPU CPU CPU CPU CPU A CPU CPU CPU CPU B CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU A B 2 copies of matrices A EdgarSolomonik Mappingdenselinearalgebra 10/29
Description: