ebook img

Improving communication performance in dense linear algebra via topology-aware collectives PDF

36 Pages·2011·2.84 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Improving communication performance in dense linear algebra via topology-aware collectives

Collectivecommunication 2.5Dalgorithms Modellingexascale Improving communication performance in dense linear algebra via topology-aware collectives Edgar Solomonik∗, Abhinav Bhatele†, and James Demmel∗ ∗ UniversityofCalifornia,Berkeley † LawrenceLivermoreNationalLaboratory Supercomputing, November 2011 EdgarSolomonik Mappingdenselinearalgebra 1/29 Collectivecommunication 2.5Dalgorithms Modellingexascale Outline Collective communication Rectangular collectives 2.5D algorithms 2.5D Matrix Multiplication 2.5D LU factorization Modelling exascale Multicast performance MM and LU performance EdgarSolomonik Mappingdenselinearalgebra 2/29 Collectivecommunication 2.5Dalgorithms Rectangularcollectives Modellingexascale Performance of multicast (BG/P vs Cray) 1 MB multicast on BG/P, Cray XT5, and Cray XE6 8192 BG/P XE6 4096 XT5 ) c e s 2048 B/ M ( h 1024 dt wi d 512 n a B 256 128 8 64 512 4096 #nodes EdgarSolomonik Mappingdenselinearalgebra 3/29 Collectivecommunication 2.5Dalgorithms Rectangularcollectives Modellingexascale Why the performance discrepancy in multicasts? (cid:73) Cray machines use binomial multicasts (cid:73) Form spanning tree from a list of nodes (cid:73) Route copies of message down each branch (cid:73) Network contention degrades utilization on a 3D torus (cid:73) BG/P uses rectangular multicasts (cid:73) Require network topology to be a k-ary n-cube (cid:73) Form 2n edge-disjoint spanning trees (cid:73) Route in different dimensional order (cid:73) Use both directions of bidirectional network EdgarSolomonik Mappingdenselinearalgebra 4/29 Collectivecommunication 2.5Dalgorithms Rectangularcollectives Modellingexascale 2D rectangular multicasts trees 2D 4X4 Torus Spanning tree 1 Spanning tree 2 root Spanning tree 3 Spanning tree 4 All 4 trees combined [Watts and Van De Geijn 95] EdgarSolomonik Mappingdenselinearalgebra 5/29 Collectivecommunication 2.5Dalgorithms Rectangularcollectives Modellingexascale Another look at that first plot How much better are rectangular algorithms on P = 4096 nodes? 1 MB multicast on BG/P, Cray XT5, and Cray XE6 (cid:73) Binomial collectives on XE6 8192 BG/P (cid:73) 1/30th of link 4096 XXET65 bandwidth ec) B/s 2048 (cid:73) Rectangular collectives on h (M 1024 BG/P widt nd 512 (cid:73) 4X the link bandwidth Ba 256 (cid:73) 120X improvement in 128 efficiency! 8 64 512 4096 #nodes How can we apply this? EdgarSolomonik Mappingdenselinearalgebra 6/29 Collectivecommunication 2.5DMatrixMultiplication 2.5Dalgorithms 2.5DLUfactorization Modellingexascale Matrix multiplication B A B A B A B A EdgarSolomonik Mappingdenselinearalgebra 7/29 Collectivecommunication 2.5DMatrixMultiplication 2.5Dalgorithms 2.5DLUfactorization Modellingexascale 2D matrix multiplication B A 16 CPUs (4x4) B CPU CPU CPU CPU CPU CPU CPU CPU A CPU CPU CPU CPU CPU CPU CPU CPU B A B A [Cannon 69], [Van De Geijn and Watts 97] EdgarSolomonik Mappingdenselinearalgebra 8/29 Collectivecommunication 2.5DMatrixMultiplication 2.5Dalgorithms 2.5DLUfactorization Modellingexascale 3D matrix multiplication 64 CPUs (4x4x4) CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU B CPU CPU CPU CPU A CPU CPU CPU CPU CPU CPU CPU CPU B CPU CPU CPU CPU CPU CPU CPU CPU A CPU CPU CPU CPU B CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU A B CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU A 4 copies of matrices [Agarwal et al 95], [Aggarwal, Chandra, and Snir 90], [Bernsten 89] EdgarSolomonik Mappingdenselinearalgebra 9/29 Collectivecommunication 2.5DMatrixMultiplication 2.5Dalgorithms 2.5DLUfactorization Modellingexascale 2.5D matrix multiplication B 32 CPUs (4x4x2) A CPU CPU CPU CPU CPU CPU CPU CPU B CPU CPU CPU CPU CPU CPU CPU CPU A CPU CPU CPU CPU B CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU A B 2 copies of matrices A EdgarSolomonik Mappingdenselinearalgebra 10/29

Description:
Collective communication. 2.5D algorithms Improving communication performance in dense . [Agarwal et al 95], [Aggarwal, Chandra, and Snir 90], [Bernsten 89] . The setup overhead is overlapped with the path overhead.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.