Batch solution of scalar and block- tridiagonal equations on many-core accelerators Endre László12 1 University of Oxford, Oxford e-Research Center, Oxford, United Kingdom 2 Pázmány Péter Catholic University, Faculty of Information Technology Budapest, Hungary 23 October 2013 Many-Core Seminar Series OeRC, University of Oxford 1 Overview • ADI (Alternating Direction Implicit) Finite Difference scheme – Solving the heat equation in 3D – Practical application – Tridiagonal system of equations • Solving tridiagonal systems – Motivation for an efficient solver – Memory access patterns of the tridiagonal solver – Algorithms: Thomas/TDMA, CR/PCR, RD/PP • Thomas (TDMA) algorithm – Optimization on GPUs: 1. shared memory - transposition 2. register shuffle – transposition – Optimization on AVX and Xeon Phi • CR/PCR (Parallel Cyclic Reduction) – Optimization on GPUs: 1. CR in shared memory – Mike Giles’s CUDA course practical code 2. PCR with register shuffle – Jeremy Appleyard’s code based on Mike Giles BS solver • Benchmark for batch scalar solvers • Conclusions for batch scalar solvers • Block Tridiagonal solver base on the Thomas algorithm – Optmizations – Benchmarks 2 ADI • Classical FD scheme – Cheaper then Crank-Nicolson – Relies on operator factoring – O(Δt2, Δx2) order accurate in both space and time – Unconditionally stable if parameters chosen right: heat conductance is positive – Originally introduced by Peaceman and Rachford1 – Many variants so far • Assumptions: – Computation (space) domain is >2D – in current case 3D – “Cubeish” domain: roughly equal sized dimensions • In CFD and Financial applications this the real-world case • The upcoming discussion of tridiagonal solvers is in the context of the ADI method 1 D. W. Peaceman and J. Rachford, H. H., "The numerical solution of parabolic and elliptic differential equations," Journal of the Society for Industrial and Applied Mathematics, vol. 3, no. 1, pp. pp. 28{41, 1955. 3 Solving the 3D Poisson equation with ADI Poisson equation: Crank-Nicolson discretization: Approximate factorization with O(Δx2) error: 4 Solving the factored form Factored form: Kernel “preproc”: Preprocessing Kernel “trid_x”: Tridiagonal sys. along X Kernel “trid_y”: Tridiagonal sys. along Y Kernel “trid_z”: Tridiagonal sys. along Z + add contribution 5 The resulting tridiagonal systems 6 Solving tridiagonal systems • Motivation for improved algorithm: – Poor performance in the cuSPARSE library – Lack of long-strided access pattern • Assumptions: – Large number of systems – enough to saturate GPU – Relatively large systems – Each system has its own coefficients and RHS – Typically O(256 x 256) number of systems on a x O(256) cube – No pivoting is done: we aim problems, which are diagonally doominant 7 Solving the heat equation • Finite difference solution • Crank Nicholson method is expensive in >2D: – Sparse matrix – Large matrix bandwidth -> unstructured access pattern • ADI method: – Decompose the system into 1D problems by splitting – Solve 3 tridiagonal systems -> structured access patterns – Solve a tridiagonal system with Thomas, CR (Cyclic Reduction) or PCR (Parallel CR), RD (Recoursive Doubling) (12)u(1) (2 2 2)un x-dim u x x y z ijk 2 u (12)u(2) u(1) y-dim t y (12)u u(2) z-dim z ijk 8 Thomas algorithm • Gaussian elimination – Sequential algorithm • 2 passes: – Forward: eliminate lower diagonal – Backward: substitute/solve the system • O(N) computation complexity – 2*N steps – 8*N FLOP total • 6*N storage requirement – 3*N coefficients and N RHS – 2*N buffer memory requirement • 9*N data traffic complexity • Diagonal dominance is required for stability! (Pivoting?) • Experiments show that for small systems (100s) Thomas algorithm is more efficient than CR with shared memory 9 Thomas algorithm Forward pass Backward pass i 0: i N 1: T i 1...(N 1): i (N 2)...0: T T 10
Description: