Parallel Computing I SS 2014 - Numerical linear algebra II
Transcrição
Parallel Computing I SS 2014 - Numerical linear algebra II
Parallel Computing Numerical linear algebra II Thorsten Grahs, 23.06.2014 Overview Sparse matrix Example – PageRank Efficient storage formats Parallel matrix-matrix-multiplication Virtual topologies in MPI 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 2 Matrices are extremely important in HPC While it’s certainly not the case that high performance computing is all about computing with matrices, matrix operations are key to many important HPC applications. Many important applications can be reduced to operations on matrices, including (but not limited to) 1. searching and sorting 2. numerical simulation of physical processes 3. optimization The list of the top 500 supercomputers in the world (http://www.top500.org) is determined by a benchmark program that performs matrix operations. 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 3 Dense vs. sparse matrices Matrix A ∈ Rm×n is dense if all or most of its entries are nonzero. Storing a dense matrix (sometimes called a full matrix) requires storing all m × n elements of the matrix. Usually an array data structure is used to store a dense matrix. A matrix is sparse if most of its entries are zero. Here we expect the number of zeros to far exceed the number of nonzeros. It is often most efficient to store only the nonzero entries of a sparse matrix, but this requires that location information also be stored. Arrays and lists are most commonly used to store sparse matrices. 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 4 Dense matrix examples Find a matrix to represent this complete graph if the ij entry contains the weight of the edge connecting node corresponding to row i with the node corresponding to column j . Use the value 0 if a connection is missing. 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 5 Sparse matrix examples Find a matrix to represent this graph if the ij entry contains the weight of the edge connecting node corresponding to row i with the node corresponding to column j . As before, use the value 0 if a connection is missing. 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 6 Example | PageRank Google’s search algorithm works with PageRank For each web-page a weight (PageRank) is computed. The weight is higher when more sides point/linked to this page. The weight is even higher, when the linked pages have higher weights Thus, the weight PRi of page i computes from the weight PRj of the linked web-pages j. Is page j linked to cj web-pages, the weight PRj is distributed over these linked pages 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 7 Example | PageRank from Wikipedia http://de.wikipedia.org/wiki/PageRank 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 8 Example | PageRank This gives rise to the following recursive formula X PRj 1−d PRi = +d n cj ∀j∈(j,i) n total number of pages d damping factor between 0 and 1. This could be written as a linear system: 1−d MPR = 1 n with Mij = δij − dLij where L is a (sparse) matrix without 1/cij , if page i is linked with page j Lij = 0, else. 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 9 Example | PageRank This system leads to the Eigenvalue problem PT (PR) = PR with Pso-called Google matrix. For d < 1 the solution of the system is unique (Theorem of Perron-Frobenius) The system is solved with an iterative algorithm. BiCGstab vs. Reverse GS for matrix n = 2Mill.. G.M. Del COrsa, A. Gullia, F. Romania, Comparison of Krylov subspace methods on the PageRank problem, J. Comp.Appl.Math, Vol. 210,1–2, 31 2007 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 10 Storing in two dim. arrays Dense or banded matrix It is natural to use a 2D array to store a dense or banded matrix. Unfortunately, there are a couple of significant issues that complicate this seemingly simple approach. Row-major vs. column-major storage pattern is language dependent. It is not possible to dynamically allocate two-dimensional arrays in C and C++. (at least not without pointer storage and manipulation overhead). 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 11 Row-major storage Languages like C and C++ use what is often called a row-major storage pattern for 2D matrices. In C and C++, the last index in a multidimensional array indexes contiguous memory locations. Thus a[i][j] and a[i][j + 1] are adjacent in memory. Example The stride between adjacent elements in the same row is 1. The stride between adjacent elements in the same column is the row length (5 in this example). 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 12 Column-major storage In Fortran 2D matrices are stored in column-major form. The first index in a multidimensional array indexes contiguous memory locations. Thus a(i, j) and a(i + 1, j) are adjacent in memory. Example The stride between adjacent elements in the same row is the column length (2 in this example) while the stride between adjacent elements in the same column is 1. Notice that if C is used to read a matrix stored in Fortran (or vice-versa), the transpose matrix will be read. 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 13 Significance of array ordering There are two main reasons why HPC programmers need to be aware of this issue: Memory access patterns can have a dramatic impact on performance, especially on modern systems with a complicated memory hierarchy. These code segments access the same elements of an array, but the order of accesses is different. Many important numerical libraries (e.g. LAPACK) are written in Fortran. To use them with C or C++ the programmer must often work with a transposed matrix 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 14 Dense storage formats Efficient storage formats for sparse matrices Obviously, there is no need to store the zeros for an sparse matrix. There are several ways to store a sparse matrix efficiently Common sparse storage formats: Coordinate Diagonal Compressed Sparse Row (CSR) ELLPACK Hybrid 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 15 Storage format COO Coordinate (COO) 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 16 Storage format CSR Compressed Sparse Row (CSR) 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 17 Storage format DIA Diagonal (DIA) 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 18 Storage format ELL ELLPACK (ELL) 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 19 Storage format Hyb Hybrid (HYB) ELL+COO (CUSP only) 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 20 Use of formats 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 21 Formats comparison | structured Structured matrix Bytes per non-zero entry Format COO CSR DIA ELL HYB 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 22 float 12.00 8.45 4.05 8.11 8.11 double 16.00 12.45 8.10 12.16 12.16 Formats comparison | unstructured Unstructured matrix Bytes per non-zero entry Format COO CSR DIA ELL HYB 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 23 float double 12.00 16.00 8.62 12.62 164.11 328.22 11.06 16.60 9.00 13.44 Formats comparison | random Random matrix Bytes per non-zero entry Format COO CSR DIA ELL HYB 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 24 float 12.00 8.42 76.83 14.20 9.60 double 16.00 12.42 153.65 21.29 14.20 Formats comparison | power-law Power-law matrix Bytes per non-zero entry Format COO CSR DIA ELL HYB 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 25 float double 12.00 16.00 8.74 12.73 118.83 237.66 49.88 74.82 13.50 19.46 Matrix-matrix multiplication Series of inner product (dot product) operations Iterative row-oriented algorithm 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 26 Block-matrix computation Replace scalar multiplication with matrix multiplication Replace scalar addition with matrix addition 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 27 Block-matrix computation Recurs Until B Small Enough 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 28 First parallel algorithm Partitioning Divide matrices into rows Each primitive task has corresponding rows of three matrices Communication Each task must eventually see every row of B Organize tasks into a ring Agglomeration and mapping Fixed number of tasks, each requiring same amount of computation Regular communication among tasks Strategy: Assign each process a contiguous group of rows 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 29 Communication of B 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 30 Communication of B (cont.) 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 31 Communication of B (cont.) 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 32 Communication of B (cont.) 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 33 Complexity Algorithm has p iterations During each iteration a process multiplies (n/p) × (n/p) block of A by (n/p) × n block of B, i.e. multiplications per processor: O(n3 /p2 ) Total computation time O(n3 /p) Each process ends up passing (p − 1)n2 /p = O(n2 ) elements of B Weakness Each process must access every element of matrix B Ratio of computations per communication is poor: only 2n/p 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 34 Cannon’s algorithm Compute C = A × B. A, B ∈ RN×N √ √ Distributed block-wise on a 2D-Field ( P × P) C should be stored in a similar manner. Processor (i,j) computes X Ci,j = Ai,k · Bk,j k i.e. needs block-row i of A and block-column j of B Prepare with alignment of blocks Iterate over rotation phase computation phase 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 35 Simple vs. Cannon’s Algorithm 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 36 Alignment Alignment Matrix A The block-rows of matrix A are shifted to the left This has to be be done for each row, until the diagonal blocks are placed in the first column. Matrix B The block-columns of matrix B are shifted upwards This has to be be done for each columns, until the diagonal blocks are placed in the first row. After this step processor (i,j) holds the blocks Ai,(j+i)%√P Row i is shifted i-times to the left √ B(j+i)% P,j Columns j is shifted j-times upwards 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 37 Alignment phase Before After alignment 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 38 Computation & Rotation Phase I – Computation After the alignment, every processor hold two corresponding blocks. In the computation phase the matrix-matrix multiplication is carried out block-wise Phase II – Rotation Blocks are shifted further to the left resp. upwards. 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 39 Example Starting point Block-are distributed to processors P = 9. Processor (i,j) holds block (i,j). 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 40 Example Rotation Alignment has taken place 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 41 Example | Iteration 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 42 Complexity analysis| Cannon’s √ Algorithm has p iterations During each iteration a process multiplies √ √ (n/ p) × (n/ p) blocks : O(n3 /p2/3 ) Total computation time O(n3 /p) During each iteration process sends and receives two √ √ blocks of size (n/ p) × (n/ p). √ Communication complexity: O(n2 / p) Row-wise algorithm was: p iterations Communication complexity: O(n2 ) 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 43 MPI-Code | Cannon’s alg. 1 2 3 4 5 6 7 8 9 10 11 /* Create Grid and get coordinates */ int dims[2], periods[2] = {1, 1}; int mycoords[2]; dims[0] = sqrt(num_procs); dims[1] = num_procs / dims[0]; MPI_Cart_create(MPI_COMM_WORLD,2,dims,periods,0, &comm_2d); MPI_Comm_rank(comm_2d, &my2drank); MPI_Cart_coords(comm_2d, my2drank, 2, mycoords); /* Local blocks of the matricies */ double *a, *b, *c; /* Load a, b and c corresponding to coordinates*/ 12 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 44 MPI-Code | Cannon’s alg. 1 2 3 4 5 6 7 8 /* Matrix rotation A */ MPI_Cart_shift(comm_2d, 0, -mycoords[0], &shiftsource, & shiftdest); MPI_Sendrecv_replace(a, nlocal*nlocal, MPI_DOUBLE, shiftdest, 77, shiftsource, 77,comm_2d, & status); /* Matrix rotation B */ MPI_Cart_shift(comm_2d, 1, -mycoords[1], &shiftsource, & shiftdest); MPI_Sendrecv_replace(b, nlocal*nlocal, MPI_DOUBLE, shiftdest, 77, shiftsource, 77, comm_2d, & status); 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 45 MPI-Code | Cannon’s alg. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 /* Finde left and upper neighbour */ MPI_Cart_shift(comm_2d, 0, -1, &rightrank, &leftrank); MPI_Cart_shift(comm_2d, 1, -1, &downrank, &uprank); for (i=0; i<dims[0]; i++) { dgemm(nlocal, a, b, c); /* c= c + a * b */ /* Matrix A shifted t the left */ MPI_Sendrecv_replace(a, nlocal*nlocal, MPI_DOUBLE, leftrank, 77, rightrank, 77, comm_2d, &status); /* Matrix B shifted upwards*/ MPI_Sendrecv_replace(b, nlocal*nlocal, MPI_DOUBLE, uprank, 77, downrank, 77, comm_2d, &status); } /* A und B back in original placement*/ 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 46 Virtual topologies in MPI MPI allows one to associate virtual topologies, f.i. Cartesian topology Graph topology Creating a topology produces a new communicator MPI provides “mapping functions” Mapping functions compute processor ranks, based on the topology naming scheme 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 47 Cartesian topology Each process is connected to its neighbours in a virtual grid Boundaries can be cyclic Processes can be identified by Cartesian coordinates 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 48 Creating a Cartesian topology int MPI_Cart_create (MPI_Comm comm_old, int ndims, int int *periods, int reorder, MPI_Comm *comm_cart) comm_old existing communicator ndims number of dimensions periods logical array indicating whether a dimension is cyclic (TRUE ⇒ cyclic boundary conditions) reorder logical FALSE ⇒ rank preserved TRUE ⇒ possible rank reordering comm_cart new Cartesian communicator 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 49 Cartesian example 1 MPI_Comm vu; 2 3 4 int dim[2], period[2], reorder; dim[0]=4; dim[1]=3; 5 6 7 8 period[0]=TRUE; period[1]=FALSE; reorder=TRUE; 9 10 11 MPI_Cart_create(MPI_COMM_WORLD, 2,dim,period,reorder,&vu); 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 50 Mapping coordinates to ranks Mapping coordinates to ranks int MPI_Cart_rank(MPI_Comm comm,init *coords,int *rank) Mapping ranks to coordinates int MPI_Cart_coords(MPI_Comm comm, int rank, int maxdims,int *coords) 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 51 Cartesian example 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 #include<mpi.h> /* Run with 12 processes */ int main(int argc, char *argv[]) { int rank; MPI_Comm vu; int dim[2],period[2],reorder, coord[2],id; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank); dim[0]=4; dim[1]=3; period[0]=TRUE; period[1]=FALSE;reorder=TRUE; MPI_Cart_create(MPI_COMM_WORLD,2,dim,period,reorder,&vu); if(rank==5){ MPI_Cart_coords(vu,rank,2,coord); printf("P:%d␣My␣coordinates␣are␣%d␣%d\n",rank,coord[0],coord[1]); } if(rank==0) { coord[0]=3; coord[1]=1; MPI_Cart_rank(vu,coord,&id); printf("Proc.at␣position␣(%d,␣%d)␣has␣rank␣%d\n",coord[0],coord[1],id); } MPI_Finalize(); } 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 52 Neighbouring ranks output Proc. at position (3,1) has rank 10 P:5 My coordinates are 1 2 Computing ranks of neighbouring processors int MPI_Cart_shift (MPI_Comm comm, int direction, int disp, int *rank_source, int *rank_dest) Does not actually shift data: returns the correct ranks for a shift that can be used in subsequent communication calls 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 53 Neighbouring ranks Arguments direction dimension in which the shift should be made) disp (length of the shift in processor coordinates [+ or -]) rank_source (where calling process should receive a message from during the shift) rank_dest (where calling process should send a message to during the shift) If we shift off of the topology, MPI_Proc_null i.e. (−1) is returned 23.06.2014 Thorsten Grahs Parallel Computing I SS 2014 Seite 54