NumLibs4HPC
Transcrição
NumLibs4HPC
Parallel Computing Numerical libraries for HPC Thorsten Grahs, 13.07.2015 Table of contents Introduction Dense Matrix Linear Algebra packages BLAS LAPACK ATLAS Matrix-Matrix multiplication Rowwise matrix-matrix multiplication Recursive block oriented matrix-matrix multiplication PETSC 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 2 Numerical libraries Standard “textbook” algorithms are often not the most efficient implementations, especially if one does it on his own In addition, they may not be the most “correct” in the sense of keeping errors as small as possible Since it’s so important to “get it right” and “get it right fast,” its nearly always advisable to use numerical libraries of proven code to handle most linear algebra tasks Focus on existing libraries. There are many and one should always look for available appropriate libraries that can help as part of tackling any programming problem 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 3 Key operations ... on vectors copy one vector to another swap two vectors scale a vector by a constant add a multiple of one vector to another inner product 1-norm of a vector 2-norm of a vector find maximal entry in a vector 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 4 Key operations ... between matrices and vectors general matrix-vector multiplication sparse matrix-vector multiplication rank-1 update (adds a scalar multiple of xyT to a matrix) ... between pairs of matrices general matrix-matrix multiplication symmetric matrix-matrix multiplication sparse matrix-matrix multiplication 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 5 Overview BLAS – Basic Linear Algebra Sub-programs The BLAS is a collection of low-level matrix and vector arithmetic operations. LAPACK– Linear Algebra PACKage LAPACK is basically a collection of higher-level linear algebra operations “solve a linear system”. LAPACK is built on top of the BLAS; ATLAS– Automatically Tuned Linear Algebra Software ATLAS is a portable reasonably good implementation of the BLAS interfaces. It also implements a few of the most commonly used LAPACK operations. 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 6 Basic Linear Algebra Sub-programs BLAS The BLAS library contains routines that provide standard building blocks for performing basic vector and matrix operations: Level 1 BLAS routines perform scalar, vector and vector-vector operations, Level 2 BLAS routines perform matrix-vector operations, and Level 3 BLAS routines perform matrix-matrix operations. BLAS is efficient, portable, and widely available ⇒ serves quite often as building-blocks for the development of high quality linear algebra software. BLAS homepage http://www.netlib.org/blas/ 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 7 BLAS description The name of each BLAS routine (usually) begins with a character that indicates the type of data it operates on: S D C Z Real single precision. Real double precision. Complex single precision. Complex double precision. Level 1 BLAS Handle operations that are O(n) data and O(n) work Level 2 BLAS Handle operations that are O(n2 ) data and O(n2 ) work Level 3 BLAS Handle operations that are O(n2 ) data and O(n3 ) work 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 8 Level 1 BLAS Handle operations that are O(n) data and O(n) work BLAS 1 sub-programs xCOPY xSWAP xSCAL xAXPY xDOT xASUM xNRM2 IxAMAX copy one vector to another swap two vectors scale a vector by a constant add a multiple of one vector to another inner product 1-norm of a vector 2-norm of a vector find maximal entry in a vector Naming convention: BLAS func. returning an int have names starting with I. Datatype indicator is displaced to 2. position 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 9 Level 1 BLAS description BLAS 1 development (1973-1977) Standard library for 15 basic vector operations BLAS 1 example Routine DAXPY Double precision solves αx + y Goals Increasing degree of abstraction (w.r.t programming) Portable interface, machine-oriented implementation Robust algorithms 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 10 Level 2 BLAS Handle operations that are O(n2 ) data and O(n2 ) work Vector-Matrix computation αAx + βy BLAS 2 sub-programs xGEMV xGER xSYR2 xTRSV general matrix-vector multiplication general rank-1 update symmetric rank-2 update solve a triangular system of equations A detailed description of BLAS 2 can be found at http://www.netlib.org/blas/blas2-paper.ps 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 11 Level 2 BLAS description BLAS 2 development (1984-1986) Standard library for 25 basic matrix-vector operations Support different matrix and data formats BLAS 2 example Routine DGEMV Double precision GEneral Matrix Vector product Ax Goals Optimization of BLAS 1 routines (inefficient) Adaption to new vector machines 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 12 Level 3 BLAS Handle operations that are O(n2 ) data and O(n3 ) work Matrix computation αAB + β BLAS 3 sub-programs xGEMM xSYMM xSYRK xSYR2K general matrix-matrix multiplication symmetric matrix-matrix multiplication symmetric rank-k update symmetric rank-2k update A detailed description of BLAS 3 can be found at http://www.netlib.org/blas/blas3-paper.ps 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 13 Level 3 BLAS description BLAS 3 development (1987-1988) Standard library for 9 basic matrix-matrix operations Support different matrix and data formats BLAS 3 example Routine DGEMM Double precision GEneral Matrix Matrix-Matrix product AB Goals Efficient memory/cache usage (optimization) 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 14 CBLAS | The BLAS for C/C++ A version of the BLAS has been written with routines that can be called directly from C/C++. The CBLAS routines have the same name as their FORTRAN counterparts except all sub-program and function names are lower-case and are prepended with cblas_. The CBLAS matrix routines also take an additional parameter to indicate if the matrix is stored in row-major or column-major order. 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 15 Calling Fortran BLAS from C/C++ Even with the matrix transpose issue, it’s fairly easy to call the Fortran BLAS from C or C++: 1. Find out the calling sequence for the routine. Man pages for BLAS routines can be looked up on-line. Study the BLAS quick reference http://www.netlib.org/blas/blasqr.pdf or http://www.netlib.org/lapack/lug/node145.html 2. Create a prototype in your C/C++ code for the function. The function name should match the Fortran function name in lower-case and have an appended underscore. 3. Optionally create an interface function to make handling the differences between how C and Fortran treat arguments easier. 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 16 Function call: DGEMM() from C DGEMM() performs the operation C ← αop(A)op(B) + βC where op(A) can either be A or AT . The call is DGEMM(TRANSA,TRANSB,M,N,K,ALPHA,A,LDA,B,LDB,BETA,C,LDC) TRANSA,TRANSB M N K ALPHA, BETA LDA, LDB, LDC (CHAR*1) ’T’ or ’Y’ for transposed,’N’ for not INTEGER number of rows in C and A INTEGER number of columns in C and B INTEGER number of columns in A and rows in B DOUBLE PRECISION scale factors for AB and C INTEGER leading dimensions of A, B, and C 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 17 Function call: DGEMM() from C DGEMM() performs the operation C ← αop(A)op(B) + βC where op(A) can either be A or AT . The call is DGEMM(TRANSA,TRANSB,M,N,K,ALPHA,A,LDA,B,LDB,BETA,C,LDC) TRANSA,TRANSB M N K ALPHA, BETA LDA, LDB, LDC (CHAR*1) ’T’ or ’Y’ for transposed,’N’ for not INTEGER number of rows in C and A INTEGER number of columns in C and B INTEGER number of columns in A and rows in B DOUBLE PRECISION scale factors for AB and C INTEGER leading dimensions of A, B, and C 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 18 BLAS implementations refblas Reference implementation from netlib (C, Fortran) ACML AMD Core Math Library, for AMD Athlon und Opteron ESSL IBMs Engineering and Scientific Subroutine Library, for PowerPC architectures HP MLIB HPs Math library, for Itanium, PA-RISC, x86 and Opteron under HP-UX and Linux. XBLAS Extra Precise Basic Linear Algebra Subroutines CUBLAS Nvidia implementation of BLAS for CUDA (GPGPUs) 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 19 Parallel BLAS PBLAS Distributed memory version of the Level 1,2 and 3 BLAS Performs message-passing with similar interface to BLAS Provides distributed data layouts. Matrices and vectors can have block distribution or block-cyclic distribution. PBLACS Basic Linear Algebra Communicat. Subprograms Message-passing library designed for linear algebra. The computational model consists of a one- or two-dimensional process grid distribution Each process stores pieces of the matrices and vectors Permits a process to be a member of several overlapping or disjoint process grids, each one labeled by a context MPI calls this context a communicator 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 20 LAPACK The Linear Algebra PACKage : Development started 1989 building algorithms for solving linear systems least-squares solutions eigenvalue problems successor to both LINPACK and EISPACK (LA packages for vector machines from the 1970s) LINPACK and EISPACK make use of the level 1 BLAS LAPACK largely replaces LINPACK and EISPACK but takes advantage of level 2 and level 3 BLAS for more efficient operation on computers with hierarchical memory and shared memory multiprocessors. Latest release LAPACK, version 3.5.0 Updated: November 16, 2013 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 21 LAPACK facts One of the LINPACK authors, Cleve Moler, went on to write an interactive, user-friendly front end to these libraries called MATLAB. LAPACK is written in FORTRAN 90, so calling from C/C++ has the same challenges as discussed for the BLAS. A version called CLAPACK exists that can be more easily called from C/C++ but still has the column-major matrix access issue. LAPACK homepage http://www.netlib.org/lapack LAPACK user guide http://www.netlib.org/lapack/lug/ 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 22 LAPACK functionalities General: general (GE), banded (GB), pair (GG), tridiag(GT) Symmetric: general (SY), banded (SB), packed (SP),tridiag (ST) Hermitian: general (HE), banded (HB), packed (HP) Positive definite (PO), packed (PP), tridiagonal (PT) Orthogonal (OR), orthogonal packed (OP) Unitary (UN), unitary packed (UP) Hessenberg (HS), Hessenberg pair (HG) Triangular (TR), packed (TP), banded (TB), pair (TG), Bidiagonal (BD) 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 23 ScaLAPACK Scalable Linear Algebra PACKages High-Performance Linear Algebra routines Designed for Parallel Distributed Memory Machines (PDMM) Driven by Univ. Tennessee, Univ. Berkeley, Univ. Colorado Denver NAG Ltd. Jack Dongarra The man behind EISPACK, LINPACK, BLAS, LAPACK, ScaLAPACK, Netlib, PVM, MPI 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 24 ScaLAPACK properties Library includes a subset of LAPACK routines redesigned for distributed memory computing Uses explicit message passing for interprocessor communication. Designed for heterogeneous computing and is portable on any computer that supports MPI or PVM. Matrices are laid out in a two- dimensional block cyclic decomposition. ScaLAPACK is based on PBLAS and PBLACS ScaLAPCK homepage http://www.netlib.org/scalapack/index.htm 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 25 ATLAS Automatically Tuned Linear Algebra Software consists of the BLAS and some routines from LAPACK The key to getting high performance from the BLAS (and hence from LAPACK) is using BLAS routines that are tuned to each particular machine’s architecture and compiler. Vendors of HPC equipment typically supply a BLAS library that has been optimized for their machines. The ATLAS project was created to provide similar support for HPC equipment built from commodity hardware (e.g. Beowulf clusters Atlas homepage http://math-atlas.sourceforge.net/ 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 26 ATLAS performance example The graph below shows GFLOPS vs matrix dimension for both the GSL (GNU Scientific Library) BLAS and the ATLAS BLAS supplied with Ubuntu 10.04 and running on a Lenovo T410 laptop. We see close to a 6-fold increase in performance for the larger matrix sizes. 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 27 Using BLAS library from ATLAS Why? Because it contains a parallel automatically optimized BLAS library. It also provides a nice C interface to BLAS Keep in mind that BLAS are Fortran routines Two ways calling Calling Fortran from C Using CBLAS 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 28 Calling Fortran from C 1 2 3 4 5 6 #include int main { int i, double double <stdio.h> () j, n, one; coefficient; x[] = { 1, 1, 1 }; 7 coefficient = 4.323; one = 1; n = 3; 8 9 10 11 dscal_(&n, &coefficient, &x[0], &one); 12 13 for (i = 0; i < 3; ++i) printf ("%f\n", x[i]); 14 15 16 return 0; 17 18 } 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 29 Calling Fortran from C The Fortran compiler appends an underscore after subroutinenames (function in Fortran). Since BLAS is written in Fortran, to call the subroutine dscal() from C one must write dscal_(). C passes arguments by value whereas Fortran passes arguments by reference. So to call a Fortran subroutine from C one must pass the memory address of the arguments [user@gauss$ gcc blasTest.c -o blasTest -L /usr/lib64/atlas -lf77blas [user@gauss]$ ./blasTest 4.323000 4.323000 4.323000 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 30 Using the C interface CBLAS 1 2 #include <stdio.h> #include <cblas.h> 3 4 5 6 7 int main () { int i; double x[] = { 1, 1, 1 }; 8 cblas_dscal(3, 4.323, x, 1); 9 10 for (i = 0; i < 3; ++i) printf ("%f\n", x[i]); 11 12 13 return 0; 14 15 } 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 31 Using the C interface CBLAS Here we use the cblas header file. Thus the main difference here is that we use a “cblas_” prefix instead an “_” suffix. That allows us to pass the arguments by value than by reference. To compile this we link against the cblas library instead of the blas library. [user@gauss$ gcc blasTest.c -o blasTest -L /usr/lib64/atlas -lcblas -latlas [user@gauss]$ ./blasTest 4.323000 4.323000 4.323000 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 32 Using the C interface CBLAS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 #include <cstdlib> #include <vector> #include <iostream> extern "C" { #include <cblas.h> } using namespace std; int main () { int i; vector<double> x(3,1); cblas_dscal (x.size(), 4.323, &x[0], 1); for (i = 0; i < x.size(); ++i) cout << x[i] << endl; return 0; } The advantages of the vector class instead of a C array is that now one doesn’t need an extra variable to store the length of the array 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 33 Example | Matrix-matrix multiplication Rowise matrix-matrix multiplication The rowwise matrix-matrix product for C = AB is computed with the pseudo-code below. 1 2 3 4 5 6 7 8 for ( i =0; i < N ; i ++){ for ( j =0; i < N ; j ++){ c[i][j] = 0.0; for ( k =0; k < N ; k ++){ c[i][j] = c[i][j] + a[i][k]*b[k][j]; } } } Notice that both A and C are accessed by rows in the innermost loop while B is accessed by columns. 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 34 Example | Matrix-matrix multiplication The graph shows GFLOPS vs matrix dimension for a rowwise matrix-matrix product. The performance drop-off suggests that all of B can fit in the processor’s cache when N=300 but not when N=400. A 300 × 300 matrix using double precision floating point values consumes approximately 700 KB. A 400 × 400 matrix needs at least 1250 KB. The cache size of the machine used is 2048 KB. 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 35 Key idea: minimize cache misses If we partition A and B into appropriate sized blocks A00 A01 B00 B01 A= , B= A10 A11 B10 B11 then the product C = AB can be computed as C00 C01 A00 B00 + A01 B10 A00 B01 + A01 B11 C= = C10 C11 A10 B00 + A11 B10 A10 B01 + A11 B11 where each of the matrix-matrix products involves matrices of roughly 1/4 the size of the full matrices. 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 36 Recursion We can exploit this idea quite easily with a recursive algorithm. 1 2 3 4 5 6 7 8 9 10 matrix function C = mult(A, B) if (size(B) > threshold) then C00 = mult(A00 , B00 ) + mult(A01 C01 = mult(A00 , B01 ) + mult(A01 C10 = mult(A10 , B00 ) + mult(A11 C11 = mult(A10 , B01 ) + mult(A11 else C = AB endif return C , , , , B10 B11 B10 B11 ) ) ) ) 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 37 Recursive block-oriet. mat.-mat. product The graph below shows GFLOPS vs matrix dimension for both a rowwise product and a recursive block-oriented matrix-matrix product. Performance is improved across the board, including for smaller matrix sizes Performance is relatively flat - no degradation as matrix sizes increase. 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 38 Comparison with BLAS Finally, we compare the results just shown with similar results using BLAS, i.e. – Default BLAS supplied with Ubuntu 12.04 – ATLAS BLAS Obviously, the ATLAS BLAS performance is far superior to either – the standard BLAS or – the recursive block routine 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 39 PETSc Portable, Extensible Toolkit for Scientific computation A suite of data structures and routines for the scalable (parallel) solution of scientific applications modelled by partial differential equations. Supported interfaces/hardware MPI, shared memory pthreads NVIDIA GPUs hybrid MPI-shared memory pthreads MPI-GPU parallelism. 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 40 What is PETSc A freely available and supported research code Available via http://www.mcs.anl.gov/petsc Free for everyone, including industrial users Hyperlinked documentation and manual pages for all routines Many tutorial-style examples Support via email: [email protected] Usable from Fortran 77/90, C, and C++ Portable to any parallel system supporting MPI, including Tightly coupled systems: Cray T3E, SGI Origin, IBM SP, HP 9000, Sun Enterprise Loosely coupled systems, e.g., networks of workstations Compaq, HP, IBM, SGI, Sun PCs running Linux or Windows 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 41 PETSc | facts Developed by Argonne National Laboratory USA Aim Tools for the scalable (parallel) solution of scientific applications modelled by partial differential equations. Languages C (main language), C++, Fortran, Python Build on BLAS, LAPACK, MPI Latest version PETSc 3.6; released June 9, 2015. Homepage http://www.mcs.anl.gov/petsc/ 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 42 PETSc’s Role Developing parallel, nontrivial PDE solvers that deliver high performance is still difficult and requires months (or even years) of concentrated effort. PETSc is a toolkit that can ease these difficulties and reduce the development time, but it is not a black-box PDE solver, nor a silver bullet. Barry Smith, Argonne National Laboratory Main developer of PETSc 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 43 PETSC | features Parallel vectors Parallel matrices Scalable parallel preconditioners Krylov subspace methods Parallel Newton-based nonlinear solvers Parallel timestepping (ODE) solvers Support for Nvidia GPU cards Complete documentation Automatic profiling of floating point and memory usage Consistent user interface Portable to UNIX and Windows 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 44 PETSC | facts 13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 45