NumLibs4HPC

Transcrição

NumLibs4HPC
Parallel Computing
Numerical libraries for HPC
Thorsten Grahs, 13.07.2015
Table of contents
Introduction
Dense Matrix Linear Algebra packages
BLAS
LAPACK
ATLAS
Matrix-Matrix multiplication
Rowwise matrix-matrix multiplication
Recursive block oriented matrix-matrix multiplication
PETSC
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 2
Numerical libraries
Standard “textbook” algorithms are often not the most
efficient implementations, especially if one does it on his
own
In addition, they may not be the most “correct” in the
sense of keeping errors as small as possible
Since it’s so important to “get it right” and “get it right fast,”
its nearly always advisable to use numerical libraries of
proven code to handle most linear algebra tasks
Focus on existing libraries. There are many and one
should always look for available appropriate libraries that
can help as part of tackling any programming problem
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 3
Key operations
... on vectors
copy one vector to another
swap two vectors
scale a vector by a constant
add a multiple of one vector to another
inner product
1-norm of a vector
2-norm of a vector
find maximal entry in a vector
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 4
Key operations
... between matrices and vectors
general matrix-vector multiplication
sparse matrix-vector multiplication
rank-1 update (adds a scalar multiple of xyT to a matrix)
... between pairs of matrices
general matrix-matrix multiplication
symmetric matrix-matrix multiplication
sparse matrix-matrix multiplication
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 5
Overview
BLAS – Basic Linear Algebra Sub-programs
The BLAS is a collection of low-level matrix and vector
arithmetic operations.
LAPACK– Linear Algebra PACKage
LAPACK is basically a collection of higher-level linear
algebra operations “solve a linear system”.
LAPACK is built on top of the BLAS;
ATLAS– Automatically Tuned Linear Algebra Software
ATLAS is a portable reasonably good implementation of
the BLAS interfaces.
It also implements a few of the most commonly used
LAPACK operations.
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 6
Basic Linear Algebra Sub-programs
BLAS
The BLAS library contains routines that provide standard
building blocks for performing basic vector and matrix
operations:
Level 1 BLAS routines
perform scalar, vector and vector-vector operations,
Level 2 BLAS routines
perform matrix-vector operations, and
Level 3 BLAS routines
perform matrix-matrix operations.
BLAS is efficient, portable, and widely available
⇒ serves quite often as building-blocks for the
development of high quality linear algebra software.
BLAS homepage http://www.netlib.org/blas/
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 7
BLAS description
The name of each BLAS routine (usually) begins with a
character that indicates the type of data it operates on:
S
D
C
Z
Real single precision.
Real double precision.
Complex single precision.
Complex double precision.
Level 1 BLAS
Handle operations that are O(n) data and O(n) work
Level 2 BLAS
Handle operations that are O(n2 ) data and O(n2 ) work
Level 3 BLAS
Handle operations that are O(n2 ) data and O(n3 ) work
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 8
Level 1 BLAS
Handle operations that are O(n) data and O(n) work
BLAS 1 sub-programs
xCOPY
xSWAP
xSCAL
xAXPY
xDOT
xASUM
xNRM2
IxAMAX
copy one vector to another
swap two vectors
scale a vector by a constant
add a multiple of one vector to another
inner product
1-norm of a vector
2-norm of a vector
find maximal entry in a vector
Naming convention: BLAS func. returning an int have names
starting with I. Datatype indicator is displaced to 2. position
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 9
Level 1 BLAS description
BLAS 1 development (1973-1977)
Standard library for 15 basic vector operations
BLAS 1 example
Routine DAXPY
Double precision
solves αx + y
Goals
Increasing degree of abstraction (w.r.t programming)
Portable interface, machine-oriented implementation
Robust algorithms
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 10
Level 2 BLAS
Handle operations that are O(n2 ) data and O(n2 ) work
Vector-Matrix computation
αAx + βy
BLAS 2 sub-programs
xGEMV
xGER
xSYR2
xTRSV
general matrix-vector multiplication
general rank-1 update
symmetric rank-2 update
solve a triangular system of equations
A detailed description of BLAS 2 can be found at
http://www.netlib.org/blas/blas2-paper.ps
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 11
Level 2 BLAS description
BLAS 2 development (1984-1986)
Standard library for 25 basic matrix-vector operations
Support different matrix and data formats
BLAS 2 example
Routine DGEMV
Double precision
GEneral Matrix
Vector product Ax
Goals
Optimization of BLAS 1 routines (inefficient)
Adaption to new vector machines
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 12
Level 3 BLAS
Handle operations that are O(n2 ) data and O(n3 ) work
Matrix computation
αAB + β
BLAS 3 sub-programs
xGEMM
xSYMM
xSYRK
xSYR2K
general matrix-matrix multiplication
symmetric matrix-matrix multiplication
symmetric rank-k update
symmetric rank-2k update
A detailed description of BLAS 3 can be found at
http://www.netlib.org/blas/blas3-paper.ps
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 13
Level 3 BLAS description
BLAS 3 development (1987-1988)
Standard library for 9 basic matrix-matrix operations
Support different matrix and data formats
BLAS 3 example
Routine DGEMM
Double precision
GEneral Matrix
Matrix-Matrix product AB
Goals
Efficient memory/cache usage (optimization)
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 14
CBLAS | The BLAS for C/C++
A version of the BLAS has been written with routines that
can be called directly from C/C++.
The CBLAS routines have the same name as their
FORTRAN counterparts except all sub-program and
function names are lower-case and are prepended with
cblas_.
The CBLAS matrix routines also take an additional
parameter to indicate if the matrix is stored in row-major or
column-major order.
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 15
Calling Fortran BLAS from C/C++
Even with the matrix transpose issue, it’s fairly easy to call
the Fortran BLAS from C or C++:
1. Find out the calling sequence for the routine. Man pages
for BLAS routines can be looked up on-line. Study the
BLAS quick reference
http://www.netlib.org/blas/blasqr.pdf or
http://www.netlib.org/lapack/lug/node145.html
2. Create a prototype in your C/C++ code for the function.
The function name should match the Fortran function
name in lower-case and have an appended underscore.
3. Optionally create an interface function to make handling
the differences between how C and Fortran treat
arguments easier.
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 16
Function call: DGEMM() from C
DGEMM() performs the operation
C ← αop(A)op(B) + βC
where op(A) can either be A or AT . The call is
DGEMM(TRANSA,TRANSB,M,N,K,ALPHA,A,LDA,B,LDB,BETA,C,LDC)
TRANSA,TRANSB
M
N
K
ALPHA, BETA
LDA, LDB, LDC
(CHAR*1) ’T’ or ’Y’ for transposed,’N’ for not
INTEGER number of rows in C and A
INTEGER number of columns in C and B
INTEGER number of columns in A and rows in B
DOUBLE PRECISION scale factors for AB and C
INTEGER leading dimensions of A, B, and C
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 17
Function call: DGEMM() from C
DGEMM() performs the operation
C ← αop(A)op(B) + βC
where op(A) can either be A or AT . The call is
DGEMM(TRANSA,TRANSB,M,N,K,ALPHA,A,LDA,B,LDB,BETA,C,LDC)
TRANSA,TRANSB
M
N
K
ALPHA, BETA
LDA, LDB, LDC
(CHAR*1) ’T’ or ’Y’ for transposed,’N’ for not
INTEGER number of rows in C and A
INTEGER number of columns in C and B
INTEGER number of columns in A and rows in B
DOUBLE PRECISION scale factors for AB and C
INTEGER leading dimensions of A, B, and C
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 18
BLAS implementations
refblas
Reference implementation from netlib (C, Fortran)
ACML
AMD Core Math Library, for AMD Athlon und Opteron
ESSL IBMs Engineering and Scientific Subroutine Library,
for PowerPC architectures
HP MLIB
HPs Math library, for Itanium, PA-RISC, x86 and Opteron
under HP-UX and Linux.
XBLAS
Extra Precise Basic Linear Algebra Subroutines
CUBLAS
Nvidia implementation of BLAS for CUDA (GPGPUs)
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 19
Parallel BLAS
PBLAS
Distributed memory version of the Level 1,2 and 3 BLAS
Performs message-passing with similar interface to BLAS
Provides distributed data layouts. Matrices and vectors
can have block distribution or block-cyclic distribution.
PBLACS Basic Linear Algebra Communicat. Subprograms
Message-passing library designed for linear algebra.
The computational model consists of a one- or
two-dimensional process grid distribution
Each process stores pieces of the matrices and vectors
Permits a process to be a member of several overlapping
or disjoint process grids, each one labeled by a context
MPI calls this context a communicator
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 20
LAPACK
The Linear Algebra PACKage :
Development started 1989 building algorithms for solving
linear systems
least-squares solutions
eigenvalue problems
successor to both LINPACK and EISPACK
(LA packages for vector machines from the 1970s)
LINPACK and EISPACK make use of the level 1 BLAS
LAPACK largely replaces LINPACK and EISPACK but
takes advantage of level 2 and level 3 BLAS for more
efficient operation on computers with hierarchical memory
and shared memory multiprocessors.
Latest release LAPACK, version 3.5.0
Updated: November 16, 2013
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 21
LAPACK facts
One of the LINPACK authors, Cleve Moler, went on to
write an interactive, user-friendly front end to these
libraries called MATLAB.
LAPACK is written in FORTRAN 90, so calling from C/C++
has the same challenges as discussed for the BLAS.
A version called CLAPACK exists that can be more easily
called from C/C++ but still has the column-major matrix
access issue.
LAPACK homepage
http://www.netlib.org/lapack
LAPACK user guide
http://www.netlib.org/lapack/lug/
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 22
LAPACK functionalities
General: general (GE), banded (GB), pair (GG),
tridiag(GT)
Symmetric: general (SY), banded (SB), packed
(SP),tridiag (ST)
Hermitian: general (HE), banded (HB), packed (HP)
Positive definite (PO), packed (PP), tridiagonal (PT)
Orthogonal (OR), orthogonal packed (OP)
Unitary (UN), unitary packed (UP)
Hessenberg (HS), Hessenberg pair (HG)
Triangular (TR), packed (TP), banded (TB), pair (TG),
Bidiagonal (BD)
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 23
ScaLAPACK
Scalable Linear Algebra PACKages
High-Performance Linear Algebra routines
Designed for Parallel Distributed Memory Machines
(PDMM)
Driven by
Univ. Tennessee, Univ. Berkeley, Univ. Colorado Denver
NAG Ltd.
Jack Dongarra
The man behind
EISPACK, LINPACK, BLAS, LAPACK, ScaLAPACK, Netlib,
PVM, MPI
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 24
ScaLAPACK properties
Library includes a subset of LAPACK routines redesigned
for distributed memory computing
Uses explicit message passing for interprocessor
communication.
Designed for heterogeneous computing and is portable on
any computer that supports MPI or PVM.
Matrices are laid out in a two- dimensional block cyclic
decomposition.
ScaLAPACK is based on PBLAS and PBLACS
ScaLAPCK homepage
http://www.netlib.org/scalapack/index.htm
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 25
ATLAS
Automatically Tuned Linear Algebra Software
consists of the BLAS and some routines from LAPACK
The key to getting high performance from the BLAS (and
hence from LAPACK) is using BLAS routines that are
tuned to each particular machine’s architecture and
compiler.
Vendors of HPC equipment typically supply a BLAS library
that has been optimized for their machines.
The ATLAS project was created to provide similar support
for HPC equipment built from commodity hardware (e.g.
Beowulf clusters
Atlas homepage
http://math-atlas.sourceforge.net/
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 26
ATLAS performance example
The graph below shows GFLOPS vs matrix dimension for
both the GSL (GNU Scientific Library) BLAS and the ATLAS
BLAS supplied with Ubuntu 10.04 and running on a Lenovo
T410 laptop. We see close to a 6-fold increase in
performance for the larger matrix sizes.
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 27
Using BLAS library from ATLAS
Why?
Because it contains a parallel automatically optimized
BLAS library.
It also provides a nice C interface to BLAS
Keep in mind that BLAS are Fortran routines
Two ways calling
Calling Fortran from C
Using CBLAS
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 28
Calling Fortran from C
1
2
3
4
5
6
#include
int main
{
int i,
double
double
<stdio.h>
()
j, n, one;
coefficient;
x[] = { 1, 1, 1 };
7
coefficient = 4.323;
one = 1;
n = 3;
8
9
10
11
dscal_(&n, &coefficient, &x[0], &one);
12
13
for (i = 0; i < 3; ++i)
printf ("%f\n", x[i]);
14
15
16
return 0;
17
18
}
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 29
Calling Fortran from C
The Fortran compiler appends an underscore after
subroutinenames (function in Fortran).
Since BLAS is written in Fortran, to call the subroutine
dscal() from C one must write dscal_().
C passes arguments by value whereas Fortran passes
arguments by reference.
So to call a Fortran subroutine from C one must pass the
memory address of the arguments
[user@gauss$ gcc blasTest.c -o blasTest -L /usr/lib64/atlas -lf77blas
[user@gauss]$ ./blasTest
4.323000
4.323000
4.323000
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 30
Using the C interface CBLAS
1
2
#include <stdio.h>
#include <cblas.h>
3
4
5
6
7
int main ()
{
int i;
double x[] = { 1, 1, 1 };
8
cblas_dscal(3, 4.323, x, 1);
9
10
for (i = 0; i < 3; ++i)
printf ("%f\n", x[i]);
11
12
13
return 0;
14
15
}
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 31
Using the C interface CBLAS
Here we use the cblas header file.
Thus the main difference here is that we use a “cblas_”
prefix instead an “_” suffix.
That allows us to pass the arguments by value than by
reference.
To compile this we link against the cblas library instead of
the blas library.
[user@gauss$ gcc blasTest.c -o blasTest -L /usr/lib64/atlas -lcblas -latlas
[user@gauss]$ ./blasTest
4.323000
4.323000
4.323000
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 32
Using the C interface CBLAS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#include <cstdlib>
#include <vector>
#include <iostream>
extern "C"
{
#include <cblas.h>
}
using namespace std;
int main ()
{
int i;
vector<double> x(3,1);
cblas_dscal (x.size(), 4.323, &x[0], 1);
for (i = 0; i < x.size(); ++i)
cout << x[i] << endl;
return 0;
}
The advantages of the vector class instead of a C array is
that now one doesn’t need an extra variable to store the
length of the array
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 33
Example | Matrix-matrix multiplication
Rowise matrix-matrix multiplication
The rowwise matrix-matrix product for C = AB is computed
with the pseudo-code below.
1
2
3
4
5
6
7
8
for ( i =0; i < N ; i ++){
for ( j =0; i < N ; j ++){
c[i][j] = 0.0;
for ( k =0; k < N ; k ++){
c[i][j] = c[i][j] + a[i][k]*b[k][j];
}
}
}
Notice that both A and C are accessed by rows in the
innermost loop while B is accessed by columns.
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 34
Example | Matrix-matrix multiplication
The graph shows GFLOPS vs matrix dimension for a
rowwise matrix-matrix product.
The performance drop-off suggests that all of B can fit in
the processor’s cache when N=300 but not when N=400.
A 300 × 300 matrix using
double precision floating
point values consumes
approximately 700 KB.
A 400 × 400 matrix needs
at least 1250 KB.
The cache size of the
machine used is 2048 KB.
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 35
Key idea: minimize cache misses
If we partition A and B into appropriate sized blocks
A00 A01
B00 B01
A=
, B=
A10 A11
B10 B11
then the product C = AB can be computed as
C00 C01
A00 B00 + A01 B10 A00 B01 + A01 B11
C=
=
C10 C11
A10 B00 + A11 B10 A10 B01 + A11 B11
where each of the matrix-matrix products involves matrices of
roughly 1/4 the size of the full matrices.
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 36
Recursion
We can exploit this idea quite easily with a recursive
algorithm.
1
2
3
4
5
6
7
8
9
10
matrix function C = mult(A, B)
if (size(B) > threshold) then
C00 = mult(A00 , B00 ) + mult(A01
C01 = mult(A00 , B01 ) + mult(A01
C10 = mult(A10 , B00 ) + mult(A11
C11 = mult(A10 , B01 ) + mult(A11
else
C = AB
endif
return C
,
,
,
,
B10
B11
B10
B11
)
)
)
)
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 37
Recursive block-oriet. mat.-mat. product
The graph below shows GFLOPS vs matrix dimension for
both a rowwise product and a recursive block-oriented
matrix-matrix product.
Performance is improved
across the board,
including for smaller
matrix sizes
Performance is relatively
flat - no degradation as
matrix sizes increase.
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 38
Comparison with BLAS
Finally, we compare the results just shown with similar
results using BLAS, i.e.
– Default BLAS supplied
with Ubuntu 12.04
– ATLAS BLAS
Obviously, the ATLAS
BLAS performance is far
superior to either
– the standard BLAS or
– the recursive block
routine
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 39
PETSc
Portable, Extensible Toolkit for Scientific computation
A suite of data structures and routines for the scalable
(parallel) solution of scientific applications modelled by partial
differential equations.
Supported interfaces/hardware
MPI,
shared memory pthreads
NVIDIA GPUs
hybrid MPI-shared memory pthreads
MPI-GPU parallelism.
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 40
What is PETSc
A freely available and supported research code
Available via http://www.mcs.anl.gov/petsc
Free for everyone, including industrial users
Hyperlinked documentation and manual pages for all
routines
Many tutorial-style examples
Support via email: [email protected]
Usable from Fortran 77/90, C, and C++
Portable to any parallel system supporting MPI, including
Tightly coupled systems:
Cray T3E, SGI Origin, IBM SP, HP 9000, Sun Enterprise
Loosely coupled systems, e.g., networks of workstations
Compaq, HP, IBM, SGI, Sun PCs running Linux or Windows
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 41
PETSc | facts
Developed by
Argonne National Laboratory USA
Aim
Tools for the scalable (parallel) solution of scientific
applications modelled by partial differential equations.
Languages
C (main language), C++, Fortran, Python
Build on
BLAS, LAPACK, MPI
Latest version
PETSc 3.6; released June 9, 2015.
Homepage
http://www.mcs.anl.gov/petsc/
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 42
PETSc’s Role
Developing parallel, nontrivial PDE solvers that deliver
high performance is still difficult and requires months (or
even years) of concentrated effort.
PETSc is a toolkit that can ease these difficulties and
reduce the development time, but it is not a black-box
PDE solver, nor a silver bullet.
Barry Smith, Argonne National Laboratory
Main developer of PETSc
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 43
PETSC | features
Parallel vectors
Parallel matrices
Scalable parallel preconditioners
Krylov subspace methods
Parallel Newton-based nonlinear solvers
Parallel timestepping (ODE) solvers
Support for Nvidia GPU cards
Complete documentation
Automatic profiling of floating point and memory usage
Consistent user interface
Portable to UNIX and Windows
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 44
PETSC | facts
13.07.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 45