SPIKE algorithm

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

The SPIKE algorithm is a hybrid parallel solver for banded linear systems developed by Eric Polizzi and Ahmed Sameh[1]^ [2]

Overview

[edit | edit source]

The SPIKE algorithm deals with a linear system AX = F, where A is a banded n×n matrix of bandwidth much less than n, and F is an n×s matrix containing s right-hand sides. It is divided into a preprocessing stage and a postprocessing stage.

Preprocessing stage

[edit | edit source]

In the preprocessing stage, the linear system AX = F is partitioned into a block tridiagonal form

[𝑨1𝑩1𝑪2𝑨2𝑩2𝑪p1𝑨p1𝑩p1𝑪p𝑨p][𝑿1𝑿2𝑿p1𝑿p]=[𝑭1𝑭2𝑭p1𝑭p].

Assume, for the time being, that the diagonal blocks Aj (j = 1,...,p with p ≥ 2) are nonsingular. Define a block diagonal matrix

D = diag(A1,...,Ap),

then D is also nonsingular. Left-multiplying D−1 to both sides of the system gives

[𝑰𝑽1𝑾2𝑰𝑽2𝑾p1𝑰𝑽p1𝑾p𝑰][𝑿1𝑿2𝑿p1𝑿p]=[𝑮1𝑮2𝑮p1𝑮p],

which is to be solved in the postprocessing stage. Left-multiplication by D−1 is equivalent to solving p systems of the form

Aj[Vj Wj Gj] = [Bj Cj Fj]

(omitting W1 and C1 for j=1, and Vp and Bp for j=p), which can be carried out in parallel.

Due to the banded nature of A, only a few leftmost columns of each Vj and a few rightmost columns of each Wj can be nonzero. These columns are called the spikes.

Postprocessing stage

[edit | edit source]

Without loss of generality, assume that each spike contains exactly m columns (m is much less than n) (pad the spike with columns of zeroes if necessary). Partition the spikes in all Vj and Wj into

[𝑽j(t)𝑽j𝑽j(b)] and [𝑾j(t)𝑾j𝑾j(b)]

where V (t)
j
 
, V (b)
j
 
, W (t)
j
 
and W (b)
j
 
are of dimensions m×m. Partition similarly all Xj and Gj into

[𝑿j(t)𝑿j𝑿j(b)] and [𝑮j(t)𝑮j𝑮j(b)].

Notice that the system produced by the preprocessing stage can be reduced to a block pentadiagonal system of much smaller size (recall that m is much less than n)

[𝑰m0𝑽1(t)0𝑰m𝑽1(b)00𝑾2(t)𝑰m0𝑽2(t)𝑾2(b)0𝑰m𝑽2(b)00𝑾p1(t)𝑰m0𝑽p1(t)𝑾p1(b)0𝑰m𝑽p1(b)00𝑾p(t)𝑰m0𝑾p(b)0𝑰m][𝑿1(t)𝑿1(b)𝑿2(t)𝑿2(b)𝑿p1(t)𝑿p1(b)𝑿p(t)𝑿p(b)]=[𝑮1(t)𝑮1(b)𝑮2(t)𝑮2(b)𝑮p1(t)𝑮p1(b)𝑮p(t)𝑮p(b)],

which we call the reduced system and denote by S̃X̃ = .

Once all X (t)
j
 
and X (b)
j
 
are found, all Xj can be recovered with perfect parallelism via

{𝑿1=𝑮1𝑽1𝑿2(t),𝑿j=𝑮j𝑽j𝑿j+1(t)𝑾j𝑿j1(b),j=2,,p1,𝑿p=𝑮p𝑾p𝑿p1(b).

SPIKE as a polyalgorithmic banded linear system solver

[edit | edit source]

Despite being logically divided into two stages, computationally, the SPIKE algorithm comprises three stages:

  1. factorizing the diagonal blocks,
  2. computing the spikes,
  3. solving the reduced system.

Each of these stages can be accomplished in several ways, allowing a multitude of variants. Two notable variants are the recursive SPIKE algorithm for non-diagonally-dominant cases and the truncated SPIKE algorithm for diagonally-dominant cases. Depending on the variant, a system can be solved either exactly or approximately. In the latter case, SPIKE is used as a preconditioner for iterative schemes like Krylov subspace methods and iterative refinement.

Recursive SPIKE

[edit | edit source]

Preprocessing stage

[edit | edit source]

The first step of the preprocessing stage is to factorize the diagonal blocks Aj. For numerical stability, one can use LAPACK's XGBTRF routines to LU factorize them with partial pivoting. Alternatively, one can also factorize them without partial pivoting but with a "diagonal boosting" strategy. The latter method tackles the issue of singular diagonal blocks.

In concrete terms, the diagonal boosting strategy is as follows. Let 0ε denote a configurable "machine zero". In each step of LU factorization, we require that the pivot satisfy the condition

|pivot| > 0εA1.

If the pivot does not satisfy the condition, it is then boosted by

pivot={pivot+ϵ𝑨j1if pivot0,pivotϵ𝑨j1if pivot<0

where ε is a positive parameter depending on the machine's unit roundoff, and the factorization continues with the boosted pivot. This can be achieved by modified versions of ScaLAPACK's XDBTRF routines. After the diagonal blocks are factorized, the spikes are computed and passed on to the postprocessing stage.

Postprocessing stage

[edit | edit source]
The two-partition case
[edit | edit source]

In the two-partition case, i.e., when p = 2, the reduced system S̃X̃ = has the form

[𝑰m0𝑽1(t)0𝑰m𝑽1(b)00𝑾2(t)𝑰m0𝑾2(b)0𝑰m][𝑿1(t)𝑿1(b)𝑿2(t)𝑿2(b)]=[𝑮1(t)𝑮1(b)𝑮2(t)𝑮2(b)].

An even smaller system can be extracted from the center:

[𝑰m𝑽1(b)𝑾2(t)𝑰m][𝑿1(b)𝑿2(t)]=[𝑮1(b)𝑮2(t)],

which can be solved using the block LU factorization

[𝑰m𝑽1(b)𝑾2(t)𝑰m]=[𝑰m𝑾2(t)𝑰m][𝑰m𝑽1(b)𝑰m𝑾2(t)𝑽1(b)].

Once X (b)
1
 
and X (t)
2
 
are found, X (t)
1
 
and X (b)
2
 
can be computed via

X (t)
1
 
= G (t)
1
 
V (t)
1
 
X (t)
2
 
,
X (b)
2
 
= G (b)
2
 
W (b)
2
 
X (b)
1
 
.
The multiple-partition case
[edit | edit source]

Assume that p is a power of two, i.e., p = 2d. Consider a block diagonal matrix

1 = diag( [1]
1
 
,..., [1]
p/2
 
)

where

𝑫~k[1]=[𝑰m0𝑽2k1(t)0𝑰m𝑽2k1(b)00𝑾2k(t)𝑰m0𝑾2k(b)0𝑰m]

for k = 1,...,p/2. Notice that 1 essentially consists of diagonal blocks of order 4m extracted from . Now we factorize as

= 12.

The new matrix 2 has the form

[𝑰3m0𝑽1[2](t)0𝑰m𝑽1[2](b)00𝑾2[2](t)𝑰m0𝑽2[2](t)𝑾2[2](b)0𝑰3m𝑽2[2](b)00𝑾p/21[2](t)𝑰3m0𝑽p/21[2](t)𝑾p/21[2](b)0𝑰m𝑽p/21[2](b)00𝑾p/2[2](t)𝑰m0𝑾p/2[2](b)0𝑰3m].

Its structure is very similar to that of 2, only differing in the number of spikes and their height (their width stays the same at m). Thus, a similar factorization step can be performed on 2 to produce

2 = 23

and

= 123.

Such factorization steps can be performed recursively. After d − 1 steps, we obtain the factorization

= 1d−1d,

where d has only two spikes. The reduced system will then be solved via

=  −1
d
 
 −1
d−1
 
 −1
1
 
.

The block LU factorization technique in the two-partition case can be used to handle the solving steps involving 1, ..., d−1 and d for they essentially solve multiple independent systems of generalized two-partition forms.

Generalization to cases where p is not a power of two is almost trivial.

Truncated SPIKE

[edit | edit source]

When A is diagonally-dominant, in the reduced system

[𝑰m0𝑽1(t)0𝑰m𝑽1(b)00𝑾2(t)𝑰m0𝑽2(t)𝑾2(b)0𝑰m𝑽2(b)00𝑾p1(t)𝑰m0𝑽p1(t)𝑾p1(b)0𝑰m𝑽p1(b)00𝑾p(t)𝑰m0𝑾p(b)0𝑰m][𝑿1(t)𝑿1(b)𝑿2(t)𝑿2(b)𝑿p1(t)𝑿p1(b)𝑿p(t)𝑿p(b)]=[𝑮1(t)𝑮1(b)𝑮2(t)𝑮2(b)𝑮p1(t)𝑮p1(b)𝑮p(t)𝑮p(b)],

the blocks V (t)
j
 
and W (b)
j
 
are often negligible. With them omitted, the reduced system becomes block diagonal

[𝑰m𝑰m𝑽1(b)𝑾2(t)𝑰m𝑰m𝑽2(b)𝑾p1(t)𝑰m𝑰m𝑽p1(b)𝑾p(t)𝑰m𝑰m][𝑿1(t)𝑿1(b)𝑿2(t)𝑿2(b)𝑿p1(t)𝑿p1(b)𝑿p(t)𝑿p(b)]=[𝑮1(t)𝑮1(b)𝑮2(t)𝑮2(b)𝑮p1(t)𝑮p1(b)𝑮p(t)𝑮p(b)]

and can be easily solved in parallel [3].

The truncated SPIKE algorithm can be wrapped inside some outer iterative scheme (e.g., BiCGSTAB or iterative refinement) to improve the accuracy of the solution.

SPIKE for tridiagonal systems

[edit | edit source]

The first SPIKE partitioning and algorithm was presented in [4] and was designed as the means to improve the stability properties of a parallel Givens rotations-based solver for tridiagonal systems. A version of the algorithm, termed g-Spike, that is based on serial Givens rotations applied independently on each block was designed for the NVIDIA GPU [5]. A SPIKE-based algorithm for the GPU that is based on a special block diagonal pivoting strategy is described in [6].

SPIKE as a preconditioner

[edit | edit source]

The SPIKE algorithm can also function as a preconditioner for iterative methods for solving linear systems. To solve a linear system Ax = b using a SPIKE-preconditioned iterative solver, one extracts center bands from A to form a banded preconditioner M and solves linear systems involving M in each iteration with the SPIKE algorithm.

In order for the preconditioner to be effective, row and/or column permutation is usually necessary to move "heavy" elements of A close to the diagonal so that they are covered by the preconditioner. This can be accomplished by computing the weighted spectral reordering of A.

The SPIKE algorithm can be generalized by not restricting the preconditioner to be strictly banded. In particular, the diagonal block in each partition can be a general matrix and thus handled by a direct general linear system solver rather than a banded solver. This enhances the preconditioner, and hence allows better chance of convergence and reduces the number of iterations.

Implementations

[edit | edit source]

Intel offers an implementation of the SPIKE algorithm under the name Intel Adaptive Spike-Based Solver [7]. Tridiagonal solvers have also been developed for the NVIDIA GPU [8][9] and the Xeon Phi co-processors. The method in [10] is the basis for a tridiagonal solver in the cuSPARSE library.[1] The Givens rotations based solver was also implemented for the GPU and the Intel Xeon Phi.[2]

References

[edit | edit source]
  1. ^ NVIDIA, Accessed October 28, 2014. CUDA Toolkit Documentation v. 6.5: cuSPARSE, http://docs.nvidia.com/cuda/cusparse.
  2. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  1. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  2. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  3. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  4. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  5. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  6. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  7. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  8. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).

Further reading

[edit | edit source]
  • Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).