Identifying which parts of the program can be executed in the same time.
Parts of code that have some computational dependency have to be executed sequentially.
Easy example:
for (i=0; i<n; i++) a[i] = 0;
More complex one:
for (i=0; i<n-1; i++) a[i] = a[i+1];
In this loop each new iteration depends on the previous one, so we can't do them in parallel. However, it can be decomposed in two loops, each of them having independent iterations.
for (i=0; i<n-1; i++) temp = a[i+1]; // temp is local to each thread for (i=0; i<n-1; i++) a[i] = temp;
Bernstein's Conditions
Bernstein (1966) determined sufficient conditions for processes or threads to be executed simultaneously.
Two processes P1 and P2 can be executed in the same time if the output of one does not intersect the input or output of the other.
Example: computing the sum of the elements of an array.
A task is defined as adding an element of the array to a local sum.
The pool of tasks is handled implicitly by a global index. There is no master thread, only workers.
Each partial sum can be computed independently. The only dependence is in the manipulation of the global index.
At the end we need to add all the local sums together.
// global variables
int sum=0, a[], size, global_index;
Worker() {
int local_sum = 0, local_index;
do {
lock(&mutex);
local_index = global_index;
global_index++;
unlock(&mutex);
if (local_index < size)
local_sum += a[local_index];
} while (local_index < size);
lock(&mutex);
sum += local_sum;
unlock(&mutex);
}
Both methods can be used in the process of solving a system of equations of the form Ax = b, where A is a matrix of n x n and x and b are vectors of size n.
Successive elimination of variables can be used to solve the system Ax = b very fast when the matrix A is triangular. This method is called Gaussian elimination.
The LU decomposition factors the matrix A as a product of upper and lower triangular matrices (U and L, respectively). The Cholesky is a particular case of LU where U is the transposed of L.
The Cholesky Decomposition
Given a symmetric positive definite matrix A, the Cholesky decomposition is a lower triangular matrix L such that A = L LT.
A matrix A is called positive definite if xTAx > 0 for every nonzero vector x.
It's also called computing the square root of the matrix A.
One of the fastest decompositions available, a special case of the LU decomposition.
Method
We start by computing for every column
Then the rest of the elements:
|
We notice that in the previous formula, computing an element depends on the diagonal element on the same column and on elements before it on the same line.
The best way to compute this is having the threads advance in the matrix parallel to the second diagonal.
Compute a Cell of L
float Compute_cell(i, j) {
x = a[i, j];
for (k=0; k<i && k<j; k++)
x -= a[i,k]*a[j,k];
if (i != j)
x = x/a[i,i];
else
x = sqrt(x);
return x;
}
Sequential Algorithm
Cholesky() {
for (i=0; i<n; i++)
for (j=i; j<n; j++)
a[i,j] = Compute_cell(i, j);
}
Parallel Algorithm
Cholesy_Worker(thread_nr) {
col = id;
for (d=0; d<2*n-1; d++) {
if (d >= 2*col && d < n + col) {
row = d - col;
a[row,col]=Compute_cell(row, col);
}
Barrier(thread_nr);
}
}
LU Decomposition
The LU decomposition factors the matrix A as a product of upper and lower triangular matrices (U and L, respectively).
Then the system can be solved the following way:
Sequential Algorithm
The elements on the diagonal of L can be chosen to be 1. This way we can store both L and U in the matrix A itself.
We don't need the previous rows in the computation. L[i,i] = 1.
Sequential Algorithm
for j = 1 to n: for i = 1 to j:
for i = j+1 to n:
Implementation of the Cholesky decomposition using MPI with two
examples of matrices that can be used as input.:
cholesky_master.cc
cholesky_slave.cc
common.cc
main.cc
cholesky_master.h
cholesky_slave.h
common.h
Makefile
Matrix.txt
Matrix2.txt