B424 Dependency Analysis

B424 Parallel and Distributed Programming

Dependency Analysis

Identifying which parts of the program can be executed in the same time.

Parts of code that have some computational dependency have to be executed sequentially.

Easy example:

for (i=0; i<n; i++)
  a[i] = 0;

More complex one:

for (i=0; i<n-1; i++)
  a[i] = a[i+1];

In this loop each new iteration depends on the previous one, so we can't do them in parallel. However, it can be decomposed in two loops, each of them having independent iterations.

for (i=0; i<n-1; i++)
  temp = a[i+1]; // temp is local to each thread
for (i=0; i<n-1; i++)
  a[i] = temp;

Bernstein's Conditions

Bernstein (1966) determined sufficient conditions for processes or threads to be executed simultaneously.

I_i - the set of memory locations read by process P_i.
O_i - the set of memory locations altered by process P_i.

Two processes P₁ and P₂ can be executed in the same time if the output of one does not intersect the input or output of the other.

Multi-Threaded Pool of Tasks

Example: computing the sum of the elements of an array.

A task is defined as adding an element of the array to a local sum.

The pool of tasks is handled implicitly by a global index. There is no master thread, only workers.

Each partial sum can be computed independently. The only dependence is in the manipulation of the global index.

At the end we need to add all the local sums together.

// global variables
int sum=0, a[], size, global_index;

Worker() {
  int local_sum = 0, local_index;
  do {
    lock(&mutex);
    local_index = global_index;
    global_index++;
    unlock(&mutex);
    if (local_index < size)
      local_sum += a[local_index];
  } while (local_index < size);
  lock(&mutex);
  sum += local_sum;
  unlock(&mutex);
}

LU and Cholesky Decompositions

Both methods can be used in the process of solving a system of equations of the form Ax = b, where A is a matrix of n x n and x and b are vectors of size n.

Successive elimination of variables can be used to solve the system Ax = b very fast when the matrix A is triangular. This method is called Gaussian elimination.

The LU decomposition factors the matrix A as a product of upper and lower triangular matrices (U and L, respectively). The Cholesky is a particular case of LU where U is the transposed of L.

The Cholesky Decomposition

Given a symmetric positive definite matrix A, the Cholesky decomposition is a lower triangular matrix L such that A = L L^T.

A matrix A is called positive definite if x^TAx > 0 for every nonzero vector x.

It's also called computing the square root of the matrix A.

One of the fastest decompositions available, a special case of the LU decomposition.

Method

We start by computing for every column

Then the rest of the elements:

We notice that in the previous formula, computing an element depends on the diagonal element on the same column and on elements before it on the same line.

The best way to compute this is having the threads advance in the matrix parallel to the second diagonal.

Compute a Cell of L

float Compute_cell(i, j) {
  x = a[i, j];
  for (k=0; k<i && k<j; k++)
    x -= a[i,k]*a[j,k];
  if (i != j)
    x = x/a[i,i];
  else
    x = sqrt(x);
  return x;
}

Sequential Algorithm

Cholesky() {
  for (i=0; i<n; i++)
    for (j=i; j<n; j++)
      a[i,j] = Compute_cell(i, j);
}

Parallel Algorithm

Cholesy_Worker(thread_nr) {
  col = id;
  for (d=0; d<2*n-1; d++) {
    if (d >= 2*col && d < n + col) {
      row = d - col;
      a[row,col]=Compute_cell(row, col);
    }
    Barrier(thread_nr);
  }
}

LU Decomposition

The LU decomposition factors the matrix A as a product of upper and lower triangular matrices (U and L, respectively).

Then the system can be solved the following way:

Solve Ly = b, where y is a vector,
Solve Ux = y, where x is the solution to the initial equation.

Sequential Algorithm

The elements on the diagonal of L can be chosen to be 1. This way we can store both L and U in the matrix A itself.

We don't need the previous rows in the computation. L[i,i] = 1.

Sequential Algorithm

for j = 1 to n:
  for i = 1 to j:

  for i = j+1 to n:

Implementation of the Cholesky decomposition using MPI with two examples of matrices that can be used as input.:
cholesky_master.cc
cholesky_slave.cc
common.cc
main.cc
cholesky_master.h
cholesky_slave.h
common.h
Makefile
Matrix.txt
Matrix2.txt