Basic Parallel Transformations used by OCA Recommendations

 

Below we describe the various transformations that are used by OCA when recommending to parallelize a given loop. As seen in the introduction, these transformations are referred to by the comments that are attached to the statements in the loop body.

 

Contents

Scalar Privatization. 2

Array Copy of initial values. 3

Handling break and return statements in OMP loops: 4

Handling loops with unknown boundaries prior to loop invocation: 8

Loop distribution with Scalar Expansion. 9

OMP Scalar Reduction and Parallel Recurrence. 13

OCA recommendations based on Helper Threads solutions  17

 

 

 

Scalar Privatization

 

For loops that contain external scalar variables declared outside the loop body and used (both read from and written to) only inside the loop (or written to before being used again after the loop completes), it is possible to declare them as private using the omp private() directive, provided they are overwritten before being used in every iteration.

The private() directive directs the OpenMP runtime to create a private copy of the variable in each iteration.

 

For example, the following simple loop contains the external scalar variable x and the loop index i - both defined outside the loop body and are only used temporarily in the loop – thus preventing a straightforward parallelization:

 

double x;

double A[100], B[100], C[100];

int i;

 

for (i=0; i<100; i++)

      {

        x = A[i] + B[i];

        C[i] = x + 1/x;

      }

      

 

If we know that the value of x after the loop execution is not used anymore we can simply use the private directive of OpenMP as follows:

 

      double x;

double A[100], B[100], C[100];

int i;

 

      #pragma omp parallel for private (i,x)

      for (i=0; i<100; i++)

      {

 

            x = A[i] + B[i];

            C[i] = x + 1/x;

      }    

 

In most cases, even if the external scalar that is written to in the loop is then read after the loop completes, it is still possible to privatize it, provided we use the “lastprivate” directive of OpenMP.

 

Array Copy of initial values

 

For loops that write into array elements which may conflict when being parallelized, it is sometimes possible to eliminate the conflicts by defining an auxiliary array that is initialized to the initial values of the original array.

This is possible if each array element that is written to, accesses only initial values.

 

For example, the following loop cannot be parallelized as is since it will create conflicts between iteration i that reads from array[i+C] and iteration i+C that tries to write into it simultaneously:

 

int array[100];  

. . .

for (int i=0; i<100; i++)

      {

            array[i] += array[i+C];

      }

 

 

To resolve this type of read/write conflicts a new array called init_array[] is defined, initialized in a separate loop and used to read from instead of the original array as follows:

 

 int init_array[100];

 

       #pragma omp parallel for private(i)

       for (int i=0; i<100; i++)

       {

            init_array[i] = array[i];

       }

 

      #pragma omp parallel for private(i)

      for (int i=0; i<100; i++)

      {

            array[i] = init_array[i] + init_array[i+C];

      }

 

 

 

 

Handling break and return statements in OMP loops:

 

OMP does not support break or return statements in parallel loops, thus causing a compilation error. To enable them still in parallel loops requires some changes to the loop.

 

For break statements we differentiate between two types: “weak break” vs. “strong break”. In the case of a weak break, the iterations that proceed the breaking statement can still execute without causing a functional error. In such a case the break statement is merely regarded for performance reasons to avoid redundant computation.

However, after a strong break, no iterations are allowed as it can cause a functional error and may generate wrong output.

A return statement usually equals a “strong” break statement.

 

See below a pseudo sequential loop containing a “weak” break as the correctness is not affected if the loop continues even after the break has occurred since it searches for any element of the a array that contains a zero:

 

 

       for (i=0; i < N; i++)

       {                                

            if (a[i] == 0) {

               found_zero_occurence = true; 

         break;  // found a zero occurrence in a[]  

      }     

       }

 

 

Unfortunately, the corresponding an OMP loop is not allowed to contain a break statement:   

 

 

   #pragma omp parallel for private(i)        

   for (i=0; i < N; i++)

   {                                

      if (a[i] == 0) {

       found_zero_occurence = true;   

       break; // ß this break statement produces compilation error

}

   }

 

 

To resolve the above case of “weak” break statement in the OMP loop it is possible to “emulate” it by adding a condition to the loop boundary as follows:

 

 

    #define P                (omp_get_num_threads())

    #define CHUNK_I          (k * N / P)

    #define CHUNK_I_PLUS_1   ((k+1)* N / P)

 

   

    int k;

    int icond = -1;

    #pragma omp parallel for private(k)

    for (k=0; k < P; k++)

    {

      int i;

 

        for(i = CHUNK_I; i < CHUNK_I_PLUS_1 && icond < 0; i++){

           

             if (a[i] == 0) {

 found_zero_occurence = true;

               icond = i;

            }

        }

     }

 

 

However, finding the first zero occurrence in an array is more complicated as we need to make sure that each parallel chunk is fully executed before being able to determine which is the minimal index holding the zero occurrence.

See the proposed solution below for the first zero occurrence problem.

Note that we define a constant “CHUNK_SIZE” which is predefined according to the optimal chunk size that is given to each omp thread which gives the best speedup:

    #define MAX_P            (omp_get_max_threads())

    #define P                (omp_get_num_threads())

    #define CHUNK_SIZE       4

 

    int k;   

    int icond = N;

 

    for (k=0; k < N; k += P*CHUNK_SIZE)

    {

      int i;           

      icond = N;

 

       #pragma omp parallel for private(i) reduction(min:icond)

       for(i = k; i < k+P*CHUNK_SIZE; i++){

           if (arr[i] == 0) {

             icond = i;

           }  

       }

 

      if (icond < N)

        break;

    }

 

 

Note also that in some cases we need to guard parts of the loop body as they could compute illegal results. See the example below where the statement “b[i] /= a[i];should be wrapped by a check that a[i] is not zero in the above solution of “strong” break in OMP loop:

 

for (i=0; i < N; i++)

      {      

                          

            if (a[i] == 0) {

         break;   // “strong” break

      }

 

      b[i] /= a[i]; // this loop part is not allowed after break

      }

 

      

 

Handling loops with unknown boundaries prior to loop invocation:

 

 

Loops which use a termination condition expression that includes a variable (that is not the loop iterator) which is modified in the loop body, cannot be parallelized via the OMP pragma.

For example, while loops are hard to parallelize for this reason.

Consider the following example:

 while (a[i] != 0) {calc(a[i]); i++;} // undetermined terminating condition

 

To resolve this one must use an external sequential for loop that iterates over the chunks and an inner OMP parallel loop that iterates inside the chunk along with corresponding break statements as follows:

 

   #define N (omp_get_num_threads() * CONST) // CONST is provided by the user

   boolean is_break = false;

 

   while (a[i] != 0) {

 

for (j=0;; j += N) {    // non terminating condition here and on given fixed value N of step size.

      is_break = false;

 

#pragma omp parallel for private(i)

for (i=j; i < j+N; i++) {

 

  if (a[i] != 0) {

    is_break = true;    // strong break OMP emulation needed

    break; 

  }

 

  calc(a[i]);

}

 

if (is_break)

   break;

}

   }

 

Loop distribution with Scalar Expansion

 

Loops with cross-iteration dependencies resulting between different arrays may be resolved by splitting the array into multiple separate loops.

 

For example, the following loop contains cross-iteration dependencies resulting from cross-array references of different arrays:

for (i=1; i < n; i++) {

      X[i] = Y[i];

      Y[i] = X[i-1];

      Z[i] = Y[i-1];

}

 

 

The above loop can be distributed to three separate loops – each is parallelizable, as follows:

 

#pragma omp parallel for private(i)

for (i=1; i < n; i++) {

      X[i] = Y[i];

}

 

#pragma omp parallel for private(i)

for (i=1; i < n; i++) {

      Y[i] = X[i-1];

}

 

#pragma omp parallel for private(i)

for (i=1; i < n; i++) {

      Z[i] = Y[i-1];

}

 

 

After splitting a loop, all scalar variables modified by one loop and then read from by a consecutive loop, requires us to apply scalar expansion into an array for them as shown in the example below:

 

int arr[N];

int i;

int x = 0;

       

for (i=1; i < N; i++)   

{

     x = A[i] + B[i];           

     C[i] = A[i-1];

     if (x != 0)           

        D[i] = isAboveThreshold(C[i-1]) + x + 1/x;

}

 

The above loop is distributed into 2 loops and the x scalar is expanded into an array as follows:

      double x = 0;

double A[100], B[100], C[100];

 

double expanded_x[100];

 

      #pragma omp parallel for private(i,x)

      for (int i=1; i<100; i++)

      {

expanded_x[i] = A[i] + B[i];

C[i] = A[i-1];

      }

 

      #pragma omp parallel for private(i)

      for (int i=1; i<100; i++)

      {

if(expanded_x[i] != 0)

              D[i] = isAboveThreshold(C[i-1] + expanded_x[i] + 1/expanded_x[i]);

      }

 

Note however, that scalar expansion may sometimes require us to use more than one auxiliary array in the case where the external scalar may not be updated in every iteration of the loop. Consider the following example:

 

        int arr[N];

 int i,j;

        int x = 0;

       

        for (i=0; i < N; i++)

        {

                arr[i] = i;

                if (i < N/2)

                  x = i;

                arr[x] = i;

        }

        return x;

 

 

In the above example, the x scalar variable is stopped being updated at the N/2 iteration. Consequently, after the loop completes, only the last updated value of the x scalar needs to be returned.

For this general case we need to use another auxiliary array called x_iteration_update[] and make sure to set it to the iteration number every time we update the expanded_x[] element.

int arr[N];

int x = 0;

 

// scalar expansion for x:

//

int expanded_x[N];

int x_updated_iteration[N];

 

// init auxiliary arrays for x:

      #pragma omp parallel for private(i)         

      for (i=0; i < N; i++)

      {

            expanded_x[i] = x;

            x_updated_iteration[i] = 0;

      }

     

      #pragma omp parallel for private(i)

      for (i=0; i < N; i++)

      {

           

            arr[i] = i;

 

            if (i < N/2) {

              expanded_x[i] = i;

              x_updated_iteration[i] = i;

            }

           

            arr[expanded_x[i]] = i;

      }  

 

      // find highest iteration number where expanded_x was modified:

//

      int max_val = 0;

 

      #pragma omp parallel for private(i) reduction (max:max_val)

      for (i=0; i < N; i++)

      {    

            if (x_updated_iteration[i] > max_val)

       {

                 max_val = x_updated_iteration[i];   

       }

      }

      x = expanded_x[max_val];

OMP Scalar Reduction and Parallel Recurrence

 

The OMP scalar reduction uses a scalar representation to extract the parallelism when backward cross iterations exist, provided the binary operation used to update the scalar variable is associative.

 

Consider the following code that searches for the maximal value in an array:

 

    double arr[N];

    double max_val=0.0;

    int i;

   

    for( i=0; i<N; i++)

    {

        if(arr[i] > max_val)

        {

            max_val = arr[i];  

        }

    }

  

    printf("\nmax_val = %f", max_val);

 

The OpenMP provides a special “reduction” directive to parallelize external scalar variables that are updated using an associative operation. In this case the Max operation is associative and thus we can parallelize the code using an OMP reduction directive as follows:

 

     double arr[N];

    double max_val=0.0;

    int i;

 

    #pragma omp parallel for reduction(max : max_val)

    for( i=0;i<N; i++)

    {

        if(arr[i] > max_val)

        {

            max_val = arr[i];  

        }

    }

  

    printf("\nmax_val = %f", max_val);

 

However, in many cases, the associative operation is not applied on a scalar variable but rather on elements of an array. In such a case the OMP reduction directive will not work as we need to compute all the prefix values of the operation and not just the final value.

The following loop contains a simple recurrence on the plus operation as follows:

 

      for(i=1; i < N; i++) {

             a[i] += a[i-1];

      }

 

 

 

To compute all the prefixes of the plus operation on the a array we can use parallel solutions to the Prefix Sum Problem. Below is an efficient implementation of the problem:

   

 

       int i,j,k,n;

     

      int P = omp_get_max_threads();

      int NdivP = N/P;

 

    //each chunk is computed sequentially:

    #pragma omp parallel for private(i,k)

    for(i=0; i<N; i += NdivP)

    {

       for(k=i+1;k<i+NdivP;k++)

                a[k] += a[k-1];   

    }

     

    //sequential run over P chunks to sum representative elements from each chunk:

    for(i=2*NdivP-1; i<N; i += NdivP)

    {

      a[i] += a[i-NdivP];   

    }

 

     // update each element in parallel based on representatives from previous chunk:

     #pragma omp parallel for private(i,k)

     for(i=NdivP-1; i<N; i+=NdivP)

     {

       for(k=i+1;k<i+NdivP;k++) {

            a[k] += a[i];

         }

     }   

    

 

The parallel prefix implementation above can also be extended in some cases to handle recurrent loops that contain simple conditional statements. For example consider the following recurrent loop containing a simple conditional statement (shown in red):

 

for(i=1; i < N; i++) {

            if (B[i])

             a[i] += a[i-1];

      }

 

Despite the additional condition, the parallel prefix sum solution is still viable by using an auxiliary array next[]  that would hold the next element in the modified recurrence array a[] to be added as follows:

 

    int i,j,k,n;

    int next[N]; //auxiliary array that holds next element index to read from based on the condition.

 

    int P = omp_get_max_threads();

    int NdivP = N/P;

 

      //init the auxiliary next[] array:

      next[0] = -1; // -1 represents an invalid element index

      #pragma omp parallel for private (i)

      for(i=1; i < N; i++) {

            next[i] = -1;

            if (B[i])

             next[i] = i-1;

      }

 

    //each chunk is computed sequentially:

    #pragma omp parallel for private(i,k)

    for(i=0; i<N; i += NdivP)

    {

       for(k=i+1;k<i+NdivP;k++)

          if (B[i]) {

            a[k] += a[k-1];   

 

            if (next[k] >= 0)

              next[k] = next[next[k]];   

         }

    }

     

    //sequential run over P chunks to sum representative elements from each chunk:

    for(i=2*NdivP-1; i<N; i += NdivP)

    {

           if (B[i])

             if (next[i] >= 0)

               a[i] += a[next[i]];   

    }

 

    // update each element in parallel based on psum

    #pragma omp parallel for private(i,k)

     for(i=NdivP-1; i<N; i+=NdivP)

     {

       for(k=i+1;k<i+NdivP;k++) {

           if (B[i])

             if (next[k] >= 0)

               a[k] += a[next[i]];

         }

     }

         

OCA recommendations based on Helper Threads solutions

 

Consider below a more complicated case for parallelization of a while loop that iterates over a linked list:

 

while (node) {

      res = work(node);   // heavy weight computation

      write (res);

      node = node->next;  // light weight computation

}

 

Parallelizing such while loops which contain heavy cross-iteration dependencies requires us to apply Scalar expansion transformation on the node and res scalars along with the transformations that handles non-fixed termination condition. We then apply Loop Distribution transformation that separates the node calculation from res calculation and finally a Weak break transformation.

See the resulted parallelized version (before applying the “weak break” transformation):

#define N (omp_get_num_threads() * CONST)  // CONST is provided by the user

      

while (node) {

 

 for ((j=0;; j += N) {

 

      node_t *expanded_node[N];

 

#pragma omp parallel for private(i)

for (i=0; i < N; i++) {

            expanded_node[i] = NULL;

}

 

//sequential:

for ((i=0; i < N; i++) { // sequential run ahead with light

                         //  weight computation

 

     if (i > 0 && expanded_node[i-1] {   

         expanded_node[i] = expanded_node[i-1]->next;

    }

 

           if (expanded_node[i] == NULL)

 break;

}}

 

#pragma omp parallel for private(i)

for (i=j; i < j+N; i++) {

 

  if (expanded_node[i] == NULL)

    break;  // weak break;

 

  if (expanded_node[i]) {

                     res[i] = work(expanded_node[i]);

          write (res[i]);

        }

    }

    node = expanded_node[N-1];

}