Basic Parallel Transformations used by
OCA Recommendations
Below
we describe the various transformations that are used by OCA when recommending
to parallelize a given loop. As seen in the introduction, these transformations
are referred to by the comments that are attached to the statements in the loop
body.
Contents
Handling
break and return statements in OMP loops:
Handling
loops with unknown boundaries prior to loop invocation:
Loop
distribution with Scalar Expansion
OMP
Scalar Reduction and Parallel Recurrence
OCA
recommendations based on Helper Threads solutions
For loops that contain external scalar
variables declared outside the loop body and used (both read from and written
to) only inside the loop (or written to before being used again after the loop
completes), it is possible to declare them as private using the omp private() directive, provided they are
overwritten before being used in every iteration.
The private()
directive directs the OpenMP runtime to create a private copy of the variable
in each iteration.
For example, the following simple loop
contains the external scalar variable x and the loop index i - both
defined outside the loop body and are only used temporarily in the loop – thus
preventing a straightforward parallelization:
double x; double
A[100], B[100], C[100]; int i; for (i=0; i<100; i++) { x = A[i] + B[i]; C[i] = x + 1/x; }
If we know that the value of x after the
loop execution is not used anymore we can simply use the private directive of
OpenMP as follows:
double x; double
A[100], B[100], C[100]; int i; #pragma omp parallel for private (i,x) for (i=0;
i<100; i++) { x = A[i] + B[i]; C[i] = x + 1/x; }
In
most cases, even if the external scalar that is written to in the loop is then
read after the loop completes, it is still possible to privatize it, provided
we use the “lastprivate” directive of OpenMP.
For
loops that write into array elements which may conflict when being
parallelized, it is sometimes possible to eliminate the conflicts by defining
an auxiliary array that is initialized to the initial values of the original
array.
This
is possible if each array element that is written to, accesses only initial
values.
For
example, the following loop cannot be parallelized as is since it will create
conflicts between iteration i that reads from array[i+C] and iteration i+C
that tries to write into it simultaneously:
int array[100]; . . . for (int i=0; i<100; i++) { array[i] += array[i+C]; }
To
resolve this type of read/write conflicts a new array called init_array[] is defined, initialized in a separate loop and
used to read from instead of the original array as follows:
int init_array[100]; #pragma omp parallel for private(i) for (int i=0; i<100; i++) { init_array[i]
= array[i]; } #pragma omp parallel for private(i) for (int i=0; i<100; i++) { array[i] = init_array[i] + init_array[i+C]; }
OMP
does not support break or return statements in parallel loops, thus causing a compilation
error. To enable them still in parallel loops requires some changes to the
loop.
For
break statements we differentiate between two types: “weak break” vs. “strong
break”. In the case of a weak break, the iterations that proceed the breaking
statement can still execute without causing a functional error. In such a case
the break statement is merely regarded for performance reasons to avoid
redundant computation.
However,
after a strong break, no iterations are allowed as it can cause a functional
error and may generate wrong output.
A
return statement usually equals a “strong” break statement.
See
below a pseudo sequential loop containing a “weak” break as the correctness is
not affected if the loop continues even after the break has occurred since it
searches for any element of the a array that contains a zero:
for (i=0; i < N; i++)
{
if (a[i] == 0) { found_zero_occurence
= true; break; // found a zero occurrence in a[]
}
}
Unfortunately,
the corresponding an OMP loop is not allowed to contain a break statement:
#pragma omp parallel for private(i)
for (i=0; i < N; i++)
{
if (a[i] == 0) { found_zero_occurence
= true; break; // ß
this break statement produces compilation error } }
To resolve the above case of “weak” break statement in the OMP
loop it is possible to “emulate” it by adding a condition to the loop boundary
as follows:
#define P (omp_get_num_threads()) #define CHUNK_I (k * N / P) #define
CHUNK_I_PLUS_1 ((k+1)* N / P) int k; int icond
= -1; #pragma omp parallel for private(k) for (k=0; k <
P; k++) { int i; for(i = CHUNK_I;
i < CHUNK_I_PLUS_1 && icond < 0;
i++){ if (a[i] == 0) { found_zero_occurence
= true; icond = i; } } }
However,
finding the first zero occurrence in an array is more complicated as
we need to make sure that each parallel chunk is fully executed before being able to
determine which is the minimal index holding the zero occurrence.
See
the proposed solution below for the first zero occurrence problem.
Note
that we define a constant “CHUNK_SIZE” which is predefined according to
the optimal chunk size that is given to each omp
thread which gives the best speedup:
#define MAX_P (omp_get_max_threads()) #define P (omp_get_num_threads()) #define
CHUNK_SIZE 4 int k; int icond
= N; for (k=0; k <
N; k += P*CHUNK_SIZE) { int i; icond
= N; #pragma omp parallel for private(i) reduction(min:icond) for(i = k; i <
k+P*CHUNK_SIZE; i++){ if (arr[i] == 0) { icond
= i; } } if (icond < N)
break; }
Note
also that in some cases we need to guard parts of the loop body as they could
compute illegal results. See the example below where the statement “b[i]
/= a[i];” should
be wrapped by a check that a[i] is not zero in the above
solution of “strong” break in OMP loop:
for (i=0; i < N; i++) {
if (a[i] == 0) {
break; // “strong” break } b[i]
/= a[i]; // this loop part is not allowed after break }
Loops which use a termination condition expression that includes a
variable (that is not the loop iterator) which is modified in the loop body,
cannot be parallelized via the OMP pragma.
For example, while loops are hard to parallelize for this
reason.
Consider the following example:
while (a[i] != 0) {calc(a[i]); i++;} // undetermined terminating condition
To resolve this one must use an external sequential for loop that
iterates over the chunks and an inner OMP parallel loop that iterates inside
the chunk along with corresponding break statements as follows:
#define N (omp_get_num_threads() * CONST) // CONST is
provided by the user boolean
is_break = false; while (a[i] != 0) { for (j=0;; j += N) { // non terminating
condition here and on given fixed value N of step size. is_break = false; #pragma omp parallel for private(i) for (i=j; i < j+N; i++) { if (a[i] != 0) { is_break
= true; // strong break OMP emulation needed break; } calc(a[i]); } if
(is_break) break; } }
Loops
with cross-iteration dependencies resulting between different arrays may be
resolved by splitting the array into multiple separate loops.
For
example, the following loop contains cross-iteration dependencies resulting
from cross-array references of different arrays:
for (i=1; i <
n; i++) { X[i] =
Y[i]; Y[i] = X[i-1]; Z[i] = Y[i-1]; }
The
above loop can be distributed to three separate loops – each is parallelizable,
as follows:
#pragma omp
parallel for private(i) for (i=1; i <
n; i++) { X[i] =
Y[i]; } #pragma omp
parallel for private(i) for (i=1; i <
n; i++) { Y[i] = X[i-1]; } #pragma omp
parallel for private(i) for (i=1; i <
n; i++) { Z[i] = Y[i-1]; }
After splitting a loop, all scalar
variables modified by one loop and then read from by a consecutive loop,
requires us to apply scalar expansion into an array for them as shown in the
example below:
int arr[N]; int i; int x = 0; for (i=1; i < N;
i++) { x = A[i] + B[i]; C[i] = A[i-1]; if (x !=
0) D[i] = isAboveThreshold(C[i-1]) + x + 1/x; }
The above loop is distributed into 2 loops
and the x scalar is expanded into an array as follows:
double x = 0; double
A[100], B[100], C[100]; double expanded_x[100]; #pragma omp parallel for private(i,x) for (int i=1; i<100; i++) { expanded_x[i] = A[i] + B[i]; C[i] = A[i-1]; } #pragma omp parallel for private(i) for (int i=1; i<100; i++) { if(expanded_x[i] != 0) D[i] = isAboveThreshold(C[i-1] + expanded_x[i]
+ 1/expanded_x[i]); }
Note however, that scalar expansion may
sometimes require us to use more than one auxiliary array in the case where the
external scalar may not be updated in every iteration of the loop. Consider the
following example:
int arr[N];
int i,j; int x = 0; for (i=0; i <
N; i++) { arr[i] = i; if (i < N/2) x = i; arr[x] = i; }
return x;
In the above example, the x scalar
variable is stopped being updated at the N/2 iteration. Consequently,
after the loop completes, only the last updated value of the x scalar
needs to be returned.
For this general case we need to use
another auxiliary array called x_iteration_update[]
and make sure to set it to the iteration number every time we update the expanded_x[] element.
int arr[N]; int x = 0; // scalar expansion for x: // int expanded_x[N]; int x_updated_iteration[N]; // init auxiliary arrays for
x: #pragma omp parallel for private(i) for (i=0; i <
N; i++) { expanded_x[i]
= x; x_updated_iteration[i]
= 0; } #pragma omp parallel for private(i) for (i=0; i <
N; i++) { arr[i] = i; if (i < N/2) {
expanded_x[i] = i;
x_updated_iteration[i] = i; } arr[expanded_x[i]] = i; }
// find
highest iteration number where expanded_x was
modified: // int max_val
= 0; #pragma omp parallel for private(i) reduction (max:max_val) for (i=0; i <
N; i++) { if (x_updated_iteration[i] > max_val) { max_val = x_updated_iteration[i]; } } x = expanded_x[max_val];
The
OMP scalar reduction uses a scalar representation to extract the parallelism
when backward cross iterations exist, provided the binary operation used to update
the scalar variable is associative.
Consider
the following code that searches for the maximal value in an array:
double arr[N]; double max_val=0.0; int i; for( i=0; i<N;
i++) { if(arr[i] > max_val) { max_val =
arr[i]; } } printf("\nmax_val = %f", max_val);
The
OpenMP provides a special “reduction” directive to parallelize external scalar
variables that are updated using an associative operation. In this case the Max
operation is associative and thus we can parallelize the code using an OMP
reduction directive as follows:
double arr[N]; double max_val=0.0; int i; #pragma omp parallel for reduction(max : max_val) for( i=0;i<N;
i++) { if(arr[i] > max_val) { max_val
= arr[i];
} } printf("\nmax_val = %f", max_val);
However,
in many cases, the associative operation is not applied on a scalar variable
but rather on elements of an array. In such a case the OMP reduction directive
will not work as we need to compute all the prefix values of the operation and
not just the final value.
The
following loop contains a simple recurrence on the plus operation as follows:
for(i=1; i <
N; i++) { a[i] += a[i-1]; }
To
compute all the prefixes of the plus operation on the a
array we can use parallel solutions to the Prefix Sum Problem. Below is an
efficient implementation of the problem:
int i,j,k,n; int P = omp_get_max_threads(); int NdivP
= N/P; //each chunk is computed
sequentially: #pragma omp parallel for private(i,k) for(i=0; i<N;
i += NdivP) { for(k=i+1;k<i+NdivP;k++) a[k] +=
a[k-1]; } //sequential run over P
chunks to sum representative elements from each chunk: for(i=2*NdivP-1;
i<N; i += NdivP) { a[i] += a[i-NdivP]; } // update each element in
parallel based on representatives from previous chunk: #pragma omp parallel for private(i,k) for(i=NdivP-1;
i<N-1; i+=NdivP) { for(k=i+1;k<i+NdivP;k++) {
a[k] += a[i];
} }
The parallel prefix implementation above can also be extended in
some cases to handle recurrent loops that contain simple conditional
statements. For example consider the following recurrent loop containing a
simple conditional statement (shown in red):
for(i=1; i <
N; i++) { if (B[i]) a[i] += a[i-1]; }
Despite the additional condition, the parallel prefix sum solution
is still viable by using an auxiliary array next[] that would hold the next element in the
modified recurrence array a[] to be added as follows:
int i,j,k,n; int next[N]; //auxiliary
array that holds next element index to read from based on the condition. int P = omp_get_max_threads(); int NdivP
= N/P; //init
the auxiliary next[] array: next[0] = -1;
// -1 represents an invalid element index #pragma omp parallel for private (i) for(i=1; i <
N; i++) { next[i]
= -1; if
(B[i]) next[i] = i-1; } //each chunk is computed
sequentially: #pragma omp parallel for private(i,k) for(i=0; i<N;
i += NdivP) { for(k=i+1;k<i+NdivP;k++)
if (B[k]) { a[k] +=
a[k-1];
if (next[k]
>= 0) next[k] =
next[next[k]];
} } //sequential run over P chunks
to sum representative elements from each chunk: for(i=2*NdivP-1;
i<N; i += NdivP) {
if (B[i]) if (next[i]
>= 0) a[i] +=
a[next[i]]; } // update each element in
parallel based on psum #pragma omp parallel for private(i,k) for(i=NdivP-1;
i<N-1; i+=NdivP) { for(k=i+1;k<i+NdivP;k++) {
if (B[k])
if (next[k]
>= 0) a[k] +=
a[next[i]];
} }
Consider below a more complicated case for parallelization of a
while loop that iterates over a linked list:
while (node) { res = work(node);
// heavy weight computation write (res); node = node->next;
// light weight computation }
Parallelizing such while loops which contain heavy cross-iteration
dependencies requires us to apply Scalar expansion transformation on the node
and res scalars along with the transformations that handles non-fixed
termination condition. We then apply Loop Distribution transformation that
separates the node calculation from res calculation and finally a
Weak break transformation.
See the resulted parallelized version (before applying the “weak
break” transformation):
#define N (omp_get_num_threads() * CONST) // CONST is provided by the user
while (node) { for ((j=0;; j += N) { node_t *expanded_node[N]; #pragma omp parallel for private(i) for (i=0; i <
N; i++) { expanded_node[i] = NULL; } //sequential: for ((i=0; i
< N; i++) { // sequential run ahead with light // weight
computation if
(i > 0 && expanded_node[i-1] { expanded_node[i]
= expanded_node[i-1]->next; } if (expanded_node[i] == NULL) break; }} #pragma omp parallel for private(i) for (i=j; i < j+N; i++) { if
(expanded_node[i] == NULL) break; // weak break; if
(expanded_node[i])
{ res[i] = work(expanded_node[i]); write (res[i]); } } node
= expanded_node[N-1]; }