Using OpenMP to produce an optimized, parallel aplication
OpenMP is a directive driven, shared memory implementation for parallel programming. Directives are added to the code in question to "guide" the compiler in producing parallelized code. A shared memory parallel program communicates between processes via memory common to all processes and thus can be used only with as many processors as have access to this common pool of memory.
A study of the process of converting a serial application to a OpenMP parallel application is given below.
The first step in optimizing a program is to locate the section(s) of code taking the most time. This process is referred to as "profiling". Several packages exist that can produce the desired information. One of them is Vampir
A Vampir instrumented version of the code to be optimized in this example produced the following output:
The profile run took a total of 42.178 seconds, 41.699 spent in the routine "fcn". Clearly, optimization efforts should concentrate on this routine. Below is the content of fcn:
subroutine fcn(npar,grad,fval,xval,iflag,futil)
implicit double precision (a-h,o-z)
dimension grad(*),xval(*)
c
common/prob/nch
integer nch
c
fval=0.
c
do i=1,nch-1
do j=i+1,nch
u=sin(xval(2*i-1))*cos(xval(2*i))
$ *sin(xval(2*j-1))*cos(xval(2*j))
$ +sin(xval(2*i-1))*sin(xval(2*i))
$ *sin(xval(2*j-1))*sin(xval(2*j))
$ +cos(xval(2*i-1))*cos(xval(2*j-1))
d=2.*sin(acos(u)/2.)
fval=fval+1./d**2
enddo
enddo
c
return
end
The innermost loop (indexed by "j") contains function calls that depend only on the index "i". Moving these evaluations outside the inner loop will improve efficiency. The following version executes in 28.94 seconds, 69% of the original time.
subroutine fcn(npar,grad,fval,xval,iflag,futil)
implicit double precision (a-h,o-z)
dimension grad(*),xval(*)
c
common/prob/nch
integer nch
c
fval=0.
c
do i=1,nch-1
a=sin(xval(2*i-1))*cos(xval(2*i))
b=sin(xval(2*i-1))*sin(xval(2*i))
c=cos(xval(2*i-1))
do j=i+1,nch
u=a*sin(xval(2*j-1))*cos(xval(2*j))
$ +b*sin(xval(2*j-1))*sin(xval(2*j))
$ +c*cos(xval(2*j-1))
d=2.*sin(acos(u)/2.)
fval=fval+1./d**2
enddo
enddo
c
return
end
The next thing to investigate is the effect of compiler choice and optimization level. When changing compilers or optimization level it is important to verify the correctness of the output. Overly aggressive optimization can produce incorrect results. The table below gives the execution time resulting from several compilers and optimization levels. In all cases, the results were identical.
| Compiler | Optimization Flag | Execute time (sec) |
| f77 | none | 28.94 |
| f77 | -O3 | 23.38 |
| ifort | none | 20.10 |
| ifort | -O3 | 8.20 |
| ifort | -O3 (no profiling) | 5.69 |
In this example, the vendor provided compiler is clearly superior to GCC using the default flags. Vendor provided compilers are optimized for the particular architecture of the machine. The GCC compiler suite has many flags that invoke optimization particular to the specific architecture being used. The -mtune and -march flags should be investigated if you are constrained to use the GCC compilers.
Profiling adds overhead to the program. Final results should not include the profiling code since this is how the program will be used in production. This is reflected in the last two rows of the table.
The last step in optimization is parallelization. Some code parallelizes well, some not at all. For this example, the innermost loop consists of many evaluations of the same function with different arguments. Of particular importance is that each pass through the loop is independent of any other pass through the loop. This makes OpenMP an good candidate for parallelization. The following version of the code executes in 2.67 seconds on an 8 processor machine.
...
c executed once in the main routine:
call OMP_SET_NUM_THREADS(8)
...
subroutine fcn(npar,grad,fval,xval,iflag,futil)
implicit double precision (a-h,o-z)
dimension grad(*),xval(*)
c
common/prob/nch
integer nch
c
fval=0.
c
!$OMP PARALLEL DO PRIVATE(a,b,c,u,d)
!$OMP PARALLEL DO REDUCTION(+:fval)
do i=1,nch-1
a=sin(xval(2*i-1))*cos(xval(2*i))
b=sin(xval(2*i-1))*sin(xval(2*i))
c=cos(xval(2*i-1))
do j=i+1,nch
u=a*sin(xval(2*j-1))*cos(xval(2*j))
$ +b*sin(xval(2*j-1))*sin(xval(2*j))
$ +c*cos(xval(2*j-1))
d=2.*sin(acos(u)/2.)
fval=fval+1./d**2
enddo
enddo
!$OMP END PARALLEL DO
c
return
end
Several excellent tutorials on the use of OpenMP exist.
The lines beginning "!$OMP ..." are OpenMP directives that guide an OpenMP enabled compiler in parallelization. These directives were taken verbatim from the LLNL tutorial excercises.
Finally, the double loop of "fcn" can be modified to a single loop and that loop parallellized:
implicit double precision (a-h,o-z)
dimension grad(*),xval(*)
c
common/prob/nch
integer nch
common/linar/nt,m(5000),n(5000)
integer nt,m,n
c
if(m(1).ne.1) call slin
fval=0.
c
mold=0
!$OMP PARALLEL DO PRIVATE(i,a,b,c,mold,u,d)
!$OMP PARALLEL DO REDUCTION(+:fval)
do k=1,nt
if(m(k).ne.mold) then
i=m(k)
a=sin(xval(2*i-1))*cos(xval(2*i))
b=sin(xval(2*i-1))*sin(xval(2*i))
c=cos(xval(2*i-1))
mold=m(k)
endif
u=a*sin(xval(2*n(k)-1))*cos(xval(2*n(k)))
$ +b*sin(xval(2*n(k)-1))*sin(xval(2*n(k)))
$ +c*cos(xval(2*n(k)-1))
d=2.*sin(acos(u)/2.)
fval=fval+1./d**2
enddo
!$OMP END PARALLEL DO
c
return
end
subroutine slin
common/prob/nch
integer nch
common/linar/nt,m(5000),n(5000)
integer nt,m,n
k=1
do i=1,nch-1
do j=i+1,nch
m(k)=i
n(k)=j
k=k+1
enddo
enddo
nt=k-1
return
end
This version executed in 1.84 seconds, nearly 23 times faster than the original.




