Using OpenMP to produce an optimized, parallel aplication

OpenMP is a directive driven, shared memory implementation for parallel programming. Directives are added to the code in question to "guide" the compiler in producing parallelized code. A shared memory parallel program communicates between processes via memory common to all processes and thus can be used only with as many processors as have access to this common pool of memory.

A study of the process of converting a serial application to a OpenMP parallel application is given below.


The first step in optimizing a program is to locate the section(s) of code taking the most time. This process is referred to as "profiling". Several packages exist that can produce the desired information. One of them is Vampir

A Vampir instrumented version of the code to be optimized in this example produced the following output:

The profile run took a total of 42.178 seconds, 41.699 spent in the routine "fcn". Clearly, optimization efforts should concentrate on this routine. Below is the content of fcn:

      subroutine fcn(npar,grad,fval,xval,iflag,futil)
      implicit double precision (a-h,o-z)
      dimension grad(*),xval(*)
c
      common/prob/nch
      integer          nch
c
      fval=0.
c
      do i=1,nch-1
         do j=i+1,nch
            u=sin(xval(2*i-1))*cos(xval(2*i))
     $       *sin(xval(2*j-1))*cos(xval(2*j))
     $       +sin(xval(2*i-1))*sin(xval(2*i))
     $       *sin(xval(2*j-1))*sin(xval(2*j))
     $       +cos(xval(2*i-1))*cos(xval(2*j-1))
            d=2.*sin(acos(u)/2.)
            fval=fval+1./d**2
         enddo
      enddo
c
      return
      end

The innermost loop (indexed by "j") contains function calls that depend only on the index "i". Moving these evaluations outside the inner loop will improve efficiency. The following version executes in 28.94 seconds, 69% of the original time.


      subroutine fcn(npar,grad,fval,xval,iflag,futil)
      implicit double precision (a-h,o-z)
      dimension grad(*),xval(*)
c
      common/prob/nch
      integer          nch
c
      fval=0.
c
      do i=1,nch-1
         a=sin(xval(2*i-1))*cos(xval(2*i))
         b=sin(xval(2*i-1))*sin(xval(2*i))
         c=cos(xval(2*i-1))
         do j=i+1,nch
            u=a*sin(xval(2*j-1))*cos(xval(2*j))
     $       +b*sin(xval(2*j-1))*sin(xval(2*j))
     $       +c*cos(xval(2*j-1))
            d=2.*sin(acos(u)/2.)
            fval=fval+1./d**2
         enddo
      enddo
c
      return
      end

The next thing to investigate is the effect of compiler choice and optimization level. When changing compilers or optimization level it is important to verify the correctness of the output. Overly aggressive optimization can produce incorrect results. The table below gives the execution time resulting from several compilers and optimization levels. In all cases, the results were identical.

CompilerOptimization FlagExecute time (sec)
f77 none 28.94
f77 -O3 23.38
ifort none 20.10
ifort -O3 8.20
ifort -O3 (no profiling) 5.69

In this example, the vendor provided compiler is clearly superior to GCC using the default flags. Vendor provided compilers are optimized for the particular architecture of the machine. The GCC compiler suite has many flags that invoke optimization particular to the specific architecture being used. The -mtune and -march flags should be investigated if you are constrained to use the GCC compilers.

Profiling adds overhead to the program. Final results should not include the profiling code since this is how the program will be used in production. This is reflected in the last two rows of the table.

The last step in optimization is parallelization. Some code parallelizes well, some not at all. For this example, the innermost loop consists of many evaluations of the same function with different arguments. Of particular importance is that each pass through the loop is independent of any other pass through the loop. This makes OpenMP an good candidate for parallelization. The following version of the code executes in 2.67 seconds on an 8 processor machine.

...
c executed once in the main routine:
      call OMP_SET_NUM_THREADS(8)
...
      subroutine fcn(npar,grad,fval,xval,iflag,futil)
      implicit double precision (a-h,o-z)
      dimension grad(*),xval(*)
c
      common/prob/nch
      integer          nch
c
      fval=0.
c
!$OMP PARALLEL DO PRIVATE(a,b,c,u,d)
!$OMP PARALLEL DO REDUCTION(+:fval)
      do i=1,nch-1
         a=sin(xval(2*i-1))*cos(xval(2*i))
         b=sin(xval(2*i-1))*sin(xval(2*i))
         c=cos(xval(2*i-1))
         do j=i+1,nch
            u=a*sin(xval(2*j-1))*cos(xval(2*j))
     $       +b*sin(xval(2*j-1))*sin(xval(2*j))
     $       +c*cos(xval(2*j-1))
            d=2.*sin(acos(u)/2.)
            fval=fval+1./d**2
         enddo
      enddo
!$OMP END PARALLEL DO
c
      return
      end

Several excellent tutorials on the use of OpenMP exist.

The lines beginning "!$OMP ..." are OpenMP directives that guide an OpenMP enabled compiler in parallelization. These directives were taken verbatim from the LLNL tutorial excercises.

Finally, the double loop of "fcn" can be modified to a single loop and that loop parallellized:

      implicit double precision (a-h,o-z)
      dimension grad(*),xval(*)
c
      common/prob/nch
      integer          nch
      common/linar/nt,m(5000),n(5000)
      integer nt,m,n
c
      if(m(1).ne.1) call slin

      fval=0.
c
      mold=0
!$OMP PARALLEL DO PRIVATE(i,a,b,c,mold,u,d)
!$OMP PARALLEL DO REDUCTION(+:fval)
      do k=1,nt
         if(m(k).ne.mold) then
            i=m(k)
            a=sin(xval(2*i-1))*cos(xval(2*i))
            b=sin(xval(2*i-1))*sin(xval(2*i))
            c=cos(xval(2*i-1))
            mold=m(k)
         endif
         u=a*sin(xval(2*n(k)-1))*cos(xval(2*n(k)))
     $        +b*sin(xval(2*n(k)-1))*sin(xval(2*n(k)))
     $        +c*cos(xval(2*n(k)-1))
         d=2.*sin(acos(u)/2.)
         fval=fval+1./d**2
      enddo
!$OMP END PARALLEL DO
c
      return
      end

      subroutine slin
      common/prob/nch
      integer          nch
      common/linar/nt,m(5000),n(5000)
      integer nt,m,n
      k=1
      do i=1,nch-1
         do j=i+1,nch
            m(k)=i
            n(k)=j
            k=k+1
         enddo
      enddo
      nt=k-1
      return
      end

This version executed in 1.84 seconds, nearly 23 times faster than the original.