|
Previous |
Section 5: Analyzing CPU-bound code.
Return | Introduction and Table of Contents
6.0 Optimizing CPU-bound Code
This section describes techniques for optimizing CPU-bound code. To determine which of these techniques to use for CPU-bound code, see Section 5, which describes analyzing CPU-bound code. In some cases, the compiler does these optimizations automatically.
Each example in this section includes one of the following icons, which represent the approximate amount of speedup that can be gained by applying the optimization technique to a piece of code.
The speedup times shown in this section were obtained by running the code samples on a specific CRAY J90 system running the UNICOS 9.0 operating system in the 3.0 Programming Environment. These times were obtained in a multiuser environment in competition with other unknown processes for system resources (such as memory ports, I/O devices, and so on). In a multiuser environment, no two runs of the same executable code will produce the exact same results. Measurement was made on time spent in the specific code block shown, not on the code as a whole.
6.1 Optimizing scalar code
If the code you have identified as using the most CPU time is a scalar code segment that contains no loops, the scalar optimization techniques described in the following sections might help improve the performance of the code on which you are working.
6.1.1 Inlining subroutines and functions
If your code spends time making frequent subroutine calls, consider inlining the called routines. Inlining replaces calls to user-defined functions with the code representing the function. Inlining eliminates the overhead of preserving the context of the calling routine and retrieving it upon return. It is especially effective for short routines that are called many times. The inline factor displayed in a flowview report for a program can help you determine which subroutines should be inlined. For more information, see the flowview(1) man page.
Inlining can be done manually or automatically. You can inline routines by copying them directly into the calling routine and changing dummy argument names to the actual argument names. Manual inlining requires some additional effort to associate parameters and avoid local variable conflicts.
If the subprogram call is the only vectorization inhibitor within a loop, inlining is also an effective technique for converting a scalar loop into a vector loop.
- CF90 inlining
The CF90 compiler offers automatic inline expansion with the -O inline command-line option. Example:
f90 -Oinline2 code.f90
Figure 15 is an example of manually inlining CF90 code. The speedup in this example comes from the elimination of the call overhead of CELSIUS, but more importantly, the loop is allowed to vectorize in the inlined code. Extremely high speedups are possible for cases such as this. Automatic inlining has a similar affect on code.
Figure 15. CF90 manual inlining example
| Original code: |
|
Inlined code: |
REAL FUNCTION CELCIUS(F)
CELCIUS=(F -32.0)*5.0/9.0
END
REAL SUBROUTINE ORIGINAL(N)
REAL A(1000), B(1000)
COMMON /DATA/A,B
! EXTERNAL SUBROUTINE CALL
DO I=1,N
A(I) = CELCIUS(B(I))
END DO
END
|
|
REAL SUBROUTINE MODIFIED(N)
REAL A(1000), B(1000)
COMMON /DATA/A,B
! EXTERNAL SUBROUTINE INLINE
DO I=1,N
A(I) = (B(I) -32.0) + 5.0/9.0
END DO
END
|
- C++ inlining
The C++ compiler offers automatic inline expansion through the inline keyword, the -h inline command-line option, or through the #pragma _CRI inline compiler directive. The -h report=I option writes inline expansion information to the stderr file. Example:
CC -h inline3 code.CC
6.1.2 Converting to faster integer types
If the code contains integer arithmetic, use the fastest integer representation that is capable of representing the values that the code needs. The following sections describe Fortran and C++ conversion methods.
To change to a different integer size with the CF90 compiler, use the KIND parameter and the SELECTED_INT_KIND function in your source code to specify the maximum number of digits the integers will need. You can also use either the -i command-line option or the !DIR$ INTEGERn compiler directive.
There are three integer sizes available for Fortran codes. Table 1, defines their relative speed within scalar and vector loops.
Table 1. Relative speeds of Fortran integer sizes
within scalar and vector loops
| Available integer sizes |
Scalar loops |
Vector loops |
| kind=(1 to 4)or int*4 |
Fastest |
Fastest |
| kind=(default size, default options) |
Slower |
Fastest |
| kind=8or int*8 or (default size, -O nofastint) |
Slowest |
Slowest |
6.1.3 Converting double precision to single precision arithmetic
Cray PVP systems provide double-precision floating-point arithmetic through software library calls. If the code contains double-precision arithmetic, use the smallest precision that satisfies the needs of the code.
Because Cray PVP systems provide a single-precision type that contains the precision of most other vendor system's double-precision type, the code might be able to use Cray single-precision arithmetic, which is much faster than Cray double-precision arithmetic. Figure 16 is a CF90 example of converting double-precision arithmetic to single precision.
Figure 16. CF90 example of converting double- to single-precision arithmetic
| Original code: |
REAL SUBROUTINE ORIGINAL (N)
INTEGER, PARAMETER :: K28=SELECTED_REAL_KIND(28)
REAL (KIND=K28), DIMENSION(1000) :: DA,DB,DC
REAL (KIND=K28) :: DD
!DOUBLE PRECISION
DO I = 1,N
DA(I) = DB(I) * DC(I) + DD
END DO
END
|
| Modified code: |
REAL SUBROUTINE MODIFIED (N)
INTEGER, PARAMETER :: K13=SELECTED_REAL_KIND(13)
REAL (KIND=K13), DIMENSION(1000) :: A,B,C
REAL (KIND=K13) :: D
!SINGLE PRECISION
DO I = 1,N
A(I) = B(I) * C(I) + D
END DO
END
|
On Cray PVP systems, single-precision arithmetic is precise to 14 decimal digits and double-precision arithmetic is precise to 29 decimal digits. With the CF90 compiler, you can use the -dp command-line option to convert to single-precision arithmetic. Without changing the source code, this will disable all double-precision arithmetic and use only single-precision arithmetic.
6.2 Optimizing vector code
If the code you have identified as using significant CPU time is a loop, vector optimization techniques can help speed up the code. If the loop is not already vectorized, use the techniques outlined in Section 6.2.1, to attempt vectorization.
If the loop is already vectorized, you can moderately improve its performance with vectorization techniques outlined in Section 6.2.2.
By default, both the CF90 and C++ compilers will try to apply the techniques described in the following sections to your code. However, the code on which you are working might be too complex for the compiler to apply these techniques automatically.
6.2.1 Converting a scalar loop to a vector loop
If the code you have identified as using the most CPU time is a scalar loop, it can approach the maximum available speed of Cray PVP systems only by using the vector registers. Try to vectorize the code by using one of the following techniques:
- Inlining, or expanding a subprogram inline, into the loop (see Section 6.1.1).
- Pushing the loop into a called subprogram (see Section 6.2.1.1).
- Segmenting or splitting the loop into two or more loops, of which only one loop contains scalar code (see Section 6.2.1.2). If the scalar portion of the loop is a subprogram call (for example, a library routine call or I/O statement), this technique can reduce the performance degradation of the subprogram call.
- Converting a scalar recurrence to a vector array (see Section 6.2.1.3).
- Avoiding ambiguous pointer references (see Section 6.2.1.4).
6.2.1.1 Pushing a loop into a subprogram
When a subprogram call appears in a loop, it cannot be vectorized. The loop might make frequent calls to a short subprogram that is a candidate for inlining (see Section 6.1.1). An alternative to subprogram inlining is loop pushing, in which the loop is pushed into the called subprogram. This is effective when the subprogram is too large or too complex to inline.
With this technique, the calling program still contains the subprogram call and consequently does not vectorize. However, the subprogram effectively does all the work, and the loop within the subprogram might vectorize.
This technique has two main benefits:
- The work within the loop might vectorize.
- The modified code has less call overhead. In the original code, each iteration of the (nonvectorizing) loop makes an external call. In the modified code, there is only one call (to a routine that does all the work).
This technique is not quite as fast as inlining, because the code still incurs overhead from a single function call in the modified code. Otherwise, the speedups are rather large.
Figure 17 is a CF90 example of loop pushing.
Figure 17. CF90 loop pushing example
| Original code: |
Modified code: |
REAL FUNCTION CELSIUS1(F)
REAL F
CELSIUS1 = (F- 32.0) * 5.0/9.0
END FUNCTION CELSIUS1
REAL SUBROUTINE ORIGINAL(N)
REAL A(1000), B(1000)
COMMON /DATA/A,B
!EXTERNAL SUBROUTINE CALL
DO I=1,N
A(I) = CELSIUS1(B(I))
END DO
END
|
SUBROUTINE CELSIUS2(N, C, F)
REAL C(N), F(N)
DO I=1,N
C(I) = (F(I)-32) * 5.0/9.0
END DO
END SUBROUTINE CELSIUS2
REAL SUBROUTINE MODIFIED(N)
REAL A(1000), B(1000)
COMMON /DATA/A,B
!LOOP PUSHED INTO SUBROUTINE
CALL CELSIUS2(N, A ,B)
END
|
6.2.1.2 Segmenting or splitting a loop (partial vectorization)
Some scalar loops contain a recurrence or other problem that precludes vectorization. One way to minimize the effects of scalar code within a loop is to isolate the scalar portion into a separate loop. This creates two loops: one that vectorizes, and one that does not. This technique is called loop splitting or, when the compiler does this, partial vectorization.
A common target for this technique is a loop containing a recurrence, subroutine call, or an I/O request. When one of these is encountered within a critical loop and it cannot be removed, you can still improve performance by minimizing the impact of the call or statement. Many loops that contain code with a subprogram call or I/O statement also contain code that is unrelated to that call or statement and that code is, in itself, vectorizable.
Figure 18 shows how to segment a loop for the CF90 compiler.
Figure 18. CF90 loop segmenting example
| Original code: |
Modified code: |
REAL SUBROUTINE ORIGINAL(N)
REAL A(1000), B(1000), C(1000)
EXTERNAL YO
T=1./3.
DO I=1,N
A(I) = ABS(YO(C(I)))
B(I) = A(I) * T * SQRT(C(I))
A(I) = SIN(ALOG(C(I)))
END DO
END
|
REAL SUBROUTINE MODIFIED(N)
REAL A(1000), B(1000), C(1000)
EXTERNAL YO
T=1./3.
DO I=1,N
A(I) = ABS(YO(C(I)))
END DO
DO I=1,N
B(I) = A(I)* T * SQRT(C(I))
A(I) = SIN(ALOG(C(I)))
END DO
END
|
6.2.1.3 Converting a scalar recurrence to a vector array
A recurrence is a data dependency between loop iterations. This occurs when an expression in one loop iteration requires a value defined in a previous iteration. Recurrences normally cannot be vectorized because the reordering of operations yields incorrect results. One way to eliminate a scalar recurrence in a loop is to store the computed scalar values in a temporary array and index them in a different order. This might mean converting a simple variable into a one-dimensional array, or a one-dimensional array into a two-dimensional array.
This technique is based on restructuring nested loops. For this technique to work you must make the scalar variable into a vector array based on the outer loop index. Thus, when the loops are switched to index the arrays in a different order, the new array is a vector array based on the inner loop index. Because the scalar-turned-vector saves all of the intermediate results, answers should still be correct.
This technique uses more memory because of the added dimension. In multidimensional cases, the increased memory usage can be significant.
Figure 19 demonstrates how to convert a scalar recurrence to vectorize a loop for the CF90 compiler.
Figure 19. Converting CF90 scalar recurrences to vector arrays
| Original code: |
Modified code: |
REAL SUBROUTINE ORIGINAL(N)
REAL A(1000), B(1000)
COMMON /DATA/A,C,BB
!SCALAR RECURRENCE
M = 1000
DO J=1,M
S = B
DO I= 1,N
S= S* C(I)
A(I)= A(I) + S
END DO
END DO
END
|
REAL SUBROUTINE MODIFIED(N)
REAL A(1000), B(1000), S(1000)
COMMON /DATA/A,C,BB
!TEMPORARY VECTOR
M = 1000
DO I=1,M
S(I) = B
END DO
DO I= 1,N
DO J=1,M
S(J)= S(J)* C(I)
A(I)= A(I) + S(J)
END DO
END DO
END
|
6.2.1.4 Avoiding ambiguous pointer references
Fortran and C++ pointers within loops can create hidden dependencies and complex aliasing. The compilers must be conservative with pointer references that cause potential dependencies in loops, and many such loops will not vectorize.
By limiting the scope of the pointers within loops, and thus creating restricted pointers, you can convert scalar loops to vector loops. For the CF90 compiler, this technique affects the pointer attribute only and does not affect the Cray Research implementation of Fortran pointer type, which is provided by the CF90 compiler.
In Figure 20, the modified subroutine uses a local pointer reference for the memory write, ensuring no dependency with the pointer memory read operations, thereby allowing the loop to vectorize.
Figure 20. CF90 example of avoiding ambiguous pointer references
| Original code: |
Inlined code: |
SUBROUTINE ORIGINAL(PA,PB,PC,D,N)
REAL, POINTER, DIMENSION(1)::PA,PB,PC
DO I=1,N
PA(I)=PB(I)*PC(I) + D
END DO
END
|
SUBROUTINE MODIFIED (A,B,C,D,N)
REAL, TARGET, DIMENSION(1000):: A,B,C
DO I=1,N
A(I)=B(I)*C(I) + D
END DO
END
|
6.2.2 Increasing performance of vector loops
If the code you have identified as using significant CPU time is a vector loop, it is using the fastest hardware available on Cray PVP systems. However, you still might be able to moderately improve its performance because not all vector loops run at peak speeds.
The techniques described in this section allow the vector loop to keep the vector registers and functional units busy instead of waiting on slower components of the computer system (such as central memory). The techniques enlarge the vector length to provide higher rates of operations per second than can shorter vector lengths. They also improve the computational intensity by providing more operations between memory references.
The following techniques can improve the performance of vector loops:
6.2.2.1 Fusing vector loops
Programmers often write two or more independent, vectorizing loops with the same iteration count. By combining two of these loops into one vector loop, or loop fusing, the overhead of loop control code for one of the loops can be eliminated. If the code on which you are working contains such independent, vectorizing loops, both the C++ and CF90 compilers will automatically fuse the loops.
This technique is especially useful with Fortran 90 array syntax. Each array syntax line acts like a separate DO loop. If several same-size array assignments are together, the compiler will optimize them by converting them to a fused DO loop.
Figure 21 demonstrates fusing vector loops for the CF90 compilers.
Figure 21. CF90 loop fusing example
| Original code: |
Inlined code: |
REAL SUBROUTINE ORIGINAL(N)
!SEPARATE VECTORIZING LOOPS
DO I=I,N
A(I) = E(I) +SQRT(F(I))
ENDDO
DO I=I,N-1
B(I) = E(I) *SQRT(F(I))
ENDDO
DO I=I,N-2
C(I) = E(I) -SQRT(F(I))
ENDDO
DO I=I,N-3
D(I) = E(I) .eq.SQRT(F(I))
ENDDO
END
|
REAL SUBROUTINE MODIFIED(N)
DO I=I,N-3
A(I) = E(I) +SQRT(F(I))
B(I) = E(I) *SQRT(F(I))
C(I) = E(I) -SQRT(F(I))
D(I) = E(I) .eq. SQRT(F(I))
ENDDO
C(N-2)= E(N-2) +SQRT(F(N-2))
DO I=N-2, N-1
B(I) = E(I) * SQRT(F(I))
ENDDO
DO I=N-2,N
A(I) = E(I) +SQRT(F(I))
ENDDO
END
|
6.2.2.2 Loop unrolling and unwinding
The technique loop unrolling is an optimization technique in which the statements within a loop are replicated while reducing the trips through the loop at the same time. This technique can increase performance by giving the compiler more instructions to schedule while reducing the number of branches. It can be effective for both scalar and vector loops. For vector loops, the CF90 and C++ compilers typically unroll the loop twice (two copies of the work), and for scalar loops, the compilers unroll up to 32 times. For some loops, it may be beneficial to unroll loops beyond what the compilers do. This may be done at the source level, but Cray Research recommends you use the unroll directives.
A loop unwinding is totally unrolling a loop so effectively that it is no longer a loop. Figure 22, shows an example of this. Loop unwinding as shown in this example can allow the next outer loop level to vectorize, although the compilers attempt to do this automatically.
Figure 22. Loop unwinding example
| Original code: |
Inlined code: |
DO J = 1,500
DO I= 1, 4
A(J,I) = B(J,I) + C(J,I)
END DO
END DO
|
DO J = 1,500
A(J,1) = B(J,1) + C(J,1)
A(J,2) = B(J,2) + C(J,2)
A(J,3) = B(J,3) + C(J,3)
A(J,4) = B(J,4) + C(J,4)
END DO
END
|
6.2.3 Avoiding memory contention
Central memory addresses on Cray PVP systems are interleaved among memory banks and other layers of the memory hierarchy (one or more of bank, section, subsection, group, quadrant, or pseudobank, depending on the system). This interleaved architecture allows vector CPUs to access memory by cycling through the hardware at each layer of the memory architecture. Cycling through each layer hides latency, allowing load and store operations to move at vector pipeline speeds.
Memory conflicts occur when one reference to memory tries to use the same memory layer hardware as another reference to memory, while the second reference is in progress. Conflicts at the subsection and bank levels can have serious performance implications on Cray PVP systems, because they defeat the latency-hiding effect of interleaved memory.
The most common source of memory conflict is vectorizing on the least rapidly changing dimension of an array (last in Fortran 90, first in C++) when the most rapidly changing dimension (first in Fortran 90, last in C++) is declared to be a multiple of a power of 2, such as 16, 32, or 64. Another possible source of memory conflict for both Fortran and C++ is arrays of structures in which the size of the structure is 16, 32, and so on. This is more difficult to avoid, but should be considered if performance is less than expected.
You can avoid this source of memory conflict by applying the following techniques to a loop or array:
- Extending the most rapidly changing dimension of the array to an odd number (extending the structure size in an array of structures)
- Transposing the indices of the array declaration and all subsequent references (change an array of N structures, each containing M elements, into a structure containing MN-element arrays)
The following C++ and Fortran examples illustrate inefficient access to memory:
C++ example:
float a[100][32];
for (i=0; i < 32;i++)
for (j=0;j < 100;j++)
a[j][i] = b[j][i] * 2.0;
Fortran example:
real a(32,100),b(32,100)
do i = 1, 32
do j = 1,100
a(i,j) = b(i,j) * 2.0
end do
end do
Order of access for C++ is a[0][0], a[1][0], a[2][0], and so on; access is down the columns of the array. For Fortran, the order of access is a(1,1), a(1,2), a(1,3), and so on; access is across the rows. Either case will cause a stride of 32 through memory, resulting in serious memory conflict.
The following examples show two ways to modify the code to improve performance. The following C++ and Fortran versions have been changed to obtain stride-1 memory access by interchanging the loop nests. The C++ version now references across the rows and the Fortran version, down the columns.
C++ example:
float a[100][32];
for (j=0;j < 100;j++)
for (i=0;i < 32;i++)
a(i,j) = b(i,j) * 2.0
Fortran example:
real a(32,100),b(32,100)
do j = 1,100
do i = 1,32
a[j][i] = b[j][i] * 2.0;
end do
end do
Another method is to change the array declarations from 32 to 33, as in the following examples. This changes the stride through memory from a bad stride (32) to a good stride (33). However, as shown in the previous example, stride-1 is best.
C++ example:
float a[100][33];
for (i=0;i < 32;i++)
for (j=0;j < 100;j++)
a[j][i] = b[j][i] * 2.0;
Fortran example:
real a(33,100),b(33,100)
do i = 1,32
do j = 1,100
a(i,j) = b(i,j) * 2.0
end do
end do
6.3 Using library routines
If your program contains sections of code that perform common mathematical functions, you might be able to substitute calls to library routines for these functions. Using library routines can greatly improve the performance of your code.
After using the techniques described in this guide to maximize single CPU performance of your code, you might want to further reduce its elapsed (wall-clock) time by using parallel processing, which is also called multitasking. Parallel processing is the technique of splitting code into subtasks to be run in parallel on multiple CPUs of a Cray PVP system. Although reducing overall elapsed time, effective use of multitasking will add moderately to the user CPU time of the code and to its memory size.
Parallel processing techniques will be discussed in a separate guide.
|