Introduction Evaluation Memory-Bound Code I/O Bound Code CPU-Bound Code

Next | Section 6: Optimizing CPU-bound code.
Return | Introduction and Table of Contents

5.0 Analyzing CPU-bound Code

In Section 1.3, you determined that your code is CPU bound and that CPU optimization techniques might improve the performance of the code. Before you optimize code for CPU performance, you must determine which routines and which sections of code in those routines are using the most user CPU time. After you determine which routines are using the most time, you can begin evaluating ways to optimize the code to run more efficiently. CPU optimization incorporates the following general steps:

  1. Analyze the code to determine where it spends most of its user CPU time (see Section 5.2).

  2. Determine whether you can make the code run faster (see Section 5.3).

  3. Apply an optimization technique (see Section 6).

  4. Check your answers (see Section 5.4).

  5. Return to the initial analysis phase to see if you are satisfied with the performance improvement brought about by applying a technique (see Section 1).

This section provides help in determining where your code spends most of its CPU time and whether you can make it run faster. Section 6, provides optimization techniques. Figure 12 displays a flowchart of the steps described in these chapters. This flowchart corresponds to the CPU-bound path in Figure 1, and to the CPU optimization box in Figure 2.

Figure 12. Analyzing and determining optimization techniques for CPU-bound code

Figure: Evaluate Code

5.1 Understanding CPU optimization

CPU optimization is based on scalar and vector processing. A scalar is a single value that can be manipulated by the scalar hardware registers. Scalar processing consists of performing logical, arithmetic, or memory operations on scalar registers one at a time. A vector is a series of values on which instructions operate. A vector can be an array, an array of structures, or any subset of an array (such as a row, column, or diagonal). When arithmetic, logical, or memory operations are applied to vectors, they use the hardware vector registers, and this is referred to as vector processing.

Vector processing is almost always much faster than scalar processing because a single machine-level instruction can operate on a series of operands within a vector register. This is a form of low-level parallelism that allows many operations to run concurrently within a single CPU in a Cray PVP system. A vectorized loop in a program is processed with hardware vector registers and typically runs about 10 times faster on a CRAY Y-MP E system than when executed with scalar registers. Therefore, most CPU optimization techniques focus on using as much vectorization as possible within the code.

5.2 Identifying the code segments that consume the most user CPU time

Profiling your code (see the following section) allows you to determine which code segments consume the most user CPU time. The additional tools described in Section 5.2.2, might also be helpful to you. After analyzing the code, go to Section 5.3, to determine why the code runs slowly.

5.2.1 Profiling code

Using the prof(1) utility and prof library together allow you to see a profile of the code. This profile indicates how much time the code spends executing different segments within routines. The prof utility uses run- time statistical sampling to determine where the code is likely to be executing instructions. At a predetermined time interval, the prof utility samples the CPU program counter (PC) register and matches it to a code block (machine instructions) within the process in memory. In this way, it is able to identify blocks of code that are frequently found in the PC register.

The code blocks, or buckets, are evenly sized, predetermined address intervals. You can modify both the sampling rate (time interval) and the size of the address buckets to trade off between finer granularity sampling or lower overhead. Both a larger time interval and a larger address bucket offer lower overhead. After execution, the prof utility collates the statistics for the final report.

Note: Because prof samples the program counter, it produces probabilities, not exact results. The longer you run prof , the more likely you are to get a representative sampling. The profiled program should have an execution time that is greater than 20 seconds to ensure a representative sampling.

Use the profview(1) utility to examine data generated by the prof command, either through reports or through interactive displays. The prof -x command prints output in a raw form that can be post-processed by the profview command. If a symbol table was generated during compilation, profview also reports the statistics according to program labels.

 

Procedure 11: Using the prof utility to check code segment timing

  Use the following procedure to produce a dynamic profile of the location at which the code uses the most CPU time. Code examples follow the procedure.

  1. Compile the code with symbol tables and load it with the prof library.

  2. Execute the code.

  3. Profile the code.

  4. View the sampling data from prof.

    The following examples show code that performs this procedure. When you enter the profview(1) command as shown in these examples, you will be using profview interactively with the X Window System interface.

    Fortran 90 example:

    f90 -G1 -l prof yourcode.f90 
    ./a.out 
    prof -x a.out  your.raw.file 
    profview your.raw.file
    

    C++ example:

    CC -Gp -lprof yourcode.C 
    ./a.out 
    prof -x a.out > your.raw.file 
    profview your.raw.file
    

    The profview command displays a line-mode (ASCII) interface or a graphic user interface (GUI) for interactive use, and also can be used in a batch (noninteractive) environment to simply generate reports.

     

 

 

Procedure 12: Interpreting profview output

  To interpret profview output, use the following procedure:

  1. Examine the pie slice graph presented by profview (see the example in Figure 13). The graph depicts the routines that dominate CPU time. The largest pie slice represents the routine in which the code consumes the most CPU time. Routines are placed in descending order of size in counterclockwise direction.

    Figure 13. profview display

    missing

     

  2. Select a pie slice from the initial pie chart display. When you select the pie slice, the display changes to show statistics within that routine, as shown in Figure 14.

    Figure 14. profview routine-level display

    profview routine-level display

     

  3. Find the code segments within subroutines that consume the most CPU time. These code segments are clearly marked with asterisks on the right-hand side of the statistics table. The largest clusters of asterisks represent the code segments consuming the majority of execution time in the selected routine. Each segment corresponds to a symbol in the form of a statement label or a line number, and each asterisk represents 4% of the execution time.

    The statistics table begins with an entry sequence (indicated by E.entry _sequence) or a preamble code (indicated by P.preamble_code). User- or compiler-generated statement labels are indicated by S.statement_label, while line-number locations are indicated with L. line_number. The example in Figure 14 shows an entry sequence and line numbers. In most cases, a line or statement label with a high percentage of the run time represents the top of a frequently executed vector loop or a statement within a frequently executed scalar loop.

    Note: The prof utility samples the entire code, including intrinsic functions and system libraries that are not shown by flowview(1) (see Section 6.2). Also, profiling provides a finer granularity of sampling within the code by showing the probability of executing instructions inside specific loops.

    For additional information, from the Reports menu, you can select the Select Report Desired:-> Summary option to display the routines that consumed the most CPU time, and within those routines, the code segments (usually loops) that are most frequently executed. These code segments are using the most CPU time.

  4. You have now identified the location at which the code is spending a significant portion of its CPU time. To begin determining whether you can make that code segment run faster, go to Section 5.3.

 

 

5.2.2 Other analysis tools and compiler command-line options

The tools and command-line options described in the following sections might help you analyze CPU-bound code.

5.2.2.1 Flowtrace

The Flowtrace compiler option and flow libraries together provide you with a flowtracing capability that provides run-time, subroutine level timing. flowtrace monitors execution time spent in subprograms, counts the number of calls to each routine, and constructs a dynamic calling tree. When a program is compiled with flowtrace and then is run, the flowtrace information is written to a file called flow.data. Use the flowview(1) command to see the results.

The flowview command displays a line-mode (ASCII) interface or a graphical user interface (GUI) for interactive use, and can also be used in a batch (noninteractive) environment to simply generate reports.

Flowtracing produces an overhead that is directly proportional to the number of subprogram calls within the program. For each subprogram call, two extra calls (to the flowtrace library routines) are issued.

5.2.2.2 Perftrace

The Perftrace feature combines flowtracing with the libperf libraries to provide a capability that times subroutines and functions based on statistics gathered from the Hardware Performance Monitor (HPM) device. Perftrace provides the same kind of statistics as are generated by the hpm(1) command, such as the monitoring of scalar activity, hold-issue conditions, memory use, and vectorization. Perftrace statistics are detailed for individual program units within the program. Perftrace is not available on the CRAY EL series systems.

The perfview(1) command displays information from raw data generated by the Perftrace library, the hpm user command, or the hpmall(8) administrator command. The perfview command displays a line-mode (ASCII) interface or a graphical user interface (GUI) for interactive use, and can also be used in a batch (noninteractive) environment to simply generate reports.

5.2.2.3 Jumptrace

Jumptrace provides exact timings of subroutines and functions during program execution. The jt(1) command gathers timing statistics on code during execution. The jumpview(1) command displays Jumptrace data generated by the jt(1) command.

The Jumptrace feature lets you determine the exact timing of every executed code block in the program. This feature can be contrasted with the flowtrace and Profiling features. The flowtrace feature times routines by using less accurate operating system timer measurements. The Profiling feature samples and records program execution addresses.

The jumpview command displays a line-mode (ASCII) interface or a graphical user interface (GUI) for interactive use, and can also be used in a batch (noninteractive) environment to simply generate reports.

5.2.2.4 CF90 listing

The CF90 compiler can provide a set of static information on the code through its listing tool. The compiler generates a compiler information file (CIF), which is then translated by the cflist(1) tool to produce a CF90 listing.

Generate this listing by using the following command line. The output appears in a file called code.list.

f90 -r2 code.f90

The following example is a portion of a loopmark listing that shows vectorization and parallelization information about the code. Loops marked with a V in the left-hand margin were vectorized by the CF90 compiler.


1                          subroutine vectest(m,n,a,b,c,d,e,g,p,q,s,t) 
2                          real, dimension(n) :: a,p,q 
3                          real, dimension(m,n) :: b,c,d,e,g 
4                          real :: s,t 
5                          integer i,j,l,m,n 
6 
7      P-- v -------- do l=1,n 
8      P    v                 c(m,l)=e(m,l)+p(l) 
9      P    1v -------    do k=1,m 
10     P    1v                  d(k,n)=d(k,n)/g(k,n) 
11     P    12v ------       do i=1,n 
12     P    12v                    a(i)=a(i)+i 
13     P    123v -----          do j=1,m 
14     P    123v                       s=s+t**.5 
15     P   123v                       b(j,i)=q(j)-p(j)+c(m,l) 
16     P   123v ---->          enddo 
17     P   12v ----->       enddo 
18     P   1v ------>    enddo 
19     P-- v ------->  enddo 
20 
21                    return 
22                    end

f90 Compiler - 8 messages: 

1) f90-6217,Vector> A loop starting at line 7
            was partially and conditionally vectorized. 
2) f90-8135,Scalar> Loop starting at line 7 
            was unrolled 2 times. 
3) f90-6403,Tasking> A loop starting at line 7 
            was tasked. 
4) f90-8135,Scalar> Loop starting at line 7 
             was unrolled 2 times. 
5) f90-6209,Vector> A loop starting at line 9 
            was partially vectorized. 
6) f90-6209,Vector> A loop starting at line 11 
            was partially vectorized. 
7) f90-6204,Vector> A loop starting at line 13 
            was vectorized. 
8) f90-6001,Scalar> An exponentiation was replaced 
            by optimization. This may cause numerical 
            differences. 

Note: Enter "explain cf90-6217" (for example) to 
      get a more detailed explanation of f90 Vector 
      message 6217 

5.3 Determining whether you can make code run faster

Use the procedures in this section to determine why identified code sections use the most CPU time and to determine which optimization techniques to apply. If the code is identified by the profview utility as using the most CPU time in a loop, use the techniques described in Procedure 13 to determine if the loop is vectorized. If the most time is not spent in a loop, go to Procedure 14.

 

Procedure 13: Checking for vectorization

 

  1. Determine whether the loop is vectorized (a vector loop) by examining the CF90 loopmark listing (see Section 5.2.2.4) .

    If the loop is vectorized, go to Procedure 14. If not, go to the next step.

  2. If a scalar loop is not vectorized, determine the reason from the following list:
    • Vectorization was turned off by a compiler directive or a command-line option.

    • The expressions inside the loop are sufficiently large or complex.

    • The loop contains a vectorization inhibitor such as one of the following:
      • A call to a function or subroutine (see Procedure 16)
      • A data dependency on results from a previous loop iteration (see Procedure 16)
      • Unstructured program branches into the loop or backward to a location outside the loop (see Procedure 14)
      • An I/O statement
      • Partial-word data manipulation such as characters or bit field references

    Note: If the inhibitor is an I/O statement or a partial word reference, the only remedy is to remove the inhibitor or isolate it in a separate loop (assuming that the remaining code will vectorize). With characters and bitwise operations, it will be necessary to modify your code if you move the inhibitors out of the loop.

 

 

 

Procedure 14: Answering general questions

 

  1. Look at the source code identified by the profview utility as code consuming the most CPU time (see Procedure 12). If it is a scalar code segment containing no loops, it is not using vector register hardware, and the scalar optimization techniques described in this step might help speed up the code. Even if your code spends most of its time in vector loops, answering the following questions and using the corresponding techniques might improve its performance.
    • Does the code contain areas with frequent subroutine calls? If so, consider applying the inlining technique (see Section 6.1.1).

    • Does the code use double-precision (128-bit) or long integer (64-bit) arithmetic? If so, and mathematics will allow, switch to single-precision (64-bit) and 46-bit integer arithmetic (see Section 6.1.2, and Section 6.1.3).

    • Does the code contain frequent conditional expressions and branching? (This code construct will map to the hpm, perfview, or jumpview reports as having a high rate of instruction buffer fetching as the code branches to locations outside the buffer.) If so, consider the following approaches:

      • Use the align compiler directives (!dir$ align for Fortran 90 or #pragma _CRI align for C++ ), which are designed to force a block of code to begin at an instruction buffer boundary. If the code segment fits into the buffer, this directive eliminates the fetching problem.

      • Restructure the code to reduce frequent backward branching and subroutine or function calls. Use the Fortran 90 CASE or the C++ switch language construct, and try to have a top-down flow in the code. You can then reposition the scalar conditional blocks to execute the most frequent true cases first.

    • Does the code contain frequent Fortran 90 calls to external subroutines passing array sections as arguments? If so, avoid excessive subroutine call overhead in Fortran 90 by using the procedure INTERFACE specification (see Section 6.1.4).

  2. If the code is using a common mathematical algorithm such as linear algebra, fast Fourier transform, sort/search, you might be able to take advantage of an existing scientific library routine. Typically, library routines are optimized beyond what the compilers can do for the code, and they justify the overhead of a call. If you think the code will benefit from a library routine, see Section 6.3.

  3. Go to Section 5.4.

 

 

 

Procedure 15: Checking vectorized loops

 

Because the code segment is a vector loop (and not a scalar loop), use one of the following techniques to increase the performance inside the loop:

  • Combine adjacent loops into a single loop (fusion) (see Section 6.2.2.1).

  • Remove an outer loop by replicating the loop body (unwinding) (see Section 6.2.2.2).

  • Reduce trips through the loop by replicating the loop body (unrolling) (see Section 6.2.2.2).

  • Avoid memory contention. This problem will appear as a full vector loop with low MFLOP rates (see Section 6.2.3).

 

 

 

Procedure 16: Vectorizing loops that contain data dependencies or calls to routines

 

If the loop contains data dependencies or calls to functions or subroutines, attempt to vectorize the loop by asking the following questions and using the corresponding techniques:

  1. Does the loop contain function or subroutine calls? If so, use one of the following techniques to optimize the code:
    • Inline the calls (see Section 6.1.1).
    • Push the loop into a subprogram (see Section 6.2.1.1).
    • Segment the loop (see Section 6.2.1.2).
  2. Does the loop contain a dependency? A primary inhibitor of vectorization is a data dependency, which is caused by applying vectorization to a nonvectorizable construct. A loop with results from one loop pass that are needed in future passes of the same loop has a data dependency conflict (also known as a data dependence) and might not be completely optimized.

    If the code contains a dependency, use one of the following techniques:

    • Segment the loop (see Section 6.2.1.2).
    • Convert scalar recurrences to vector arrays (see Section 6.2.1.3).
    • Avoid ambiguous pointer aliasing (see Section 6.2.1.4).
    • Choose a new algorithm that vectorizes.

 

 

5.4 Checking answers

After applying any optimization technique, check your answers. Some optimization techniques alter the order of operations in the code and might change the outcome. If the answers are not the same as those you received before applying the optimization technique one of the following things is occurring:

  • In applying the technique, you intended to change the code, but might have inadvertently changed the algorithm. If this is the case, revert to the code as it was written before applying the optimization technique. You might be able to either apply the technique differently so that it does not change the algorithm, or apply a different technique.

  • The technique you applied might have exposed a numeric sensitivity in the code. If so, either revert to the code as it was written before applying the optimization technique, or try to remove the numeric instability

Return to the initial analysis phase to see if you are satisfied with the performance improvement brought about by applying a technique (see Section 1).

    Next | Section 6: Optimizing CPU-bound code.
Previous | Section 4: Optimizing I/O-bound code
  Return | Introduction and Table of Contents

Contact webmaster@asc.edu with questions or comments regarding this page.
Last updated Sept. 30, 1999 -- (c)1999 Alabama Supercomputer Authority