Introduction Evaluation Memory-Bound Code I/O Bound Code CPU-Bound Code

Next | Section 2: Analyzing and optimizing memory-bound code.
Return | Introduction and Table of Contents

1.0 Evaluating Code

The first step in optimizing a program is evaluating its overall performance. This allows you to decide where to focus optimization efforts. When you compile and execute a program on a Cray PVP system, it usually will be dominated by one of the following general activities:

  • Memory management
  • Input/output (I/O) processing
  • CPU computation

The flowchart in Figure 1 shows a comprehensive view of the single- CPU optimization process described in this document. This is an iterative method that requires you to determine when to stop optimizing the code.

Figure 1. Optimization Overview

Figure: Optimization Overview

 

The 'Evaluate code' box in Figure 1 is expanded in Figure 2 to summarize the recommended initial analysis.

Figure 2. Evaluate Code

Figure: Evaluate Code

1.1 Memory Bound Programs

When the compiled program begins execution, it becomes a UNICOS user process in memory. The process resides in memory during execution and it interacts with the operating system. Its size might expand or decrease during execution. This expanding and decreasing is referred to as memory management. If the process spends excessive elapsed time managing memory, it is considered a memory-bound process.

A general rule for the size of a code's process in memory is 'smaller is better'. The advantages of a smaller process are reductions in swap frequency, wait-time, and time-to-load. Although there are tradeoffs described later in this manual that provide exceptions to this rule, for now you need only to identify whether your process is memory bound.

The ja utility reports job or session-related accounting information provided by the job accounting daemon. It can provide you with a report that contains an overall view of memory usage, I/O, and system overhead. This report can help you determine whether your code is memory bound, I/O bound, or CPU bound. Figure 3 shows the steps to take to obtain a ja report, and a sample ja report.

To find out how much user memory is available on your Cray PVP system, use the sysconf, target or sar -M commands, or ask your system administrator. The sysconf(1) command reports a number for user memory (USRMEM) in the SOFTWARE portion of its report, as shown in the sample portion of a sysconf command output in Figure 4.

Procedure 1: Determining whether the code is memory bound

 

To determine whether the code is memory bound, perform the following steps:

  1. To obtain a report that shows an overall view of memory usage, I/O, and system overhead, run the job accounting utility, ja -clth (single-tasked), with the code, as shown in the following example. (To save an extra step, you can also run ja(1) concurrently with hpm(1) , the results of which you will use later).

  2. Find the Memory HiWater column of the ja report. Perform the following steps to determine if the memory high-water mark for the process is a significant fraction of available user memory.

    Note: A significant fraction of available user memory on your system is a subjective measure based on the chances of the code finding room in memory as it competes with other users' jobs. Whether the code finds room in memory depends on how many jobs are competing for memory, the job size, the priority of the process, memory latency, and other factors.

    1. Multiply the number in the Memory HiWater column of the ja report by 512. This gives you the maximum size of your process as measured in Cray words.
    2. Determine the amount of available user memory and the number of available processors on your system by using the UNICOS sysconf(1) command (see Figure 4). Use the USRMEM value in the SOFTWARE report of the sysconf command as the amount of available user memory for your system. Use the NCPU value in the HARDWARE report of the sysconf command as the number of available processors.
    3. Although you might be able to execute a process as large as the size stated in USRMEM of the sysconf report, you are probably sharing your Cray PVP system with other users. Also, your Cray PVP system most likely has multiple CPUs. Therefore, use the following rules of thumb to determine whether the memory high-water mark for the process is a significant fraction of available user memory.
      • If the Cray PVP system on which the code is running has 4 CPUs or less, divide the maximum size of your process by user memory (USRMEM on the sysconf output). If the result is equal to or greater than .333, your process will probably benefit from optimization of memory management. See Section 2 on optimizing memory-bound code. If the result is less than .333, go to the next step.
      • If the Cray PVP system on which the code is running has more than 4 CPUs, divide the user memory (USRMEM on the sysconf output) by the number of processors (NCPU on the sysconf output). Compare this number to the maximum size of your process. If the maximum size of your process is bigger than this number, your process will probably benefit from optimization of memory management. See Section 2 on optimizing memory-bound code. If the maximum size of your process is not bigger than this number, go to the next step.

  3. If the number in the Sys CPU Seconds column in the ja report for the code is greater than 10 percent of the number in the User CPU Seconds column, see Section 2 on optimizing memory-bound code. Inefficient memory management is one cause for excessive elapsed time. If the number in the Sys CPU Seconds column is not greater than 10% of the number in the User CPU Seconds column, your code is not memory bound. Go to Section 1.2.

 

Figure 3. Sample ja report

Figure: Sample ja report

Figure 4. Sample sysconf command output

Figure: Sample sysconf command output

1.2 I/O Bound Programs

If a program spends most of its elapsed time performing I/O, it is considered I/O bound. I/O optimization can offer a significant savings in elapsed time. If the design of a program requires large amounts of I/O, you should optimize for I/O performance. However, if the code runs 24 hours and performs only 1 hour of I/O, there are other areas of optimization that will probably have a greater impact on overall code performance.

Procedure 2: Determining whether the code is I/O bound

 

  1. Run the job accounting utility, ja -clth (single- tasked), with the code to get a report for an overall view of memory usage, I/O, and system overhead. You can use the same report you created in Procedure 1 without running the code again.
  2. If the sum of the number in the I/O Wait Sec Lck column plus the number in the I/O Wait Sec Unlck column of the ja report is close to or greater than 50% of the number in the User CPU- Seconds column, the code is probably I/O bound, and you can use Section 1.2.1, to further assess a potential problem. If not, and you have already determined that the code is not memory bound, go to Section 1.3.

 

1.2.1 Narrowing the focus for I/O bound code

If you suspect the code is I/O bound, you can also use the ja report to determine the dominant type of I/O, the efficiency of the code's requests, and an estimate of the effective transfer rate to the I/O device. This information will help you choose an I/O optimization technique that offers you the best opportunity for performance improvement.

If the I/O wait-time for the code is prominent (as you determined in Procedure 2, Step 2) or if you think it should be smaller, use the following information from the ja report for the code to analyze and improve performance. This information will be useful when you continue analyzing your code by using techniques techniques found in Section 3: Analyzing I/O-bound code.

  • For unformatted disk I/O, if the number in the I/O Wait Sec Unlck column is equal to or larger than the number in the I/O Wait Sec Lck column, the code is making significant use of the system cache. In most cases, avoiding the system cache will improve the overall performance of the code.
  • Perform the following arithmetic with information from the ja report:

    (Kwords Xferred ÷ Log I/O Request ÷ 1000)

    The result is the average size of the code's I/O requests (in Mwords per request). Increasing this average reduces the frequency of the I/O requests in the code, which improves the overall I/O performance.

  • Perform the following arithmetic with information from the ja report:

    (Kwords Xferred * 8 ÷ 1000) ÷ (I/O Wait Sec Lck + I/O Wait Sec Unlck)

    The result is the average transfer rate of the code's I/O device (in Mbytes per second). If this transfer rate is significantly lower than the sustained rate of the program's I/O device, there is an inefficiency somewhere in the I/O process. You will want to either raise the transfer speeds to be closer to the speed of the I/O device or use a faster device.

    Use the information you have obtained from the ja report, and go to Section 4, to continue evaluating the I/O-bound code.

1.3 CPU Bound Programs

If the code is neither memory bound nor I/O bound, it must be CPU- bound code; that is, it spends most of its elapsed-time performing CPU calculations. For all Cray PVP systems except the CRAY EL series, an hpm(1) report on the code will tell you how effectively the code is using vector registers, instruction buffers, and memory ports.

Procedure 3: Determining CPU performance statistics

 

The hpm(1) command allows a Cray PVP system user to access the hardware performance monitor (HPM) and obtain overall program timing information. For complete information on the hpm utility, see the hpm(1) man page.

On Cray PVP systems other than the CRAY EL series, use the following procedure to access the code's CPU performance statistics and to determine whether the code is optimized for single-CPU speed:

  1. Produce an hpm report by first compiling the program, then issuing the hpm command with the program name as shown in the following example:
    f90 yourcode.f90 or CC yourcode.C 
    
    hpm ./a.out

    Figure 5 shows a sample hpm group 0 report from a CRAY Y-MP E system.

    Figure 5. Sample hpm report

    Figure: Sample hpm report

  2. Determine whether the code is dominated by scalar or vector operations. To do this, use the values shown in the hpm report to perform the following steps. If the code is dominated by scalar operations, see Section 6, on analyzing the CPU-bound code. If the code is dominated by vector operations, go to Step 3.
    1. Find the millions of instructions per second (MIPS) for the code by looking at the Million inst/sec (MIPS) row of the hpm report.
    2. Find the floating point operations per second (FLOPS) for the code by looking at the Floating ops/CPU second row of the hpm report.
    3. Use Table 1 to determine whether the MIPS and FLOPS for the code are low or high, based on the hardware expectations. The code is not likely to achieve peak FLOPS or MIPS rates, but some codes are capable of performing at substantial fractions of peak speed.

      Low FLOPS are any single-digit rate up to 20% of peak rates for that system. High FLOPS are generally anything above 66% of peak rates for that system.

      Low MIPS rates are at or near the bottom of the range for that system. High MIPS rates are generally anything near 50% of peak MIPS for that system.

      Table 1. Single-CPU hardware expectations

      Cray Series T90 C90 Y-MP E J90
      Peak FLOPS 2000M 1000M 333M 200M
      MIPS Range 30 to 500 20 to 250 20 to 100 20 to 80

    4. Use Table 2 to determine whether the code is dominated by scalar or vector operations.

      Table 2. Determining whether the code is dominated by scalar or vector operations

      FLOPS MIPS Determination
      High or Medium Low Code is most likely dominated by vector operations. The code is performing well. Go to Step 3, which might offer clues to modest performance gains.
      Medium Medium Code is most likely a mix of scalar and vector operations. CPU optimization might help improve its CPU performance. See Section 6, on analyzing the CPU-bound code. Step 3, and Step 4, might also offer clues to performance gains.
      Medium or Low High Code is most likely dominated by scalar operations (is not vector code). CPU optimization will help improve its CPU performance. See Section 6, on analyzing the CPU-bound code. Step 3, and Step 4, might also offer clues to performance gains.
      Low Medium Code is dominated by scalar operations. CPU optimization will help improve its CPU performance. See Section 6 on analyzing the CPU-bound code.
      Low Low Code is performing poorly whether it is scalar or vector. It has a CPU performance problem. CPU optimization will help improve its CPU performance. See Section 6on analyzing the CPU-bound code.

  3. Determine if the instruction buffer fetches per second (as shown by the Inst.buffer fetches/sec row of the hpm report) are close to or greater than 0.1 million per second. If yes, see the following note and Section 6 on analyzing the CPU-bound code. If no, go to the next step.

    Note: Excessive instruction buffer fetches tend to slow down the code. If the code has a high rate (approaching 0.1 million per second), the code may have excessive jumping (as with go to or if-then-else constructs) or excessive calls to subprograms. CPU optimization probably will help the overall performance.

  4. Determine the computational intensity ratio of the code. Use the values in the hpm report, as follows:

    1. Divide the number in the Floating ops/CPU second row by the number in the CPU mem.references/sec row. This is the computational intensity of the code.

      The computational intensity ratio is the ratio of the floating-point operation rate to the memory access rate. This ratio should reflect the floating-point operations in the code (for example, a=b+c has 1 Floating ops/CPU second and 3 CPU mem. references/sec, or a computational intensity of 0.33). Any ratio less than 1/3 makes excessive use of the CPU memory ports while the remainder of the processor idles. This is usually caused by memory-to-memory traffic (a=b), highly scalar code, or a hidden performance problem.

    2. If the computational intensity ratio of the code is less than 1/3, the program is not making effective use of the Cray PVP system CPU. See Section 6 on analyzing the CPU-bound code. If the computational intensity ratio of the code is 1/3 or more and you have already determined that the code is not memory or I/O bound, the code is probably optimized for single-CPU performance.

 

Next | Section 2: Analyzing and optimizing memory-bound code.
Return | Introduction and Table of Contents

Contact webmaster@asc.edu with questions or comments regarding this page.
Last updated Sept. 30, 1999 -- (c)1999 Alabama Supercomputer Authority