|
Next |
Section 2: Analyzing and optimizing memory-bound code.
Return | Introduction and Table of Contents
1.0 Evaluating Code
The first step in optimizing a program is evaluating its overall
performance. This allows you to decide where to focus optimization efforts.
When you compile and execute a program on a Cray PVP system, it usually will
be dominated by one of the following general activities:
- Memory management
- Input/output (I/O) processing
- CPU computation
The flowchart in Figure 1 shows a comprehensive view of the single- CPU
optimization process described in this document. This is an iterative method
that requires you to determine when to stop optimizing the code.
Figure 1. Optimization Overview
The 'Evaluate code' box in Figure 1 is expanded in Figure 2 to summarize the recommended initial analysis.
Figure 2. Evaluate Code
1.1 Memory Bound Programs
When the compiled program begins execution, it becomes a UNICOS user
process in memory. The process resides in memory during execution and it
interacts with the operating system. Its size might expand or decrease during
execution. This expanding and decreasing is referred to as memory management.
If the process spends excessive elapsed time managing memory, it is
considered a memory-bound process.
A general rule for the size of a code's process in memory is 'smaller
is better'. The advantages of a smaller process are reductions in swap
frequency, wait-time, and time-to-load. Although there are tradeoffs
described later in this manual that provide exceptions to this rule, for now
you need only to identify whether your process is memory bound.
The ja utility reports job or session-related accounting
information provided by the job accounting daemon. It can provide you with a
report that contains an overall view of memory usage, I/O, and system
overhead. This report can help you determine whether your code is memory
bound, I/O bound, or CPU bound. Figure 3 shows the steps to take to obtain a
ja report, and a sample ja report.
To find out how much user memory is available on your Cray PVP system, use
the sysconf, target or
sar -M commands, or ask your system administrator. The
sysconf(1) command reports a number for user memory (USRMEM) in the
SOFTWARE portion of its report, as shown in the sample portion of a
sysconf command output in Figure 4.
|
Procedure 1: Determining whether the code is memory bound
|
|
To determine whether the code is memory bound, perform the following steps:
- To obtain a report that shows an overall view of memory usage, I/O,
and system overhead, run the job accounting utility, ja -clth
(single-tasked), with the code, as shown in the following example. (To save an
extra step, you can also run ja(1) concurrently with hpm(1)
, the results of which you will use later).
- Find the Memory HiWater column of the ja report.
Perform the following steps to determine if the memory high-water mark for
the process is a significant fraction of available user memory.
Note: A significant fraction of available user memory on your system is a
subjective measure based on the chances of the code finding room in memory as
it competes with other users' jobs. Whether the code finds room in memory
depends on how many jobs are competing for memory, the job size, the priority
of the process, memory latency, and other factors.
- Multiply the number in the Memory HiWater column of the
ja report by 512. This gives you the maximum size of your
process as measured in Cray words.
- Determine the amount of available user memory and the number of
available processors on your system by using the UNICOS
sysconf(1) command (see Figure 4). Use the USRMEM value
in the SOFTWARE report of the sysconf command as the amount
of available user memory for your system. Use the NCPU value in the
HARDWARE report of the sysconf command as the number of
available processors.
- Although you might be able to execute a process as large as the
size stated in USRMEM of the sysconf report, you are probably
sharing your Cray PVP system with other users. Also, your Cray PVP
system most likely has multiple CPUs. Therefore, use the following
rules of thumb to determine whether the memory high-water mark for the
process is a significant fraction of available user memory.
- If the Cray PVP system on which the code is running has 4
CPUs or less, divide the maximum size of your process by user
memory (USRMEM on the sysconf output). If the result is equal to
or greater than .333, your process will probably benefit from
optimization of memory management. See Section 2 on optimizing
memory-bound code. If the result is less than .333, go to the
next step.
- If the Cray PVP system on which the code is running has more
than 4 CPUs, divide the user memory (USRMEM on the sysconf
output) by the number of processors (NCPU on the sysconf output).
Compare this number to the maximum size of your process. If the
maximum size of your process is bigger than this number, your
process will probably benefit from optimization of memory
management. See Section 2 on optimizing memory-bound code. If
the maximum size of your process is not bigger than this number,
go to the next step.
- If the number in the Sys CPU Seconds column in the ja
report for the code is greater than 10 percent of the number in the User
CPU Seconds column, see Section 2 on optimizing memory-bound code.
Inefficient memory management is one cause for excessive elapsed time.
If the number in the Sys CPU Seconds column is not greater than 10% of the
number in the User CPU Seconds column, your code is not memory bound. Go to
Section 1.2.
|
Figure 3. Sample ja report
Figure 4. Sample sysconf command output
1.2 I/O Bound Programs
If a program spends most of its elapsed time performing I/O, it is
considered I/O bound. I/O optimization can offer a significant savings in
elapsed time. If the design of a program requires large amounts of I/O, you
should optimize for I/O performance. However, if the code runs 24 hours and
performs only 1 hour of I/O, there are other areas of optimization that will
probably have a greater impact on overall code performance.
|
Procedure 2: Determining whether
the code is I/O
bound
|
|
- Run the job accounting utility, ja -clth (single- tasked), with
the code to get a report for an overall view of memory usage, I/O, and system
overhead. You can use the same report you created in Procedure 1 without
running the code again.
- If the sum of the number in the I/O Wait Sec Lck column plus the
number in the I/O Wait Sec Unlck column of the ja report
is close to or greater than 50% of the number in the User CPU-
Seconds column, the code is probably I/O bound, and you can use Section
1.2.1, to further assess a potential problem. If not, and you have already
determined that the code is not memory bound, go to Section 1.3.
|
If you suspect the code is I/O bound, you can also use the ja report to determine the dominant type of I/O, the
efficiency of the code's requests, and an estimate of the effective transfer rate to the I/O device. This information
will help you choose an I/O optimization technique that offers you the best opportunity for performance
improvement.
If the I/O wait-time for the code is prominent (as you determined in Procedure 2, Step 2) or if you think it
should be smaller, use the following information from the ja report for the code to analyze and improve
performance. This information will be useful when you continue analyzing your code by using techniques techniques found in Section 3: Analyzing I/O-bound code.
- For unformatted disk I/O, if the number in the I/O Wait Sec Unlck column is equal to or larger
than the number in the I/O Wait Sec Lck column, the code is making significant use of the system
cache. In most cases, avoiding the system cache will improve the overall performance of the code.
Perform the following arithmetic with information from the ja report:
(Kwords Xferred ÷ Log I/O Request ÷ 1000)
The result is the average size of the code's I/O requests (in Mwords per request). Increasing this average reduces
the frequency of the I/O requests in the code, which improves the overall I/O performance.
Perform the following arithmetic with information from the ja report:
(Kwords Xferred * 8 ÷ 1000) ÷ (I/O Wait Sec Lck + I/O Wait Sec
Unlck)
The result is the average transfer rate of the code's I/O device (in Mbytes per second). If this transfer rate is
significantly lower than the sustained rate of the program's I/O device, there is an inefficiency somewhere in the I/O
process. You will want to either raise the transfer speeds to be closer to the speed of the I/O device or use a faster
device.
Use the information you have obtained from the ja report, and go to Section 4, to continue evaluating the I/O-bound code.
1.3 CPU Bound Programs
If the code is neither memory bound nor I/O bound, it must be CPU- bound
code; that is,
it spends most of its elapsed-time performing CPU calculations. For all Cray
PVP systems
except the CRAY EL series, an hpm(1) report on the code will tell
you how
effectively the code is using vector registers, instruction buffers, and
memory ports.
|
Procedure 3: Determining CPU
performance statistics
|
The hpm(1) command allows a Cray PVP system user to access the hardware performance monitor
(HPM) and obtain overall program timing information. For complete information on the hpm utility, see
the hpm(1) man page.
On Cray PVP systems other than the CRAY EL series, use the following
procedure to access
the code's CPU performance statistics and to determine whether the code is
optimized for
single-CPU speed:
- Produce an hpm report by first compiling the program, then issuing the hpm command
with the program name as shown in the following example:
f90 yourcode.f90 or CC yourcode.C
hpm ./a.out
Figure 5 shows a sample hpm group 0 report from a CRAY Y-MP E system.
Figure 5. Sample hpm report
- Determine whether the code is dominated by scalar or vector operations. To do this, use the values shown in the
hpm report to perform the following steps. If the code is dominated by scalar operations, see Section 6, on
analyzing the CPU-bound code. If the code is dominated by vector operations, go to Step 3.
- Find the millions of instructions per second (MIPS) for the code by looking at the
Million inst/sec (MIPS) row of the hpm report.
- Find the floating point operations per second (FLOPS) for the code by looking at the
Floating ops/CPU second row of the hpm report.
- Use Table 1 to determine whether the MIPS and FLOPS for the code are low or high, based
on the hardware expectations. The code is not likely to achieve peak FLOPS or MIPS rates, but
some codes are capable of performing at substantial fractions of peak speed.
Low FLOPS are any single-digit rate up to 20% of peak rates for that system. High FLOPS
are generally anything above 66% of peak rates for that system. Low MIPS rates are at or near the
bottom of the range for that system. High MIPS rates are generally anything near 50% of peak
MIPS for that system.
Table 1. Single-CPU hardware expectations
| Cray Series |
T90 |
C90 |
Y-MP E |
J90 |
| Peak FLOPS |
2000M |
1000M |
333M |
200M |
| MIPS Range |
30 to 500 |
20 to 250 |
20 to 100 |
20 to 80 |
- Use Table 2 to determine whether the code is dominated by scalar or vector operations.
Table 2. Determining whether the code is dominated by scalar or vector operations
| FLOPS |
MIPS |
Determination |
| High or Medium |
Low |
Code is most likely dominated by vector operations. The code is performing well. Go to Step 3,
which might offer clues to modest performance gains. |
| Medium |
Medium |
Code is most likely a mix of scalar and vector operations. CPU optimization might help improve its
CPU performance. See Section 6, on analyzing the CPU-bound code. Step 3, and Step 4, might also offer clues to performance gains. |
| Medium or Low |
High |
Code is most likely dominated by scalar operations (is not vector code). CPU optimization will
help improve its CPU performance. See Section 6, on analyzing the CPU-bound code. Step 3, and Step 4, might also offer clues to performance gains. |
| Low |
Medium |
Code is dominated by scalar operations. CPU optimization will help improve its CPU
performance. See Section 6 on analyzing the CPU-bound code. |
| Low |
Low |
Code is performing poorly whether it is scalar or vector. It has a CPU performance problem. CPU
optimization will help improve its CPU performance. See Section 6on analyzing the CPU-bound code. |
- Determine if the instruction buffer fetches per second (as shown by the Inst.buffer
fetches/sec row of the hpm report) are close to or greater than 0.1 million per second. If yes,
see the following note and Section 6 on analyzing the CPU-bound code. If no, go to the next step.
Note: Excessive instruction buffer fetches tend to slow down the code. If the code has a high rate (approaching
0.1 million per second), the code may have excessive jumping (as with go to or if-then-else constructs) or excessive
calls to subprograms. CPU optimization probably will help the overall performance.
- Determine the computational intensity ratio of the code. Use the values in the hpm report, as
follows:
- Divide the number in the Floating ops/CPU second row by the number in the CPU
mem.references/sec row. This is the computational intensity of the code.
The computational intensity ratio is the ratio of the floating-point operation rate to the memory access rate. This
ratio should reflect the floating-point operations in the code (for example, a=b+c has 1 Floating ops/CPU
second and 3 CPU mem. references/sec, or a computational intensity of 0.33). Any ratio
less than 1/3 makes excessive use of the CPU memory ports while the remainder of the processor idles. This is
usually caused by memory-to-memory traffic (a=b), highly scalar code, or a hidden performance problem.
If the computational intensity ratio of the code is less than 1/3, the program is not making effective use of
the Cray PVP system CPU. See Section 6 on analyzing the CPU-bound code. If the computational intensity ratio of
the code is 1/3 or more and you have already determined that the code is not memory or I/O bound, the code is
probably optimized for single-CPU performance.
|
Next |
Section 2: Analyzing and optimizing memory-bound code.
Return | Introduction and Table of Contents
|