|
Next | Section 3: Analyzing I/O-bound code.
Return | Introduction and Table of Contents
2.0 Optimizing Memory-bound Code
Cray PVP systems contain fixed-size real memory, and they have no virtual memory
capabilities. Large memory codes must fit within available user memory on these systems.
User processes that dynamically expand and shrink during execution are managing memory,
and they must compete with other processes for real memory space. Processes that spend
excessive time managing memory are said to be memory bound.
Inefficient memory management within a process can increase its elapsed time, and this
can affect other user processes in the system. If the process is large compared to the
amount of available user memory, small performance problems can become worse. When a
process attempts to dynamically change its size, it becomes a candidate for a swap out of
memory onto an external device (such as a disk) before the operating system can find
enough contiguous memory to swap it back in. Also, the code might make system memory
calls without your knowledge. These situations can greatly increase elapsed time.
Optimizing memory-bound code reduces elapsed time and system CPU time, and may have a
side effect of reducing user CPU time.
|
Procedure 4: Steps for memory optimization
|
|
Your initial analysis, done in Procedure 1 has indicated that the code could be memory
bound. Continue with memory optimization by following these basic steps:
- Use the procstat and procview utilities to generate a memory
report for your user process. With that report, identify whether the code is memory
bound; that is, determine whether it spends large amounts of system CPU time to process
memory requests or to wait for memory to become available.
If your process is memory bound, continue to the next step. If not, go to Step
5.
Evaluate dynamic memory alternatives and apply a specific memory management
optimization technique based on your evaluation.
- Check your answers. If the answers are not the same as those you received
before applying the optimization technique, one of the following things is occurring:
In applying the technique, you intended to change the code, but
might have inadvertently changed the algorithm. If this is the case, revert to the code
as it was written before applying the optimization technique. You might be able to either
apply the technique differently so that it does not change the algorithm, or apply a
different technique.
The technique you applied might have exposed a numeric sensitivity
in the code. If so, either revert to the code as it was written before applying the
optimization technique, or try to remove the numeric instability.
Obtain new elapsed time data. If you are satisfied with the new elapsed
time of your code, return to the initial analysis phase to examine other optimization
opportunities. If your code continues to be memory bound, return to Step 3.
Figure 6 displays a flowchart of the steps described in this section. This flowchart
corresponds to the memory-bound path displayed in Figure 1 and the memory management box
in Figure 2.
Figure 6. Optimizing memory-bound code
|
2.1 Understanding memory management
Two basic types of dynamic memory management are available on Cray PVP systems running
the UNICOS operating system:
- Dynamic memory managed by the system heap routines
- Expandable dynamic common blocks
You cannot use both methods within the same code.
When code is compiled and loaded, it is translated into Cray machine instructions
(object code), logically linked together with library routines, and packaged in an
executable file that is, by default, named a.out. When you execute
a.out, it is swapped into memory and becomes a user process. Most UNIX systems
have at least the following three distinct areas defined in memory for each user process:
- User area
- Stack area
- Heap
The user area is where the object code (text area) and external and static
variables (data area) reside. The stack area is commonly used to temporarily store
context information for each routine that calls another subprogram. The heap is
a dynamic area used for all other memory needs (except Fortran COMMON): dynamic
variables, I/O buffers, flexible file input/output (FFIO) user cache, and so on.
The heap and the stack areas are allowed to grow dynamically, but the user area
remains fixed. On a virtual memory system, both the heap and the stack area of a single
process can grow independently with a virtual hole between them.
Cray PVP systems use only real memory, and the UNICOS operating system implements
dynamic allocation of heap and stack space differently from virtual systems. A UNICOS
user process has a dynamically expandable heap, with stack space wholly contained within
the heap. The stack space is managed without benefit of direct hardware support, and both
heap and stack space must appear to grow and shrink independently.
Initial memory allocation for the heap and the stack space is established at the load
step by the segldr(1) utility (called by the f90(1), CC(1),
and cc(1) commands). With SEGLDR directives, you can specify initial sizes for
both heap and stack, as well as their respective increment size.
The heap is dynamic and can be increased or decreased from an executing code by using
calls to the library. Routines (including Cray libraries) must request space from the
heap directly through calls to the UNICOS system.
Dynamic memory management is inherently expensive to a user process because it
requires service from the operating system through system calls. An expansion of the heap
might require the process to be relocated in memory. If there is no remaining space large
enough in memory, the UNICOS operating system will swap the requesting process to a
secondary device (such as a disk) until enough memory becomes available. This adds
elapsed time and system CPU time.
The UNICOS operating system provides a second method to manage memory for Fortran 90
codes, the dynamic common block. To use this method, you must specify only one dynamic
common block (which might be blank common) for the SEGLDR loader to place at the high end
of the process memory space.
This technique requires your heap to be a fixed size. Heap expansion is not allowed
because the dynamic common area is stored directly after the heap. Therefore, the initial
size of the heap must be large enough to handle all requests for heap space. Generally,
an initial heap size between 5,000 and 10,000 words is adequate.
Fortran codes that use this method typically overindex an array within a dynamic
common block, but require careful tracking by the programmer to avoid an operand-range
error. You can expand and contract the dynamic common block by using the SBREAK library
routine. SBREAK expands the field length of the user process to provide more memory, and
it also releases memory when it receives a negative argument.
All subroutines within the same code have access to its dynamic common block at any
time during program execution. Its contents cannot be initialized at load or compile
time.
2.2 Identifying large amounts of memory wait time or system CPU time
The procstat(1) and procview(1) utilities can provide accurate
memory information about your code. The procstat utility gathers process-level
memory statistics, such as elapsed time, number of calls to memory processor, number of
memory declines, and total time to complete memory requests. The procview
utility allows you to view the statistics.
To create a procview report, execute the procstat utility with the
name of the program to be analyzed listed as an argument. The procstat output
also can be written in raw format to a file that is then processed by procview.
The procview command displays a line-mode (ASCII) interface or a graphic user
interface (GUI) for interactive use, and can also be used in a batch (noninteractive)
environment to generate reports.
|
Procedure 5: Creating a
procview report
|
|
Use the following procedure to view a report of complete information for your code:
- Compile and run the code.
- Run the procstat utility on the code.
- Use procview to view the report.
The following examples show how to perform this procedure. When you enter
procview as shown in these examples, you will be using procview
interactively with the X Window System interface.
CF90 example:
f90 prog.f90
procstat -R ProcstatRawFile a.out
procview ProcstatRawFile
C++ example:
CC prog.C
procstat -R ProcstatRawFile a.out
procview ProcstatRawFile
|
|
Procedure 6:
Interpreting the procview report
|
|
Use the following procedure to determine whether the code spends large amounts of system
CPU time processing memory requests:
- Within procview, create a procstat Process Report sorted by
maximum memory used, by selecting the following:
Reports => Processes => Maximum Memory Used (Long Format)
Figure 7, shows an example of the long format of the Maximum Memory Used report.
Figure 7. procstat Maximum Memory Used report
Compare the number in the Elapsed Time row against the number in the
Total Time to Complete Memory Requests row. If the number in the Total Time
to Complete Memory Requests row is a large percent of the number in the Elapsed
Time row, the code is spending excessive time managing memory and is considered a
memory-bound process. Although "large percent" is a subjective measure based on your
requirements, use 10% as a guideline. The example shown in Figure 7, has a percentage of 90%; therefore, the code is clearly memory bound.
If the number in the Total Time to Complete Memory Requests row is not a
large percent of the number in the Elapsed Time row, the code is not memory
bound. Return to the initial analysis phase to further examine the code for I/O and CPU
performance inefficiencies.
- Use the following steps to create a procview data plot. You will use this
information when you evaluate memory management alternatives in Section 2.3.
- Within procview, return to the main window. In the window pane, select the
graphs button, which
produces a graphs menu.
- From the graphs menu, choose X/Y plot to produce an X/Y
plot menu.
- From the X/Y plot menu, choose Select X axis to produce a
Select X axis menu.
- From the Select X axis menu, select Wall Clock Time to determine
the statistic printed on the X axis of the graph. This selection returns you to the
X/Y plot menu.
- From the X/Y plot menu, choose Select Y axis to produce a
Select Y axis menu.
- From the Select Y axis menu select Memory to determine the
statistic printed on the Y axis of the graph.
- Select Generate plot to produce the procview data plot.
Figure 8, shows a sample procview data plot. The procview data plot
shows the code's process memory size as it executes from start to finish. This plot will
show memory expansion and memory shrinkage during code execution. Each data point on the
plot represents a system call for the UNICOS operating system to complete a memory
request.
Figure 8. Sample procview data plot
- Evaluate and implement memory management alternatives by using the techniques shown
in Section 2.3.
|
2.3 Evaluating dynamic memory alternatives and applying a technique
If your code is memory bound, it will probably exhibit one of the symptoms listed in
the following sections. Each section lists a symptom followed by a recommended technique
to reduce the elapsed time caused by memory requests. Evaluate the behavior of your code
to see if it matches one or more of these symptoms, and select one of the corresponding
techniques. After applying any optimization technique, return to Procedure 4, Step 4, to check your answers and examine the new elapsed time for the code.
Check the number in the Number of Calls to Memory Processor row in the
procstat Process Report that you created in Procedure 6, Step 1. Does the code
have a large number of system calls? If possible, reduce the number of system calls for
memory within the source code.
Check both the number in the Number of Calls to Memory Processor row in the
procstat Process Report and the pattern shown in the procview data
plot. Does the code make many system requests to expand or contract memory in small
increments? If so, the procview data plot will have a pattern similar to the
following figure:
If the procview data plot has such a shape, apply one of the following
techniques:
- Reduce the number of memory requests and increase the size of the requests within the
code.
- Ensure that the first system call requests sufficient memory for prolonged usage, and
minimize system calls to shrink the size of the process.
- Initialize a larger heap, as described in Section 2.4.
Do you see frequent up and down movement in the procview data plot combined with
a large number of calls to the memory processor in the procstat Process report?
If so, the procview data plot will have a pattern similar to the following
figure:
This condition is caused by any of the following situations:
- Alternate requests and releases of memory from the heap. Apply one of the following
techniques:
- Within your source code, attempt to reuse existing heap space instead of releasing
it. Use the Fortran 90 ALLOCATE and DEALLOCATE keywords or the C++ new and delete
operators or the malloc or free library functions. These do not require the compiler to
generate a system call.
- In Fortran code, avoid using an SBREAK statement with a negative argument. In C++
code, avoid calling the sbreak function with a negative argument..
- Initialize a larger heap as described in Section 2.4.
- Frequent stack overflows and underflows (or stack thrashing). You can check this
condition by using the Fortran STOP statement or the C++ stkstat(3) system call to
produce a report of stack overflows. To avoid stack overflows, use the SEGLDR directive
to increase the value in the Initial stack size row of Figure 9, to the maximum stack
size as displayed by the STOP statement output. To create the Load Map Program Statistics
report, see Section 2.4.2.
The following STOP statement output shows that the program experiences a large number
of stack overflows:
f90 where.f90
./a.out is < INPUT_FILE
STOP executed at line 261 in Fortran routine 'CLACIER'
CP: 34.435s, Wallclock; 198.094s, 2.2% of 8-CPU Machine
HWM mem: 236775, HWM stack: 10048, Stack overflows: 750000
By indicating a stack size that matches the stack high water mark (shown by the HWM
stack value),
stack overflows are now zero, and CPU time improves from 34.4 seconds to 13.5 seconds:
f90 -Wl"-S10048" where.f90
./a.out < INPUT_FILE
STOP executed at line 261 in Fortran routine 'CLACIER'
CP: 13.543s, Wallclock; 58.227s, 2.9% of 8-CPU Machine
HWM mem: 209338, HWM stack: 10048, Stack overflows: 0
Does the procview data plot for the code show a temporary memory expansion
(allocation) of a significant duration? If so, the procview data plot will have
a shape similar to the following figure:
When the process expands in such a manner during execution, it runs the risk of being
swapped to disk, costing excessive elapsed time. Use a SEGLDR directive (see Section 2.4) to initialize the process at its largest size to avoid a swap.
Examine the source code. Does the process release heap blocks in a different order than
they were allocated? This can cause memory fragmentation for the process. Reorder the
code to allocate and de-allocate heap space in the opposite order (last allocated should
be first de-allocated).
2.4 Memory initialization
When forced to obtain more memory within the code, it is more efficient to use a few
large requests than to use many small requests. One way to do this is to directly modify
source code to issue fewer, larger requests. Another way is to initialize the heap with a
SEGLDR directive. The most effective way to minimize the elapsed time due to memory
management is to start with enough memory in the first place.
At the load step, SEGLDR, which is usually called by both the f90(1) and CC(1) commands,
establishes initial memory allocation for both the heap and the stack space. You can
specify any of these parameters with a command-line option for segldr,
CC, or f90, as in the following examples.
Example 1: The following SEGLDR command specifies an initial heap size of
150,000 Cray words and a heap increment of 75,000 Cray words for the object file, file.o:
segldr -H150000+75000 file.o
Example 2: The following Cray C++ compiler command specifies an initial stack
size of 100,000 Cray words and stack increment of 50,000 Cray words for the source file,
file.C:
CC -dSTACK=100000+50000 file.C
Example 3:The following CF90 compiler command names a dynamic common block
(DYNAM), specifies an initial heap size of 10,000 Cray words, and establishes a zero heap
increment size for the source file, code.f90. The heap increment is set to zero to force
a fixed heap size required by using a dynamic common block.
f90 -Wl"-H10000+0;DYNAMIC=DYNAM" code.f90
An optimal heap size for your process is a size that is large enough to prevent system
calls for memory, but not so large that it contains unused memory.
|
Procedure 7: Determining optimal heap size
|
|
To determine an optimal initial heap size for your SEGLDR directive, perform the
following steps:
- Generate a ja report or retrieve the ja report you used in your initial analysis in
Procedure 1.
- Create a Load Map Program Statistics report for the code by using SEGLDR directives.
Example: The sample load map in Figure 9 was generated by using the following f90 command:
f90 -Wl"-H10000+9000 -S6000+5000 -Mmap.fil,stat" code.f90
This command line specifies the following:
- Load map statistics in a file called map.fil
- Initial heap size of 10 Kwords
- Heap increment of 9 Kwords
- Initial stack size of 6 Kwords
- Stack increment size of 5 Kwords
Figure 9 shows a snapshot of map.fil with some added notations.
- Look at the number in the Memory HiWater column of the ja report you created for the
code
(Procedure1). This number is reported in blocks, and a block is equal to 512 (decimal)
Cray memory
words. Multiply this figure by 512 to determine the maximum process memory size for the
code in
Cray words.
Example:
Assume the Memory HiWater figure from the ja report and the load map in Figure 9, are
from the
same code, bigio. This user process grew to 8,128 blocks. Perform the following
arithmetic to obtain
decimal words for Memory HiWater: 8,128 x 512 = 4,161,536 words.
- Look at the number listed in the Base address of managed memory/stack row of the Load
Map
Program Statistics report (Figure 9). This number is an octal representation of the
user area size measured in Cray words. Convert this number to decimal to determine the
user area size for the code in decimal format.
Example: The Base address of managed memory/stack figure is 376,372 octal.
This is equivalent to 130,298 decimal.
- Subtract the user area size (total from the preceding step) from the maximum process
memory size (total from Step 3). The result should be a good estimate for the minimum
heap size required to avoid system calls for more memory.
Example: 4,161,536 - 130,298 = 4,034,238
This represents the largest heap size for the process. Note that this might change for
a different dataset,
and it will expand with larger library I/O buffer and user cache sizes.
-
Use the result from the preceding step in a SEGLDR directive to set the initial
heap size.
Example: To avoid system requests for memory, reload the object file with the
following segldr command:
segldr -H4035000+10000 bigio.o
Figure 9. Load map statistics on a CRAY Y-MP E series system
|
Next | Section 3: Analyzing I/O-bound code.
Previous | Section 1: Evaluating code
Return | Introduction and Table of Contents
|