Introduction Evaluation Memory-Bound Code I/O Bound Code CPU-Bound Code

Next | Section 5: Analyzing CPU-bound code.
Return | Introduction and Table of Contents

4.0 Optimizing I/O-bound Code

This chapter describes specific techniques for optimizing I/O-bound code. To determine which of these techniques to use for I/O-bound code, see Section 3. The following techniques are described in this chapter:

4.1 Optimizing formatted I/O

Formatted I/O is the slowest I/O and is useful only when the files must be viewed by people or transferred to systems other than Cray Research systems. However, if you are transferring the data to a system other than a Cray Research system, you can easily send the unformatted (binary) version instead of the formatted ASCII version by using the Cray foreign file conversion facility provided by the Flexible File I/O (FFIO) library. Use the techniques described in the following sections to optimize code that contains formatted I/O.

4.1.1 Changing to unformatted I/O

If possible, change formatted I/O to unformatted I/O by using one of the following methods:
  • In Fortran code, remove references to the FORMAT statement label and modify the Fortran OPEN statement to include FORM='UNFORMATTED'.

    Example:

    
    OPEN (10,FORM='UNFORMATTED')
    
  • For C++ codes, cout < < and cin > > are formatted read and write member functions of the iostream class. Also, the scanf(3) and printf(3) function calls (including fscanf, sscanf, fprintf, and sprintf) require formatting to human- readable ASCII. Convert these functions to call unformatted I/O functions such as fread and fwrite instead. You can access FFIO by using the ffread(3) and ffwrite(3) functions in your code in conjunction with the UNICOS assign(1) command.
  • To access the I/O layers provided by the FFIO libraries, use the -F command- line option with the UNICOS assign command. This will provide access to the automated foreign file conversion.

4.1.2 Reducing the amount of formatted I/O

If you cannot change formatted to unformatted I/O, reduce the quantity of formatted I/O. Show only small samples of the data to the human viewer by using the following techniques:
  • Change the code to show final results instead of many intermediate results.
  • Change the code to show a checksum instead of the data itself.
  • If the program sends data to another computer system (or printer), revise the program so that only the final version of the data is formatted.
  • If you need to view the data, consider shipping it unformatted to a graphics postprocessor.

4.1.3 Increasing formatted I/O efficiency for Fortran programs

Use the methods in the following sections to increase formatted I/O efficiency for Fortran programs.

4.1.3.1 Minimizing the number of data items in the I/O list

With the CF90 compiler you can increase formatted I/O efficiency by minimizing the number of data items in the I/O list. Consider the following example:
DIMENSION X(20), Y(10), Z(5,30) 
WRITE (6,101) Y, (X(I), I=1,20), Z(M,J)

With vectorization turned off, this WRITE statement represents 22 data items. In this case, the WRITE operation would require 22 calls to the library routines that drive the WRITE statement. When vectorization is turned on, the compiler treats each innermost implicit DO loop as a single data item, so that the preceding WRITE statement requires only 3 calls.

If you rewrite the statement as follows, the parameter list always represents 3 calls, even if all optimization is turned off:

WRITE (6,101) Y, X, Z(M,J)

4.1.3.2 Using a single READ, WRITE, or PRINT statement

To increase formatted I/O efficiency for Fortran programs, read or write as much data as possible with a single READ, WRITE, or PRINT statement. Consider the following example:
    DO J = 1, M
    DO I = 1, N 
    WRITE (42, 100) X(I,J) 
100 FORMAT (E25.15) 
    ENDDO 
    ENDDO

It is more efficient to write the entire array with a single WRITE statement, as follows:

    WRITE (42, 100) ((X(I,J),I=1,N),J=1,M) 
100 FORMAT (E25.15)

The following statement is even more efficient:

    WRITE (42, 100) X 
100 FORMAT (E25.15)

Each of these three code fragments produce exactly the same output; however, the latter two examples are about twice as fast as the first. Also, the latter two examples are equivalent only if the implied DO loops write out the entire array, in order, and without omitting any items. You can use the format to control how much data is written per record.

4.1.3.3 Using longer records

To increase formatted I/O efficiency for Fortran programs, use longer records if possible. Because a certain amount of processing work is necessary to read or write each record, it is better to write fewer long records, rather than more short records. Consider the following example:
    WRITE (42, 100) X 
100 FORMAT (E25.15)

If you change it as follows, the resulting file will have 80% fewer records and, more importantly, the program will execute faster:


WRITE (42, 101) X 
101 FORMAT (5E25.15)

Be careful to ensure that the resulting file does not contain records that are too long for the intended application. For example, certain text editors and utilities cannot process lines that are longer than a predetermined limit. Generally, lines that are not longer than 128 characters are safe to use in most applications.

4.1.3.4 Using repeated edit descriptors

To increase formatted I/O efficiency for Fortran programs, use repeated edit descriptors whenever possible. For integers that fit in 4 digits (that is, less than 10000 and greater than -1000), avoid the following format:
200 FORMAT (16(X,I4))

Instead, use a format of the following form:

201 FORMAT (16I5)

4.1.3.5 Using data edit descriptors that are the same width as the character data

To increase formatted I/O efficiency for Fortran programs, when reading and writing character data, use data edit descriptors that are the same width as the character data. For CHARACTER*n variables, the optimal data edit descriptor is A (or An). For Hollerith data in integer variables, the optimal data edit descriptor is A8 (or R8).

4.1.4 Increasing formatted I/O efficiency for C++ programs

Calling a function increases overhead. To decrease overhead and increase formatted I/O efficiency for C++ programs, combine multiple calls to I/O functions into fewer calls. Consider the following example:
for (i=0; i < N;) {
    fprintf(o1,"%d ",a[i];
    ++i;
    if (i%5 == 0) fprintf(o1,"\n"); 
}

If you change it as follows, the resulting code will make 80% fewer calls to fprintf and, more importantly, the program will execute faster:

for (i=0; i < N; i+=5) {
   fprintf(o2, "%d %d %d %d %d\n"
      a[i], a[i+1], a[i+2], a[i+3], a[i+4]); 
}

4.1.5 Increasing library buffer sizes for formatted I/O requests

For sequential-access formatted I/O files, the buffer size should be set equal to the length of a record or a multiple of that number. Generally, larger is better when buffering sequential access files. To specify the library buffer size for Fortran, use the assign(1) command with the following options:
assign -b sz

For C++, use the setvbuf(3) library function.

4.2 Optimizing large, sequential, unformatted I/O requests

Sequential access indicates that data items in a file have an implicit order. Unless the code issues positioning requests such as fseek(3) or rewind(3), the system always accesses the next record automatically. If the code is issuing sequential, unformatted I/O requests larger than 1 Mword, use the techniques described in the following sections to optimize its I/O.

4.2.1 Changing I/O file format to unbuffered and unblocked

The default I/O file format for sequential unformatted Fortran I/O is COS blocked (assign -s cos f:filename), which means that the I/O request uses the library buffer and bypasses the system cache. Although the COS blocked file format helps fulfill the Fortran standard for sequential, unformatted I/O by marking (or blocking) record positions within a file, it is not the fastest I/O available for large, sequential transfers. COS blocked I/O requires user CPU time to create and insert (or interpret and remove) the control words.

If the code is issuing sequential, unformatted I/O requests larger than 1 Mword (8 Mbyte), change the I/O file format to unbuffered and unblocked by using the -s u option, or by specifying the FFIO system, or syscall layer, as shown in the following assign(1) command examples:

assign -s u f:filename
assign -F system f:filename
assign -F syscall f:filename

C++ codes can access the FFIO libraries by using the ffread(3) and ffwrite(3) I/O function calls in conjunction with the UNICOS assign command.

Using unbuffered, unblocked I/O file format requires you to construct well-formed I/O requests in the code. These are simply I/O requests that begin and end on disk sector boundaries, usually 512 words (4096 bytes) or a multiple of 512 words. This unit of measurement is also known as a UNICOS block or click. See your system administrator to determine the sector size of the disks you are using.

4.2.2 Converting to asynchronous I/O

Converting to asynchronous I/O is a way to continue I/O activity in parallel with the code's CPU computation. If there are operations in the code that can be executed while the code is waiting for I/O to complete, convert the code to asynchronous I/O. For example, if the code contains any of the following sequences, converting to asynchronous I/O might reduce elapsed time:
  • Repetitive patterns of input, computation on that data, output, then input again
  • I/O that appears in a loop

Most prominent sequential, unformatted I/O requests that consume a majority of the code's elapsed time will benefit from code conversion to asynchronous I/O. You can convert to asynchronous I/O by using the assign(1) command or by modifying your source code.

4.2.2.1 Using the assign command to convert code to asynchronous I/O

The easiest way to convert code to asynchronous I/O is by using an FFIO layer, either cachea or bufa, with the assign(1) command, as follows:
assign -F cachea:bs:nbufs f:filename 
assign -F bufa:bs:nbufs f:filename

The bs argument specifies the size in 512-word blocks of each cache page or buffer. The nbue argument specifies the number of cache pages or buffers to use. You can tune these arguments to better suit the I/O activities of the code.

If the code requires the use of COS blocked format, you can establish a specialized FFIO layer to provide asynchronous access by using the following UNICO assign command:

assign -F cos.async f:filename

4.2.2.2 Modifying source code to convert code to asynchronous

You can modify the source code to take better advantage of the asynchronous FFIO layer by breaking up a large I/O request into smaller iterative requests. Within the iterations, perform the necessary computation on that data. An example of this technique is called double- buffering.

With double-buffering, two sets of data (buffers) are active at any given moment for each stream of input or output data. One buffer is active in CPU work, while the other is active in I/O (reading or writing). In a typical double buffer scheme, the I/O and CPU work sets are staggered, as in the following algorithm:

  1. The first set of input data is read.

  2. The second set of input data is read while the CPU works on the first set of input data.

  3. The third set of input data is read while the CPU works on the second set of input data and the first set of data is output.

  4. This sequence continues until all data is read. As the last data set is read, the next-to-last CPU work is in progress, and the third-from-last data set is output.

  5. The CPU works on the last data set and the next-to-last data set is output.

  6. The final data set is output.

The following Fortran 90 example shows a double-buffering code example with the older alternatives to the cachea and bufa FFIO layers, BUFFERIN and BUFFEROUT. The first input is the BUFFERIN statement before the DO loop. Inside the loop, each BUFFERIN statement synchronizes the previous BUFFERIN statement, and each BUFFEROUT statement synchronizes the previous BUFFEROUT statement. This is called blocking asynchronous I/O, because each request to the same unit blocks execution until the previous request is complete. The last BUFFERIN statement is synchronized by the call to UNIT in the last iteration (I.EQ.M).


PROGRAM DBUF 
PARAMETER (N=1001472,M=1000) 
REAL A(N,0:1), B(N,0:1) 

CALL ASNUNIT (10,'-s u',IERR) 
CALL ASNUNIT (11,'-s u',IERR) 
IRD=0 
BUFFERIN (10,0) (A(1,IRD),A(N,IRD)) 
DO 10 I=1,M 
IWK=IRD 
IRD=MOD(IRD+1,2) 
IF (I.NE.M) BUFFERIN (10,0) (A(1,IRD),A(N,IRD)) 
IF (I.EQ.M) FERR=UNIT(10) 
CALL WORK(A(1,IWK),B(1,IWK)) 
BUFFEROUT (11,0) (B(1,IWK),B(N,IWK)) 
10 CONTINUE 
END

4.2.2.3 Using effective library buffer sizes for large, sequential, unformatted I/O

For large, sequential, unformatted I/O requests, enlarge the program's library buffer to at least the size of its largest record, if possible. To specify the library buffer size for Fortran, use the assign(1) command with the following options:
assign -b sz f:filename

For C++, use the setvbuf(3) library function.

4.3 Optimizing small, sequential, unformatted I/O requests

If the code is issuing sequential, unformatted I/O requests that are 1 Mword or smaller, use the techniques described in the following sections to optimize I/O.

4.3.1 Using effective library buffer sizes for small, sequential, unformatted I/O requests

For small, sequential, unformatted I/O requests, use an effective library buffer size by ensuring that the library buffer is at least the size of the largest I/O request or a multiple of that size. For Fortran, use the assign -b sz command to specify the library buffer size. For C++, use the setvbuf library function.

4.3.2 Increasing I/O request size and issuing fewer requests

To optimize small, sequential, unformatted I/O requests, increase the size of the I/O requests and issue fewer requests. This helps to reduce the overhead of system and user CPU time and also may allow you to use the optimization techniques that apply to large I/O requests (see Section 4.2). You can use the following techniques to increase the size of the I/O requests:
  • Read or write larger array sections instead of one element at a time.
  • Combine read requests and write requests into a single read request or single write request.
  • Extract I/O from inner loops.

4.3.3 Using the memory-resident (MR) FFIO layer

For small, sequential, unformatted I/O requests, if the file called by the code is heavily reused, the memory- resident (MR) layer in FFIO can improve performance over disk I/O by allowing the first portion of the file to reside in memory. For information on the MR layer, see Section 4.6.

4.4 Optimizing techniques for direct access I/O

Direct access indicates that a program can access records or data at any point in the file. This also can be called nonsequential or random access I/O.

4.4.1 CF90 direct access I/O

The Fortran 90 standard provides two types of access: sequential and direct. Sequential access restricts the program to reading from or writing to the I/O unit with records of any length in sequential order. Direct access divides the file associated with the I/O unit into fixed-length records, and allows the program to read or write records randomly. You can achieve Fortran direct access by opening a file with the ACCESS=DIRECT keyword on the OPEN statement and specifying the fixed record size with the RECL keyword. All references to that file must specify the record number, REC, on subsequent READ and WRITE statements.

CF90 direct access example

OPEN (22,ACCESS='DIRECT',RECL=8000) 
READ (22,REC=10) (DATA(I),I=1,1000) 
WRITE (22,REC=2) (OUTNUM(J),J=1,150)

4.4.2 C++ direct access I/O

C++ programs do not use the I/O functions that transfer data to accomplish random access. C++ programs use the fseek(3) function or the lseek(2) system call to set the position in the file of the next input or output operation. The position is set in bytes, beginning at zero. Thus, C++ programmers are completely responsible for record keeping and indexing.

C++ direct access example

stream = fopen ("file","r+"); 
 
bytes_per_word = 8; 
nwords = 1000; 
 
lrec = bytes_per_word*nwords; 
fseek (stream,9*lrec,SEEK_SET); 
fread (data,bytes_per_word,nwords,stream); 
fseek (stream,1*lrec,SEEK_SET); 
fwrite (outnum,bytes_per_word,150,stream); 
fseek (stream,  lrec - bytes_per_word*150,SEEK_CUR); 
fread (data,bytes_per_word,nwords,stream);

4.4.3 Optimizing techniques for direct access code

If the program is reading or writing files in direct access (as opposed to sequential access) you may be able to improve performance by using the following techniques:

  • Ensure that the files are in binary file format and that they bypass the system cache by using the assign -s bin command.

  • Ensure that the code is not using formatted or COS blocked file formats.

  • Set the library buffer size as close to the length of a record (request) as possible without going under the length. This minimizes unnecessary data transfers, which are not useful for random I/O.

  • For small, random I/O requests, use a smaller library buffer than the default. Limit its size to the record length of the code. This might improve performance by avoiding excessive unused data movement when filling the unused portion of the buffer.

  • For large I/O requests, the library buffer size should be set equal to the length of the fixed-size record (request). To specify the library buffer size for Fortran, use the assign -b sz command. For C++, use the setvbuf(3) library function.

  • If the code makes repeated references to the same place in the data file, a memory-resident (MR) buffer might help if it can include the most frequently used area of data. For information on the MR layer, see Section 4.6.

  • If the code uses word-addressable data, you can transfer the data faster with a binary file format (using the assign -s bin command), which also bypasses the system cache and forces use of the GETWA and PUTWA I/O routines without changing the source code. The GETWA and PUTWA I/O routines are among the fastest types of random-access I/O on Cray PVP systems, but they place the burden of record keeping and indexing on you.

  • Rearrange the data file so that the code can process it sequentially. Sequential I/O is usually faster than direct-access I/O. You might be able to use separate files to accomplish the same effect.

4.5 Optimizing asynchronous I/O requests

In most code, synchronous I/O is used more often than asynchronous I/O (also known as raw I/O) is used. Synchronous I/O indicates that control is returned to the calling program after all requested data is transferred. The I/O transfer runs serially with respect to the CPU work.

Asynchronous I/O indicates that control is returned to the calling program after the I/O process has started, but before the I/O is completed. The I/O transfer runs in parallel with respect to the CPU work. The user program continues executing at the same time the I/O operation is executing.

If the code is using asynchronous I/O, use the techniques described in the following sections. Some of these methods increase CPU overhead but decrease total elapsed time if there is significant work to do during the I/O transfer.

4.5.1 Using unblocked file format for asynchronous I/O requests

To optimize asynchronous I/O requests, use unblocked file format if the code does not need to backspace, position the file pointer, read partial records, and so on. You can improve asynchronous I/O performance moderately by eliminating the overhead associated with record marking, or blocking. This can be done in several ways, depending on the type of I/O and certain other characteristics.

For example, the following assign statements specify the unblocked file structure:

assign -s unblocked f:filename 
assign -s u f:filename 
assign -s bin f:filename

4.5.2 Avoiding the system cache

For asynchronous I/O, avoid using the system cache by using the assign -s u command. This allows the data to transfer directly between the user process and the actual device without a stopover (with synchronization) in system cache.

4.5.3 Using effective library buffer sizes for asynchronous I/O requests

If the program is using the default I/O file format for sequential unformatted Fortran I/O, which is COS blocked (with the assign -F cos command), to optimize asynchronous I/O requests, ensure that the largest record size is less than or equal to half the library buffer size. COS blocked I/O file format indicates that the I/O request uses the library buffer and bypasses the system cache.

Setting the library buffer size to an even number greater than 63 blocks causes COS blocked files to perform double-buffered asynchronous I/O by dividing the library buffer in half. When the library buffer size is an even number of disk sectors, each half of the buffer is well-formed. Thus, I/O requests for either half-buffer do not need to be rerouted through the system cache.You can change the buffer size by using the SEGLDR directive SET, as follows:

SET=_def_cos_thrsh:size

You can also change the buffer size by using the assign(1) command to specify a special FFIO layer, as follows:

assign -F cos.async:size f:filename

4.5.4 Balancing workload

Device I/O speeds are typically slower than CPU computation speeds by several orders of magnitude. If the code does not perform sufficient computation between I/O requests, it will spend most of its time waiting on I/O and lose the benefit of using asynchronous I/O. Try to balance both the I/O activity and the computation involving its data by moving as much of the CPU work as possible into the code that lies between asynchronous I/O requests.

4.5.5 Minimizing required synchronization

During asynchronous I/O processing, code reaches a synchronization point at which it has to wait for I/O completion before continuing. With an imbalance between CPU and I/O activity, this causes extended I/O wait time and an idle CPU. If this happens frequently, attempt to restructure the code to reduce required synchronization points.

4.5.6 Tune FFIO user cache

If you are using asynchronous I/O through the cachea, bufa, or cos.async FFIO layers, you can adjust their sizes by using the UNICOS assign(1) command. For complete information on controlling buffers and cache pages, see the Application Programmer's I/O Guide, publication SG-2168.

4.6 Using an optimal storage device

If you have some flexibility with the storage devices your code uses, ensure that it uses the fastest devices available for the appropriate situations. The following sections describe storage devices and the situations in which they are best used.

4.6.1 Memory-resident files

Use memory-resident files for small requests, heavily reused files, or for large files in which most of the I/O activity occurs at the beginning of the file. The assign(1) command provides an option to declare certain files to be memory resident. This option causes these files to reside within the field length of the user's process; its use can result in very fast access times.

4.6.2 Memory-resident predefined file systems

Large memory systems might have predefined file systems resident in memory. Memory resident file systems provide memory-to-memory speed, which is the fastest I/O available on Cray PVP systems. Your system administrator can tell you which file systems are mounted in memory, and you might have access to create data files in those directories.

4.6.3 SSD

The SSD solid-state storage device is the fastest external I/O device on Cray Research computer systems, although it is an optional device that is not available on all Cray PVP systems. The SSD stores data in memory chips and operates at speeds about as fast as main memory or 10 to 50 times faster than magnetic disks.

4.6.4 Disk striping

If your file system is composed of partitions on more than one disk, using the disks at the same time can result in performance improvements. This technique is called disk striping. Disk striping can be accomplished through either hardware or software.

4.6.5 Disk arrays

Using disk arrays (for example, DA-60, DA-301, and so on) can be faster than single disk drives such as DD-60, DD-42, DD-301, and so on.

4.6.6 Disks

If possible, use disks only for files that are accessed one or two times or for saved files that are read at a later time. Try to use memory or SSD for most other activity.

4.6.7 Tapes

Consider tape to be a long-term storage device. Tape is both cost-effective and disaster-resistant. Before selecting tape, consider that it has slower access speed and that there is contention for the drive and delays for mounting. However, tape is appropriate for long-term archive storage of very large data files.

4.7 Minimizing system calls

With few exceptions, system calls are required for all physical I/O requests and data movement to or from the library buffer. The following options minimize system calls:

  • Ensure that I/O requests are as large as possible. For example, write whole arrays rather than one row at a time. Group multiple arrays into one write statement.

  • Use larger buffers (or user cache) to capture many I/O requests in the user process space before the I/O library transfers the data out.

  • Use scratch files for intermediate data that you no longer need after the code completes execution. This can eliminate unnecessary data movement and might avoid the device entirely.

  • Scratch files are temporary and are deleted when they are closed. To create a Fortran scratch file, open a file with STATUS='SCRATCH' and use STATUS='DELETE'.

  • Use the MR layer when appropriate (see Section 4.6).

  • Use SDS if available (see Section 4.6).

    Next | Section 5: Analyzing CPU-bound code.
Previous | Section 3: Analyzing I/O-bound code
  Return | Introduction and Table of Contents

Contact webmaster@asc.edu with questions or comments regarding this page.
Last updated Sept. 30, 1999 -- (c)1999 Alabama Supercomputer Authority