|
Next |
Section 5: Analyzing CPU-bound code.
4.0 Optimizing I/O-bound CodeThis chapter describes specific techniques for optimizing I/O-bound code. To determine which of these techniques to use for I/O-bound code, see Section 3. The following techniques are described in this chapter:
Formatted I/O is the slowest I/O and is useful only when the files must be viewed by
people or transferred to systems other than Cray Research systems. However, if you are
transferring the data to a system other than a Cray Research system, you can easily send
the unformatted (binary) version instead of the formatted ASCII version by using the Cray
foreign file conversion facility provided by the Flexible File I/O (FFIO) library. Use
the techniques described in the following sections to optimize code that contains
formatted I/O.
4.1.2 Reducing the amount of formatted I/OIf you cannot change formatted to unformatted I/O, reduce the quantity of formatted I/O. Show only small samples of the data to the human viewer by using the following techniques:
4.1.3 Increasing formatted I/O efficiency for Fortran programsUse the methods in the following sections to increase formatted I/O efficiency for Fortran programs.
4.1.3.1 Minimizing the number of data items in the I/O listWith the CF90 compiler you can increase formatted I/O efficiency by minimizing the number of data items in the I/O list. Consider the following example:DIMENSION X(20), Y(10), Z(5,30) WRITE (6,101) Y, (X(I), I=1,20), Z(M,J) With vectorization turned off, this WRITE statement represents 22 data items. In this case, the WRITE operation would require 22 calls to the library routines that drive the WRITE statement. When vectorization is turned on, the compiler treats each innermost implicit DO loop as a single data item, so that the preceding WRITE statement requires only 3 calls. If you rewrite the statement as follows, the parameter list always represents 3 calls, even if all optimization is turned off: WRITE (6,101) Y, X, Z(M,J)
4.1.3.2 Using a single READ, WRITE, or PRINT statementTo increase formatted I/O efficiency for Fortran programs, read or write as much data as possible with a single READ, WRITE, or PRINT statement. Consider the following example:
It is more efficient to write the entire array with a single WRITE statement, as follows:
The following statement is even more efficient:
Each of these three code fragments produce exactly the same output; however, the latter two examples are about twice as fast as the first. Also, the latter two examples are equivalent only if the implied DO loops write out the entire array, in order, and without omitting any items. You can use the format to control how much data is written per record.
4.1.3.3 Using longer recordsTo increase formatted I/O efficiency for Fortran programs, use longer records if possible. Because a certain amount of processing work is necessary to read or write each record, it is better to write fewer long records, rather than more short records. Consider the following example:
If you change it as follows, the resulting file will have 80% fewer records and, more importantly, the program will execute faster: WRITE (42, 101) X 101 FORMAT (5E25.15) Be careful to ensure that the resulting file does not contain records that are too long for the intended application. For example, certain text editors and utilities cannot process lines that are longer than a predetermined limit. Generally, lines that are not longer than 128 characters are safe to use in most applications.
4.1.3.4 Using repeated edit descriptorsTo increase formatted I/O efficiency for Fortran programs, use repeated edit descriptors whenever possible. For integers that fit in 4 digits (that is, less than 10000 and greater than -1000), avoid the following format:200 FORMAT (16(X,I4)) Instead, use a format of the following form: 201 FORMAT (16I5)
4.1.3.5 Using data edit descriptors that are the same width as the character dataTo increase formatted I/O efficiency for Fortran programs, when reading and writing character data, use data edit descriptors that are the same width as the character data. For CHARACTER*n variables, the optimal data edit descriptor is A (or An). For Hollerith data in integer variables, the optimal data edit descriptor is A8 (or R8).
If you change it as follows, the resulting code will make 80% fewer calls to fprintf and, more importantly, the program will execute faster:
4.1.5 Increasing library buffer sizes for formatted I/O requestsFor sequential-access formatted I/O files, the buffer size should be set equal to the length of a record or a multiple of that number. Generally, larger is better when buffering sequential access files. To specify the library buffer size for Fortran, use the assign(1) command with the following options:assign -b sz For C++, use the setvbuf(3) library function.
Sequential access indicates that data items in a file have an implicit order. Unless
the code issues positioning requests such as fseek(3) or rewind(3), the system always
accesses the next record automatically. If the code is issuing sequential, unformatted
I/O requests larger than 1 Mword, use the techniques described in the following sections
to optimize its I/O.
If the code is issuing sequential, unformatted I/O requests larger than 1 Mword (8 Mbyte), change the I/O file format to unbuffered and unblocked by using the -s u option, or by specifying the FFIO system, or syscall layer, as shown in the following assign(1) command examples: assign -s u f:filename C++ codes can access the FFIO libraries by using the ffread(3) and ffwrite(3) I/O function calls in conjunction with the UNICOS assign command. Using unbuffered, unblocked I/O file format requires you to construct well-formed I/O requests in the code. These are simply I/O requests that begin and end on disk sector boundaries, usually 512 words (4096 bytes) or a multiple of 512 words. This unit of measurement is also known as a UNICOS block or click. See your system administrator to determine the sector size of the disks you are using.
4.2.2 Converting to asynchronous I/OConverting to asynchronous I/O is a way to continue I/O activity in parallel with the code's CPU computation. If there are operations in the code that can be executed while the code is waiting for I/O to complete, convert the code to asynchronous I/O. For example, if the code contains any of the following sequences, converting to asynchronous I/O might reduce elapsed time:
Most prominent sequential, unformatted I/O requests that consume a majority of the code's elapsed time will benefit from code conversion to asynchronous I/O. You can convert to asynchronous I/O by using the assign(1) command or by modifying your source code. 4.2.2.1 Using the assign command to convert code to asynchronous I/OThe easiest way to convert code to asynchronous I/O is by using an FFIO layer, either cachea or bufa, with the assign(1) command, as follows:assign -F cachea:bs:nbufs f:filename assign -F bufa:bs:nbufs f:filename The bs argument specifies the size in 512-word blocks of each cache page or buffer. The nbue argument specifies the number of cache pages or buffers to use. You can tune these arguments to better suit the I/O activities of the code. If the code requires the use of COS blocked format, you can establish a specialized FFIO layer to provide asynchronous access by using the following UNICO assign command: assign -F cos.async f:filename
4.2.2.2 Modifying source code to convert code to asynchronousYou can modify the source code to take better advantage of the asynchronous FFIO layer by breaking up a large I/O request into smaller iterative requests. Within the iterations, perform the necessary computation on that data. An example of this technique is called double- buffering.With double-buffering, two sets of data (buffers) are active at any given moment for each stream of input or output data. One buffer is active in CPU work, while the other is active in I/O (reading or writing). In a typical double buffer scheme, the I/O and CPU work sets are staggered, as in the following algorithm:
The following Fortran 90 example shows a double-buffering code example with the older alternatives to the cachea and bufa FFIO layers, BUFFERIN and BUFFEROUT. The first input is the BUFFERIN statement before the DO loop. Inside the loop, each BUFFERIN statement synchronizes the previous BUFFERIN statement, and each BUFFEROUT statement synchronizes the previous BUFFEROUT statement. This is called blocking asynchronous I/O, because each request to the same unit blocks execution until the previous request is complete. The last BUFFERIN statement is synchronized by the call to UNIT in the last iteration (I.EQ.M). PROGRAM DBUF PARAMETER (N=1001472,M=1000) REAL A(N,0:1), B(N,0:1) CALL ASNUNIT (10,'-s u',IERR) CALL ASNUNIT (11,'-s u',IERR) IRD=0 BUFFERIN (10,0) (A(1,IRD),A(N,IRD)) DO 10 I=1,M IWK=IRD IRD=MOD(IRD+1,2) IF (I.NE.M) BUFFERIN (10,0) (A(1,IRD),A(N,IRD)) IF (I.EQ.M) FERR=UNIT(10) CALL WORK(A(1,IWK),B(1,IWK)) BUFFEROUT (11,0) (B(1,IWK),B(N,IWK)) 10 CONTINUE END 4.2.2.3 Using effective library buffer sizes for large, sequential, unformatted I/OFor large, sequential, unformatted I/O requests, enlarge the program's library buffer to at least the size of its largest record, if possible. To specify the library buffer size for Fortran, use the assign(1) command with the following options:assign -b sz f:filename For C++, use the setvbuf(3) library function.
If the code is issuing sequential, unformatted I/O requests that are 1 Mword or
smaller, use the techniques described in the following sections to optimize I/O.
4.3.3 Using the memory-resident (MR) FFIO layerFor small, sequential, unformatted I/O requests, if the file called by the code is heavily reused, the memory- resident (MR) layer in FFIO can improve performance over disk I/O by allowing the first portion of the file to reside in memory. For information on the MR layer, see Section 4.6.
Direct access indicates that a program can access records or data at any point in the
file. This also can be called nonsequential or random access I/O.
CF90 direct access example OPEN (22,ACCESS='DIRECT',RECL=8000) READ (22,REC=10) (DATA(I),I=1,1000) WRITE (22,REC=2) (OUTNUM(J),J=1,150)
4.4.2 C++ direct access I/OC++ programs do not use the I/O functions that transfer data to accomplish random access. C++ programs use the fseek(3) function or the lseek(2) system call to set the position in the file of the next input or output operation. The position is set in bytes, beginning at zero. Thus, C++ programmers are completely responsible for record keeping and indexing.C++ direct access example
4.4.3 Optimizing techniques for direct access codeIf the program is reading or writing files in direct access (as opposed to sequential access) you may be able to improve performance by using the following techniques:
In most code, synchronous I/O is used more often than asynchronous I/O (also known as raw I/O) is used. Synchronous I/O indicates that control is returned to the calling program after all requested data is transferred. The I/O transfer runs serially with respect to the CPU work. Asynchronous I/O indicates that control is returned to the calling program after the I/O process has started, but before the I/O is completed. The I/O transfer runs in parallel with respect to the CPU work. The user program continues executing at the same time the I/O operation is executing. If the code is using asynchronous I/O, use the techniques described in the following sections. Some of
these methods increase CPU overhead but decrease total elapsed time if there is significant work to do
during the I/O transfer.
For example, the following assign statements specify the unblocked file structure: assign -s unblocked f:filename assign -s u f:filename assign -s bin f:filename
4.5.2 Avoiding the system cacheFor asynchronous I/O, avoid using the system cache by using the assign -s u command. This allows the data to transfer directly between the user process and the actual device without a stopover (with synchronization) in system cache.
4.5.3 Using effective library buffer sizes for asynchronous I/O requestsIf the program is using the default I/O file format for sequential unformatted Fortran I/O, which is COS blocked (with the assign -F cos command), to optimize asynchronous I/O requests, ensure that the largest record size is less than or equal to half the library buffer size. COS blocked I/O file format indicates that the I/O request uses the library buffer and bypasses the system cache.Setting the library buffer size to an even number greater than 63 blocks causes COS blocked files to perform double-buffered asynchronous I/O by dividing the library buffer in half. When the library buffer size is an even number of disk sectors, each half of the buffer is well-formed. Thus, I/O requests for either half-buffer do not need to be rerouted through the system cache.You can change the buffer size by using the SEGLDR directive SET, as follows: SET=_def_cos_thrsh:size You can also change the buffer size by using the assign(1) command to specify a special FFIO layer, as follows: assign -F cos.async:size f:filename
4.5.4 Balancing workloadDevice I/O speeds are typically slower than CPU computation speeds by several orders of magnitude. If the code does not perform sufficient computation between I/O requests, it will spend most of its time waiting on I/O and lose the benefit of using asynchronous I/O. Try to balance both the I/O activity and the computation involving its data by moving as much of the CPU work as possible into the code that lies between asynchronous I/O requests.
4.5.5 Minimizing required synchronizationDuring asynchronous I/O processing, code reaches a synchronization point at which it has to wait for I/O completion before continuing. With an imbalance between CPU and I/O activity, this causes extended I/O wait time and an idle CPU. If this happens frequently, attempt to restructure the code to reduce required synchronization points.
4.5.6 Tune FFIO user cacheIf you are using asynchronous I/O through the cachea, bufa, or cos.async FFIO layers, you can adjust their sizes by using the UNICOS assign(1) command. For complete information on controlling buffers and cache pages, see the Application Programmer's I/O Guide, publication SG-2168.
If you have some flexibility with the storage devices your code uses, ensure that it
uses the fastest devices available for the appropriate situations. The following sections
describe storage devices and the situations in which they are best used.
With few exceptions, system calls are required for all physical I/O requests and data movement to or from the library buffer. The following options minimize system calls:
Next | Section 5: Analyzing CPU-bound code.
Previous | Section 3: Analyzing I/O-bound code
Return | Introduction and Table of Contents
|