
The purpose of this page is to convey tips for getting better performance with your I/O on the Kraken Lustre file systems. You can also view our list of I/O Best Practices.
Contents

Lustre File System
The Lustre file system on Kraken exists across a set of 336 block storage devices, referred to as Object Storage Targets (OSTs), that are managed by 48 service nodes serving as Object Storage Servers (OSSs). Each file in a Lustre file system is broken into chunks and stored on a subset of the OSTs. A single service node serving as the Metadata Server (MDS) assigns and tracks all of the the storage locations associated with each file in order to direct file I/O requests to the correct set of OSTs and corresponding OSSs. The metadata itself is stored on a block storage device referred to as the MDT.
When a compute node needs to create or access a file, it requests the associated storage locations from the MDS and the associated MDT. I/O operations then occur directly with the OSSs and OSTs associated with the file.
Figure 1: View of the Lustre File System. The route for data movement from application process memory to disk is shown by arrows. Figure 1 (c)2009 Cray Inc
Striping
Storing a single file across multiple OSTs (referred to as striping) offers two benefits: 1) an increase in the bandwidth available when accessing the file and 2) an increase in the available disk space for storing the file. However, striping is not without disadvantages, namely: 1) increased overhead due to network operations and server contention and 2) increased risk of file damage due to hardware malfunction. Given the tradeoffs involved, the Lustre file system allows users to specify the striping policy for each file or directory of files using the lfs utility.
Figure 2: Logical and Physical views of striping. Four application processes write a variable amount of data sequentially within a shared file. This shared file is striped over 4 OSTs with 1 MB stripe sizes. This write operation is not stripe aligned therefore some processes write their data to stripes used by other processes. Some stripes are accessed by more than one process (which may cause contention). Additionally, OSTs are accessed by variable numbers of processes (3 OST0, 1 OST1, 2 OST2 and 2 OST3). Figure 2 (c)2009 Cray Inc
Performance concerns related to file striping include resource contention on the block device (OST) and request contention on the OSS associated with the OST. This contention is minimized when processes (who access the file in parallel) access file locations that reside on different stripes. Additionally, performance can be improved by minimizing the number of OSTs in which a process must communicate. An effective strategy to accomplish this is to stripe align your I/O requests. Ensure that processes access the file at offsets which correspond to stripe boundaries. Stripe settings should take into account the I/O pattern utilized to access the file.
Two commonly used lfs suboptions are getstripe and setstripe. The command lfs getstripe can be used to get striping information on files and directories, while the command lfs setstripe can be used to set the striping (and Lustre stripe buffer size).
The setstripe usage is as follows:
lfs setstripe <filename|dirname> -s <size> -i <index> -c <count>
where
| size | = | the number of bytes on each OST (0 indicating default of 1 MB) specified with k, m, or g to indicate units of KB, MB, or GB, respectively, |
| index | = | the OST index of first stripe (-1 indicating default), and |
| count | = | the number of OSTs to stripe over (0 indicating default of 4 and -1 indicating all OSTs [limit of 160]). |
For example, the command
lfs setstripe <dir> -s 0 -i -1 -c 1
sets the stripe count (width) to 1 on a directory. By default files created within this directory will inherit its stripe settings. This method is the easiest way to control the stripe settings of files created by an application.
Serial I/O
Serial I/O includes those application I/O patterns in which one process performs I/O operations to one or more files. In general, serial I/O is not scalable.
Serial I/O is limited by the single process which performs I/O. I/O operations can only occur as quickly as the single processes can read/write data to disk.
Parallelism in the Lustre file system cannot be exploited to increase I/O performance.
Larger I/O operations and matching Lustre stripe settings may improve performance. This reduces the latency of I/O operations.
Figure 3: Write Performance for serial I/O at various Lustre stripe counts. File size is 32 MB per OST utilized and write operations are 32 MB in size. Utilizing more OSTs does not increase write performance. The Best performance is seen by utilizing a stripe size which matches the size of write operations.
Figure 4: Write Performance for serial I/O at various Lustre stripe sizes and I/O operation sizes. File utilized is 256 MB written to a single OST. Performance is limited by small operation sizes and small stripe sizes. Either can become a limiting factor in write performance. The best performance is obtained in each case when the I/O operation and stripe sizes are similar.
File-per-Process
File-per-process is a communication pattern in which each process of a parallel application writes its data to a private file. This pattern creates N or more files for an application run of N processes. The performance of each process’s file write is governed by the statements made above for serial I/O. However, this pattern constitutes the simplest implementation of parallel I/O due to the possibility of improved I/O performance from a parallel file system.
Each file is subject to the limitations of serial I/O.
Improved performance can be obtained from a parallel file system such as Lustre. However, at large process counts (large number of files) metadata operations may hinder overall performance. Additionally, at large process counts (large number of files) OSS and OST contention will hinder overall performance.
Figure 5: Write performance of a file-per-process I/O pattern as a function of number of files/processes. The file size is 128 MB with 32 MB sized write operations. Performance increases as the number of processes/files increases until OST and metadata contention hinder performance improvements.
Single-shared-file
A single shared file I/O pattern involves multiple application processes which either independently or concurrently share access to the same file. This particular I/O pattern can take advantage of both process and file system parallelism to achieve high levels of performance. However, at large process counts contention for file system resources OSTs can hinder performance gains.
The layout of the single shared file and its interaction with Lustre settings is particularly important with respect to performance.
At large core counts file system contention limits the performance gains of utilizing a single shared file. The major limitation is the 160 OST limit on the striping of a single file.
Figure 6: Two possible shared file layouts. The aggregate file size in both cases is 1 and 2 GB depending on which block size is utilized. The major difference in file layouts is the locality of the data from each process. Layout #1 keeps data from a process in a contiguous block, while Layout #2 strides this data throughout the file. Thirty-two (32) processes will concurrently access this shared file.
Figure 7: Write performance utilizing a single shared file accessed by 32 processes. Stripe counts utilized are 32 (1 GB file) and 64 (2 GB file) with stripe sizes of 32 MB and 1 MB. A 1 MB stripe size on Layout #1 results in the lowest performance due to OST contention. Each OST is accessed by every process. Whereas, the highest performance is seen from a 32 MB stripe size on Layout #1. Each OST is accessed by only one process. A 1 MB stripe size gives better performance with Layout #2. Each OST is accessed by only one process. However, the overall performance is lower due to the increased latency in the write (smaller I/O operations). With a stripe count of 64 each process communicates with 2 OSTs.
Figure 8: Write Performance of a single shared file as the number of processes increases. A file size of 32 MB per process is utilized with 32 MB write operations. For each I/O library (Posix, MPI-IO, and HDF5) performance levels off at high core counts.
Subsetting I/O
At large core counts I/O performance can be hindered by the collection of metadata operations (File-per-process) or file system contention (Single-shared-file). One solution is to use a subset of application processes to perform I/O. This action will limit the number of files (File-per-process) or limit the number of processes accessing file system resources (Single-shared-file).
An example follows which creates an MPI communicatior that only includes I/O nodes (a subset of the total number of processes). This example also shows independent and collective I/O with MPI-I/O.
! listofionodes is an array of the ranks of writers/readers call MPI_COMM_GROUP(MPI_COMM_WORLD, WORLD_GROUP, ierr) call MPI_GROUP_INCL(WORLD_GROUP, nionodes, listofionodes, IO_GROUP,ierr) call MPI_COMM_CREATE(MPI_COMM_WORLD,IO_GROUP, MPI_COMMio, ierr) ! open call MPI_FILE_OPEN & (MPI_COMMio, trim(filename), filemode, finfo, mpifh, ierr) ! read/write call MPI_FILE_WRITE_AT & (mpifh, offset, iobuf, bufsize, MPI_REAL8, status, ierr) ! OR utilizing collective writes ! call MPI_FILE_SET_VIEW !& (mpifh, disp, MPI_REAL8, MPI_REAL8, "native", finfo, ierr) ! call MPI_FILE_WRITE_ALL !& (mpifh, iobuf, bufsize, MPI_REAL8, status, ierr) ! close call MPI_FILE_CLOSE(mpifh, ierr)
If you cannot implement a subsetting approach, it would still be to your advantage to limit the number of synchronous file opens. This is useful for limiting the number of requests hitting the metadata server (of which there is only one).

