The tips described here are divided into the following categories:
MPI version
MPI Environment Variables
Rank Placement
MPI Programming Techniques
It is assumed that the reader has a good knowledge of MPI and has run MPI programs on a cluster or other parallel HPC resource. Then these tips will show ways to improve performance on a Cray XT5 (and likely other XT versions). In addition, some of the tips apply to other cluster resources as well.
Warning: there are no silver bullets. Some tips work well only in certain circumstances, and thus is behooves the reader to understand their code and be willing and able to do performance testing.
MPI Version
On Cray XTs, it is important to always use the latest MPI from Cray. Cray packages their MPI (based on MPICH) into a package called MPT. Cray is always trying to improve their MPI and, quite often, newer versions of MPT have significant performance improvements over predecessors.
In addition to these performance improvements, sometimes a new version of MPT will have changed default behaviors/settings. For example, from MPT 3.0.x to MPT 3.1.x, many of the default buffer settings changed and so and settings users used to use with 3.0.x may not be appropriate for 3.1.x. In particular, MPT 3.1.x tries to dynamically set the right buffer sizes at job launch time (rather than use static settings like MPT 3.0.x.)
MPI Environment Variables
In this section, some environment variables that affect MPI performance are discussed. Cray has default settings based on average best performance of codes they have tested, but many codes may benefit from adjusting environment variables. Most of this information is available in the MPI man page – it is expected that users read the man page, and revisit it when new versions are installed.
Note that this document does not attempt to describe the many MPI buffer environment variables and how to set them. Please see Kim McMahon's presentation.
MPICH_FAST_MEMCPY
When set, this environment variable enables an optimized memcpyroutine in MPI. The optimized routine is used for local memory copies in the point-to-point and collective MPI operations. This can help performance of some collectives that send large (256KB and greater) messages. Very often this makes collectives faster, though speedup varies by size.
In particular, if message sizes are known to be greater than 1MB, then an optimized memcpycan be used that works well for large sizes, but not for smaller sizes. The default is not enabled.
Example: PHASTA code run at 2,048 cores: setting this environment variable reduced the walltime to completion from 262 seconds to 195 seconds.
MPICH_COLL_SYNC
If set, then a barrier is performed at the beginning of each specified MPI collective function. This forces all processes participating in that collective to sync up before the collective can begin. (Makes the implicit barrier explicit.) To enable this feature for all MPI collectives, set the value to 1. Default is off. This can be enabled for a selected list of MPI collectives.
There are rare examples where this helps; if the code has lots of collectives and MPI profiling shows imbalance (lots of sync time), this may help. For example, the PHASTA (CFD-turbulent flows) code sees an improvement at 2,048 processes – reduction in walltimefrom 262 seconds to 218 seconds. This is due to the fact that PHASTA has many MPI_Allreduce calls. However, setting this environment variable slowed down the NekTarG (CFD-Blood Flow) code by 7%.
MPICH_MPIIO_HINTS
If set, this override the default value of one or more MPI-IO hints. This also overrides any value set in the application code with calls to the MPI_Info_set routine. The hints are applied to the file when it is opened with an MPI_File_open() call.
Another environment variable that would likely be used in conjunction with this is the MPICH_MPIIO_HINTS_DISPLAY variable. If set, causes rank 0 in the participating communicator to display the names and values of all MPI-IO hints that are set for the file being opened with the MPI_File_open call.
Default settings:
PE 0: MPIIO hints for c2F.TILT3d.hdf5:
cb_buffer_size = 16777216
romio_cb_read = automatic
romio_cb_write = automatic
cb_nodes = #nodes/8
romio_no_indep_rw = false
ind_rd_buffer_size = 4194304
ind_wr_buffer_size = 524288
romio_ds_read = automatic
romio_ds_write = automatic
direct_io = false
cb_config_list = *:1
The syntax for this environment variable is:
export MPICH_MPIIO_HINTS=data.hdf5:direct_io=true # bash,ksh
Examples:
For FlashIO at 5,000 processes writing out 500MB per MPI thread, the following improved performance:
romio_cb_write = "ENABLE" romio_cb_read= "ENABLE" cb_buffer_size = 32MWhen the “romio_cb_*” hints are enabled, all collective reads/writes will use collective buffering. When disabled, all collective reads/writes will be serviced with individual operations by each process. When set to automatic, ROMIO will use heuristics to determine when to enable the optimization.
For S3D at 10K cores, the following settings improved performance.
- romio_ds_write = ‘disable'
- specifies if data sieving is to be done on read. Data; sieving is a technique for efficiently accessing noncontiguous regions of data
- romio_no_indep_rw = 'true'
- specifies whether deferred open is used.
ROMIO docs say that “romoio_no_indep_rw” indicates no independent read or write operations will be performed. This can be used to limit the number of processes that open the file.
Data sieving (romio_ds_*) is a technique for efficiently accessing noncontiguous regions of data in files when noncontiguous accesses are not provided as a file system primitive or where the noncontiguous access primitives are inefficient for a certain datatype. In the data sieving technique, a number of noncontiguous regions are accessed by reading a block of data containing all of the regions, including the unwanted data between them (called "holes"). The regions of interest are then extracted from this large block by the client. This technique has the advantage of a single I/O call, but additional data is read from the disk and passed across the network. For file systems with locking the data sieving technique can also be used for writes through the use of a read-modify-write process
MPICH_MPIIO_CB_ALIGN
If set to 1, new algorithms (in MPT 3.1.x) that take into account physical I/O boundaries and the size of I/O requests are used to determine how to divide the I/O workload when collective buffering is enabled. This can improve performance by causing the I/O requests of each collective buffering node (aggregator) to start and end on physical I/O boundaries and by preventing more than one aggregator making reference to any given stripe on a single collective I/O call. If set to zero or not defined, the algorithms in MPT release 3.0 are used. The default is not set.
MPICH_ENV_DISPLAY
If set, then rank 0 will display all MPICH environment variables and their current settings at MPI initialization time. The default is not enabled.
This can be useful for debugging purposes.
Similarly, MPICH_VERSION_DISPLAY displays the version of Cray MPT being used.
MPICH_SMP_OFF
If set, this disables the on-node SMP device and uses the Portals device for all MPI message transfers. Can be effective in rare cases where codes benefit from using Portals matching instead of MPI matching. The default is not enabled. Can be useful for debugging reproducibility issues.
MPICH_PTL_MATCH_OFF
If set, this disables registration of receive requests with portals. Setting this allows MPI to perform the message matching for the portals device. It may be beneficial to set this variable when an application exhausts portals internal resources and for latency-sensitive applications.
MPICH_PTL_SEND_CREDITS
Enables flow control to prevent the Portals event queue from being overflowed. Value of ‘-1’ should prevent queue overflow in any situation, but it does not always. Should only be used as needed, as flow control will result in less optimal performing code. If the Portals unexpected event queue cannot be increased enough, then flow control may need to be enabled.
Rank Placement
In some cases changing how the processes are laid out on the machine may affect performance, by relieving synchronization/imbalance time. The default is currently SMP-style placement. This means that for or a multi-core node, sequential MPI ranks are placed on the same node. In general, MPI codes perform better using SMP placement nearest neighbor. Note that the Cray MPI collectives have been optimized to be SMP aware.
For example, an 8-process job launched on a XT5 node with 2 quad-core processors would be placed as:
PROCESSOR 0 1 RANK 0,1,2,3 4,5,6,7
The default ordering can be changed using the environment variable:
MPICH_RANK_REORDER_METHOD
The following are the different values that you can set it to:
- 0
- Round-robin placement. Sequential MPI ranks are placed on the next node in the list.
- 1
- SMP-style placement. All cores from all nodes are allocated in a sequential order.
- 2
- Folded rank placement. Similar to default ordering except that the tasks N+1 ... 2N are mapped to slave cores of nodes N ... 1.
- 3
- Custom ordering. The ordering is specified in a file named MPICH_RANK_ORDER. (See the mpi man page for details.)
Generally, rank reordering is useful when (1) point-to-point communication consumes significant fraction of program time and load imbalance detected, (2) also shown to help for collectives (alltoall) on subcommunicators (GYRO), and (3) spread out I/O across nodes (POP).
Rank order and CrayPAT
The CrayPat performance measurement tools can be used to generate a custom ordering. This is available if MPI functions are traced (-g mpi or -O apa). For example,
pat_build -O apa my_program
Then, the
pat_reportutility can be used to generate suggested MPI rank orderings.- mpi_sm_rank_order
- Uses message data from tracing MPI to generate suggested MPI rank order. Requires the program to be instrumented using the pat_build -g mpi option.
- mpi_rank_order
- Uses time in user functions, or alternatively, any other metric specified by using the -s mro_metric options, to generate suggested MPI rank order.
Example output from CrayPAT:
Table 1: Suggested MPI Rank Order Eight cores per node: USER Samp per node Rank Max Max/ Avg Avg/ Max Node Order USER Samp SMP USER Samp SMP Ranks d 17062 97.6% 16907 100.0% 832,328,820,797,113,478,898,600 2 17213 98.4% 16907 100.0% 53,202,309,458,565,714,821,970 0 17282 98.8% 16907 100.0% 53,181,309,437,565,693,821,949 1 17489 100.0% 16907 100.0% 0,1,2,3,4,5,6,7The output from CrayPAT indicates that:
- the custom ordering “d” might be the best
- Folded-rank next best
- Round-robin 3rd best
- Default ordering last
Example: GYRO 8.0
Ran the GYRO code on problem B3-GTC with 1024 processes with alternate MPI orderings including the suggested custom ordering from CrayPAT.
Reorder method
comm time
Default
11.26s
0 – round-robin
6.94s
2 – folded-rank
6.68s
d-custom from apa
8.03s
The table indicates that the best time was achieved with the “folded-rank” MPI placement.
MPI Programming Techniques
Pre-posting receives
If possible, pre-post receives before sender posts the matching send. This is an optimization technique for all MPICH installations, not just Cray. Even an IBM manual states: “well-written applications try to pre-post their receives.” They also warn about posting too many.
Don't go crazy pre-posting receives though. Will hit Portals internal resource limitations eventually.
Example: Halo update – with four buffers (n,s,e,w), post all receive requests as early as possible. Makes a big difference on CNL (not as important on Catamount).
Overlapping communication with computation
Overlapping communication with computation is in some sense a corollary of the pre-posting receives advice. If non-blocking receives and sends can be intermixed with computational loops, then on some architectures communication may overlap computational work. However, not all architectures support overlapping communication and computation. Briefly, it is best to program with non-blocking MPI send and receives when possible so as to enable overlapped communication and computation. However, just the use of non-blocking sends and receives will not achieve this goal, rather the non-blocking communication needs to be intermixed with computational work. (And the pre-posting of receives described above really only works if there is computational work in-between the receives and sends to ensure the receives are clearly posted first.)
Example
Basic
9 pt computation Update ghost cell boundaries East/West IRECV, ISEND, WAITALL North/South IRECV, ISEND, WAITALLMaximal Irecv preposting
Prepost all IRECV 9 pt computation Update ghost cell boundaries East/West ISEND Wait on E/W IRECV only North/South ISEND, Wait on the rest *Makes use of temp buffersAggregating data
For very small buffers, aggregating data into fewer MPI calls (especially for collectives)is a simple way to vastly improve MPI performance. For example, one MPI_alltoall with an array of 3 reals is clearly better than three MPI_alltoalls with 1 real.
Note, this approach can be taken too far, and so a warning is “Do not aggregate too much.” The MPI protocol switches from an short (eager) protocol to a long message protocol using a receiver pull method once the message is larger than the eager limit. This limit is by default 128000 bytes, but it can be changed with the MPICH_MAX_SHORT_MSG_SIZE environment variable. The optimal size for messages most of the time is less than the eager limit.
Example
Turbulence code (DNS) replaced 3 AllGatherv calls by one with a larger message resulting in 25% less runtime for one routine.
Original for (index = 0; index < No; index++){ double tmp; tmp = 0.0; out_area[index] = Bndry_Area_out(A,labels[index]); gdsum(&outlet_area[index],1,&tmp); } for (index = 0; index < Ni; index++){ double tmp; tmp = 0.0; in_area[index] = Bndry_Area_in(A, labels[index]); gdsum(&inlet_area[index],1,&tmp);} void gdsum (double *x, int n, double *work) { register int i; MPI_Allreduce (x, work, n, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD); /* *x = *work; */ dcopy (n,work,1,x,1); return; }Improved for (index = 0; index < No; index++){ out_area[index] = Bndry_Area_out(A, labels[index]);} /* Get gdsum out of for loop */ tmp = new double[No]; gdsum (outlet_area, No, tmp); delete tmp; for (index = 0; index < Nin; index++){ in_area[index] = Bndry_Area_in(A, labels[index]);} /* Get gdsum out of for loop */ tmp = new double[Ni]; gdsum(inlet_area, Ni, tmp); delete tmp;

