Mira MPI Documentation

From Parts
Jump to: navigation, search

Contents

The MPI Standard

The MPI standard is a formal specification document that is backed by the MPI Forum, but no official standardization body (e.g. ISO or ANSI). Fortunately, it has wide vendor support in HPC and is available on all common platforms.

Please see this page for all of the MPI standardization documents, including MPI-2.2 and MPI 3.0.

Blue Gene/Q currently provides support for the MPI-2.2 standard except for a few features that are incompatible with the underlying operating system (e.g. those that may require fork() or support for TCP/IP on the compute nodes). These features have never been provided on Blue Gene systems so there is no change in this respect.

MPI-3

There is no official plan to support MPI-3 on Blue Gene/Q but a subset of these features may be available at some point in the future.

A portable implementation of MPI-3 nonblocking collectives (NBC) on top of MPI-1 is provided by LibNBC.

All of the features of MPI-3 NBC and RMA are available already from PAMI, so adventurous users wishing to use e.g. remote atomics or nonblocking reductions can use PAMI for this purpose.

Many other features of MPI-3 can be implemented on Blue Gene/Q using a mixture of MPI, MPIX, PAMI and SPI calls. If you are using MPI-3 on other systems and would like these features on Blue Gene/Q, please contact support and we may be able to assist you in this respect.

Tutorials

The LLNL tutorial is excellent. Many other resources can be found online via popular search engines.

Explanation of MPI variants

I will refer only to mpicc here, but almost everything written will be language- and compiler-agnostic. The few exceptions will be noted.

Official MPI variants

There are six official versions of MPI on Blue Gene/Q:

$ ls -1 /bgsys/drivers/ppcfloor/comm/*/bin/mpicc
/bgsys/drivers/ppcfloor/comm/gcc/bin/mpicc
/bgsys/drivers/ppcfloor/comm/gcc.legacy/bin/mpicc
/bgsys/drivers/ppcfloor/comm/xl/bin/mpicc
/bgsys/drivers/ppcfloor/comm/xl.legacy/bin/mpicc
/bgsys/drivers/ppcfloor/comm/xl.legacy.ndebug/bin/mpicc
/bgsys/drivers/ppcfloor/comm/xl.ndebug/bin/mpicc

These are the only officially-supported versions of MPI on Blue Gene/Q and will always correspond to the latest driver release installed on the system.

Debugging vs. no debugging

The "no debugging" variants of MPI are denoted by the presence of .ndebug in the directory path.

If you compile MPI without debugging enabled, software overhead is reduced. This will affect the latency of some functions in such a way as to benefit highly optimized and thoroughly debugged codes that are latency-limited.

Do not use the ndebug builds of MPI unless you are confident that your application will executed correctly. Otherwise you will lose the opportunity to debug erroneous usage of MPI and potentially some internal errors in MPI.

Legacy behavior vs. fine-grain locking

The "legacy" variants of MPI are denoted by the presence of .legacy in the directory path

The legacy versions of MPI use coarse-grain locking, which means mutual exclusion between threads at the MPI function level. The non-legacy versions of MPI use fine-grain locking, which implies that some features - point-to-point is the most important use case for this - allow overlapping access to MPI by multiple threads due to the locking on a per-object level.

Fine-grain locking entails slightly higher overload for locking, so this should not be used unless required, unless the side-effects of fine-grain locking (e.g. asynchronous progress) are desirable. Applications that send many small messages should also benefit from fine-grain locking as this increases injection rate due to exploitation of comm threads within MPI.

Traditional MPI codes that use only blocking MPI calls and do not make MPI calls from multiple threads at the same time (i.e. those that do not use MPI_THREAD_MULTIPLE) should perform better with the legacy builds of MPI.

Asynchronous progress

Submitting with the environment variable PAMID_THREAD_MULTIPLE set to 1 will activate communication helper threads that will drive MPI in the background, thereby enabling passive-target progress of MPI one-sided communication as well as nonblocking point-to-point. It is absolutely critical to use this feature if you use MPI RMA passive-mode communication. Other uses of MPI may or may not benefit from it.

You cannot enable asynchronous progress in c64 mode because there are no threads available to use for communication. If an MPI code requiring asynchronous progress also uses threads, threads should be scheduled such that both will be active. This should work for OpenMP but may fail for codes that use Pthreads directly if the user does not implement their threaded code in such a way that the kernel can deschedule them in favor of the communication threads.

The MPI Fortran module

You must use a module compiled with the same Fortran compiler - GNU or XLF - that you are using for your application. If you don't, it won't work and some versions (including most or all of the ones released in 2012) will crash.

All you need to remember to do is use one of the four versions of MPI compiled with XL if you are using the XL Fortran compiler and use one of the two versions of MPI compiled with GCC if you are using the GNU Fortran compiler.

ALCF expects that this problem will be solved at some point in the future, but we have no timeline from IBM for this.

Unofficial MPI variants

Building MPI from source

As the MPICH-based MPI source code is open-source on Blue Gene/Q, anyone can compile their own version and run with that. This is unsupported but may be perfectly functional for some users. See this page for instructions on how to build MPICH from source on Blue Gene/Q.

Because ALCF provides the LLVM compiler for users, there are MPI build scripts for Clang and Clang++. In addition, one can build MPI from source using these compilers, although this is not necessary to use MPI with Clang and Clang++, due to the compiler agnostic nature of C functions. The C++ interface of MPI is probably compiler-sensitive, but since the C++ interface was deprecated in MPI 2.2 and deleted in MPI 3.0 (see MPI and C++ for more information), users should cease using the C++ interface to MPI as soon as possible.

TODO Document paths to LLVM MPI compiler scripts and site installs of MPICH built with Clang, should they exist (they don't yet).

Arbitrary ranks per node

Some application codes require a particular number of MPI ranks per node to match a static domain decomposition. However, on Blue Gene/Q, one can only request a power of two number of ranks per node (i.e. 1, 2, 4, 8, 16, 32 and 64). To ameliorate this inflexibility, ALCF has developed MARPN, which allows a user to run any number of ranks per node. MARPN is not officially supported but ALCF will try to ensure it is functional for users who need it.

Please note that MARPN is not the best solution to the problem it solves, but is one a solution that requires no code changes. Please see the MARPN page for more information on why one should or should not use it.

Environment variables

The Application Development Redbook documents a number of environment variables that affect the behavior of MPI. Here we document some of the important ones. Users are encouraged to read the Redbook for more information.

The prefix of an environment variable indicates what software it affects. The prefix PAMID indicates an option that affects MPI, specifically the PAMI device inside of MPICH. These environment variables have no effect upon Charm++, GASNet or codes that use PAMI directly.

The prefix MUSPI controls very low-level network options that will affect all communication software. Modification of these environment variables is strongly discouraged unless the user is certain that their usage is both necessary and correct.

Memory

  • PAMI_MEMORY_OPTIMIZED - Determines whether PAMI is configured for a restricted memory job. If not set, PAMI is not memory optimized and uses memory as needed to increase performance. A related option for reducing memory usage inside of MPI is PAMID_EAGER=497, which disables eager messages larger than one packet.
  • PAMID_DISABLE_INTERNAL_EAGER_TASK_LIMIT - This prevents the use of the eager protocol in point-to-point messages generated inside of MPI (e.g. for collective algorithms that do not use PAMI collectives), as the eager protocol can lead to excessive memory usage when the application enters a collective at different times on different processes. The default is 512k but should be set lower if memory usage leads to job failure and the stack trace indicates it has happened inside of MPI collective operations. This option is likely required when running with 32 or 64 processes-per-node (i.e. c32 and c64 modes).

If you find that PAMID_DISABLE_INTERNAL_EAGER_TASK_LIMIT is necessary for your code run successfully, you might consider placing barriers in front of the collectives that are known to cause the application to crash. Some of the collectives operations known to cause problems related to internal flow control are MPI_Comm_create and MPI_Comm_split, due to the allgather operation that happens internally.

Debugging

  • PAMID_VERBOSE - Provides debugging information during MPI_Abort() and during various MPI function calls. Some settings affect performance. To simplify debugging, set this variable to 1 for all applications. Setting this variable to 2 will generate output related to collective protocols on a per-communicator basis. Setting this variable to 3 will generate a very large amount of output and is discouraged when running on a large number of nodes (more than 128 is probably large in this case).
  • PAMID_STATISTICS - Turns on the printing of statistics for the message layer such as the maximum receive queue depth.
  • PAMID_COLLECTIVES - Setting this variable to 0 disables the user of optimized collectives. It can be very useful for debugging since optimized collectives are more likely to fail in certain cases.
  • PAMID_COLLECTIVE_name - Equivalent to PAMID_COLLECTIVES but on a per-collective basis, where the collective is determined by name. See the Redbook for details. It is probably necessary to use this option only after consulting the output generated by PAMID_VERBOSE=2.

Performance tuning

  • PAMID_THREAD_MULTIPLE - This variable sets three separate options - PAMID_ASYNC_PROGRESS, PAMID_CONTEXT_POST and PAMID_CONTEXT_MAX - such that MPI will make asynchronous progress on one-sided and two-sided operations and use internal parallelism in such a way as to maximize injection-rate. This option cannot be used with the .legacy variants of MPI.
  • PAMID_EAGER and PAMID_RZV control the switchover between the eager and rendezvous protocols for MPI point-to-point. Some applications may benefit from non-default options, but users should not change these settings unless they have good reason to believe that the defaults are not optimal. See the Redbook for details.

There are a number of variables related to flow control for which the default may not be optimal for some codes. Because tuning these variables is non-trivial, users are encouraged to work with ALCF Support instead of experimenting with them on their own.

MPIX topology information

Many applications have a well-defined communication topology that can be mapped to a torus network topology. On Blue Gene/P, the torus was three-dimensional (ignoring the cores as a fourth dimension). On Blue Gene/Q, the network topology is a five-dimensional torus, although the fifth dimension is always of length two.

Because the majority of ALCF codes use MPI, only the MPIX-based topology API is documented here. There is also a low-level topology API that is documented here, although there is no performance benefit to using it. The MPIX topology API is a thin wrapper to the low-level topology API.

Example Code

This example code demonstrates the high-level torus query calls available through MPIX.

#include <stdio.h>
#include <mpi.h>
#ifdef __bgq__
#  include <mpix.h>
#else
#  warning This test should be run on a Blue Gene.
#endif

int main(int argc, char *argv[])
{
    int provided;
    MPI_Init_thread( &argc, &argv, MPI_THREAD_SINGLE, &provided );

    int rank, size;
    MPI_Comm_rank( MPI_COMM_WORLD, &rank );
    MPI_Comm_size( MPI_COMM_WORLD, &size );

    MPIX_Hardware_t hw;
    MPIX_Hardware(&hw);

    if (rank==0)
    {
        printf("%d: MPIX test on %d MPI processes \n", rank, size);
        printf("%d: clock freq    = %u MHz memory size   = %u MB \n", rank, hw.clockMHz, hw.memSize);
        printf("%d: torus dim.    = %u sizeOfPset    = %u\n", rank, hw.torus_dimension, hw.sizeOfPset);
        printf("%d: torus size    = (%u,%u,%u,%u,%u,%u) \n", rank, hw.Size[0], hw.Size[1], hw.Size[2], hw.Size[3], hw.Size[4], hw.Size[5] );
        printf("%d: torus wraps?  = (%u,%u,%u,%u,%u,%u) \n", rank, hw.isTorus[0], hw.isTorus[1], hw.isTorus[2], hw.isTorus[3], hw.isTorus[4], hw.isTorus[5] );
    }

    fflush(stdout);
    MPI_Barrier(MPI_COMM_WORLD);

    for (int i=0; i<size; i++)
    {
        if (rank==i)
        {
            printf("%d: physical rank = %u physical size = %u \n", rank, hw.prank, hw.psize);
            printf("%d: idOfPset      = %u rankInPset    = %u \n", rank, hw.idOfPset, hw.rankInPset);
            printf("%d: core ID       = %u proc per node = %u \n", rank, hw.coreID, hw.ppn);
            printf("%d: torus coords = (%u,%u,%u,%u,%u,%u) \n", rank, hw.Coords[0], hw.Coords[1], hw.Coords[2], hw.Coords[3], hw.Coords[4], hw.Coords[5] );
            fflush(stdout);
        }
        MPI_Barrier(MPI_COMM_WORLD);
    }

    MPI_Finalize();

    return 0;
}

The output looks like this:

0: MPIX test on 2048 MPI processes 
0: clock freq    = 1600 MHz memory size   = 16384 MB 
0: torus dim.    = 5 sizeOfPset    = 0
0: torus size    = (2,2,4,4,2) 
0: torus wraps?  = (0,0,1,1,1) 
0: physical rank = 0 physical size = 2048 
0: idOfPset      = 0 rankInPset    = 0 
0: core ID       = 0 proc per node = 16 
0: torus coords = (0,0,0,0,0) 
1: physical rank = 1 physical size = 2048 
1: idOfPset      = 0 rankInPset    = 0 
1: core ID       = 1 proc per node = 16 
1: torus coords = (0,0,0,0,0) 
...

MPI and threads

Blue Gene/Q provides a wide variety of hybrid programming models. It is important to use these in the proper manner compliant with the MPI standard and other specifications. The MPI standard requires the user to initialize MPI using MPI_Init_thread and appropriate arguments when using threads. Below are a few examples.

MPI and OpenMP

The most common usage of OpenMP is fork-join, in which case MPI_THREAD_FUNNELED is sufficient. In this mode, only the main thread is permitted to make MPI calls. In practice, this is identical to MPI_THREAD_SINGLE and no overhead is imposed since no mutual exclusion or thread-local state is required in the MPI implementation. The only change that occurs when MPI_THREAD_FUNNELED is used is that MPI must use thread-safe runtime calls (e.g. malloc).

If you make MPI calls within OpenMP parallel regions but use one of OpenMP's mutual exclusion mechanisms (e.g. critical) then you need to use MPI_THREAD_SERIALIZED, which also imposes no runtime overhead on Blue Gene/Q (or other MPICH-based implementation). If OpenMP threads make MPI calls without mutual exclusion, then MPI_THREAD_MULTIPLE is required, which does impose overhead since MPI must perform some locking internally. However, MPI on Blue Gene/Q is highly optimized for this use case relative to other implementations, so users may be pleasantly surprised that the overhead is not as bad as expected.

Example Code

#include <stdio.h>
#include <stdlib.h>

#ifdef _OPENMP
#include <omp.h>
#else
#warning Your compiler does not support OpenMP, at least with the flags you're using. \
If you're using GCC or XLC, add -fopenmp or -qsmp=omp, respectively.
#endif

#include <mpi.h>

#define MPI_THREAD_STRING(level)  \
        ( level==MPI_THREAD_SERIALIZED ? "THREAD_SERIALIZED" : \
                ( level==MPI_THREAD_MULTIPLE ? "THREAD_MULTIPLE" : \
                        ( level==MPI_THREAD_FUNNELED ? "THREAD_FUNNELED" : \
                                ( level==MPI_THREAD_SINGLE ? "THREAD_SINGLE" : "THIS_IS_IMPOSSIBLE" ) ) ) )

int main(int argc, char ** argv)
{
    /* These are the desired and available thread support.
       A hybrid code where all MPI calls are made from the main thread can used FUNNELED.
       If threads are making MPI calls, MULTIPLE is appropriate. */
    int requested = MPI_THREAD_FUNNELED, provided;

    /* MPICH2 will be substantially more efficient than OpenMPI 
       for MPI_THREAD_{FUNNELED,SERIALIZED} but this is unlikely
       to be a serious bottleneck. */
    MPI_Init_thread(&argc, &argv, requested, &provided);
    if (provided<requested)
    {
        printf("MPI_Init_thread provided %s when %s was requested.  Exiting. \n",
               MPI_THREAD_STRING(provided), MPI_THREAD_STRING(requested) );
        exit(1);
    }

    int world_size, world_rank;

    MPI_Comm_size(MPI_COMM_WORLD,&world_size);
    MPI_Comm_rank(MPI_COMM_WORLD,&world_rank);

    printf("Hello from %d of %d processors\n", world_rank, world_size);

#ifdef _OPENMP
    #pragma omp parallel
    {
        int omp_id  = omp_get_thread_num();
        int omp_num = omp_get_num_threads();
        printf("MPI rank = %2d OpenMP thread = %2d of %2d \n", world_rank, omp_id, omp_num);
        fflush(stdout);
    }
#else
    printf("MPI rank = %2d \n", world_rank);
    fflush(stdout);
#endif

    MPI_Finalize();
    return 0;
}

MPI and Pthreads

If multiple threads make MPI calls, MPI_THREAD_SERIALIZED or MPI_THREAD_MULTIPLE is required, the former if the user code implements mutual exclusion between different thread's access to MPI and the latter if the MPI implementation is expected to do this.

Example code

This is a simple example that prints out information about MPI, Pthreads and OpenMP used together.

You should submit this job with different values of the environment variables POSIX_NUM_THREADS and OMP_NUM_THREADS. On Blue Gene/Q, this test will output where the threads are executing. On other systems, the hardware affinity information is null.

Makefile

CC      = mpicc
COPT    = -g -O2 -std=gnu99 -fopenmp
LABEL   = gnu

LD      = $(CC)
CFLAGS  = $(COPT)
LDFLAGS = $(COPT) bgq_threadid.o -lm -lpthread

all: mpi_omp_pthreads.x

%.$(LABEL).x: %.o bgq_threadid.o
	$(LD) $(LDFLAGS) $< -o $@

%.o: %.c
	$(CC) $(CFLAGS) -c $< -o $@

clean:
	$(RM) $(RMFLAGS) *.o *.lst 

realclean: clean
	$(RM) $(RMFLAGS) *.x

bgq_threadid.c

#ifdef __bgq__
#include </bgsys/drivers/ppcfloor/spi/include/kernel/location.h>
#endif

/*=======================================*/
/* routine to return the BGQ core number */
/*=======================================*/
int get_bgq_core(void)
{
#ifdef __bgq__
    //int core = Kernel_PhysicalProcessorID();
    int core = Kernel_ProcessorCoreID();
    return core;
#else
    return -1;
#endif
}

/*==========================================*/
/* routine to return the BGQ hwthread (0-3) */
/*==========================================*/
int get_bgq_hwthread(void)
{
#ifdef __bgq__
    //int hwthread = Kernel_PhysicalHWThreadID();
    int hwthread = Kernel_ProcessorThreadID();
    return hwthread;
#else
    return -1;
#endif
}

/*======================================================*/
/* routine to return the BGQ virtual core number (0-67) */
/*======================================================*/
int get_bgq_vcore(void)
{
#ifdef __bgq__
    int hwthread = Kernel_ProcessorID();
    return hwthread;
#else
    return -1;
#endif
}

mpi_omp_pthreads.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#include <unistd.h>
#include <pthread.h>
#include <omp.h>
#include <mpi.h>

/* this is to ensure that the threads overlap in time */
#define NAPTIME 3

#define MAX_POSIX_THREADS 64

static pthread_t thread_pool[MAX_POSIX_THREADS];

static int mpi_size, mpi_rank;
static int num_posix_threads;

void* foo(void* dummy)
{
    int i, my_pth = -1;
    pthread_t my_pthread = pthread_self();

    for (i=0 ; i<num_posix_threads ; i++)
        if (my_pthread==thread_pool[i]) my_pth = i;
    
    sleep(NAPTIME);

    int my_core = -1, my_hwth = -1;
    int my_omp, num_omp;
    #pragma omp parallel private(my_core,my_hwth,my_omp,num_omp) shared(my_pth)
    {
        sleep(NAPTIME);

        my_core = get_bgq_core();
        my_hwth = get_bgq_hwthread();

        my_omp  = omp_get_thread_num();
        num_omp = omp_get_num_threads();
        fprintf(stdout,"MPI rank = %2d Pthread = %2d OpenMP thread = %2d of %2d core = %2d:%1d \n",
                       mpi_rank, my_pth, my_omp, num_omp, my_core, my_hwth);
        fflush(stdout);

        sleep(NAPTIME);
    }

    sleep(NAPTIME);

    pthread_exit(0);
}

void bar()
{
    sleep(NAPTIME);

    int my_core = -1, my_hwth = -1;
    int my_omp, num_omp;
    #pragma omp parallel private(my_core,my_hwth,my_omp,num_omp)
    {
        sleep(NAPTIME);

        my_core = get_bgq_core();
        my_hwth = get_bgq_hwthread();

        my_omp  = omp_get_thread_num();
        num_omp = omp_get_num_threads();
        fprintf(stdout,"MPI rank = %2d OpenMP thread = %2d of %2d core = %2d:%1d \n",
                       mpi_rank, my_omp, num_omp, my_core, my_hwth);
        fflush(stdout);

        sleep(NAPTIME);
    }
    sleep(NAPTIME);
}

int main(int argc, char *argv[])
{
    int i, rc;
    int provided;
 
    MPI_Init_thread(&argc,&argv,MPI_THREAD_MULTIPLE,&provided);
    if ( provided != MPI_THREAD_MULTIPLE ) exit(1);
 
    MPI_Comm_size(MPI_COMM_WORLD,&mpi_size);
    MPI_Comm_rank(MPI_COMM_WORLD,&mpi_rank);
 
    MPI_Barrier(MPI_COMM_WORLD);
 
    sleep(NAPTIME);

#ifdef __bgq__
    int bg_threadlayout = atoi(getenv("BG_THREADLAYOUT"));
    if (mpi_rank==0) fprintf(stdout,"BG_THREADLAYOUT = %2d\n", bg_threadlayout);
#endif

    num_posix_threads = atoi(getenv("POSIX_NUM_THREADS"));
    if (num_posix_threads<0)                 num_posix_threads = 0;
    if (num_posix_threads>MAX_POSIX_THREADS) num_posix_threads = MAX_POSIX_THREADS;

    if (mpi_rank==0) fprintf(stdout,"POSIX_NUM_THREADS = %2d\n", num_posix_threads);
    if (mpi_rank==0) fprintf(stdout,"OMP_MAX_NUM_THREADS = %2d\n", omp_get_max_threads());
    fflush(stdout);

    if ( num_posix_threads > 0 ) {
        //fprintf(stdout,"MPI rank %2d creating %2d POSIX threads\n", mpi_rank, num_posix_threads); fflush(stdout);
        for (i=0 ; i<num_posix_threads ; i++){
            rc = pthread_create(&thread_pool[i], NULL, foo, NULL);
            assert(rc==0);
        }

        MPI_Barrier(MPI_COMM_WORLD);

        sleep(NAPTIME);

        for (i=0 ; i<num_posix_threads ; i++){
            rc = pthread_join(thread_pool[i],NULL);
            assert(rc==0);
        }
        //fprintf(stdout,"MPI rank %2d joined %2d POSIX threads\n", mpi_rank, num_posix_threads); fflush(stdout);
    } else {
        bar();
    }

    MPI_Barrier(MPI_COMM_WORLD);

    sleep(NAPTIME);

    MPI_Finalize();

    return 0;
}

Implementation Details

A presentation of the design of the communication software on Blue Gene/Q is frequently given at ALCF workshops. Slides from a past workshop available.

MPICH

MPI for Blue Gene/Q is based upon MPICH. MPICH2 1.4 was the basis for MPI in the V1R1 driver, while MPICH2 1.5 is the basis for MPI in the V1R2 driver. Future drivers may offer support for more recent releases of MPICH but IBM has not committed to this.

Curious users wishing to understand the design of MPICH may consult the MPICH Developer Documentation but this should not be necessary. It is documented here for students and other researchers who may want to learn more about system software.

PAMI

Blue Gene/Q is implemented on top of PAMI, which is a high-level active-message API that has been used to implement many programming models, including Charm++ and PGAS languages (via GASNet).

Adventurous users that wish to understand how PAMI works should examine the comments in /bgsys/drivers/ppcfloor/comm/sys/include/pami.h and possibly the example code available in the driver source or here.

Personal tools