This document presents MPP, a lightweight C++ wrapper for MPI that aims to improve integration of MPI into the C++ language. MPP uses object-oriented concepts like streams to simplify point-to-point communication using send/receive operators. It also introduces endpoints that allow reading from and writing to different processes. MPP handles user-defined data types generically without requiring inheritance or serialization. Performance tests show MPP outperforms Boost.MPI and has only slightly higher latency than native C bindings. The goal of MPP is to provide a high-performance yet easy-to-use C++ interface to MPI.
2. 1 if ( rank==0 ) {
2 MPI Send((const int[1]) { 2 }, 1, MPI INT, 1, 1,
3 MPI COMM WORLD );
4 std::array<float,2> val = {3.14f, 2.95f};
5 MPI Send(&val.front(), val.size(), MPI FLOAT, 1, 0,
6 MPI COMM WORLD);
7 } else if (rank == 1) {
8 int n;
9 MPI Recv(&n, 1, MPI INT, 0, 1, MPI COMM WORLD,
10 MPI STATUS IGNORE);
11 std::vector<float> values(n);
12 MPI Recv(&values.front(), n, MPI FLOAT, 0, 0,
13 MPI COMM WORLD, MPI STATUS IGNORE);
14 }
Figure 1. Simple MPI program using C bind-
ings
ization library [10] to handle transmission of user-deļ¬ned
data types (i.e. merging of objects with a sparse memory
representation into a continuous data chunk) that negatively
impacts the performance.
1 if ( world.rank() ==0 ) {
2 world.send( 1, 1, 2 );
3 world.send( 1, 0, std::array<float,2>({3.14f, 2.95f}) );
4 } else if (world.rank() == 1) {
5 int n;
6 world.recv(0, 1, n);
7 std::vector<float> values(n);
8 world.recv(0, 0, values);
9 }
Figure 2. Boost.MPI version of the program
from Figure 1
An object-oriented approach to improve the C++ MPI
interface is OOMPI [9] which speciļ¬es send and receive
operations in a more user friendly way by overloading the
insertion << and extraction >> C++ operators. In OOMPI, a
Port towards a process rank is obtained by using the array
subscript operator [] on a communicator object (see line 2
in Figure 3). A further advantage is the convenience to com-
bine these operators in one C++ instruction when inserting
or extracting data to/from the same stream. A drawback
of OOMPI is the poor integration of arrays and user data
types in general. For example, sending an array instance
requires the programmer to explicitly instantiate an object
of class OOMPI Array message, which requires the size
and type of the data to be manually speciļ¬ed as in the cur-
rent MPI speciļ¬cation (line 4). The support for generic user
data types requires the objects being sent to inherit from the
OOMPI User type interface. This is a rather severe lim-
itation as it does not allow any legacy class (e.g. the STLās
containers) to be directly supported.
1 if ( OOMPI COMM WORLD.rank() == 0 ) {
2 OOMPI COMM WORLD[1] << 2;
3 std::array<float,2> val = {3.14f, 2.95f};
4 OOMPI COMM WORLD[1] <<
5 OOMPI Array message(&val.front(), val.size());
6 } else if (OOMPI COMM WORLD.rank() == 1) {
7 int n;
8 OOMPI COMM WORLD[0] >> n;
9 std::vector<float> values(n);
10 OOMPI COMM WORLD[0] >>
11 OOMPI Array message(values, 2);
12 }
Figure 3. OOMPI version of the program from
Figure 1
In this paper, we combine some of the concepts pre-
sented in Boost.MPI and OOMPI and propose an advanced
lightweight MPI C++ interface called MPP that aims at
transparently integrating the message passing paradigm into
the C++ programming language without sacriļ¬cing perfor-
mance. Our approach focuses on point-to-point commu-
nications and integration of user data types which, unlike
Boost.MPI, relies entirely on native MPI Datatypes for
better performance. Our interface also utilizes advanced
concepts from other parallel programming languages such
future objects [5] which simpliļ¬es the use of MPI asyn-
chronous routines.
Overall, MPP is designed with a speciļ¬c focus on per-
formance. As we target HPC systems, we understand how
critical performance is and spent signiļ¬cant effort to re-
duce the interface overhead. We compare the performance
of MPP with Boost.MPI and show that, for a simple ping-
pong application, MPP achieves a four times larger through-
put (in terms of messages per seconds). Compared to the
pure C bindings, MPP has an increased latency of only
9%. As far as the handling of user data types is concerned,
MPP is able to reduce the transfer time of a linked list (i.e.
std::list<T> from C++ STL) up to 20 times compared
to Boost.MPI. To determine the real beneļ¬t of using MPP
for real applications, we rewrote the computational kernel
of QUAD MPI [7] to use Boost.MPI and MPP. The ob-
tained results show a performance improvement of around
12% compared to Boost.MPI.
The rest of the paper is organized as follows. In Section 2
we introduce MPP as a lightweight C++ wrapper to MPI
using small code snippets. In Section 3 we compare our
library against Boost.MPI and an plain MPI implementation
using two micro-benchmark codes and an application code
4
3. called QUAD MPI. Section 4 concludes the paper.
2 MPP: C++ Interface to MPI
We use object-oriented programming concepts and C++
templates to design a lightweight wrapper for MPI routines
that simpliļ¬es the way in which MPI programs are written.
Similar to Boost.MPI, we achieve this goal by reducing the
amount of information required by MPI routines and by in-
ferring as much as possible at compile-time. By reducing
the amount of code written by the users, we expect less pro-
gramming errors. Furthermore, by making type checking
safer, most common programming mistakes can be captured
at compile-time. In this work, we focus on point-to-point
operators, as the specialised semantics of collective oper-
ations has no counterpart in C++ STL. We also present a
generic mechanism of handling C++ user data types which
allows for easy transfer of C++ objects to any existing MPI
routine (including collective operations).
2.1 Point-to-Point Communication
While Boost.MPI maintains in its API design the style of
the traditional send/receive MPI routines, our approach is
more similar to OOMPI aiming at a better C++ integration
by deļ¬ning these basic operations using streams. A stream
is an abstraction that represents a device on which input
and output operations are performed. Therefore, sending
or receiving a message through a MPI channel can be seen
as a stream operation. We introduce a mpi::endpoint
class which has the semantics of a bidirectional stream from
which data can be read (received) or written (sent) using the
<< and >> operators. The concept of endpoints is simi-
lar to the Port abstraction of OOMPI, however, because
our mechanism is based on generic programming, user-
deļ¬ned data types can be transparently handled. In con-
trast, OOMPI is based on inheritance which forces the pro-
grammer to instantiate an OOMPI Message class contain-
ing the data type and size required by the MPI routines un-
derneath [11] (see line 4 in Figure 3).
Because an MPI send/receive operation offers more ca-
pabilities than C++ streams (e.g. tags for messages, non-
blocking semantics), endpoints cannot be directly modelled
using an āis-aā relationship. Fortunately, STLās utilities
(e.g. algorithms) are mostly based on templates and end-
points can be passed to any generic function which relies on
the << or >> stream operations. Figure 4 shows an example
that uses an endpoint as argument to a generic read from
function. An endpoint is generated from a communicator
using the () operator to which the process rank is passed
(line 3). The mpi::comm class is a simple wrapper for an
MPI Communicator with the capability of creating end-
points and retrieving the current process rank and the com-
municator size. The mpi::world refers to an instance of
the comm class which wraps the MPI COMM WORLD com-
municator.
1 namespace mpi {
2 struct comm {
3 mpi::endpoint operator()(int) const;
4 };
5 } // end mpi namespace
6
7 template <class InStream, class T>
8 void read from(InStream& in, T& val) {
9 in >> val;
10 }
11
12 int val[2];
13 // reads the first element of the val array from std::cin
14 read from(std::cin, val[0]);
15
16 // receives 2nd element of val array from rank 1
17 read from(mpi::comm::world(1), val[1]);
Figure 4. Example of usage of endpoints in a
generic function.
Figure 5 shows how the program in Figure 1 can be
rewritten with MPP. First of all, the objects are either sent
or received using stream operations which allows for a more
compact code compared to C MPI bindings (half in size) or
to Boost.MPI. Secondly, objects are automatically wrapped
by a generic mpi::msg<T> object, which does not need
to be speciļ¬ed by the user (as opposed to OOMPI). Adding
this level of indirection allows MPP to handle both prim-
itive and user data types in a way transparent to the user.
R-values (i.e. values with no address such as constants) are
handled similar to any regular L-value (e.g. variables) us-
ing C++ constant references via the msg class, which avoids
unnecessary memory allocation. The interface also allows
to specify message tags by manually allocating the message
wrapper (example in line 3).
MPP also supports non-blocking semantics for the
send and receive operations through the overloaded <
and > operators. Unlike blocking send/receives, asyn-
chronous operations return a future object [5] of class
mpi::request<T> which can be polled to test whether
the pending operation has completed or not. An exam-
ple of non-blocking operations in MPP is shown in Fig-
ure 6. For non-blocking receives, the method T& get()
waits for the underlying operation to complete (line 2)
and, upon completion, it returns a reference to the received
value. The mpi::request<T> class also provides a
void wait() and a bool test() method implement-
ing the semantics of MPI Wait and MPI Test, respec-
5
4. 1 using namespace mpi;
2 if ( comm::world.rank() == 0 ) {
3 comm::world(1) << std::array<float,2>({3.14f, 2.95f});
4 comm::world(1) << msg(2, 1);
5 } else if (mpi::world.rank() == 1) {
6 int n;
7 comm::world(0) << msg(n, 1);
8 std::vector<float> values(n);
9 comm::world(0) >> values;
10 }
Figure 5. MPP version of the program from
Figure 1.
tively. The example also shows MPPās support for receive
operations which listen for messages coming from an un-
known process using the mpi::any constant rank when
creating an endpoint (line 3).
1 float real;
2 mpi::request<float>&& req =
3 mpi::comm::world(mpi::any) > real;
4 // ... do something else ...
5 use( req.get() );
Figure 6. Non-blocking MPP endpoints.
Errors returned in MPI by every routine as an error code
are handled in MPP via C++ exceptions. Any call to MPP
routines can potentially throw an exception as a subclass of
mpi::exception. The method get error code()
of this class allows the retrieval of the native error code.
2.2 User Data Types
OOMPI is one of the ļ¬rst APIs trying to introduce
support for user data types through inheritance from an
OOMPI User type class. Unfortunately, this mechanism
is relatively weak because, by relying on inheritance, it does
not allow the handling of class instances provided by third-
party libraries (e.g. STL containers). Another attempt is
the use of serialization in Boost.MPI which, although el-
egant, introduces a high runtime overhead. The objective
of MPP is to reach the same level of integration with user
data types as Boost.MPI without performance loss, which
we achieve by relying on the existing MPI support for user
data types, i.e. MPI Datatype. The deļ¬nition of an
MPI Datatype is rather cumbersome and therefore not
commonly used. Indeed, deļ¬ning an MPI Datatype re-
quires the programmer to specify several information re-
lated to its memory layout which often leads to program-
ming errors that are very difļ¬cult to debug. However, be-
1 template <class T>
2 struct mpi type traits<std::vector<T>> {
3 static inline const Tā
4 get addr( const std::vector<T>& vec ) {
5 return mpi type trait<T>::get addr(vec.front());
6 }
7 static inline const size t
8 get size( const std::vector<T>& vec ) {
9 return vec.size();
10 }
11 static inline MPI Datatype
12 get type( const std::vector<T>& ) {
13 return mpi type trait<T>::get type( T() );
14 }
15 };
16 ...
17 typedef mpi type traits<vector<int>> vect traits;
18 vector<int> v = { 2, 3, 5, 7, 11, 13, 17, 19 };
19 MPI Ssend( vect traits::get addr(v),
20 vect traits::get size(v),
21 vect traits::get type(v), ... );
Figure 7. Example of using mpi type traits
to handle STL vectors.
cause operations on data types are mapped to DMA trans-
fers by the MPI library, the use of an MPI Datatype out-
performs any other techniques based on software serializa-
tion.
The integration of user data types is achieved by using
a design pattern called type traits [4]. An example is illus-
trated in Figure 7 for C++ STLās std::vector<T> class.
We let the user specialize a class which statically provides
the compiler three pieces of information required to map a
user data type to MPI Datatypes:
1. the memory address from which the data type instance
begins;
2. the type of each element;
3. the number of elements.
Because a C++ vector is contiguously allocated in mem-
ory, the starting address of the ļ¬rst element has to be recur-
sively computed for handling generic regular nested types
(e.g. vector<array<float,10>> in lines 3ā6). The
length is the number of elements present in the vector (line
9) and the type is the data type of a vector element (line
11 ā 14). Because our mechanism is not based on inher-
itance (like in OOMPI), it is open for integration and use
with third party class libraries. Lines 17 ā 21 show how the
introduced type traits can be used with the MPI C binding.
This method can also be used for collective operations or
for one of the several ļ¬avors of MPI Send for which an
6
5. !
(a) Number of ping/pong operations per second.
#$ %
'
(
(
%
(b) Comparison of Boost.MPI and MPP for STLās linked list
(std::listT).
Figure 8. MPP performance evaluation results.
appropriate operator cannot be deļ¬ned. MPP also provides
several type traits for some of the STL containers such as
vector, array and list.
3 Performance Evaluation
In this section we compare the performance of MPP
against Boost.MPI and the standard C binding of MPI. We
used the Open MPI version 1.4.2 to execute the experi-
ments. We did not consider OOMPI for performance eval-
uation since its development has been stopped since several
years. We ļ¬rst compared the MPI bindings by using micro-
benchmarks and then by using a real MPI application called
QUAD MPI which is a C++ program that approximates an
integral based on a quadrature rule [7].
3.1 Micro Benchmarks
The purpose of the ļ¬rst experiment is to measure the la-
tency overhead introduced by MPP over the standard C in-
terface to MPI compared to Boost.MPI. We implemented
a simple ping-pong application which we executed on a
shared memory machine with a single AMD Phenom II X2
555, 3.5 GHz dual-core processors, 1MB of L2 cache, and
6MB of L3 cache. This way, any data transmission over-
head is minimized and the focus is solely on the interface
overhead. Figure 8(a) displays the number of ping-pong
operations per second for varying message sizes. MPP has
approximately 9% larger latency for small messages com-
pared to the native MPI routines. This overhead is due to the
creation of a temporary status object corresponding to the
MPI Status returned by the MPI receive routine contain-
ing the message source, size, tag, and error (if any). Com-
pared to Boost.MPI, MPP shows nevertheless a consistent
performance improvement of around 75% for small mes-
sage sizes. Because both implementations use plain vectors
to store the exchanged message, no serialization is involved
to explain the overhead difference. We believe that the main
reason for this overhead comes from the fact that Boost.MPI
is implemented as a library and every call to MPI routines
pays the overhead of an additional function call. We solved
the problem in MPP by designing a pure header-based im-
plementation, which allows all MPP routines to be inlined
by the compiler, thus eliminating any overhead. The graph
also illustrates that, as expected, the overhead decreases for
larger messages as the communication time becomes pre-
dominant.
In the second experiment, we compared MPP with
Boost.MPI for the support of user-deļ¬ned data types. We
used a listdouble type of varying size exchanged be-
tween two processes in a loop repeated one thousand times.
We executed the experiment on an IBM blade cluster with
a quad-core Intel Xeon X5570 processors interconnected
through Inļ¬niband network. We allocated the two MPI pro-
cesses on different blades in order to simulate a real use
case scenario. Figure 8(b) shows the time necessary to per-
form this micro-benchmark for different list sizes and the
7
6. 1 double my a, my b;
2 my total = 0.0;
3 if ( rank == 0 ) {
4 for ( unsigned q = 1; q p; ++q ) {
5 my a = ( ( p ā q ) ā a + ( q ā 1 ) ā b ) / ( p ā 1 );
6 MPI Send ( my a, 1, MPI DOUBLE, q, 0 );
7
8 my b = ( ( p ā q ā 1 ) ā a + ( q ) ā b ) / ( p ā 1 );
9 MPI Send ( my b, 1, MPI DOUBLE, q, 0 );
10 }
11 } else {
12 MPI Recv ( my a, 1, MPI DOUBLE, 0, 0, status );
13 MPI Recv ( my b, 1, MPI DOUBLE, 0, 0, status );
14
15 for ( unsigned i = 1; i = my n; ++i ) {
16 x = ((my n ā i) ā my a + (i ā 1) ā my b) / (my n ā 1);
17 my total = my total + f ( x );
18 }
19 my total = (my b ā my a) ā my total / (double) my n;
20 }
Figure 9. Computational kernel of QUAD MPI.
speedup achieved by MPP over Boost.MPI. For small lists
of 100 elements, the speedup is approximately 20, how-
ever, the performance gap closes by increasing the list size.
The reason is the std::list implementation in MPP us-
ing MPI Type struct, which requires enumerating all
memory addresses that compose the object being sent. To
create an MPI Datatype for a linked list, three arrays
have to be provided:
ā¢ the displacement of each list element relative to the
starting address;
ā¢ the size of each element;
ā¢ the data type of each element (i.e. O(3Ā·N) of memory
overhead).
We observe in Figure 8(a) that building such a data type be-
comes more expensive as the list size increases, so that for
large linked lists over 50,000 elements the software serial-
ization outperforms the MPI data typing mechanism. Future
optimization could improve the support of large data struc-
tures integrating in MPP a mechanism that switches from
the use of MPI Datatype to serialization starting from a
critical size.
3.2 QUAD MPI Application Code
The micro-benchmarks highlighted the low latency of
the MPP bindings, however this does not indicate much
about the beneļ¬ts of using MPP for real application codes.
1 my total = 0.0;
2 if ( rank == 0 ) {
3 for ( unsigned q = 1; q p ; ++q ) {
4 world.send(q, 0, (( p ā q ) ā a + ( q ā 1 ) ā b) / ( p ā 1 ));
5 world.send(q, 0, (( p ā q ā 1 ) ā a + ( q ) ā b) / ( p ā 1 ));
6 }
7 } else {
8 double my a, my b;
9 world.recv(0, 1, my a);
10 world.recv(0, 2, my b);
11
12 for ( unsigned i = 1; i = my n; ++i ) {
13 x = ((my n ā i) ā my a + (i ā 1) ā my b) / (my n ā 1);
14 my total = my total + f ( x );
15 }
16 my total = (my b ā my a) ā my total / (double) my n;Å
17 }
Figure 10. Computational kernel of
QUAD MPI rewritten using Boost.MPI.
For this purpose we took a simple MPI application ker-
nel called QUAD MPI and rewritten using Boost.MPI and
MPP. QUAD MPI is a C program which approximates an
integral using a quadrature rule [7] and can be efļ¬ciently
parallelized using MPI. From the original code [7], we ex-
tracted the computational kernel depicted in Figure 9. The
process rank 0 assigns to every other process a sub-interval
of [A, B] and these bounds are then communicated using
message passing routines. The number of communication
statement in the code is limited, i.e. 2 Ā· (P ā 1), where P
is the number of processes. Therefore, this code represents
a good balance between communication and computation
making it a good choice to determine the beneļ¬ts of MPP
bindings.
This QUAD MPI kernel can be easily rewritten using
Boost.MPI and MPP, as shown respectively in Figures 10
and 12. In both cases, we removed the necessity of assign-
ing the value being sent to the my a and my b variables be-
cause both Boost.MPI and MPP support sending R-values
that are computed and directly sent to the destination (lines
4 and 5). The code at the receiver side is similar, the only
difference being that we can now restrict the scope of the
my a and my b variables to the else body only (lines 9 and
10), which allows a faster machine code generation as the
compiler can utilize the CPU registers more efļ¬ciently. Ad-
ditionally, MPP allows for a further reduction of the code
as shown in Figure 12, since the two sends (line 4) and the
two receives (line 9) can be combined together into a sin-
gle statement. MPP also relieves the programmer from the
burden of specifying a message tag by utilizing the tag 0
by default. With MPP we are able to shrink the input code
8
7. 1 my total = 0.0;
2 if ( rank == 0 ) {
3 for ( unsigned q = 1; q p; ++q ) {
4 comm::world(q) ((p ā q) ā a + (q ā 1) ā b) / (p ā 1)
5 ((p ā q ā 1) ā a + q ā b ) / (p ā 1);
6 }
7 } else {
8 double my a, my b;
9 comm::world(0) my a my b;
10
11 for ( unsigned i = 1; i = my n; ++i ) {
12 x = ((my n ā i) ā my a + (i ā 1) ā my b) / (my n ā 1);
13 my total = my total + f ( x );
14 }
15 my total = (my b ā my a) ā my total / (double) my n;
16 }
Figure 11. Computational kernel of
QUAD MPI rewritten using MPP.
by 30% (in terms of number of characters), which reduces
the chances of programming errors and increases the overall
productivity.
We ran the three versions of the QUAD MPI kernel on a
machine with 16 cores (a dual socket Intel Xeon CPU) and
used shared memory to minimize communications costs
and highlight the library overhead. We compiled the in-
put programs with optimization enabled (i.e. -O3 ļ¬ag), re-
peated each experiment for 10 times, and reported the av-
erage and standard deviation in execution time (see Fig-
ure 12).
Because of the removal of the superļ¬uous assignment
operations to the my a and my b variables, the MPP ver-
sion performs slightly faster than the original code. It is
worth noticing that, although the same optimization has
been applied to the Boost.MPI version, the large overhead
of Boost.MPI cancels any beneļ¬t making the resulting code
the slowest of all three. Compared to Boost.MPI, the MPP
version has a performance improvement of around 12%.
4 Conclusions
In this paper we presented MPP as an advanced C++ in-
terface to MPI. We combined some of the ideas of OOMPI
and Boost.MPI into a lightweight, header-only interface
smoothly integrated with the C++ environment. We intro-
duced a transparent mechanism for dealing with user data
types which, for small objects, is up to 20 times faster than
Boost.MPI due to the use of MPI Datatypes instead of
software serialization. We showed that programs written
using MPP are more compact compared to the MPI C bind-
ings and that the object oriented design overhead introduced
!
#
$
!
%
Figure 12. QUAD MPI performance compari-
son.
is negligible. Furthermore, MPP can avoid common pro-
gramming errors in two ways:
1. through its interface design that uses future objects to
avoid reading the buffer of an asynchronous receive
before data has been written;
2. by automatically inferring most of the input arguments
required by MPI routines.
The MPP interface is freely available at [2].
In the future we intent to extend the interface to sup-
port easier use of other complex MPI features such as dy-
namic process management, operations on communicators
and groups, and creation of process topologies.
5 Acknoledgments
This research has been partially funded by the Austrian
Research Promotion Agency (FFG) under grant P7030-025-
011 and by the Tiroler Zukunftsstiftung under the Trans-
lational Research Grant āParallel Computing with Java for
Manycore Computers.
References
[1] C99 standard. www.open-std.org/JTC1/SC22/
wg14/www/docs/n1124.pdf
[2] MPI C++ Interface. https://github.com/
motonacciu/mpp
[3] The MPI-1 Speciļ¬cation. http://www.mpi-forum.
org/docs/docs.html
[4] A. Alexandrescu. Traits: The else-if-then of types. In
C++ Report, pages 22ā25, 2000. http://erdani.
com/publications/traits.html
9
8. [5] H. C. Baker, Jr. and C. Hewitt. The incremental garbage
collection of processes. In Proceedings of the 1977 sympo-
sium on Artiļ¬cial intelligence and programming languages,
pages 55ā59, New York, NY, USA, 1977. ACM.
[6] J. S. Bill, B. S. Y, and A. L. Z. The design and evolution
of the MPI-2 C++ interface. In In Proceedings, 1997 In-
ternantional Conference on Scientiļ¬c Computing in Object-
Oriented Parallel Computing, Lecture Notes in Computer
Science. Springer-Verlag, 1997.
[7] J. Burkardt. http://people.sc.fsu.edu/
Ėjburkardt/c_src/quad_mpi/quad_mpi.html
[8] P. Kambadur, D. Gregor, A. Lumsdaine, and A. Dharurkar.
Modernizing the C++ interface to MPI. In Recent Advances
in Parallel Virtual Machine and Message Passing Inter-
face, Lecture Notes in Computer Science, pages 266ā274.
Springer Berlin / Heidelberg, 2006.
[9] B. C. McCandless, J. M. Squyres, and A. Lumsdaine.
Object-Oriented MPI (OOMPI): A class library for the mes-
sage passing interface. In Proceedings of the Second MPI
Developers Conference, pages 87ā, Washington, DC, USA,
1996. IEEE Computer Society.
[10] R. Ramsey. Boost serialization library.www.boost.org/
doc/libs/release/libs/serialization/
[11] A. Skjellum, D. G. Wooley, A. Lumsdaine, Z. Lu, M. Wolf,
J. M. Squyres, B. Mccandless, and P. V. Bangalore. Object-
oriented analysis and design of the message passing inter-
face, 1998.
10