Checkpoint/Restart approaches for a thread-based MPI runtime Article
ADAM Julien, KERMARQUER Maxime, BESNARD Jean-Baptiste, et al.
Parallel Computing, 2019, vol. 85, p. 204-219.
Abstract
Fault-tolerance has always been an important topic when it comes to running
massively parallel programs at scale. Statistically, hardware and software
failures are expected to occur more often on systems gathering millions of
computing units. Moreover, the larger jobs are, the more computing hours would
be wasted by a crash. In this paper, we describe the work done in our MPI
runtime to enable both transparent and application-level checkpointing
mechanisms. Unlike the MPI 4.0 User-Level Failure Mitigation (ULFM) interface,
our work targets solely Checkpoint/Restart and ignores other features such as
resiliency. We show how existing checkpointing methods can be practically
applied to a thread-based MPI implementation given sufficient runtime
collaboration. The two main contributions are the preservation of high-speed
network performance during transparent C/R and the over-subscription of
checkpoint data replication thanks to a dedicated user-level scheduler support.
These techniques are measured on MPI benchmarks such as IMB, Lulesh and Heatdis,
and associated overhead and trade-offs are discussed.
Transparent high-speed network checkpoint/restart in MPI Article
ADAM Julien, BESNARD Jean-Baptiste, MALONY Allen D., et al.
Proceedings of the 25th European MPI Users’ Group Meeting. ACM, 2018. p. 12.
Abstract
Fault-tolerance has always been an important topic when it comes to running
massively parallel programs at scale. Statistically, hardware and software
failures are expected to occur more often on systems gathering millions of
computing units. Moreover, the larger jobs are, the more computing hours would
be wasted by a crash. In this paper, we describe the work done in our MPI
runtime to enable transparent checkpointing mechanism. Unlike the MPI 4.0
User-Level Failure Mitigation (ULFM) interface, our work targets solely
Checkpoint/Restart (C/R) and ignores wider features such as resiliency. We show
how existing transparent checkpointing methods can be practically applied to MPI
implementations given a sufficient collaboration from the MPI runtime. Our C/R
technique is then measured on MPI benchmarks such as IMB and Lulesh relying on
Infiniband high-speed network, demonstrating that the chosen approach is
sufficiently general and that performance is mostly preserved. We argue that
enabling fault-tolerance without any modification inside target MPI applications
is possible, and show how it could be the first step for more integrated
resiliency combined with failure mitigation like ULFM.
Introducing task-containers as an alternative to runtime-stacking Article
BESNARD Jean-Baptiste, ADAM Julien, SHENDE Sameer, et al.
Proceedings of the 23rd European MPI Users’ Group Meeting. ACM, 2016. p. 51-63.
Abstract
The advent of many-core architectures poses new challenges to the MPI
programming model which has been designed for distributed memory message
passing. It is now clear that MPI will have to evolve in order to exploit
shared-memory parallelism, either by collaborating with other programming models
(MPI+ X) or by introducing new shared-memory approaches. This paper considers
extensions to C and C++ to make it possible for MPI Processes to run into
threads. More generally, a thread-local storage (TLS) library is developed to
simplify the collocation of arbitrary tasks and services in a shared-memory
context called a task-container. The paper discusses how such containers
simplify model and service mixing at the OS process level, eventually easing the
collocation of arbitrary tasks with MPI processes in a runtime agnostic fashion,
opening alternatives to runtime stacking.
A Parallel and Resilient Frontend for High Performance Validation Suites Article
ADAM Julien, PÉRACHE Marc
International Conference on Vector and Parallel Processing. Springer, Cham, 2016. p. 248-255.
Abstract
In any well-structured software project, a necessary step consists in validating
results relatively to functional expectations. However, in the high-performance
computing (HPC) context, this process can become cumbersome due to specific
constraints such as scalability and/or specific job launchers. In this paper we
present an original validation front-end taking advantage of HPC resources for
HPC workloads. By adding an abstraction level between users and the batch
manager, our tool JCHRONOSS, drastically reduces test-suite running time, while
taking advantage of distributed resources available to HPC developers. We will
first introduce validation work-flow challenges before presenting the
architecture of our tool and its contribution to HPC validation suites.
Eventually, we present results from real test-cases, demonstrating effective
speed-up up to 25x compared to sequential validation time – paving the way to
more thorough validation of HPC applications.