“Advanced Multi Fault Tolerance” in parallel applications: Multi level storage with FTI middleware
Eric BOYER
CINES
Abstract : Fault tolerance is a key concern for future HPC systems. Checkpointing delay on a global file system increases dramatically as scaling grows and MTTI reduces as expected with the increasing of hardware components expected in future Exascale systems.
The approach used by AMFT relies on the usage of different storage level, using asynchronous processes, to unleash scalability. The FTI library is available at application level to implement advanced checkpointing features.
This project has started at INRIA in 2010 with TiTech. Then a contribution from CEA, CINES and GENCI was initiated through a PRACE prototype. The FTI library was enriched by new features and assessed at CEA-TGCC and at CINES.
.
« Advanced Multi Fault Tolerance » dans les applications parallèles: Stockage multi- niveaux avec l’intergiciel FTI
Résumé : La tolérance de panne est une préoccupation majeure pour les prochaines de calcul. Le temps des points de reprise sur un système de fichiers global augmente considérablement lors du passage à l’échelle et de la réduction du MTTI qui accompagnera l’accroissement de composants des futurs, systèmes Exaflopiques. L'approche utilisée par AMFT repose sur l’utilisation de plusieurs niveaux de stockage en introduisant des processus asynchrones, déverrouillant le passage à l’échelle. La bibliothèque FTI est disponible au niveau de l'application pour mettre en œuvre les fonctions avancées de points de reprise. Ce projet a commencé à l'INRIA en 2010 avec TiTech. Ensuite, une contribution du CEA, GENCI et le CINES a été lancé par un prototype de PRACE. La bibliothèque de FTI a été enrichie par de nouvelles fonctionnalités et évaluée au CEA-TGCC et au CINES.
|
Biographie : Eric BOYER Eric Boyer is High Performance Computing Research Engineer at CINES since 1994. He is HPC architect including data architecture specific to high end supercomputers and evaluation expert in acquisition process of HPC platforms. Since 2008 he is responsible for national and international partnerships and had led at CINES HPC-Europa, PRACE PP, 1IP, 2IP, 3IP and participates in PRACE-2020 definition. He has been involved in several prototype design and assessment program : "Multicore processors and accelerator ClearSpeed based platform", In collaboration with LRZ; "ExaScale I/O" focusing on LUSTRE HSM integration in partnership with CEA; "Advance Multi Fault Tolerance" adressing resiliency as exascale challenge, with INRIA and CEA. He is also expert member of recruitment committee for ministry of higher education. He has a diploma of Engineer in Computer Science, and is graduated in mathematics and physics. |
|