FFMK - A fast and fault tolerant microkernel based system for exascale computing

With their vastly increased number of functional components, exascale computers will be extremely vulnerable to system failures and performance losses due to imbalances and operating system jitter. This project addresses these problems by designing, implementing, and testing a prototype of a microkernel-based operating system (OS) with a fast, integrated checkpointing/restart mechanism and a load/checkpoint management system.

Topics:

Exascale systems require deterministic and minimal operating system impact on applications. We propose to foster and enhance microkernels for the use in HPC. This is motivated by the fact that current HPC clusters, running fully fledged all-in-one operating systems, will not scale to exascale due to various OS noise impacts.

Exascale systems require fast checkpointing and restart mechanisms. In exascale systems, the MTBF will eventually become lower than the time needed to write a traditional disk-based checkpoint, even if the mechanism uses costly parallel I/O. Therefore, we propose to store checkpoints in memory and provide flexible fault-tolerance levels through applying erasure codes.

The viability of our approach will be demonstrated with prototype implementations and suitable HPC application showcases.

Group

Scalable Algorithms

Heads

Reinefeld, Alexander, Prof. Dr.

Members

Schütt, Thorsten, Dr.

Steinke, Thomas, Dr.

Schintke, Florian, Dr.

Gholami Estahbanati, Masoud

Partners

TU Dresden

Hebrew University Jerusalem Israel

Related projects

XtreemFS

BabuDB

Funding

DFG

Publications

2024

Optimizing Checkpoint/Restart and Input/Output for Large Scale Applications Doctoral thesis, Humboldt-Universität zu Berlin, Alexander Reinefeld, Björn Scheuermann, Jens-Peter Redlich (Advisors), 2024 Masoud Jami BibTeX
DOI

2021

Combining XOR and Partner Checkpointing for Resilient Multilevel Checkpoint/Restart 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 277-288, 2021 Masoud Gholami, Florian Schintke BibTeX
DOI

2019

FFMK: A Fast and Fault-Tolerant Microkernel-Based System for Exascale Computing Software for Exascale Computing - SPPEXA 2016-2019, pp. 483-516, 2019 Carsten Weinhold, Adam Lackorzynski, Jan Bierbaum, Martin Küttler, Maksym Planeta, Hannes Weisbach, Matthias Hille, Hermann Härtig, Alexander Margolin, Dror Sharf, Ely Levy, Pavel Gak, Amnon Barak, Masoud Gholami, Florian Schintke, Thorsten Schütt, Alexander Reinefeld, Matthias Lieber, Wolfgang Nagel BibTeX
DOI

Multilevel Checkpoint/Restart for Large Computational Jobs on Distributed Computing Resources 2019 IEEE 38th Symposium on Reliable Distributed Systems (SRDS), pp. 143-152, 2019 Masoud Gholami, Florian Schintke BibTeX
DOI

2018

Checkpoint Scheduling for Shared Usage of Burst-Buffers in Supercomputers Proceedings of the 47th International Conference on Parallel Processing Companion; SRMPDS 2018: The 14th International Workshop on Scheduling and Resource Management for Parallel and Distributed Systems, pp. 44:1-44:10, 2018 Masoud Gholami, Florian Schintke, Thorsten Schütt BibTeX
DOI

Modeling Checkpoint Schedules for Concurrent HPC Applications CoSaS 2018 International Symposium on Computational Science at Scale, 2018 Masoud Gholami, Florian Schintke, Thorsten Schütt, Alexander Reinefeld BibTeX

2016

FFMK: A Fast and Fault-tolerant Microkernel-based System for Exascale Computing SPPEXA Symposium 2016, 2016 Carsten Weinhold, Adam Lackorzynski, Jan Bierbaum, Martin Küttler, Maksym Planeta, Hermann Härtig, Amnon Shiloh, Ely Levy, Tal Ben-Nun, Amnon Barak, Thomas Steinke, Thorsten Schütt, Jan Fajerski, Alexander Reinefeld, Matthias Lieber, Wolfgang Nagel BibTeX
DOI

Fast In-Memory Checkpointing with POSIX API for Legacy Exascale-Applications SPPEXA Symposium 2016, 2016 J. Fajerski, Matthias Noack, Alexander Reinefeld, Florian Schintke, Thorsten Schütt, Thomas Steinke BibTeX
DOI

2015

Resilience in Exascale Computing (Dagstuhl Seminar 14402) Dagstuhl Reports, 4(9), pp. 124-139, 2015 Hermann Härtig, Satoshi Matsuoka, Frank Mueller, Alexander Reinefeld BibTeX
DOI

The Impact of Process Placement and Oversubscription on Application Performance: A Case Study for Exascale Computing Proceedings of the 3rd International Conference on Exascale Applications and Software, EASC 2015, A. Gray, L. Smith, M. Weiland (Eds.), pp. 13-18, 2015, ISBN: 978 -0-9 926615 -1-9 (preprint available as ZIB-Report 15-05) Florian Wende, Thomas Steinke, Alexander Reinefeld PDF
PDF (ZIB-Report)
BibTeX

2014

QoS-aware Storage Virtualization for Cloud File Systems Proceedings of the 1st ACM International Workshop on Programmable File Systems, pp. 19-26, PFSW '14, 2014 Christoph Kleineweber, Alexander Reinefeld, Thorsten Schütt BibTeX
DOI