FFMK
A fast and fault tolerant microkernel based system for exascale computing
Description
With their vastly increased number of functional components, exascale computers will be extremely vulnerable to system failures and performance losses due to imbalances and operating system jitter. This project addresses these problems by designing, implementing, and testing a prototype of a
microkernel-based operating system (OS) with a fast, integrated checkpointing/restart mechanism and a load/checkpoint management system.
- Exascale systems require deterministic and minimal operating system impact on applications. We propose to foster and enhance microkernels for the use in HPC. This is motivated by the fact that current HPC clusters, running fully fledged all-in-one operating systems, will not scale to exascale due to various OS noise impacts.
- Exascale systems require fast checkpointing and restart mechanisms. In exascale systems, the MTBF will eventually become lower than the time needed to write a traditional disk-based checkpoint, even if the mechanism uses costly parallel I/O. Therefore, we propose to store checkpoints in memory and provide flexible fault-tolerance levels through applying erasure codes.
The viability of our approach will be demonstrated with prototype implementations and suitable HPC application showcases.
Members
Alexander Reinefeld, Thorsten Schütt, Thomas Steinke
Partners
TU Dresden, Hebrew University Jerusalem Israel
Funding
DFG
Duration
01/2013 - 12/2015
Department
Parallel and Distributed Systems
