With their vastly increased number of functional components, exascale computers will be extremely vulnerable to system failures and performance losses due to imbalances and operating system jitter. This project addresses these problems by designing, implementing, and testing a prototype of a microkernel-based operating system (OS) with a fast, integrated checkpointing/restart mechanism and a load/checkpoint management system.
Topics:
- Exascale systems require deterministic and minimal operating system impact on applications. We propose to foster and enhance microkernels for the use in HPC. This is motivated by the fact that current HPC clusters, running fully fledged all-in-one operating systems, will not scale to exascale due to various OS noise impacts.
- Exascale systems require fast checkpointing and restart mechanisms. In exascale systems, the MTBF will eventually become lower than the time needed to write a traditional disk-based checkpoint, even if the mechanism uses costly parallel I/O. Therefore, we propose to store checkpoints in memory and provide flexible fault-tolerance levels through applying erasure codes.
The viability of our approach will be demonstrated with prototype implementations and suitable HPC application showcases.