With their vastly increased number of functional components, exascale computers will be extremely vulnerable to system failures and performance losses due to imbalances...
Combining XOR and Partner Checkpointing for Resilient Multilevel Checkpoint/Restart
2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 277-288, 2021
Masoud Gholami, Florian SchintkeBibTeX DOI
FFMK: A Fast and Fault-Tolerant Microkernel-Based System for Exascale Computing
Software for Exascale Computing - SPPEXA 2016-2019, pp. 483-516, 2019
Carsten Weinhold, Adam Lackorzynski, Jan Bierbaum, Martin Küttler, Maksym Planeta, Hannes Weisbach, Matthias Hille, Hermann Härtig, Alexander Margolin, Dror Sharf, Ely Levy, Pavel Gak, Amnon Barak, Masoud Gholami, Florian Schintke, Thorsten Schütt, Alexander Reinefeld, Matthias Lieber, Wolfgang Nagel
BibTeX DOI
Multilevel Checkpoint/Restart for Large Computational Jobs on Distributed Computing Resources
2019 IEEE 38th Symposium on Reliable Distributed Systems (SRDS), pp. 143-152, 2019
Masoud Gholami, Florian SchintkeBibTeX DOI
Checkpoint Scheduling for Shared Usage of Burst-Buffers in Supercomputers
Proceedings of the 47th International Conference on Parallel Processing Companion; SRMPDS 2018: The 14th International Workshop on Scheduling and Resource Management for Parallel and Distributed Systems, pp. 44:1-44:10, 2018
Masoud Gholami, Florian Schintke, Thorsten SchüttBibTeX DOI