While the compute power of HPC systems increases substantially, the bandwidth and latency for communication is not improved at the same rate. In this project, we address two main sources for limiting scalability: inter-process data exchange in parallel simulation codes, and I/O, e.g., for writing checkpoints that allow a quick restart in case of systems failure (fault tolerance). Large scale PDE solvers are increasingly communication-bound, which limits the exploitation of modern HPC systems. It is thus a requirement to reduce the communication even at cost of more computation. One approach are time-parallel methods, such as PFASST, which allow to exploit additional nodes for parallelization, but rely on the transfer of fine grid corrections from one time subinterval to the next. Fault tolerance becomes increasingly relevant for large scale HPC systems, where node failure is a common event. There is a trade-off between restart time, amount of data to be written, and frequency of checkpoints. OS controlled checkpoint writing is convenient, but incurs a high communication cost (potential large memory dumps). Application-level fault tolerance allows to restrict the amount of data to be written to checkpoints to the smaller subset of data needed for a restart. The project aims at exploring the use of lossy FE data compression for reducing the communication cost in HPC computations in exchange for a small increase in computation cost. We will investigate both inter-node communication in time-parallel PDE solves and I/O for checkpointing.

Background

Ongoing trends in high performance computing are

  • rapidly increasing compute power per node (MIC, GPU)
  • modestly growing memory per node
  • modestly growing inter-node communication bandwidth
  • modestly growing mass storage bandwidth.

As a result, massively parallel simulations are increasingly communication-bound, and for scalability, increasing the compute/communication ratio is necessary.

PFASST

The Parallel Full Approximation Scheme in Space and Time by Emmet and Minion is an efficient time-parallel PDE solver. It is based on spectral deferred correction (SDC) methods for ordinary differential equations, e.g. arising from a method-of-lines finite element discretization of the PDE. SDC methods utilize a low-order time-stepping method like (semi-) implicit Euler to iteratively solve the collocation equation corresponding to the Picard integral formulation of the initial value problem. PFASST subdivides the time domain into sub-intervals, on which SDC iterations are performed in parallel, with frequent communication of updated finite element solution vectors between parallel ranks.

Compression of FE solutions
 

To reduce the communication overhead, we use transform coding to compress MPI messages. Techniques developed in the project Trajectory Compression are based on changing the basis representation of finite element coefficient vectors, e.g. from nodal to hierarchical basis, quantization of the resulting coefficients, and entropy coding. The resulting inexactness influences the convergence behavior of PFASST, such that suitable quantization tolerances need to be determined.

 

Application-based fault tolerance

In future exa-scale HPC systems, node failure will be a common event, requiring implementation of restart facilities. While OS controlled checkpointing is convenient, it results in large memory dumps. Application-based checkpointing allows the application to decide when and what to store. Additionally, checkpoints can be written with reduced accuracy, e.g. in the context of finite element methods by using lossy compression to reduce the checkpoint size.