Checkpointing to Save Job Progress

Saving Checkpoints for Long Simulations

What is Checkpointing?

Checkpointing is the process of saving necessary data from a running simulation, usually implemented either to restart a job or as a safe point in case of system failure.

Good Practices for Checkpointing

Plan Ahead

  • Many software applications provide options for checkpointing/restarting your simulations. Before starting your simulations, ensure that you have the relevant checkpointing FLAGS, ON.
  • It is important to note that every application is going to have different nomenclature and formatting. ensure that you have the appropriate flags for the application in use.
  • Alternatively, the Rescale platform has a native checkpointing function Snapshot, that enables users to easily store intermediate files. Note that this method is not optimal for restarts.

Software based checkpointing and restart procedures

Checkpoint Relevant Information only

  • It is good practice to save only relevant information that is needed for restarting your simulations.
  • Excessive writing of data could lead to Out of Memory related system failures or slow down the simulation process.
  • Generally, most applications allow for the writing of restart files, which can be used to restart the simulations. For example, Abaqus writes .rst files that can be used to restart the simulation from the last computed iteration/step.

Monitoring Simulation

  • For long job simulations, it is recommended that you monitor your job at regular intervals, as doing so will enable you to catch any potential errors that can arise
  • In addition to identifying errors, regular monitoring will allow you to check progress and stop simulations in cases where the applications does not automatically stop after an error.

Try to Avoid the Following

Checkpointing Too Often

  • Excessive checkpointing takes up the available storage on the
    cloud instance. This will interrupt the simulation and lead to insufficient memory based system errors.
  • Excessively writing output files will also lead to slowing down the the simulation process and increasing overall job time.

No Checkpointing

  • Failure to preform regular checkpointing could result in the loss of progress and data in the event of a system failure.
  • For example, if you have a simulation run for several days, we advice to checkpoint every few hours in simulation time.