Skip to content

Consulting

Checkpoint/Restore

Fault tolerance, preemption recovery, and context migration for AI and HPC workloads.

Why Checkpoint/Restore?

Long-running AI training jobs, HPC simulations, and bioinformatics pipelines are increasingly run on preemptible or spot instances to reduce cost. When those instances are reclaimed, hours or days of computation can be lost. Checkpoint/Restore (C/R) solves this by saving application state to persistent storage and resuming it on a new instance — transparently, without code changes.

What I Offer

  • Architecture review — evaluate your workloads for C/R readiness and identify the right approach (CRIU, in-app checkpointing, MemVerge MemMachine, or hybrid)
  • Integration — implement C/R into existing pipelines on AWS Batch, Kubernetes, SLURM, or custom schedulers
  • AI memory management — leverage memory platforms to make model context portable across LLMs and infrastructure
  • Cost optimization — combine C/R with spot/preemptible instances to cut compute costs while maintaining reliability

Background

As an HPC Developer Advocate at AWS I helped push Checkpoint/Restore integration into AWS Batch for bioinformatics and HPC workloads. At MemVerge I work on MemMachine, an AI memory platform that makes context portable across different LLMs and cloud infrastructure.

This page is a work in progress. More details coming soon.

Book a call to discuss your needs