Checkpoint/Restore Consulting

Why Checkpoint/Restore?

Long-running AI training jobs, HPC simulations, and bioinformatics pipelines are increasingly run on preemptible or spot instances to reduce cost. When those instances are reclaimed, hours or days of computation can be lost. Checkpoint/Restore (C/R) solves this by saving application state to persistent storage and resuming it on a new instance — transparently, without code changes.

What I Offer

Architecture review — evaluate your workloads for C/R readiness and identify the right approach (CRIU, in-app checkpointing, MemVerge MemMachine, or hybrid)
Integration — implement C/R into existing pipelines on AWS Batch, Kubernetes, SLURM, or custom schedulers
AI memory management — leverage memory platforms to make model context portable across LLMs and infrastructure
Cost optimization — combine C/R with spot/preemptible instances to cut compute costs while maintaining reliability

Background

As an HPC Developer Advocate at AWS I helped push Checkpoint/Restore integration into AWS Batch for bioinformatics and HPC workloads. At MemVerge I work on MemMachine, an AI memory platform that makes context portable across different LLMs and cloud infrastructure.

This page is a work in progress. More details coming soon.

Book a call to discuss your needs