HPC Disaster Recovery & Emergency Response

When Standard DR Fails

Standard enterprise DR plans—built on "replicate everything"—fail in HPC. Replicating 20 Petabytes of scratch data in real-time is economically and technically impossible. Our Tiered Resilience Strategy prioritizes metadata over data and essential capabilities over convenience, ensuring your institution survives a catastrophic event.

1. The HPC Emergency Response Plan (ERP)

Thermal Panic Protocol

HPC density is so high that cooling failures can melt hardware in minutes. We implement automated "Soft Kill" (pause scheduler/checkpoint jobs) and "Hard Kill" (cut compute power) scripts to save the infrastructure before the heat does permanent damage.

Cyber Kill-Switch

In case of active intrusion, we use a "Containment Mode" script. Instead of pulling cables (which destroys RAM evidence), we change Border Router ACLs to drop all external user traffic while preserving internal management access for forensics.

2. The Tiered Data Recovery Model

Data Tier	Priority / Strategy	RTO (Recovery Time)
Tier 0: Metadata	Synchronous Replication (MDTs, Slurm DB, LDAP). Critical for FS integrity.	< 4 Hours
Tier 1: Home/Source	Daily Incremental Backups (Code, Binaries) to Tape or S3 Glacier.	< 24 Hours
Tier 2: Project Data	Async Snapshots on local storage. No offsite backup (cost-saving).	1-3 Days
Tier 3: Scratch	Sacrificial. Intermediate files; users accept 100% loss risk.	N/A (Re-run)

3. Specialized Technical Recovery

Parallel Filesystem Recovery

After a hard crash, we never mount Lustre/GPFS read-write immediately. We follow a strict protocol:

Metadata Check: Run lfsck on Metadata Targets only.
Read-Only Mount: Verify directory structures on a single client node.
Degraded Mount: Bring OSTs online one by one to isolate corrupted hardware.

"Cloud Bursting" as DR

If the physical datacenter is destroyed, the cloud is the only viable site. We implement a "Pilot Light" strategy:

Maintain dormant cluster images on AWS ParallelCluster or Azure CycleCloud.
Continuously sync Tier 1 data via Globus to S3 buckets.
Instantly spin up thousands of cores in case of total site loss.

Disaster Recovery Toolset

Category	Tool	Usage
File System Health	lfsck / mmfsck	Critical consistency checks for Lustre and GPFS metadata/objects.
Cloud DR	AWS ParallelCluster	Spinning up elastic "Pilot Light" clusters in case of site disaster.
Data Sync	Globus	High-speed asynchronous replication of Tier 1 data to the cloud.
State Persistence	Slurm StateSave	Ensuring the queue and accounting database survive controller failure.

Protect Your Institution's Future

Download our "HPC Tiered Disaster Recovery Plan" to define RTOs and RPOs that fit your budget and research needs.

Download DR Guide (.docx)