Disaster Recovery & Emergency Response
Resilience at Scale: Prioritizing Metadata, Capabilities, and Survival.
When Standard DR Fails
Standard enterprise DR plans—built on "replicate everything"—fail in HPC. Replicating 20 Petabytes of scratch data in real-time is economically and technically impossible. Our Tiered Resilience Strategy prioritizes metadata over data and essential capabilities over convenience, ensuring your institution survives a catastrophic event.
1. The HPC Emergency Response Plan (ERP)
Thermal Panic Protocol
HPC density is so high that cooling failures can melt hardware in minutes. We implement automated "Soft Kill" (pause scheduler/checkpoint jobs) and "Hard Kill" (cut compute power) scripts to save the infrastructure before the heat does permanent damage.
Cyber Kill-Switch
In case of active intrusion, we use a "Containment Mode" script. Instead of pulling cables (which destroys RAM evidence), we change Border Router ACLs to drop all external user traffic while preserving internal management access for forensics.
2. The Tiered Data Recovery Model
| Data Tier | Priority / Strategy | RTO (Recovery Time) |
|---|---|---|
| Tier 0: Metadata | Synchronous Replication (MDTs, Slurm DB, LDAP). Critical for FS integrity. | < 4 Hours |
| Tier 1: Home/Source | Daily Incremental Backups (Code, Binaries) to Tape or S3 Glacier. | < 24 Hours |
| Tier 2: Project Data | Async Snapshots on local storage. No offsite backup (cost-saving). | 1-3 Days |
| Tier 3: Scratch | Sacrificial. Intermediate files; users accept 100% loss risk. | N/A (Re-run) |
3. Specialized Technical Recovery
Parallel Filesystem Recovery
After a hard crash, we never mount Lustre/GPFS read-write immediately. We follow a strict protocol:
- Metadata Check: Run
lfsckon Metadata Targets only. - Read-Only Mount: Verify directory structures on a single client node.
- Degraded Mount: Bring OSTs online one by one to isolate corrupted hardware.
"Cloud Bursting" as DR
If the physical datacenter is destroyed, the cloud is the only viable site. We implement a "Pilot Light" strategy:
- Maintain dormant cluster images on AWS ParallelCluster or Azure CycleCloud.
- Continuously sync Tier 1 data via Globus to S3 buckets.
- Instantly spin up thousands of cores in case of total site loss.
Disaster Recovery Toolset
| Category | Tool | Usage |
|---|---|---|
| File System Health | lfsck / mmfsck | Critical consistency checks for Lustre and GPFS metadata/objects. |
| Cloud DR | AWS ParallelCluster | Spinning up elastic "Pilot Light" clusters in case of site disaster. |
| Data Sync | Globus | High-speed asynchronous replication of Tier 1 data to the cloud. |
| State Persistence | Slurm StateSave | Ensuring the queue and accounting database survive controller failure. |
Protect Your Institution's Future
Download our "HPC Tiered Disaster Recovery Plan" to define RTOs and RPOs that fit your budget and research needs.
Download DR Guide (.docx)