Infrastructure Maintenance Training for HPC is about moving from "Reactive Firefighting" to "Proactive Reliability."

An HPC cluster is a Formula 1 car, not a family sedan. It runs hot, it runs fast, and parts break constantly. If your staff waits for a red light on a dashboard to fix something, the cluster is already losing money. Effective training focuses on Predictive Maintenance, Hardware Swapping Drills, and Root Cause Analysis.

Here is the detailed breakdown of the maintenance curriculum (Hardware, System, Facility), the "Fire Drill" methodology, and the toolset, followed by the downloadable Word file.

1. The Maintenance Layers

Staff needs to be trained on three distinct physical layers.

A. The Hardware Layer (The Metal)

The Skill: Safely swapping components without taking the whole cluster down.
Training Drills:

Hot Swapping: Pulling a failed power supply unit (PSU) or fan while the node is still running.
DIMM Reseating: Identifying which specific stick of RAM is throwing ECC errors (using IPMI logs) and replacing it.
Cable Hygiene: Managing fiber optic cables (InfiniBand) effectively. A 1mm bend radius on a fiber cable breaks the glass inside.

B. The System Layer (The OS)

The Skill: Managing the "State" of the nodes.
Training Drills:

Draining Nodes: How to tell the scheduler (Slurm) "Don't send new jobs to Node 50," wait for current jobs to finish, and then reboot it.
Image Provisioning: Using tools like Warewulf or xCAT to re-flash a corrupted OS image in 5 minutes.
Firmware Updates: Coordinating a BIOS/BMC update across 1,000 nodes simultaneously without bricking them.

C. The Facility Layer (Power & Cooling)

The Skill: Understanding the environment that keeps the servers alive.
Training Drills:

Leak Response: What to do if a Direct Liquid Cooling (DLC) manifold drips on a CPU. (Step 1: Emergency Power Off).
Thermal Monitoring: Reading the CDU (Coolant Distribution Unit) panel. Understanding "Approach Temperature" and "Flow Rate."

2. The "Game Day" Methodology

Lectures don't work for maintenance. You need Chaos Engineering drills.

Scenario 1: The "Split Brain"

Simulation: Unplug the network cable between the two head nodes.
Task: Staff must identify which Head Node is active and ensure they don't both try to write to the storage at the same time (which corrupts data).

Scenario 2: The "Lustre Lock"

Simulation: Kill a storage server process.
Task: Staff must restart the service and recover the file system journals before users notice the I/O freeze.

3. Operational Best Practices

Ticket Hygiene: Every maintenance action must be logged. "If it isn't in Jira, it didn't happen."
Change Management: Never patch the production cluster on a Friday afternoon. Establish a "Maintenance Window" (e.g., first Tuesday of the month).
Spares Management: Training on inventory. If a node dies, do we have a spare CPU on the shelf? If not, the node is dead for 4 weeks.

4. Key Applications & Tools

Category	Tool	Usage
Provisioning	Warewulf / xCAT	The standard tools for stateless cluster management. Re-images nodes on every reboot.
Monitoring	Nagios / Icinga	The "Red Light" dashboard. Alerts staff if a temperature sensor goes above threshold.
Metrics	Prometheus + Grafana	The "Trend" dashboard. Shows if CPU temperature is slowly creeping up over 6 months (indicating dust buildup).
Hardware	IPMItool	Command-line tool to talk to the Baseboard Management Controller (BMC). Used to remotely power cycle a frozen node.