Security
Assessments & Vulnerability Analysis in HPC is about finding the
cracks in the "Soft Center" without crashing
the supercomputer.
Standard IT
security tools (like aggressive port scanners) can actually
knock over an HPC cluster by saturating the low-latency network or
freezing the scheduler. Therefore, HPC security assessments require a
specialized "White Hat" approach that respects the delicate
performance requirements of the hardware.
Here is the
detailed breakdown of the assessment methodology, the specific HPC attack
surfaces (Scheduler & Interconnect), and the remediation workflow, followed
by the downloadable Word file.
1. The
Assessment Methodology
We use a
tiered approach to probe the system without causing an outage.
- Phase 1: Passive Discovery (The
Scan)
- Action: Using tools like OpenVAS
or Nessus to scan the Login Nodes and Management Nodes
for outdated packages (CVEs).
- HPC Rule: Never scan the
high-speed interconnect (InfiniBand/Omni-Path) with standard Ethernet
scanners. It can cause a "Broadcast Storm" that crashes the
fabric.
- Phase 2: Configuration Audit
(The Review)
- Action: Checking the config files of
the Scheduler (Slurm/PBS) and the Filesystem (Lustre/GPFS).
- Check: Are users allowed to run
"Prolog/Epilog" scripts? (If yes, can they trigger root
commands?). Are setuid binaries allowed on the shared storage?
- Phase 3: Active Penetration
Test (The Red Team)
- Action: We give a security expert a
standard "Student" account.
- Goal: Can they escalate privileges
to root? Can they read the data in another user's folder? Can they crash a compute node?
2. Common HPC Vulnerabilities
HPC systems
have unique weak points that standard web servers do not.1
- The
Scheduler (Privilege Escalation):
- Risk: Schedulers like Slurm run as root to manage resources.2 If
a user can trick the scheduler into running a script they wrote, they
become root on the entire cluster.
- The Shared Filesystem (Data
Leakage):
- Risk: Parallel filesystems (Lustre) rely on client-side trust. If a user can
mount the filesystem on their own laptop (connected to the network), they
can often bypass user permissions and read everything.
- Container
Breakout:
- Risk: Users bringing Docker
containers. If the container is not properly isolated (e.g., running as --privileged),
the user can "break out" of the container and attack the host
kernel.3
3. The
"Zero Trust" Mitigation Strategy
Since you
cannot encrypt every packet (it's too slow), you rely on strict segmentation.
- Mitigation
1: Root Squashing:
- Configuring the filesystem
(NFS/Lustre) so that even if a user is root on their own node, the filesystem treats them as nobody.
- Mitigation
2: Prolog/Epilog Sanitization:
- Ensuring that any script
running before or after a job is strictly controlled by the admin, not
the user.
- Mitigation
3: Outbound Blocking:
- Compute nodes should have no route
to the internet. This prevents malware from "phoning home"
to a Command & Control server.
4. Key Applications & Tools
|
Category
|
Tool
|
Usage
|
|
Scanner
|
OpenVAS / Nessus
|
Standard
vulnerability scanning for the Login/Mgmt nodes.
Identifies old kernels and unpatched SSH versions.
|
|
Audit
|
Lynis
|
A
security auditing tool for Linux. It checks for weak password policies, open
ports, and file permissions.
|
|
HPC Specific
|
Check_Slurm
|
Scripts designed to check for known misconfigurations in the Slurm workload manager.
|
|
Forensics
|
Auditd
|
The Linux
Audit Daemon. Essential for tracking who ran that sudo
command three weeks ago.
|