Implementing
advanced access control in HPC is fundamentally different from enterprise IT.
In enterprise IT, you isolate machines. In HPC, you must isolate jobs
while allowing them to run across thousands of machines simultaneously.
Here is the
implementation guide for a "Gateway & Ticket" access model,
designed to secure the cluster without blocking science.
1. The
Architecture: "Gateway & Ticket"
Moving
beyond simple passwords or static SSH keys is essential. The modern standard
uses Short-Lived Certificates for entry and Job-Based Authorization
for internal movement.
2.
Perimeter: Implementing SSH Certificates
Static SSH
keys (public/private keypairs) are a security liability because they never
expire. If a researcher's laptop is stolen, your cluster is compromised until
you manually revoke that key.
Implementation Plan:
Bash
# Trust the
CA's public key
TrustedUserCAKeys /etc/ssh/user_ca.pub
#
Revocation list (crucial for immediate bans)
RevokedKeys
/etc/ssh/revoked_keys
# Disable
static keys eventually
# PubkeyAuthentication yes (Transition period)
3.
Internal: Locking Down Compute Nodes
A common
breach pattern: A user logs into a compute node they aren't
using to mine crypto or snoop on another user's memory.
The
Solution: pam_slurm_adopt
This is a Slurm-specific PAM module that rejects SSH connections to
compute nodes unless the user has an active job running on that specific node.
Implementation Steps:
Bash
account required
pam_slurm_adopt.so action_no_jobs=deny
Bash
TaskPlugin=task/cgroup
PrologFlags=Contain # Ensures ssh
sessions are adopted into the job's cgroup
4.
Federated Identity: "Bring Your Own Identity"
For
research collaborations (e.g., a multi-university grant), creating local
accounts for every external collaborator is unmanageable.
Implementation:
CILogon
Use CILogon to bridge university credentials (InCommon/Eduroam) with your local
Linux accounts.
5.
Privileged Access Management (PAM) for Admins
Admins
should never log in as root. They should log in as themselves and escalate only
when necessary.
Summary Configuration Checklist
|
Control |
Location |
Configuration / Tool |
|
Kill Static Keys |
Login Nodes |
TrustedUserCAKeys |
|
Block Node Hopping |
Compute Nodes |
pam_slurm_adopt |
|
Web Portal Auth |
Open OnDemand |
Mod_auth_openidc + CILogon |
|
Internal Trust |
All Nodes |
Munge (MUNGE Uid
'N' Gid Emporium) key must be rotated annually. |
Here is the
complete toolkit for implementing these changes: the Ansible Playbook to
enforce the technical controls and the Transition Email to manage the
human side of the change.
Part 1:
The Ansible Playbook (pam_slurm_adopt)
This
playbook targets your compute nodes. It automates the
installation of the pam_slurm_adopt module and configures the PAM stack to
reject users who do not have an active job on the node.
Prerequisites: This assumes your nodes are running
a RHEL-derivative (Rocky, Alma, CentOS) and you have the Slurm packages
available.
YAML
---
- name: Secure
Compute Nodes with pam_slurm_adopt
hosts: compute_nodes
become: yes
vars:
# Adjust this path if your Slurm libs are
elsewhere (e.g., /usr/local/lib/...)
pam_module_path: /usr/lib64/security/pam_slurm_adopt.so
tasks:
- name: Ensure Slurm PAM module package is installed
dnf:
name: slurm-pam_slurm
state: present
# Note: If you built Slurm from source,
ensure the .so file exists manually
- name: Verify pam_slurm_adopt.so exists
stat:
path: "{{ pam_module_path }}"
register: pam_file
- name: Fail if PAM module is missing
fail:
msg: "pam_slurm_adopt.so not found
at {{ pam_module_path }}. Check Slurm installation."
when: not pam_file.stat.exists
- name: Backup current PAM sshd config
copy:
src: /etc/pam.d/sshd
dest: /etc/pam.d/sshd.bak_{{ ansible_date_time.iso8601 }}
remote_src: yes
- name: Insert pam_slurm_adopt into PAM Account
stack
pamd:
name: sshd
type: account
control: required
module_path: "{{
pam_module_path }}"
module_arguments: "action_no_jobs=deny"
state: before
# We insert this BEFORE the standard
'password-auth' or 'system-auth' include
# to ensure Slurm checks happen first.
new_type: account
new_control: include
new_module_path: password-auth
notify: Restart SSHD
- name: Ensure Slurm is configured for cgroups
(Required for adoption)
lineinfile:
path: /etc/slurm/slurm.conf
regexp: '^TaskPlugin='
line: 'TaskPlugin=task/cgroup'
state: present
notify: Reload Slurmd
handlers:
- name: Restart SSHD
service:
name: sshd
state: restarted
- name: Reload Slurmd
service:
name: slurmd
state: reloaded
Key
Technical Note: The
action_no_jobs=deny argument is the "Security Enforcer." Without
this, the module might just log a warning but still let the user in. We want a hard
deny.
Part 2: The Transition Email (Change Management)
Implementing
strict access control often frustrates researchers who are used to treating compute nodes like personal workstations. This email frames
the restriction as a benefit to their work (performance protection)
rather than just a security rule.
Subject: [Action Required] Important
Security & Performance Upgrade to HPC Cluster
Date: January 2026 To: All HPC
Users From: Research Computing Infrastructure Team
Dear
Research Community,
To ensure
the integrity of your data and maximize the performance of your simulations, we
are upgrading the access control policies on the [Cluster Name] compute
environment, effective [Date].
What is
changing? We are
implementing Job-Based Access Control on all compute nodes.
Why are
we doing this?
How does
this affect you?
Action
Required: No
immediate action is required on your part. However, if you have automated
scripts that rely on SSHing to specific compute nodes without a job
reservation, they will break on [Date]. Please update your workflows to
use Slurm dependencies.
If you have
a specific use case that this policy prevents, please reply to this ticket so
we can discuss a "Special Partition" solution.
Thank you
for helping us keep [Cluster Name] fast and secure.
Best
regards,
[Your
Name/Team] [Contact
Info]
Part 3:
Next Steps for "Soft Launch"
Before you
run that Ansible playbook on the whole cluster: