Implementing advanced access control in HPC is fundamentally different from enterprise IT. In enterprise IT, you isolate machines. In HPC, you must isolate jobs while allowing them to run across thousands of machines simultaneously.

Here is the implementation guide for a "Gateway & Ticket" access model, designed to secure the cluster without blocking science.

1. The Architecture: "Gateway & Ticket"

Moving beyond simple passwords or static SSH keys is essential. The modern standard uses Short-Lived Certificates for entry and Job-Based Authorization for internal movement.

The Gateway (Perimeter): Users enter via an SSH Certificate Authority (CA). No static keys are allowed.
The Ticket (Internal): Once inside, the user's "ticket" to access compute nodes is strictly tied to their active job via the scheduler.

2. Perimeter: Implementing SSH Certificates

Static SSH keys (public/private keypairs) are a security liability because they never expire. If a researcher's laptop is stolen, your cluster is compromised until you manually revoke that key.

Implementation Plan:

Deploy an SSH CA: Use tools like HashiCorp Vault (for enterprise) or Netflix BLESS (AWS Lambda based) to act as the Certificate Authority.
The Workflow:

User authenticates via SSO (Okta/AD) + MFA (Duo/YubiKey).
The CA issues a signed SSH Certificate valid for 12 hours.
User logs in: ssh -i my_cert user@hpc-login.edu.

Server Config (/etc/ssh/sshd_config on Login Nodes):

Bash

# Trust the CA's public key

TrustedUserCAKeys /etc/ssh/user_ca.pub

# Revocation list (crucial for immediate bans)

RevokedKeys /etc/ssh/revoked_keys

# Disable static keys eventually

# PubkeyAuthentication yes (Transition period)

3. Internal: Locking Down Compute Nodes

A common breach pattern: A user logs into a compute node they aren't using to mine crypto or snoop on another user's memory.

The Solution: pam_slurm_adopt

This is a Slurm-specific PAM module that rejects SSH connections to compute nodes unless the user has an active job running on that specific node.

Implementation Steps:

Install the Module: Ensure pam_slurm_adopt.so is compiled and available in /lib64/security/.
Configure PAM (/etc/pam.d/sshd on Compute Nodes):

Add this line after standard account management but before the session is opened.

Bash

account required pam_slurm_adopt.so action_no_jobs=deny

Note: action_no_jobs=deny is the critical setting. It creates a "hard shell" around the node.

Configure Slurm (slurm.conf):

You must use the task/cgroup plugin for this to work effectively, as it tracks the process lineage.

Bash

TaskPlugin=task/cgroup

PrologFlags=Contain # Ensures ssh sessions are adopted into the job's cgroup

4. Federated Identity: "Bring Your Own Identity"

For research collaborations (e.g., a multi-university grant), creating local accounts for every external collaborator is unmanageable.

Implementation: CILogon

Use CILogon to bridge university credentials (InCommon/Eduroam) with your local Linux accounts.

The Translation Layer: You need a mapping file (mapmap) that translates researcher@stanford.edu → local_user_su50.
The Flow:

User clicks "Log In" on your Open OnDemand web portal.
Redirects to CILogon → Selects "Stanford University".
Authenticates at Stanford + Stanford MFA.
CILogon passes a token back to your portal, which maps it to local_user_su50.

5. Privileged Access Management (PAM) for Admins

Admins should never log in as root. They should log in as themselves and escalate only when necessary.

No Root SSH: PermitRootLogin no in sshd_config on all nodes.
The "Sudo" Trap: Do not give sudo access on compute nodes. It breaks the audit trail and file permissions.
Mgmt Node Isolation:

Admins connect to a dedicated Management Node (not the Login Node).
Use Ansible/Puppet for changes. "Manual" changes by root should be rare and trigger an alert.

Summary Configuration Checklist

Control	Location	Configuration / Tool
Kill Static Keys	Login Nodes	TrustedUserCAKeys
Block Node Hopping	Compute Nodes	pam_slurm_adopt
Web Portal Auth	Open OnDemand	Mod_auth_openidc + CILogon
Internal Trust	All Nodes	Munge (MUNGE Uid 'N' Gid Emporium) key must be rotated annually.

Here is the complete toolkit for implementing these changes: the Ansible Playbook to enforce the technical controls and the Transition Email to manage the human side of the change.

Part 1: The Ansible Playbook (pam_slurm_adopt)

This playbook targets your compute nodes. It automates the installation of the pam_slurm_adopt module and configures the PAM stack to reject users who do not have an active job on the node.

Prerequisites: This assumes your nodes are running a RHEL-derivative (Rocky, Alma, CentOS) and you have the Slurm packages available.

YAML

---

- name: Secure Compute Nodes with pam_slurm_adopt

hosts: compute_nodes

become: yes

vars:

# Adjust this path if your Slurm libs are elsewhere (e.g., /usr/local/lib/...)

pam_module_path: /usr/lib64/security/pam_slurm_adopt.so

tasks:

- name: Ensure Slurm PAM module package is installed

dnf:

state: present

# Note: If you built Slurm from source, ensure the .so file exists manually

- name: Verify pam_slurm_adopt.so exists

stat:

path: "{{ pam_module_path }}"

- name: Fail if PAM module is missing

fail:

msg: "pam_slurm_adopt.so not found at {{ pam_module_path }}. Check Slurm installation."

when: not pam_file.stat.exists

- name: Backup current PAM sshd config

copy:

src: /etc/pam.d/sshd

dest: /etc/pam.d/sshd.bak_{{ ansible_date_time.iso8601 }}

remote_src: yes

- name: Insert pam_slurm_adopt into PAM Account stack

pamd:

type: account

control: required

module_path: "{{ pam_module_path }}"

module_arguments: "action_no_jobs=deny"

state: before

# We insert this BEFORE the standard 'password-auth' or 'system-auth' include

# to ensure Slurm checks happen first.

new_type: account

new_control: include

new_module_path: password-auth

notify: Restart SSHD

- name: Ensure Slurm is configured for cgroups (Required for adoption)

lineinfile:

path: /etc/slurm/slurm.conf

regexp: '^TaskPlugin='

line: 'TaskPlugin=task/cgroup'

state: present

notify: Reload Slurmd

handlers:

- name: Restart SSHD

service:

state: restarted

- name: Reload Slurmd

service:

state: reloaded

Key Technical Note: The action_no_jobs=deny argument is the "Security Enforcer." Without this, the module might just log a warning but still let the user in. We want a hard deny.

Part 2: The Transition Email (Change Management)

Implementing strict access control often frustrates researchers who are used to treating compute nodes like personal workstations. This email frames the restriction as a benefit to their work (performance protection) rather than just a security rule.

Subject: [Action Required] Important Security & Performance Upgrade to HPC Cluster

Date: January 2026 To: All HPC Users From: Research Computing Infrastructure Team

Dear Research Community,

To ensure the integrity of your data and maximize the performance of your simulations, we are upgrading the access control policies on the [Cluster Name] compute environment, effective [Date].

What is changing? We are implementing Job-Based Access Control on all compute nodes.

Current State: Users can SSH into any compute node at any time.
New State: You will only be able to SSH into a compute node if you have an active job running on it.

Why are we doing this?

Performance Protection: "Node hopping" (logging into nodes you aren't using) inadvertently consumes memory and CPU cycles, slowing down the science of colleagues who have reserved that node.
Security Compliance: New grant requirements (including NIH and NSF standards) require us to strictly isolate workloads.

How does this affect you?

Running Jobs: You can still SSH into nodes where your jobs are running to check logs (e.g., ssh node05 works while your job runs there).
Monitoring: If you need to check cluster status, please use the login nodes or our web portal [Link to OnDemand/Portal] rather than hopping onto compute nodes.
Debugging: If you need to test code interactively, please use salloc or srun --pty /bin/bash to request an interactive session. Do not run heavy processing on the login nodes.

Action Required: No immediate action is required on your part. However, if you have automated scripts that rely on SSHing to specific compute nodes without a job reservation, they will break on [Date]. Please update your workflows to use Slurm dependencies.

If you have a specific use case that this policy prevents, please reply to this ticket so we can discuss a "Special Partition" solution.

Thank you for helping us keep [Cluster Name] fast and secure.

Best regards,

[Your Name/Team] [Contact Info]

Part 3: Next Steps for "Soft Launch"

Before you run that Ansible playbook on the whole cluster:

Test on a single partition: Create a debug partition with 2 nodes and apply the playbook there.
Verify the Deny: Log in as a user without a job and confirm you get a "Permission Denied" message (even with correct SSH keys).
Verify the Allow: Submit an interactive job (salloc -p debug), wait for it to start, and confirm you can SSH in.