Implementing advanced access control in HPC is fundamentally different from enterprise IT. In enterprise IT, you isolate machines. In HPC, you must isolate jobs while allowing them to run across thousands of machines simultaneously.

Here is the implementation guide for a "Gateway & Ticket" access model, designed to secure the cluster without blocking science.

1. The Architecture: "Gateway & Ticket"

Moving beyond simple passwords or static SSH keys is essential. The modern standard uses Short-Lived Certificates for entry and Job-Based Authorization for internal movement.


2. Perimeter: Implementing SSH Certificates

Static SSH keys (public/private keypairs) are a security liability because they never expire. If a researcher's laptop is stolen, your cluster is compromised until you manually revoke that key.

Implementation Plan:

  1. Deploy an SSH CA: Use tools like HashiCorp Vault (for enterprise) or Netflix BLESS (AWS Lambda based) to act as the Certificate Authority.
  2. The Workflow:
  3. Server Config (/etc/ssh/sshd_config on Login Nodes):

Bash

# Trust the CA's public key

TrustedUserCAKeys /etc/ssh/user_ca.pub

 

# Revocation list (crucial for immediate bans)

RevokedKeys /etc/ssh/revoked_keys

 

# Disable static keys eventually

# PubkeyAuthentication yes (Transition period)


3. Internal: Locking Down Compute Nodes

A common breach pattern: A user logs into a compute node they aren't using to mine crypto or snoop on another user's memory.

The Solution: pam_slurm_adopt

This is a Slurm-specific PAM module that rejects SSH connections to compute nodes unless the user has an active job running on that specific node.

Implementation Steps:

  1. Install the Module: Ensure pam_slurm_adopt.so is compiled and available in /lib64/security/.
  2. Configure PAM (/etc/pam.d/sshd on Compute Nodes):

Bash

account    required     pam_slurm_adopt.so action_no_jobs=deny

  1. Configure Slurm (slurm.conf):

Bash

TaskPlugin=task/cgroup

PrologFlags=Contain  # Ensures ssh sessions are adopted into the job's cgroup


4. Federated Identity: "Bring Your Own Identity"

For research collaborations (e.g., a multi-university grant), creating local accounts for every external collaborator is unmanageable.

Implementation: CILogon

Use CILogon to bridge university credentials (InCommon/Eduroam) with your local Linux accounts.


5. Privileged Access Management (PAM) for Admins

Admins should never log in as root. They should log in as themselves and escalate only when necessary.

Summary Configuration Checklist

Control

Location

Configuration / Tool

Kill Static Keys

Login Nodes

TrustedUserCAKeys

Block Node Hopping

Compute Nodes

pam_slurm_adopt

Web Portal Auth

Open OnDemand

Mod_auth_openidc + CILogon

Internal Trust

All Nodes

Munge (MUNGE Uid 'N' Gid Emporium) key must be rotated annually.

Here is the complete toolkit for implementing these changes: the Ansible Playbook to enforce the technical controls and the Transition Email to manage the human side of the change.

Part 1: The Ansible Playbook (pam_slurm_adopt)

This playbook targets your compute nodes. It automates the installation of the pam_slurm_adopt module and configures the PAM stack to reject users who do not have an active job on the node.

Prerequisites: This assumes your nodes are running a RHEL-derivative (Rocky, Alma, CentOS) and you have the Slurm packages available.

YAML

---

- name: Secure Compute Nodes with pam_slurm_adopt

  hosts: compute_nodes

  become: yes

  vars:

    # Adjust this path if your Slurm libs are elsewhere (e.g., /usr/local/lib/...)

    pam_module_path: /usr/lib64/security/pam_slurm_adopt.so

   

  tasks:

    - name: Ensure Slurm PAM module package is installed

      dnf:

        name: slurm-pam_slurm

        state: present

      # Note: If you built Slurm from source, ensure the .so file exists manually

     

    - name: Verify pam_slurm_adopt.so exists

      stat:

        path: "{{ pam_module_path }}"

      register: pam_file

 

    - name: Fail if PAM module is missing

      fail:

        msg: "pam_slurm_adopt.so not found at {{ pam_module_path }}. Check Slurm installation."

      when: not pam_file.stat.exists

 

    - name: Backup current PAM sshd config

      copy:

        src: /etc/pam.d/sshd

        dest: /etc/pam.d/sshd.bak_{{ ansible_date_time.iso8601 }}

        remote_src: yes

 

    - name: Insert pam_slurm_adopt into PAM Account stack

      pamd:

        name: sshd

        type: account

        control: required

        module_path: "{{ pam_module_path }}"

        module_arguments: "action_no_jobs=deny"

        state: before

        # We insert this BEFORE the standard 'password-auth' or 'system-auth' include

        # to ensure Slurm checks happen first.

        new_type: account

        new_control: include

        new_module_path: password-auth

      notify: Restart SSHD

 

    - name: Ensure Slurm is configured for cgroups (Required for adoption)

      lineinfile:

        path: /etc/slurm/slurm.conf

        regexp: '^TaskPlugin='

        line: 'TaskPlugin=task/cgroup'

        state: present

      notify: Reload Slurmd

 

  handlers:

    - name: Restart SSHD

      service:

        name: sshd

        state: restarted

 

    - name: Reload Slurmd

      service:

        name: slurmd

        state: reloaded

Key Technical Note: The action_no_jobs=deny argument is the "Security Enforcer." Without this, the module might just log a warning but still let the user in. We want a hard deny.


Part 2: The Transition Email (Change Management)

Implementing strict access control often frustrates researchers who are used to treating compute nodes like personal workstations. This email frames the restriction as a benefit to their work (performance protection) rather than just a security rule.

Subject: [Action Required] Important Security & Performance Upgrade to HPC Cluster

Date: January 2026 To: All HPC Users From: Research Computing Infrastructure Team

Dear Research Community,

To ensure the integrity of your data and maximize the performance of your simulations, we are upgrading the access control policies on the [Cluster Name] compute environment, effective [Date].

What is changing? We are implementing Job-Based Access Control on all compute nodes.

Why are we doing this?

  1. Performance Protection: "Node hopping" (logging into nodes you aren't using) inadvertently consumes memory and CPU cycles, slowing down the science of colleagues who have reserved that node.
  2. Security Compliance: New grant requirements (including NIH and NSF standards) require us to strictly isolate workloads.

How does this affect you?

Action Required: No immediate action is required on your part. However, if you have automated scripts that rely on SSHing to specific compute nodes without a job reservation, they will break on [Date]. Please update your workflows to use Slurm dependencies.

If you have a specific use case that this policy prevents, please reply to this ticket so we can discuss a "Special Partition" solution.

Thank you for helping us keep [Cluster Name] fast and secure.

Best regards,

[Your Name/Team] [Contact Info]


Part 3: Next Steps for "Soft Launch"

Before you run that Ansible playbook on the whole cluster:

  1. Test on a single partition: Create a debug partition with 2 nodes and apply the playbook there.
  2. Verify the Deny: Log in as a user without a job and confirm you get a "Permission Denied" message (even with correct SSH keys).
  3. Verify the Allow: Submit an interactive job (salloc -p debug), wait for it to start, and confirm you can SSH in.