Cluster Tuning & Resource Management is the operational science of ensuring your supercomputer is never idle.

A "well-tuned" cluster is not just one where jobs run fast (Performance); it is one where the hardware is utilized at 95%+ capacity 24/7 (Throughput). If you have 1,000 cores and 200 are empty because the scheduler is waiting for a "Big Job" to start, you are losing money.

Effective management combines Scheduler Logic (playing Tetris with jobs) with Kernel Tuning (optimizing how the OS manages RAM and CPU cycles).

Here is the detailed breakdown of the scheduling strategies, NUMA awareness, and resource isolation, followed by the downloadable Word file.

1. The Scheduler Strategy: Playing "Tetris"

The primary tool for throughput is the Scheduler (Slurm, PBS, LSF). By default, schedulers are "First In, First Out" (FIFO). This is terrible for throughput because one massive job can block everyone else.

Backfill Scheduling (The Tetris Algorithm):

Scenario: A massive job needs 1,000 cores. Only 800 are free. The massive job waits. 200 cores sit idle.
Optimization: The scheduler looks at the queue. It sees a tiny job that needs 10 cores for 1 hour. It calculates: "Can I run this tiny job and finish it before the massive job gets its resources?"
Result: It slips the tiny job into the gap. The idle cores are used, throughput increases, and the massive job is not delayed.

Preemption:

Scenario: An urgent "Hurricane Forecast" job arrives.
Optimization: The cluster pauses or kills low-priority "Research" jobs to free up space instantly.

2. Kernel & Memory Tuning

Standard Linux is tuned for web servers and desktops, not supercomputers. You must retune the OS kernel.

A. NUMA (Non-Uniform Memory Access) Awareness

The Physics: In a dual-socket server, CPU 1 has its own RAM, and CPU 2 has its own RAM.
The Problem: If CPU 1 tries to read data from CPU 2's RAM, it is 50% slower.
The Fix: Numactl. We force jobs to stay "Local." We tell the OS: "If this process starts on CPU 1, allocate all its memory on RAM 1. Do not cross the bridge."

B. Hugepages

The Problem: Standard RAM is divided into tiny 4KB pages. If a simulation uses 1TB of RAM, the CPU spends a huge amount of time just flipping through the "Page Table" index to find data.
The Fix: Enable Transparent Huge Pages (THP) or explicit 2MB/1GB Hugepages. This reduces the size of the index, making memory lookups faster.

C. Swappiness

The Rule: vm.swappiness = 1.
Why: In HPC, if a node runs out of RAM and starts writing to Disk (Swapping), performance drops by 1000x. It is better for the application to crash immediately (OOM Kill) so the scheduler can restart it on a larger node, rather than run painfully slow for days.

3. Resource Isolation (Cgroups)

How do you stop a user from crashing a node?

Control Groups (Cgroups): This is the "Box."
Implementation: When a job requests 4GB of RAM, Slurm creates a Cgroup limit of 4GB.
Enforcement: If the user's code tries to use 4.1GB, the Linux Kernel instantly kills only that job. The rest of the node (and other users on it) remains unaffected. Without Cgroups, that job would consume all system RAM and crash the entire server.

4. Key Applications & Tools

Category	Tool	Usage
Scheduler	Slurm	The industry standard. Highly tunable for Backfill, Fairshare, and Preemption policies.
Memory Tuning	Numactl	Command-line tool to bind processes to specific memory banks (e.g., numactl --cpunodebind=0 --membind=0).
Process Isolation	Cgroups (v2)	Linux kernel feature used to enforce strict CPU and RAM limits per job.
I/O Tuning	Tuned-adm	A RedHat tool that applies "Profiles" (e.g., tuned-adm profile throughput-performance) to auto-set kernel latencies.