ML and AI Workshops on HPC are unique because they must bridge two different cultures: the "Data Science" culture (interactive, Jupyter-based, pip-install everything) and the "HPC" culture (batch-based, Slurm, strict modules).

A successful workshop doesn't just teach Machine Learning; it teaches how to do ML without breaking the supercomputer.

Here is the detailed breakdown of the infrastructure, the 3-Day curriculum, and the hands-on labs, followed by the downloadable Word file.

1. The Workshop Infrastructure

You cannot run a workshop on the Login Node. You need a dedicated environment.

Open OnDemand (OOD): The "Killer App" for training.

Why: It gives users a "Jupyter" button in their web browser.
Magic: When they click "Launch," OOD secretly submits a Slurm job, waits for a GPU node, starts Jupyter there, and tunnels it back to the user. The user feels like they are on a laptop, but they are actually on an A100 GPU node.

Magic Castle:

Tool: An open-source project that creates a temporary, full-featured HPC cluster in the cloud (AWS/Azure/GCP) just for the workshop.
Benefit: If a student accidentally deletes /etc, they only destroy a temporary node, not your production cluster.

2. The Curriculum: From Interactive to Batch

Day 1: The Environment (Stop using pip install)

The Conflict: Python users love conda and pip. HPC admins hate them because they create millions of tiny files that kill the file system.
The Solution: Apptainer (Singularity).
Lab: "Building your first Container."

Task: Take a standard Docker container (from NVIDIA NGC), convert it to a .sif file, and run it on the cluster.

Day 2: Scaling Up (Distributed Data Parallel - DDP)

The Concept: Moving from 1 GPU to 4 GPUs.
The Tool: PyTorch DDP or PyTorch Lightning.
Lab: "ResNet Resized."

Task: Run a training script on 1 GPU. Record the speed. Change 3 lines of code to enable DDP. Run it on 4 GPUs. Watch the speed increase (and discuss why it isn't perfectly 4x faster).

Day 3: The Frontier (LLMs & Model Parallelism)

The Challenge: Models like Llama-3-70B are too big for one GPU memory.
The Solution: FSDP (Fully Sharded Data Parallel).
Lab: "Fine-Tuning a Llama."

Task: Use Low-Rank Adaptation (LoRA) to fine-tune a large language model on a custom text dataset, spreading the model layers across multiple nodes.

3. Critical "Soft Skills" for AI Users

Checkpointing: "Your job will die after 24 hours. Does your code save its state every hour?"
IO Patterns: "Do not read 1 million JPEG files one by one. Pack them into a WebDataset (Tar) file so the storage system doesn't choke."

4. Key Applications & Tools

Category	Tool	Usage
Portal	Open OnDemand	The standard interface for workshops. Allows "Zero-Install" participation via Chrome/Firefox.
Container	Apptainer	The required format for running Docker containers safely on shared HPC systems.
Framework	PyTorch Lightning	The best teaching tool for scaling. It abstracts away the complex MPI engineering code so students focus on the ML.
Dataset	WebDataset	A library for high-performance I/O. Essential for teaching users how to feed GPUs fast enough.