ML and
AI Workshops on HPC
are unique because they must bridge two different cultures: the "Data
Science" culture (interactive, Jupyter-based,
pip-install everything) and the "HPC" culture (batch-based, Slurm, strict modules).
A
successful workshop doesn't just teach Machine Learning; it teaches how to do
ML without breaking the supercomputer.
Here is the
detailed breakdown of the infrastructure, the 3-Day curriculum, and the
hands-on labs, followed by the downloadable Word file.
1. The
Workshop Infrastructure
You cannot
run a workshop on the Login Node. You need a dedicated environment.
- Open OnDemand (OOD): The "Killer App" for
training.
- Why: It gives users a "Jupyter" button in their web browser.
- Magic: When they click
"Launch," OOD secretly submits a Slurm
job, waits for a GPU node, starts Jupyter
there, and tunnels it back to the user. The user feels like they are on a
laptop, but they are actually on an A100 GPU
node.
- Magic
Castle:
- Tool: An open-source project that
creates a temporary, full-featured HPC cluster in the cloud
(AWS/Azure/GCP) just for the workshop.
- Benefit: If a student accidentally
deletes /etc, they only destroy a temporary
node, not your production cluster.
2. The
Curriculum: From Interactive to Batch
Day 1:
The Environment (Stop using pip install)
- The Conflict: Python users love conda and pip. HPC admins hate them because they
create millions of tiny files that kill the file system.
- The
Solution: Apptainer (Singularity).
- Lab: "Building your first
Container."
- Task: Take a standard Docker
container (from NVIDIA NGC), convert it to a .sif
file, and run it on the cluster.
Day 2:
Scaling Up (Distributed Data Parallel - DDP)
- The Concept: Moving from 1 GPU to 4 GPUs.
- The Tool: PyTorch
DDP or PyTorch Lightning.
- Lab:
"ResNet Resized."
- Task: Run a training script on 1
GPU. Record the speed. Change 3 lines of code to enable DDP. Run it on 4
GPUs. Watch the speed increase (and discuss why it isn't perfectly 4x
faster).
Day 3:
The Frontier (LLMs & Model Parallelism)
- The Challenge: Models like Llama-3-70B are
too big for one GPU memory.
- The Solution: FSDP (Fully Sharded Data
Parallel).
- Lab:
"Fine-Tuning a Llama."
- Task: Use Low-Rank Adaptation (LoRA) to fine-tune a large language model on a custom
text dataset, spreading the model layers across
multiple nodes.
3.
Critical "Soft Skills" for AI Users
- Checkpointing: "Your job will die
after 24 hours. Does your code save its state every hour?"
- IO Patterns: "Do not read 1 million
JPEG files one by one. Pack them into a WebDataset
(Tar) file so the storage system doesn't choke."
4. Key Applications & Tools
|
Category
|
Tool
|
Usage
|
|
Portal
|
Open OnDemand
|
The
standard interface for workshops. Allows "Zero-Install"
participation via Chrome/Firefox.
|
|
Container
|
Apptainer
|
The
required format for running Docker containers safely on shared HPC systems.
|
|
Framework
|
PyTorch Lightning
|
The best
teaching tool for scaling. It abstracts away the complex MPI engineering code
so students focus on the ML.
|
|
Dataset
|
WebDataset
|
A library
for high-performance I/O. Essential for teaching users how to feed GPUs fast
enough.
|