Data Analysis Best Practices training is about curing the "Laptop Syndrome."

Most data scientists are used to working on a laptop where they load a CSV into RAM, run Pandas, and save a result. When they try this on an HPC cluster with 10 TB of data, the system crashes. The training must shift their mindset from "Load everything" to "Stream/Chunk everything" and from "Sequential" to "Parallel."

Here is the detailed breakdown of the best practices curriculum, the "I/O Golden Rules," and the tool migration strategy, followed by the downloadable Word file.

1. The Mindset Shift: Laptop vs. Cluster

The first session must explain why the cluster breaks if you treat it like a big laptop.

The "Head Node" Trap:

Bad Practice: Running python analysis.py on the login node.
The Fix: Teaching Interactive Sessions. Using salloc or Open OnDemand to get a dedicated compute node for exploration.

The "CSV" Addiction:

Bad Practice: Reading a 500GB CSV file. It requires parsing text line-by-line, which is CPU-heavy and slow.
The Fix: Binary Formats. Training teams to convert raw text into Parquet or HDF5 immediately. These formats are columnar (fast reads) and splittable (parallel processing).

2. The Four Pillars of Best Practices

A. I/O Hygiene (Don't Kill the Filesystem)

The Rule: "Avoid Many Small Files."
The Problem: The parallel filesystem (Lustre/GPFS) is designed for bandwidth (GB/s), not operations (IOPS). Reading 1 million 1KB files is 1000x slower than reading one 1GB file.
The Practice: Use Tar/Zip archives or database formats (SQLite/DuckDB) to bundle small files together before processing.

B. Memory Management (The OOM Killer)

The Rule: "Lazy Evaluation."
The Problem: On a laptop, you load the whole dataset. On HPC, 10TB won't fit in RAM.
The Practice: Use libraries that process data in Chunks or Streams.

Bad: df = pd.read_csv('massive.csv') (Crashes).
Good: dd.read_csv('massive.csv') (Dask reads only what is needed, lazily).

C. Scaling Strategy (Parallelism)

The Rule: "Embarrassingly Parallel is Good."
The Practice: If you need to process 1,000 separate files, do not write a for loop. Write a script to process one file, and use a Slurm Job Array to launch 1,000 independent jobs. It’s simple, robust, and scales perfectly.

D. Reproducibility (Data Versioning)

The Rule: "Code changes; Data changes too."
The Practice: Using DVC (Data Version Control). Just as Git tracks code changes, DVC tracks large data files. This ensures that "Analysis V2" is running on "Dataset V2," not accidental mixes of old and new data.

3. The Tool Migration Path

Show teams the direct "HPC Equivalent" of their favorite tools.

Laptop Tool	HPC / Scalable Tool	Why Switch?
Pandas	Dask / Polars	Pandas is single-core. Dask distributes the dataframe across 100 cores. Polars is multi-threaded and much faster.
CSV / JSON	Parquet / Avro	Binary formats support compression and "Predicate Pushdown" (reading only the necessary columns).
Matplotlib	Datashader	Matplotlib crashes with millions of points. Datashader renders billions of points by rasterizing them on the server side.
Pickle	Joblib	Pickle is insecure and slow for large arrays. Joblib is optimized for NumPy array serialization.