Data
Analysis Best Practices training is about curing the "Laptop Syndrome."
Most data
scientists are used to working on a laptop where they load a CSV into RAM, run
Pandas, and save a result. When they try this on an HPC cluster with 10 TB of
data, the system crashes. The training must shift their mindset from "Load
everything" to "Stream/Chunk everything" and from "Sequential"
to "Parallel."
Here is the
detailed breakdown of the best practices curriculum, the "I/O Golden
Rules," and the tool migration strategy, followed by the downloadable Word
file.
1. The
Mindset Shift: Laptop vs. Cluster
The first
session must explain why the cluster breaks if you treat it like a big
laptop.
2. The
Four Pillars of Best Practices
A. I/O
Hygiene (Don't Kill the Filesystem)
B.
Memory Management (The OOM Killer)
C. Scaling Strategy
(Parallelism)
D. Reproducibility (Data Versioning)
3. The
Tool Migration Path
Show teams
the direct "HPC Equivalent" of their favorite tools.
|
Laptop Tool |
HPC / Scalable Tool |
Why Switch? |
|
Pandas |
Dask / Polars |
Pandas is
single-core. Dask distributes the dataframe across 100 cores. Polars
is multi-threaded and much faster. |
|
CSV / JSON |
Parquet / Avro |
Binary
formats support compression and "Predicate Pushdown" (reading only
the necessary columns). |
|
Matplotlib |
Datashader |
Matplotlib
crashes with millions of points. Datashader renders
billions of points by rasterizing them on the server side. |
|
Pickle |
Joblib |
Pickle is
insecure and slow for large arrays. Joblib is
optimized for NumPy array serialization. |