Data
Management Solutions
in HPC address the single biggest bottleneck in modern science: Data Gravity.
Computing
is fast, but moving data is slow. If you have 5 Petabytes of simulation data,
you cannot simply "copy-paste" it to the cloud or another server. It
would take months.
Effective
Data Management is about Tiering (placing data on the right hardware at the
right time) and Governance (knowing what you have so you don't lose it).
Here is the
detailed breakdown of the tiered architecture, the concept of HSM (Hierarchical
Storage Management), and the toolset, followed by the downloadable Word file.
1. The
Architecture: The Storage Layer Cake
You cannot
afford to store all your data on the fastest hard drives. It is economically
impossible. We implement a Multi-Tiered
Architecture:
2. The
Strategy: Hierarchical Storage Management (HSM)
Managing
these tiers manually is a disaster. Users will fill up the expensive Scratch
drive and refuse to move files.
HSM Software automates
this.
3. Data Governance: Finding the Needle
HPC
clusters often have 500 million files. Finding "that one simulation from
2019" is impossible with standard tools.
4. Key Applications & Tools
|
Category |
Tool |
Usage |
|
Data Movement |
Globus |
The
standard for moving massive datasets between institutions (e.g., University A
to University B) securely and reliably (it auto-retries if network fails). |
|
Governance |
iRODS |
"Integrated
Rule-Oriented Data System." Middleware that enforces policies (e.g.,
"Encrypt all data tagged 'medical'"). |
|
Archive |
OpenArchive / CTA |
Manages
the robotic tape libraries (CERN Tape Archive). |
|
Sync |
Rclone |
The
"Swiss Army Knife" for syncing HPC data to Cloud Storage (S3,
Dropbox, Google Drive). |