Data Management Solutions - Malgukke Computing

The Challenge of Data Gravity

Computing is fast, but moving data is slow. If you have 5 Petabytes of simulation data, you cannot simply "copy-paste" it—it would take months. Effective Data Management is about Tiering (placing data on the right hardware at the right time) and Governance (ensuring searchability and compliance).

1. The Architecture: The Storage Layer Cake

Tier 0: Scratch (Formula 1)

Hardware: NVMe / All-Flash. Extreme speed (100GB/s+). Used only for active simulations; files are ephemeral and deleted after 30 days.

Tier 1: Project (Family Van)

Hardware: SAS SSDs or Fast HDDs. Persistent storage for code, scripts, and active research results. Backed up daily.

Tier 2: Archive (Cargo Ship)

Hardware: LTO Tape or S3 Object Storage. Slow but cheap. Designed for long-term compliance (10+ years) and cold data.

Hierarchical Storage Management (HSM)

Managing petabytes manually is impossible. HSM Software automates the data flow based on rules:

"If a file hasn't been opened in 90 days, move it from Flash to Tape, but keep a pointer so it stays visible in the folder."

When a user clicks an archived file, the robotic arm in the tape library automatically retrieves it. The user gets the data with a small delay, but the expensive flash remains empty for active science.

Data Governance: Finding the Needle

Metadata Tagging

Standard file systems are blind. We use iRODS to wrap every file in metadata. The system doesn't just see results.dat; it knows the PI, the project ID, the grant number, and the simulation parameters.

The FAIR Principles

We ensure your data is:

Findable
Accessible
Interoperable
Reusable

Data Management Toolset

Category	Tool	Usage
Data Movement	Globus	Secure, reliable transfer of massive datasets between institutions with auto-retry.
Governance	iRODS	Rule-oriented system that enforces policies like encryption for medical tags.
Archive	CTA (CERN Tape Archive)	Advanced management for robotic tape libraries at exascale.
Cloud Sync	Rclone	The "Swiss Army Knife" for syncing cluster data to S3, Azure, or Google Drive.

Master Your Data Lifecycle

Download our "Multi-Tiered Storage Strategy Guide" to learn how to implement automated HSM in your cluster.

Download Management Guide (.docx)