Scalability
and Flexibility Planning is the engineering discipline of designing an HPC system that can grow
without breaking.
In standard
IT, "scaling" often just means "buying a bigger server"
(Vertical Scaling). In Supercomputing, physics limits how big a single server
can be. Therefore, HPC relies on Horizontal Scaling (adding more nodes).
However, if
you add 1,000 nodes to a cluster that was designed for 100, the network will
choke, the storage will freeze, and the power breakers will trip. Scalability
planning prevents this.
Here is the
detailed breakdown of the strategy, the "Scale-Up vs. Scale-Out"
concepts, and the infrastructure considerations, followed by the downloadable
Word file.
1. The
Two Types of Scalability
To plan for the future, you must understand the two directions
of growth:
- Scale-Out
(Horizontal):
- Concept: Adding more compute nodes to the cluster.
- Use Case: Running more
simulations at the same time (Throughput) or running one massive
simulation across more cores (Parallelism).
- Bottleneck: The Network (Interconnect).
If the switch topology isn't "Non-Blocking,"
adding nodes actually slows down
the system.
- Scale-Up
(Vertical):
- Concept: Making individual nodes
stronger (e.g., adding GPUs or more RAM to existing servers).
- Use Case: AI training or huge memory
databases.
- Bottleneck: Power & Cooling. A
standard rack handles 10kW. A rack full of GPUs might need 60kW.
2.
Strategic Planning Layers
A.
Infrastructure Readiness (The Room)
You cannot
plug in a new rack if you don't have the power.
- Dark Power/Cooling: We design data centers with
"Day 1" capacity (e.g., 500kW) but install the piping and
breakers for "Day 5" capacity (e.g., 2MW). This allows you to
roll in new hardware instantly without construction work.
- Floor Weight: Future liquid-cooled racks are
heavy. We ensure the raised floor can support 3,000 lbs
per rack, even if current racks are light.
B.
Network Topology (The Spine)
- Fat Tree Pruning: We design the core network
switches (Spine) with empty ports.
- Strategy: On Day 1, you might only
connect 100 nodes. But the Spine switch is sized for 500. When you expand,
you just plug in the new leaf switches. You
don't have to rip out the cabling backbone.
C. Storage Namespaces
- Global Namespace: Users should never have to
know that you added a new storage array. They just see /scratch getting bigger.
- Strategy: We use Parallel File Systems (Lustre/GPFS) where adding capacity is as simple as
adding a new "Object Storage Target" (OST) to the live pool.
3.
Flexibility: Cloud Bursting
Sometimes,
"Scalability" means handling a temporary spike that is too big to buy
hardware for.
- Hybrid Architecture: We configure the scheduler (Slurm) so that it sees the Cloud (AWS/Azure) as just
another partition.
- Result: When the on-prem cluster is
full, the system automatically spins up 1,000 cloud nodes, runs the job,
and shuts them down. Infinite scalability,
zero permanent footprint.