Conducting a comparative performance analysis between HPC systems requires moving beyond "peak theoretical" numbers to empirical, workload-specific data. To identify the optimal configuration, you must evaluate how different architectures (CPU vs. GPU, Ethernet vs. InfiniBand, Shared vs. Parallel storage) handle your specific scientific kernels.

Here is a structured methodology for conducting a detailed comparative analysis.


1. Defining the Benchmarking Suite

A valid comparison requires a tiered approach, starting from synthetic hardware tests and moving to real-world application performance.


2. Analysis Framework: The Roofline Model

The most effective way to compare systems is the Roofline Model. It plots Arithmetic Intensity (Operations per Byte) against Attainable Performance (GFLOP/s).


3. Key Comparison Variables

When identifying the "optimal" configuration, compare these three pillars:

Variable

Metrics to Compare

Optimal Configuration Indicator

Compute Density

Time-to-Solution, Energy-per-Solution (Joules)

The system that completes the task with the lowest wall-clock time and power draw.

Interconnect

Injection Bandwidth, Bisection Bandwidth

High efficiency when scaling from 2 to 128 nodes; look for the "Scaling Efficiency" curve to remain above 80%.

Storage I/O

Metadata Ops/sec, Sustained Throughput

Look for minimal "I/O Wait" times during large-scale checkpointing.

4. Cost-Performance Analysis (TCO)

Performance isn't just about speed; it’s about Efficiency-per-Dollar. A system that is 10% faster but 50% more expensive is rarely the "optimal" choice for sustainable growth.


5. Identifying the "Bottleneck" (Comparative Profiling)

Use Performance Co-Pilot (PCP) or Intel VTune to compare hardware counters across systems.7

  1. Instruction per Cycle (IPC): If System A has higher IPC than System B on the same code, it suggests better branch prediction or instruction scheduling for your logic.
  2. NUMA Locality: Compare performance degradation when tasks cross NUMA boundaries. This identifies if your code needs specific process-pinning optimizations to run optimally on high-core-count processors.

6. Comparative Analysis Checklist