Supermicro AS-8126GS-NB3RT: Eight NVIDIA Blackwell B300 GPUs in a Single 8U Chassis
The Supermicro AS-8126GS-NB3RT-01-G2 is a Gold Series 8U AI server built around the NVIDIA HGX B300 NVL8 — eight Blackwell B300 GPUs unified via NVLink into a single coherent memory fabric. Paired with dual AMD EPYC 9575F processors, 3TB of DDR5-6400, and PCIe 5.0 E1.S NVMe storage, this is the most compute-dense single-node GPU platform available for enterprise AI in 2025. Here is the full GO33 verdict.
At a Glance
Full Technical Specifications
| Component | Specification |
|---|---|
| Model | Supermicro AS-8126GS-NB3RT-01-G2 |
| Form Factor | 8U Rackmount — 1 Node |
| Series | A+ Gold Series (Ready to Ship) |
| CPU | 2× AMD EPYC 9575F — 64-core, 3.30GHz base, 256MB L3 cache, 400W TDP each |
| Total Cores / Threads | 128 cores / 256 threads |
| Memory | 24× 128GB DDR5-6400 ECC RDIMM = 3TB total |
| GPU | 1× NVIDIA HGX B300 NVL8 (8× Blackwell B300 GPUs, NVLink 5 interconnect) |
| Boot Storage | 2× 1.9TB M.2 NVMe PCIe 4.0 (Opal-capable) |
| Data Storage | 8× 7.68TB E1.S NVMe PCIe 5.0 (1 DWPD) |
| Total Raw NVMe | ~63.3TB |
| Network | 2× CX7 200GbE QSFP112 — NDR InfiniBand + Ethernet + 10GbE RJ45 on-board |
| Total Network BW | 400Gb/s |
| Management | IPMI 2.0 — dedicated management port |
| OS Support | Ubuntu 22.04 LTS, RHEL 9, Rocky Linux 9, Windows Server 2022 |
| Ships Within | 24 Hours (Gold Series) |
Pros & Cons
Pros
- NVIDIA HGX B300 NVL8 — the most capable single-node GPU board in 2025
- Dual EPYC 9575F: 128 cores, 256MB L3 per socket — serious CPU headroom
- 3TB DDR5-6400 across 24 DIMMs — handles full-precision 140B model serving
- PCIe 5.0 E1.S NVMe — 12GB/s+ per slot for checkpoint and dataset I/O
- Dual CX7: NDR InfiniBand and Ethernet on the same NIC — no separate HCA
- Gold Series in-stock — ships within 24 hours, no lead time uncertainty
- Single 8U node simplifies GPU-to-CPU topology vs multi-node configs
Cons
- 8U footprint is large — plan rack density before ordering
- Peak power draw ~10–14kW — verify data centre circuit budget
- Blackwell driver/framework ecosystem still maturing in early 2025
- Price-per-node is very high — best at sustained, high-utilisation workloads
- E1.S at 1 DWPD — not ideal for write-heavy checkpoint workflows
Inside the AS-8126GS-NB3RT
NVIDIA HGX B300 NVL8 — Blackwell Architecture
The HGX B300 NVL8 baseboard houses eight Blackwell B300 GPUs linked via NVLink 5 into a unified GPU fabric. Blackwell’s second-generation transformer engine delivers native FP4/FP8 throughput, higher HBM3e bandwidth per GPU, and improved all-reduce performance versus Hopper. For LLM training at 30B–140B parameter scale, the flat NVLink memory space eliminates the NCCL tuning overhead of multi-node tensor parallelism — a meaningful operational advantage for smaller AI teams.
In our London lab we validated the system on Ubuntu 22.04 LTS with CUDA 12.x and PyTorch 2.x nightly builds. Supermicro’s Gold Series pre-validation means driver bundles are tested before shipment — important when you are paying for this class of compute.
Dual AMD EPYC 9575F
Two EPYC 9575F processors deliver 128 total cores at 3.30GHz base, 512MB aggregate L3 cache, and a combined 800W TDP. This is AMD’s Genoa-X generation built for throughput-heavy workloads. The CPU pairing provides ample headroom for tokenisation pipelines, dataset preprocessing, and serving-side request routing without GPU-CPU bandwidth contention — a compromise many GPU-dense platforms make, and which this system avoids.
3TB DDR5-6400 Memory
All 24 DIMM slots are populated with 128GB DDR5-6400 ECC RDIMMs. For memory-capacity-bound LLM serving — running a 70B model at full BF16 precision requires roughly 140GB, a 140B model approximately 280GB — the 3TB headroom allows multiple concurrent large models in memory simultaneously. This matters for multi-tenant inference serving where model switching latency is a cost driver.
Storage: 8× E1.S PCIe 5.0
Eight 7.68TB E1.S PCIe 5.0 drives deliver 61.4TB of high-bandwidth data storage. At PCIe 5.0 speeds each slot sustains over 12GB/s sequential read, making dataset streaming and model checkpoint reads highly capable. The 1 DWPD endurance rating suits read-heavy workloads well; for environments checkpointing every few hundred training iterations, plan write amplification or consider higher-endurance E1.S SKUs.
Networking: Dual CX7 at 200GbE
Two CX7 add-on cards provide 400Gb/s total bandwidth via QSFP112 ports. Each CX7 supports both NDR InfiniBand RDMA and RoCEv2 Ethernet — you can run GPU-to-GPU cluster traffic over InfiniBand or standard Ethernet without swapping hardware as your cluster scales. The on-board 10GbE RJ45 is cleanly separated for management traffic, IPMI, and out-of-band access.
Benchmark Results
GO33 London Lab, June 2025. Tools: MLPerf Inference v4.0, fio 3.36, custom NCCL all-reduce harness. Scores normalised relative to 8× H100 SXM baseline. Higher = better.
Best Used For
Not Right For
- Teams training models under 7B parameters — a 4U 4-GPU system delivers better cost efficiency
- Edge inference where rack space and power are severely constrained
- Kubernetes workloads requiring GPU fractioning across many small jobs — consider MIG-capable H100 configs
Frequently Asked Questions
The Definitive Single-Node AI Training Platform for 2025
The Supermicro AS-8126GS-NB3RT delivers the NVIDIA Blackwell HGX B300 NVL8 in a properly resourced 8U platform — no CPU, memory, or storage compromises. If you are running LLM training or high-throughput generative AI inference on-premises at scale, this is the benchmark platform for 2025. Gold Series availability means you can have it racked and running inside a week.
GO33 Score: 9.2 / 10 — Strongly recommended for enterprise AI training and inference teams.
