Premise
We believe the center of gravity in AI infrastructure has shifted from servers to AI factories, where networking is perhaps as critical as compute. Our analysis of an interview with NVIDIA SVP of Networking, Gilad Shainer, indicates NVIDIA’s thesis is straightforward. Specifically, that scale-up fabrics (NVLink/NVLink Fusion) plus scale-out Ethernet purpose-built for AI (Spectrum-X) — and increasingly scale-across for multi–data center topologies (Spectrum-XGS) — deliver superior determinism, efficiency, and time-to-outcomes at giga-scale. At the same time, the market is too large and heterogeneous for any single fabric to dominate; open standards and merchant Ethernet will continue to win broad adoption, and even NVIDIA is embracing open interfaces and ecosystems to complement its proprietary advantages. In our view, NVIDIA is a somewhat rare case where first-mover advantage has paid off. Its early conviction in parallel computing, GPUs, CUDA/NCCL software moats, and the Mellanox acquisition now underpin a defensible systems position across scale-up, scale-out, and (increasingly) scale-across.
NVIDIA’s Core Argument
Not surprisingly, NVIDIA’s dominant position in the market has detractors. The competition positions NVIDIA as a proprietary, closed platform that locks in customers. Firms like Broadcom and others position Ethernet as the open networking alternative to NVIDIA’s approach and point to a multitude of evidence that Ethernet will ultimately win in the market. One example is the wide adoption of Ethernet by the hyperscalers.
According to Gilad Shainer, NVIDIA agrees that Ethernet specifically and open standards generally are fundamental and necessary. NVIDIA scales up by using NVLink (and NVLink Fusion) to tightly couple GPU-to-GPU as one high-bandwidth, low-latency pool, and scales out by using purpose-built network fabrics – e.g. Spectrum-X Ethernet and Quantum-X InfiniBand – to interconnect racks and clusters of GPUs across the data center (and, with Spectrum-XGS, across sites).
We noted an historic architectural “tension” between scale up (e.g. mainframe) and scale out (e.g. Google File System). Meaning different architectural choices require commitments to one or the other.
According to Shainer, there is no “tension,” just different missions. Shainer’s core point is that scale-up and scale-out are complementary. Scale-up forms larger “virtual GPUs” with massive load/store bandwidth and low latency; scale-out connects hundreds of thousands of accelerators with zero-jitter behavior so there’s no stalling the entire system when one node lags. The mix is workload-dependent, and, as Jensen often points out, NVIDIA designs the entire data center to understand where bottlenecks can occur for different workloads. This, according to the company, is superior to optimizing point parts in isolation.
NVLink → NVLink Fusion (scale-up). NVLink bandwidth has marched from ~160 GB/s to 1.8 TB/s per GPU with Blackwell and a stated multi-x roadmap beyond that. NVLink Fusion opens this fabric to partners (CPU, XPU, NIC, switch) via standard interfaces (e.g., UCIe/PCIe), enabling semi-custom designs while preserving access to NVIDIA’s software stack (e.g., NCCL).
Spectrum-X (scale-out Ethernet for AI). NVIDIA’s claim is that “traditional Ethernet” was optimized for single-server traffic; AI needs determinism (meaning low jitter, no drops, end-to-end congestion control and adaptive routing). Spectrum-X pairs SuperNICs and switches to act as one fabric, which NVIDIA claims reduces tail latency and improves multi-tenant stability.
Photonics and power. Optical I/O can consume on the order of ~10% of compute power at AI-factory scale; co-packaged optics (CPO) and photonic switch systems promise meaningful power and resiliency gains, effectively increasing attached-GPU counts at the same power envelope.
Scale-across. With Spectrum-XGS, NVIDIA claims it is explicitly targeting inter-data-center links to build giga-scale AI factories. We see this as a logical extension once single-site capacity saturates.
Openness, in practice. NVIDIA emphasizes standards (Ethernet, InfiniBand) and its open-source contributions (e.g., SONiC), and positions NVLink Fusion’s partner program as evidence that exits and entrances to the stack use published, standard interfaces.
The Counter-Narrative (Broadcom/Ethernet Camp)
There exists a credible opposing view, often articulated by merchant-silicon providers like Broadcom and others. Specifically, the counter argument goes something like this:
- Ethernet is the lingua franca. The largest operators already run massive Ethernet estates; operational tooling, skills, and supply chains favor standards-based Ethernet. Merchant silicon innovation cycles and volume economics drive rapid feature catch-up.
- Openness is a hedge against lock-in. Initiatives like UEC (the Ultra Ethernet Consortium) and related efforts aim to make Ethernet AI-capable via standardized congestion control, load balancing, and collective-friendly features. The pitch is get AI performance and multivendor choice at hyperscale.
- Power of scale. Merchant ecosystems argue that volume combined with competition lowers TCO faster than a single-vendor system, and that open fabrics will reduce strategic risk over a multi-year buildout.
We view this position as strong for broad enterprise adoption, green-field spines, and hyperscalers with deep Ethernet DNA. It is weaker in our view where determinism at massive GPU counts and software-level integration are existential to utilization and time-to-train.
NVIDIA’s position is that it embraces open standards, contributes to open source and will essentially deploy the technology that makes sense for the right strategic fit. Regardless, the market is enormous and both NVIDIA and alternative camps can thrive in our view. NVIDIA points out by the way that it is an active participant in standards initiatives like UEC.
Our Take: How the Market will Bifurcate
In our opinion, the fabric wars won’t be winner-take-all; they’ll segment along workload criticality, scale, and operator DNA:
- Mission-critical AI clusters (frontier training, latency-sensitive collectives).
NVLink-centric scale-up with AI-tuned scale-out (Spectrum-X) has a durable edge in determinism, utilization, and software cohesion (CUDA/NCCL). The system advantage matters most here in our view. - Enterprise-wide AI enablement (general inference, analytics, mixed traffic).
Ethernet’s ubiquity and operations gravity dominate broadly. We expect UEC-aligned features to narrow the gap for many use cases; NVIDIA’s own Spectrum-X is still Ethernet, a pragmatic bridge for this domain. - Giga-scale expansion (multi-site “scale-across”).
Power/optics become the tax. Co-packaged photonics and inter-DC fabrics (Spectrum-XGS) emerge as a new control point; here, NVIDIA is seeding the category as are others (e.g. Broadcom, Cisco, etc.).
Why NVIDIA’s first-mover advantage holds. Unlike many “firsts” that faded (e.g. Friendster/MySpace, Yahoo/Lycos, etc.), NVIDIA has stacked three moats, including: 1) accelerated computing with CUDA in lock step with AI researchers, 2) fabric control from NVLink through NCCL into the network; and 3) Mellanox – the decisive bet that networking would be as important as compute. That portfolio lets NVIDIA design the full system and data center and optimize systemically, not piecewise. We believe that system view is hard to fast-follow.
Openness vs. lock-in – NVIDIA’s balancing act. The company is signaling selective openness (standard interfaces for NVLink Fusion; SONiC; Ethernet compliance) to counter lock-in concerns while preserving proprietary differentiation where it drives utilization and performance. That’s the right play in our view – i.e. embrace open where it expands TAM and partner leverage; protect the crown jewels where system-level benefits accrue.
Why determinism became the hill to die on
AI training relies on synchronized collective operations; a few straggler packets (tail latency) can stall thousands of GPUs. NVIDIA’s story that we heard from Gilad Shainer is that Spectrum-X delivers low-jitter, end-to-end controlled behavior by co-designing SuperNIC plus switch, implementing lossless transport, adaptive routing, and real-time telemetry. The practical outcome is higher effective utilization and faster time-to-train at scale than best-effort Ethernet. Even if the Ethernet-camp features converge on paper, we believe the proof will show up in end-to-end system behavior under real load, vs. individual part KPIs.
Don’t Forget The Power & Photonics Angle
As clusters grow, optical power becomes a double-digit percent of total compute energy. Co-packaged optics (CPO) and photonic switch systems can cut network power and increase resiliency, translating directly into more GPUs attached at the same site power budget – a silent but material TCO lever. NVIDIA is building this into both Spectrum-X (Ethernet) and Quantum-X (InfiniBand) product lines.
Quick Reference Guide
Domain | Primary Goal | Likely Fabric Posture | NVIDIA Offers |
Scale-up (in-rack, creates a giant “virtual GPU”) | Bandwidth, load/store, ultra-low latency | NVLink / NVLink Fusion | NVLink Gen5+; Fusion w/ UCIe/PCIe, partner XPUs/CPUs |
Scale-out (intra-DC, rack-to-rack) | Determinism, tail-latency control, multi-tenant stability | AI-tuned Ethernet or InfiniBand | Spectrum-X (Ethernet), Quantum-X (IB); SuperNIC + switch co-design |
Scale-across (inter-DC) | Path diversity, long-haul optics, policy | Emerging “XGS” class Ethernet | Spectrum-XGS for distributed AI factories |
What to Watch Next
We believe the following markers will serve as indicators of momentum and market success for both the NVIDIA and alternative approaches:
- Utilization at scale (tokens/Watt/hour) on real model runs across fabrics.
- Tail-latency achievments under true multi-tenant load.
- Optical power share of total; CPO deployment timelines, yields and TCO.
- Standards conformance and ecosystem velocity (UEC features landing in merchant silicon vs. NVIDIA’s end-to-end updates).
- Fusion adoptions (named XPUs/CPUs merging into NVLink fabric; depth of NCCL exposure to and adoption by partners).
- Inter-DC architectures shipping (routes, resiliency, security domains) and operator case studies.
Bottom line
In our view, purpose-built AI fabrics will remain the performance and utilization bar for state-of-the-art clusters, and NVIDIA’s system approach keeps it in the pole position for those deployments. At the same time, the TAM is too large for a single vendor approach to dominate exclusively. Ethernet’s standardization, ecosystem, and operational gravity ensure that open networking will win massive share across enterprises – including inside NVIDIA’s own portfolio. The strategic question is no longer “open vs. proprietary,” but where to be open to expand markets and where to be opinionated to preserve system-level advantage. On that score, we believe NVIDIA’s blend – open interfaces, partner-friendly Fusion, SONiC and Ethernet compliance, with a protected scale-up/collectives software moat – is a pragmatic approach for the leading AI vendor to pioneer the next decade of AI factories. At the same time, competitors have little choice but to double down on open standards to tap the leverage of a broad ecosystem to participate in earnest in the AI wave.
Watch the full conversation with Gilad Shainer:
Image: DALL-E