AI Data Centre Networking

AI training workloads demand a network fabric where every packet arrives, every time. VelOS DC delivers production-ready RDMA over Converged Ethernet version 2 (RoCEv2), Priority Flow Control (PFC), Explicit Congestion Notification (ECN), Data Center Quantized Congestion Notification (DCQCN) and Dynamic Load Balancing (DLB) -- giving GPU clusters the lossless, low-latency fabric with seamless integration into existing AI infrastructure through open standards.

Book a Consultation

Download Brochure

Lossless Ethernet for GPU Clusters - Shipping Today on Open Hardware

About Evollabs

Evollabs delivers a production-ready, lossless Ethernet fabric specifically engineered to eliminate the bottlenecks of distributed AI training. By integrating RoCEv2, PFC, and DCQCN on open NeoX hardware, we provide the zero-tolerance, low-latency environment required for massive GPU clusters to synchronize gradients without stalling. Our Linux-native approach ensures that your network fabric operates with the same precision and automation as your compute nodes, maximizing GPU utilization and accelerating time-to-insight.

THE CHALLENGE

Distributed AI training is fundamentally a networking problem. When hundreds or thousands of GPUs collaborate on a training job, they must synchronise gradients after every iteration -- exchanging massive volumes of data across the fabric with minimal latency and zero tolerance for packet loss. A single dropped packet in an RDMA over Converged Ethernet version 2 (RoCEv2) flow does not simply trigger a retransmission: it forces the entire Remote Direct Memory Access (RDMA) operation to restart, stalling every GPU in the training job while the network recovers. GPU idle time is the most expensive waste in AI infrastructure, and the network is overwhelmingly the cause.

Traditional Ethernet was designed for best-effort delivery. It assumes that packet loss is acceptable and that higher-layer protocols (TCP) will handle retransmission. This assumption is fundamentally incompatible with the requirements of RDMA workloads, where the network must guarantee lossless delivery at line rate. Building a lossless Ethernet fabric requires Priority Flow Control to prevent buffer overflow, Explicit Congestion Notification to signal congestion before drops occur, and end-to-end congestion management to prevent hotspots from cascading across the fabric. These are not optional features -- they are prerequisites for any network carrying RoCEv2 traffic.

The scale of AI training compounds the challenge. A 256-GPU training cluster generates east-west traffic patterns that dwarf traditional data centre workloads: every GPU communicates with every other GPU during gradient synchronisation, creating all-to-all traffic bursts that stress every link in the fabric simultaneously. As clusters grow to 512, 1,024, and beyond, the fabric must scale bandwidth and maintain lossless behaviour without introducing congestion-induced tail latency that degrades training throughput. And the proprietary AI fabric solutions offered by legacy vendors come with price tags that rival the GPU infrastructure itself -- locking operators into single-vendor networking stacks for the most rapidly evolving segment of their infrastructure.

THE SOLUTION

VelOS DC delivers the lossless Ethernet fabric that AI workloads demand, built on production capabilities that are shipping today -- not roadmap items.

RoCEv2 enables GPU-to-GPU communication with minimal latency and near-zero CPU overhead, bypassing the traditional TCP/IP stack to deliver the consistent, low-latency performance that large-scale distributed training requires. Priority Flow Control (PFC) prevents packet drops on priority traffic classes by sending pause frames to upstream devices when buffer thresholds are reached. Explicit Congestion Notification (ECN) signals congestion to endpoints before drops occur, enabling proactive rate adjustment. Together, PFC and ECN create the lossless Ethernet fabric that RDMA workloads require -- ensuring that every packet reaches its destination without retransmission.

Data Centre Quantised Congestion Notification (DCQCN) provides end-to-end congestion control by combining ECN signalling with fine-grained rate adjustment at the source. It prevents congestion hotspots in large-scale AI training clusters by dynamically throttling senders before queue depths trigger PFC pause frames. DCQCN is essential for maintaining consistent fabric performance across hundreds or thousands of GPU nodes operating in parallel -- preventing the tail latency spikes that degrade training throughput and waste GPU cycles.

These capabilities run on VelOS DC's Linux-native architecture, meaning AI infrastructure teams can manage the network fabric using the same Linux tools and automation frameworks they use for GPU cluster management. There is no separate proprietary toolchain for the network -- Ansible configures switches alongside GPU nodes, and standard Linux monitoring integrates fabric telemetry into existing AI infrastructure dashboards.

Recommended Hardware

VelOS DC runs on NeoX data centre switches built on Broadcom switching silicon. The following hardware is validated for AI fabric deployments, with high-bandwidth spine switches providing the aggregate capacity that GPU-to-GPU traffic demands.

Role	Model	Ports	Chipset	Why This Hardware
Role AI Fabric Leaf	Model NeoX NS7726-32X	Ports 32x 100G QSFP28 + 2x 10G SFP+	Chipset Broadcom Trident 3	Why This Hardware 100G server-facing ports for GPU node connectivity with rich L2/L3 feature set for lossless fabric configuration
Role AI Fabric Spine (400G)	Model NeoX NS9716-32D	Ports 32x 400G QSFP-DD	Chipset Broadcom Tomahawk 3	Why This Hardware High-bandwidth spine, 12.8 Tbps class, with 400G breakout to 4x 100G for flexible leaf connectivity
Role AI Fabric Spine (Hyperscale)	Model NeoX NS9736-64D	Ports 64x 400G QSFP56-DD	Chipset Broadcom Tomahawk 4	Why This Hardware Maximum bandwidth, 25.6 Tbps class for large-scale AI clusters requiring the highest port density and aggregate throughput

DEPLOYMENT TOPOLOGY

AI/GPU Cluster Fabric with VelOS DC

AI/GPU Cluster Fabric with VelOS DC enables high-performance, low-latency networking for large-scale AI and GPU workloads. Powered by advanced data center orchestration and intelligent fabric management, it ensures seamless scalability, optimized resource utilization, and reliable connectivity across distributed GPU clusters for accelerated AI training and inference environments.

Request A Demo

Use Cases

USE CASE SCENARIOS

An enterprise AI team is deploying a 256-GPU training cluster for large language model fine-tuning and computer vision workloads. The existing data center network uses traditional Ethernet switches that were designed for general-purpose traffic -- not the lossless, low-latency requirements of RoCEv2.

Deployment: Build a dedicated AI fabric pod using leaf switches (32x 100G for GPU node connectivity) and spine switches (32x 400G, broken out to 4x 100G per port). VelOS DC configures PFC, ECN, and DCQCN across the fabric to deliver the lossless Ethernet that RoCEv2 requires. Eight leaf switches provide 256 x 100G ports for GPU nodes, with two NS9716-32D spine switches providing the aggregate bandwidth for all-to-all gradient synchronisation traffic.

Result: Lossless fabric with RoCEv2 eliminates GPU idle time from packet loss. Linux-native VelOS DC integrates into the existing AI infrastructure management pipeline -- the same Ansible playbooks that provision GPU nodes configure the network switches. Open Standards Evollabs switches delivers the fabric at a significantly lower cost than proprietary AI networking solutions, while maintaining full hardware choice and operational flexibility.

Capabilities

POWERED BY VELOS DC

VelOS DC is a data center network operating system built from the ground up on Linux, with AI networking capabilities shipping in production today. It integrates natively with compute environments and AI infrastructure management frameworks, delivering lossless fabric with seamless integration into existing AI infrastructure through open standards.

Book a Consultation

RDMA over Converged Ethernet v2 for zero-copy, low-latency GPU-to-GPU communication with near-zero CPU overhead