Protocol

What is RoCE (RDMA over Converged Ethernet)?

RoCE is a network protocol that enables RDMA (Remote Direct Memory Access) to operate over standard Ethernet networks, providing low-latency kernel-bypass I/O without requiring specialized InfiniBand hardware.

Technical Overview

RoCE (pronounced "Rocky") was developed by the InfiniBand Trade Association and standardized in the IBTA specification. Two versions exist: RoCE v1 operates at Layer 2 (Ethernet frames) and cannot be routed beyond a single broadcast domain; RoCE v2 (also called RoCEv2 or "Routable RoCE") encapsulates RDMA transport packets inside UDP/IP, enabling routing across Layer-3 boundaries. RoCEv2 uses UDP destination port 4791 and is the version used in virtually all modern datacenter deployments.

RoCE requires a lossless Ethernet fabric because RDMA's credit-based flow control model was designed for InfiniBand's inherently lossless transport. When a packet is dropped due to congestion on a standard Ethernet switch, the RDMA connection must recover — which involves retransmitting from the last acknowledged position and can severely impact latency. To prevent this, RoCE deployments must configure Priority Flow Control (PFC) on all switches and NICs to pause specific traffic classes when buffers fill, and typically also deploy DCQCN (Data Center Quantized Congestion Notification) as a proactive congestion signal.

RoCE-capable NICs (RNICs) implement the RDMA transport in hardware. Common RNIC vendors include NVIDIA (ConnectX series), Broadcom (FastLinQ), Marvell, and Intel (formerly. Omni-Path). These NICs expose RDMA verbs to applications via the ibverbs/libibverbs library (Linux) or the RDMA provider layer in Windows. The NVMe/RDMA transport driver uses these verbs to implement NVMe-oF over RoCE.

How It Relates to NVMe/TCP

RoCE is the most common RDMA fabric for NVMe-oF in datacenter environments where RDMA performance is required. It sits between InfiniBand (highest performance, highest cost) and NVMe/TCP (standard Ethernet, commodity NICs) in both performance and operational complexity. The lossless Ethernet requirement is the most significant operational burden of RoCE compared to NVMe/TCP — misconfigured PFC can cause priority-inversion "deadlocks" that pause all traffic, not just RDMA. NVMe/TCP avoids this entirely because TCP's built-in congestion control handles packet loss gracefully without requiring switch-level PFC.

Key Characteristics

  • Versions: RoCE v1 (L2 only), RoCE v2 (UDP/IP, routable)
  • UDP port: 4791 (RoCE v2)
  • Latency: 2–10 µs (NIC-to-NIC)
  • Network requirement: Lossless Ethernet (PFC + DCQCN)
  • NIC requirement: RDMA-capable NIC (RNIC)
  • NVMe-oF binding: NVMe/RDMA (TP8000)

RoCE vs NVMe/TCP Tradeoffs

RoCE's 2–10 µs NIC-to-NIC latency and near-zero CPU data-path overhead make it attractive for workloads where every microsecond matters — financial trading, HPC checkpointing, and AI/ML training with large gradient synchronization. However, it requires RNIC hardware ($500–$2,000+ per port), lossless Ethernet fabric configuration, and ongoing PFC monitoring. NVMe/TCP sacrifices some latency (typically 15–30 µs more end-to-end) in exchange for running on any standard NIC, any Ethernet switch, and standard IP networking tools — a compelling tradeoff for general-purpose cloud and enterprise storage.