Overview

chili10-101dUbuntu 24.04 · kernel 6.174× L40S · DOCA 3.3.0

How GPUDirect Storage works — and why the ceilings are what they are

A short tour of the data paths that produced the numbers in this report.

Path 1

Local NVMe → GPU (direct)

cuFile issues a P2P DMA on behalf of the NVMe driver. Bytes go NVMe → CPU root complex → GPU BAR1, never touching DRAM.

Requires: nvidia-fs kernel module, GPU BAR1 exposed via ReBAR, IOMMU in passthrough or off.
Verified with: gdscheck -p and XferType: GPUD in the benchmark output.

Path 2

NFSoRDMA → GPU (direct)

vastnfs-dkms plumbs RDMA reads/writes through libcufile_rdma directly into GPU BAR1. The NIC DMAs into VRAM the same way local NVMe does.

Requires: nvidia_peermem, libcufile_rdma, vastnfs-dkms (not the in-tree rpcrdma), MOFED-built mlnx-nfsrdma-dkms.
Verified with: cat /proc/fs/nfsfs/cbstats showing RDMA xprts and XferType: GPUD in gdsio output.

Why the ceilings are what they are

Three independent ceilings — GPU uplink, drive aggregate, and NIC line rate — each constraint takes over as the workload scales.

Ceiling	Theoretical	Measured	%	Limits
Single-GPU PCIe 4.0 x16	~28 GiB/s	26.4	94%	TLP overhead + GPU endpoint credits. Run 5 peak.
4× drive raid0	~58 GB/s	53.0	91%	4× Solidigm D7-PS1010 spec ~14.5 GB/s read. Run 2 peak.
400 GbE NIC line rate	~45 GiB/s	43.4	96%	Ethernet + RoCE v2 + NFS framing overhead. Run 6 peak.

GDS vs POSIX compatibility mode

When libcufile can't use the direct path, it falls back to POSIX: a read into a DRAM bounce buffer followed by a cudaMemcpyAsync. Always works. Costs bandwidth.

On writes, compat mode costs nearly 2× bandwidth — the data is copied into DRAM before being pushed out. On reads, the impact is less because DRAM can prefetch; nixlbench Run 3 shows POSIX actually winning at single-thread because of kernel readahead, but that advantage disappears once you fan out across workers.

Where GDS wins vs where POSIX wins

GDS wins when

Large batches + many submitter threads (multi-GPU, multi-worker)
Small KV blocks where per-I/O overhead dominates (DeepSeek R1)
DRAM bandwidth is contested (multi-tenant, high-core-count boxes)
Real inference/training workloads where the GPU pulls from VRAM anyway

POSIX holds up when

Single-threaded AIO is the shape of the workload
Large sequential reads where kernel readahead is effective
Multi-MB blocks at shallow queue depth (Llama 70B r=1)
DRAM is not contested — plenty of memory channels free