Overview
How GPUDirect Storage works — and why the ceilings are what they are
A short tour of the data paths that produced the numbers in this report.
Path 1
Local NVMe → GPU (direct)
cuFile issues a P2P DMA on behalf of the NVMe driver. Bytes go NVMe → CPU root complex → GPU BAR1, never touching DRAM.
- Requires: nvidia-fs kernel module, GPU BAR1 exposed via ReBAR, IOMMU in passthrough or off.
- Verified with:
gdscheck -pand XferType: GPUD in the benchmark output.
Path 2
NFSoRDMA → GPU (direct)
vastnfs-dkms plumbs RDMA reads/writes through libcufile_rdma directly into GPU BAR1. The NIC DMAs into VRAM the same way local NVMe does.
- Requires: nvidia_peermem, libcufile_rdma, vastnfs-dkms (not the in-tree rpcrdma), MOFED-built mlnx-nfsrdma-dkms.
- Verified with:
cat /proc/fs/nfsfs/cbstatsshowing RDMA xprts and XferType: GPUD in gdsio output.
Why the ceilings are what they are
Three independent ceilings — GPU uplink, drive aggregate, and NIC line rate — each constraint takes over as the workload scales.
| Ceiling | Theoretical | Measured | % | Limits |
|---|---|---|---|---|
| Single-GPU PCIe 4.0 x16 | ~28 GiB/s | 26.4 | 94% | TLP overhead + GPU endpoint credits. Run 5 peak. |
| 4× drive raid0 | ~58 GB/s | 53.0 | 91% | 4× Solidigm D7-PS1010 spec ~14.5 GB/s read. Run 2 peak. |
| 400 GbE NIC line rate | ~45 GiB/s | 43.4 | 96% | Ethernet + RoCE v2 + NFS framing overhead. Run 6 peak. |
GDS vs POSIX compatibility mode
When libcufile can't use the direct path, it falls back to POSIX: a read into a DRAM bounce buffer followed by a cudaMemcpyAsync. Always works. Costs bandwidth.
On writes, compat mode costs nearly 2× bandwidth — the data is copied into DRAM before being pushed out. On reads, the impact is less because DRAM can prefetch; nixlbench Run 3 shows POSIX actually winning at single-thread because of kernel readahead, but that advantage disappears once you fan out across workers.
Where GDS wins vs where POSIX wins
GDS wins when
- Large batches + many submitter threads (multi-GPU, multi-worker)
- Small KV blocks where per-I/O overhead dominates (DeepSeek R1)
- DRAM bandwidth is contested (multi-tenant, high-core-count boxes)
- Real inference/training workloads where the GPU pulls from VRAM anyway
POSIX holds up when
- Single-threaded AIO is the shape of the workload
- Large sequential reads where kernel readahead is effective
- Multi-MB blocks at shallow queue depth (Llama 70B r=1)
- DRAM is not contested — plenty of memory channels free