Overview

NIXL Bench — GDS vs POSIX vs GDS_MT

nixlbench running with one submitter thread, batch size 64, sweeping block sizes from 64 KiB to 16 MiB against /gds. Compares three nixl backends: GDS (GPUDirect), POSIX (DRAM-bounce AIO), and GDS_MT (multi-threaded GDS variant).

Read · 16 MiB · POSIX
21.6 GiB/s
POSIX wins single-thread AIO — the kernel page cache + libaio reorder help here.
Read · 16 MiB · GDS
12.6 GiB/s
Single-thread GDS is submission-rate-bound. See /gdsio for how this changes with more workers.
Read · 16 MiB · GDS_MT
~12 GiB/s
GDS_MT did not improve single-thread numbers — the parallelism it adds is at the file level, not the worker level.
READ — block-size sweep
1 submitter thread · batch=64 · 2 GiB total xfer · VRAM target · /gds
WRITE — block-size sweep
1 submitter thread · batch=64 · 2 GiB total xfer · VRAM initiator · /gds
What this tells us
  • Single-thread benchmarks favor POSIX-AIO. Linux AIO + the readahead machinery + DRAM-side prefetch all compound. GDS moves the data via P2P DMA but one submitter is the bottleneck — the GPU is waiting.
  • This is not representative of real workloads. Training and inference fan out across many submitter threads (one per DataLoader worker, or one per KV shard). See the Run 2 multi-GPU numbers on /gdsio — 53 GiB/s aggregate vs ~22 GiB/s here.
  • GDS wins where DRAM bandwidth is the real constraint. Here we have 2 × 12-channel DDR5 ≈ 920 GiB/s of DRAM BW — nowhere near the bottleneck. On machines with tighter DRAM headroom or in multi-tenant setups, POSIX's DRAM bounce gets expensive fast.
  • Small blocks are cache-hostile. Both backends start below 10 GiB/s at 64 KiB and climb with block size. The nvidia-fs lower bound is the kernel-entry cost per submission, amortized across more bytes as blocks grow.
Raw CSVs (space-delimited, columns: block, batch, GiB/s, avg-lat, prep-p50/p99, post-p50/p99, tx-p50/p99): GDS-W · GDS-R · POSIX-W · POSIX-R · GDS_MT-W · GDS_MT-R