Overview

chili10-101dUbuntu 24.04 · kernel 6.174× L40S · DOCA 3.3.0

NIXL Bench — GDS vs POSIX vs GDS_MT

nixlbench running with one submitter thread, batch size 64, sweeping block sizes from 64 KiB to 16 MiB against /gds. Compares three nixl backends: GDS (GPUDirect), POSIX (DRAM-bounce AIO), and GDS_MT (multi-threaded GDS variant).

Read · 16 MiB · POSIX

21.6 GiB/s

POSIX wins single-thread AIO — the kernel page cache + libaio reorder help here.

Read · 16 MiB · GDS

12.6 GiB/s

Single-thread GDS is submission-rate-bound. See /gdsio for how this changes with more workers.

Read · 16 MiB · GDS_MT

~12 GiB/s

GDS_MT did not improve single-thread numbers — the parallelism it adds is at the file level, not the worker level.

READ — block-size sweep

1 submitter thread · batch=64 · 2 GiB total xfer · VRAM target · /gds

WRITE — block-size sweep

1 submitter thread · batch=64 · 2 GiB total xfer · VRAM initiator · /gds

What this tells us

Single-thread benchmarks favor POSIX-AIO. Linux AIO + the readahead machinery + DRAM-side prefetch all compound. GDS moves the data via P2P DMA but one submitter is the bottleneck — the GPU is waiting.
This is not representative of real workloads. Training and inference fan out across many submitter threads (one per DataLoader worker, or one per KV shard). See the Run 2 multi-GPU numbers on /gdsio — 53 GiB/s aggregate vs ~22 GiB/s here.
GDS wins where DRAM bandwidth is the real constraint. Here we have 2 × 12-channel DDR5 ≈ 920 GiB/s of DRAM BW — nowhere near the bottleneck. On machines with tighter DRAM headroom or in multi-tenant setups, POSIX's DRAM bounce gets expensive fast.
Small blocks are cache-hostile. Both backends start below 10 GiB/s at 64 KiB and climb with block size. The nvidia-fs lower bound is the kernel-entry cost per submission, amortized across more bytes as blocks grow.

Raw CSVs (space-delimited, columns: block, batch, GiB/s, avg-lat, prep-p50/p99, post-p50/p99, tx-p50/p99): GDS-W · GDS-R · POSIX-W · POSIX-R · GDS_MT-W · GDS_MT-R