Overview

chili10-101dUbuntu 24.04 · kernel 6.174× L40S · DOCA 3.3.0

KV Bench — model-realistic KV-cache transfers

kvbench wraps nixlbench with model-appropriate block sizes and batch counts drawn from three real model configs. This is the closest single-GPU benchmark to actual inference-serving behavior.

DeepSeek R1 · GDS

16.4 GiB/s

Wins POSIX (14.8) at this model's 549 KB blocks — first single-thread GDS > POSIX result in the study.

Llama 70B · POSIX · r=1

21.4 GiB/s

Best overall number. Deeper batch (r=10) dropped to 19.5 — diminishing returns past batch ~100.

Llama 8B · POSIX · r=10

17.2 GiB/s

GDS at r=10 hung on this libcufile — known-issue with batch ≥ 256 under nvidia-fs 2.28.4.

Throughput by model × backend

Single GPU · single thread · VRAM target · KV-cache block sizes per model · GDS r=10 omitted because it hung on this libcufile

Key insight

Small KV blocks favor GDS

At DeepSeek R1's 549 KB blocks, POSIX-AIO's per-I/O syscall + DMA-bounce overhead starts to dominate. GDS's in-kernel P2P path has lower per-op cost, so it pulls ahead even at single-thread depth.

Rule of thumb from this run: for sub-MB KV blocks, prefer GDS. For multi-MB blocks, POSIX is competitive until batch depth overwhelms the AIO pool.

Known issue

GDS hangs at batch ≥ 256

With nvidia-fs 2.28.4 + CUDA 12.9, GDS hangs indefinitely when kvbench --num_requests scales the effective batch past ~256 submissions in flight. POSIX r=10 completes normally. We did not pursue workarounds here; tracked for a later libcufile upgrade.

r=1 worked for all three models. Those are the GDS numbers shown.

Per-request latency breakdown (Llama 70B · 2.5 MB)

Shows where the work happens — GDS spends almost nothing in the POST phase because there is no DRAM bounce.

Backend	Prep μs	Post μs	Tx μs	Avg GiB/s
GDS (r=1)	~28	~8	~20000	12.5
POSIX (r=1)	~25	~14000	~1000	21.4
POSIX (r=10)	~25	~20000	~250	19.5

Figures extracted from nixlbench output in kvbench logs. GDS Tx dominates because the whole transfer happens inside that phase; POSIX Post dominates because the DRAM bounce is counted there.

Raw logs: /benchmarks/2026-04-22/kvbench/