Overview

KV Bench — model-realistic KV-cache transfers

kvbench wraps nixlbench with model-appropriate block sizes and batch counts drawn from three real model configs. This is the closest single-GPU benchmark to actual inference-serving behavior.

DeepSeek R1 · GDS
16.4 GiB/s
Wins POSIX (14.8) at this model's 549 KB blocks — first single-thread GDS > POSIX result in the study.
Llama 70B · POSIX · r=1
21.4 GiB/s
Best overall number. Deeper batch (r=10) dropped to 19.5 — diminishing returns past batch ~100.
Llama 8B · POSIX · r=10
17.2 GiB/s
GDS at r=10 hung on this libcufile — known-issue with batch ≥ 256 under nvidia-fs 2.28.4.
Throughput by model × backend
Single GPU · single thread · VRAM target · KV-cache block sizes per model · GDS r=10 omitted because it hung on this libcufile
Key insight
Small KV blocks favor GDS

At DeepSeek R1's 549 KB blocks, POSIX-AIO's per-I/O syscall + DMA-bounce overhead starts to dominate. GDS's in-kernel P2P path has lower per-op cost, so it pulls ahead even at single-thread depth.

Rule of thumb from this run: for sub-MB KV blocks, prefer GDS. For multi-MB blocks, POSIX is competitive until batch depth overwhelms the AIO pool.

Known issue
GDS hangs at batch ≥ 256

With nvidia-fs 2.28.4 + CUDA 12.9, GDS hangs indefinitely when kvbench --num_requests scales the effective batch past ~256 submissions in flight. POSIX r=10 completes normally. We did not pursue workarounds here; tracked for a later libcufile upgrade.

r=1 worked for all three models. Those are the GDS numbers shown.

Per-request latency breakdown (Llama 70B · 2.5 MB)
Shows where the work happens — GDS spends almost nothing in the POST phase because there is no DRAM bounce.
BackendPrep μsPost μsTx μsAvg GiB/s
GDS (r=1)~28~8~2000012.5
POSIX (r=1)~25~14000~100021.4
POSIX (r=10)~25~20000~25019.5

Figures extracted from nixlbench output in kvbench logs. GDS Tx dominates because the whole transfer happens inside that phase; POSIX Post dominates because the DRAM bounce is counted there.