Two passes. First one OOM’d. Second one ran at 166 t/s. The difference was one CUDA-graph flag.
Here’s the story.
The model
NVIDIA dropped a quiet release on May 8th: nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-NVFP4. 4k+ downloads/week, very little community noise. The interesting bits:
- 30B parameters, MoE-A3B (128 routed experts + 1 shared) — only ~3B active per token
- Native NVFP4 quantization (NVIDIA’s E2M1 4-bit, ModelOpt checkpoint format) — Blackwell tensor cores accept FP4 inputs natively, no dequantize-and-multiply step
- Hybrid Mamba+Attention architecture (
NemotronHForCausalLM) — half the layers are linear-cost state-space, half are quadratic attention - 262K native context (we’ll see we can’t actually use that on 24 GB)
- ~19 GB safetensors
This is the first NVIDIA-pushed model targeting consumer Blackwell that’s actually on HuggingFace. (Nemotron-3 Super from April is still gated.) Worth a test.
Pass 1 — vLLM defaults, OOM
image: vllm/vllm-openai:nightly
args:
- nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-NVFP4
- --max-model-len 32768
- --gpu-memory-utilization 0.92
- --trust-remote-code
Boot:
CUDA out of memory. Tried to allocate 316.00 MiB. GPU 0 has a total capacity
of 23.42 GiB of which 416.38 MiB is free. Including non-PyTorch memory, this
process has 23.13 GiB memory in use.
Math:
- Model: 18.78 GB
- vLLM’s default CUDA graph capture (
FULL_AND_PIECEWISEmode, sizes [1, 2, 4, 8]) reserves ~4 GB - HAMi takes ~600 MB off the top of the physical 24 GB
- 23.42 GB usable - 19 GB model - 4 GB graphs ≈ 0.4 GB free
- KV cache and compute buffers both need that 0.4 GB
It doesn’t fit by a few hundred MB.
Workaround #1: --enforce-eager — disable CUDA graphs entirely. Pod boots. Bench: 32.60 t/s (10 runs, σ ≈ 0.08 — extremely stable). But eager mode is ~half the speed of a graph-captured path on decode-heavy workloads. This is the floor, not the ceiling.
Filed away. Moved on. (Then circled back when the user asked why eager and the answer “graphs OOM” felt like it should be solvable.)
Pass 2 — PIECEWISE graphs with restricted capture sizes
vLLM’s CLI doesn’t expose --cudagraph-mode directly. The setting lives inside --compilation-config, as a JSON blob:
args:
- nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-NVFP4
- --max-model-len 4096
- --gpu-memory-utilization 0.95
- --max-num-seqs 1
- --compilation-config
- '{"cudagraph_mode": 1, "max_cudagraph_capture_size": 4, "cudagraph_capture_sizes": [1, 2, 4]}'
- --trust-remote-code
What changed:
cudagraph_mode: 1=CUDAGraphMode.PIECEWISE(vs default2=FULL_AND_PIECEWISE). PIECEWISE captures only the splitting ops, not full forward graphs. Drops half the capture memory.cudagraph_capture_sizes: [1, 2, 4]— explicit. Default would scan [1, 2, 4, 8] and capture all four.max_cudagraph_capture_size: 4— hard cap.max_model_len: 4096— KV cache traded for graph headroom (more on this below).max_num_seqs: 1— single-stream, no concurrency.
Boot log:
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 3/3 [00:00<00:00, 8.30it/s]
CUDA graph pool memory: 0.04 GiB (actual), 0.05 GiB (estimated)
0.04 GiB. 40 megabytes. That’s the entire CUDA graph pool. Versus the default’s ~4 GiB. 100× reduction in graph memory, and vLLM still captures three useful sizes.
Pod loads cleanly. 23.13 GB used / 23.42 GB available. 290 MB margin.
Bench
10 runs of Space Invaders HTML completion (2000 tokens each), temp=0.6 top_k=20 min_p=0.
| Run | t/s | Wall time |
|---|---|---|
| 1 (graph capture / JIT warmup) | 23.28 | 85.89s |
| 2 | 166.62 | 12.00s |
| 3 | 166.46 | 12.02s |
| 4 | 165.91 | 12.05s |
| 5 | 165.46 | 12.09s |
| 6 | 165.31 | 12.10s |
| 7 | 165.25 | 12.10s |
| 8 | 165.40 | 12.09s |
| 9 | 166.41 | 12.02s |
| 10 | 166.39 | 12.02s |
Post-warmup: 165.91 t/s AVG, σ ≈ 0.5 (range 165.25 – 166.62).
That’s 5.1× the eager-mode bench. Almost all the difference comes from CUDA graphs — same model weights, same backend kernels, same context window. The host-CPU roundtrip cost on small-batch decode is just that brutal without graphs.
The new ranking on Olares One
This is my full Olares One leaderboard as of tonight (single user, single-stream, 2000-token completions):
| Stack | t/s AVG | Model | Quant |
|---|---|---|---|
| Nemotron-Labs Elastic 30B-A3B NVFP4 + vLLM + PIECEWISE | 165.91 | NemotronH 30B-A3B | NVFP4 ModelOpt |
| Gemma 4 26B-A4B vLLM | 135.97 | Gemma 4 26B-A4B | AWQ-INT4 |
| BeeLlama Qwen3.6 27B + DFlash + turbo3 KV | 107.54 | Qwen3.6 27B dense | Q3_K_XL |
| llama.cpp MTP master ad27757 | 74.28 | Qwen3.6 27B dense | Q3_K_XL |
| Nemotron-Labs eager (no graphs) | 32.60 | NemotronH 30B-A3B | NVFP4 ModelOpt |
vs the prior Gemma 4 champion: +22%. vs my best Qwen3.6 path (BeeLlama): +55%. vs Qwen3.6 + MTP-master: +124%.
For reference, the Reddit r/LocalLLaMA bench reports for the same model class on desktop hardware:
- RTX 5090 desktop 32 GB on Qwen3.6 27B UD-Q4_K_XL: ~180-185 t/s
- RTX 4090 24 GB on Qwen3.6 27B Q3_K_XL: ~115 t/s
The 5090M Laptop is at ~50% the memory bandwidth of the 5090 desktop on paper. Closing within 10% of desktop 5090 on a 30B model is something native NVFP4 kernels can do that quantized GGUF formats can’t — there’s no dequantize step in the inference loop.
Why this works so well
Three things compound to give us +22% over the prior champion:
-
Native NVFP4 on Blackwell tensor cores. AWQ-INT4 (Gemma 4 path) and Q3_K_XL (Qwen3.6 GGUF) both need a dequantize-and-multiply: take 4-bit, expand to FP16/BF16, then run the GEMM. Blackwell’s FP4 tensor cores skip that. vLLM picks
FlashInferCutlassNvFp4LinearKernelfor the GEMM andFLASHINFER_CUTLASSfor the MoE — both written specifically for the NVFP4 path. The same 30B model in BF16 wouldn’t fit in 24 GB at all. -
MoE-A3B routing on FlashInfer. 128 experts, 3B active per token. The active-parameter count is comparable to a dense 3B model, but you get the breadth of a 30B model when the router needs it. The FlashInfer CUTLASS MoE backend has been tuned for this exact pattern; the routing overhead is small (~5%) compared to the savings from not running all 30B params per token.
-
Hybrid Mamba+Attention. Half the layers are state-space (Mamba) — O(n) per token regardless of context. At 4K context this matters less, but it means roughly half the decode cost is constant-per-token instead of scaling with KV cache size.
Constraints
max_model_len = 4096. This is the painful one. The KV cache is what’s left over after model + graphs + compute buffers. At 32K context you’d need ~4 GB more KV space which doesn’t fit. Options to push it:
- Wait for vLLM PR #40082 (FlashInfer b12x MoE + FP4 GEMM for SM120/121) to land in nightly — should reduce per-layer overhead and free a few hundred MB
- Try
kv_cache_dtype=fp8to halve KV memory (was already fp8 default for Gemma 4 path) - Accept the 4K limit for use cases that don’t need long context
For my use case — Hermes Agent calls, code completion, single-shot Q&A — 4K is enough most of the time. Long PDFs and multi-turn coding sessions need a different path.
max_num_seqs = 1. Single-stream. For multi-user / concurrent agentic workflows you want more, but on 24 GB with this model we can’t afford more KV cache. Same constraint as max_model_len.
Warmup: Run 1 = 23 t/s. CUDA graph capture + JIT compilation + first-pass kernel autotune. Subsequent runs hit 165 immediately. For one-shot evals the first response is slow; for sustained use, no impact.
NVFP4 quality. vLLM warns “Detected ModelOpt NVFP4 checkpoint. Please note that the format is experimental and could change in future.” For Kimi K2.5 / K2.6 NVFP4 (the only other NVFP4 models out so far), NVIDIA’s published MMLU divergences were within 1% of the INT4 baseline. For this 30B I haven’t done quality eval yet — needs an MMLU / HumanEval sweep before recommending for production.
What I’m shipping next
Working on packaging this as nemotronlabselastic30bnvfp4one (or whatever shorter name I can fit) in the Olares market source. Once the chart is up, it’ll be a one-click install: pull the image, download the weights, boot with the compilation-config above.
Two things to watch:
- vLLM PR #40082 — when it lands in nightly (probably tomorrow’s build), I’ll re-bench. If FlashInfer b12x cuts per-layer memory, we may be able to push
max_model_lento 8K or 16K at the same throughput. - The 8B variant of Nemotron-Labs Diffusion — much smaller (~3 GB), would fit with the entire 262K KV cache enabled. Different model class (diffusion not transformer for prose) but worth a side-by-side.
If you’re on consumer Blackwell (5090 desktop, 5090M mobile, 5080) and want to try this, the compilation-config trick is the key. Default vLLM args OOM. cudagraph_mode: 1 + restricted capture sizes is the unlock. Same recipe works on any NVFP4 model where the issue is graph capture eating too much VRAM.
Hardware: Olares One — RTX 5090M Laptop (24 GB GDDR7, sm_120 Blackwell consumer mobile), Intel Core Ultra 9 275HX 24-core, 96 GB DDR5. Software: vLLM v0.19.2rc1.dev107+g4eafc7292 nightly (2026-05-20 06:22 UTC). Bench prompt: Space Invaders HTML game completion, 2000 tokens, temp=0.6 top_k=20 min_p=0. Ten runs single-stream, single-user.