Skip to content
/ airelien.dev
Go back
Aurélien AMSELLEM

166 t/s on Nemotron-Labs 30B-A3B NVFP4 — the new fastest LLM on Olares One, hidden behind one CUDA-graph flag

NVIDIA released Nemotron-Labs Elastic 30B-A3B with native NVFP4 quantization two weeks ago. On Olares One (RTX 5090M consumer mobile sm_120, 24 GB), vLLM's default config OOMs at load. With one CUDA-graph flag set right — PIECEWISE mode and explicit capture_sizes [1,2,4] — the model boots and runs at 165.91 t/s. That's +22% over Gemma 4, +55% over BeeLlama on Qwen3.6 27B, +124% over my MTP-master build. New champion.

Two passes. First one OOM’d. Second one ran at 166 t/s. The difference was one CUDA-graph flag.

Here’s the story.

The model

NVIDIA dropped a quiet release on May 8th: nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-NVFP4. 4k+ downloads/week, very little community noise. The interesting bits:

This is the first NVIDIA-pushed model targeting consumer Blackwell that’s actually on HuggingFace. (Nemotron-3 Super from April is still gated.) Worth a test.

Pass 1 — vLLM defaults, OOM

image: vllm/vllm-openai:nightly
args:
  - nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-NVFP4
  - --max-model-len 32768
  - --gpu-memory-utilization 0.92
  - --trust-remote-code

Boot:

CUDA out of memory. Tried to allocate 316.00 MiB. GPU 0 has a total capacity
of 23.42 GiB of which 416.38 MiB is free. Including non-PyTorch memory, this
process has 23.13 GiB memory in use.

Math:

It doesn’t fit by a few hundred MB.

Workaround #1: --enforce-eager — disable CUDA graphs entirely. Pod boots. Bench: 32.60 t/s (10 runs, σ ≈ 0.08 — extremely stable). But eager mode is ~half the speed of a graph-captured path on decode-heavy workloads. This is the floor, not the ceiling.

Filed away. Moved on. (Then circled back when the user asked why eager and the answer “graphs OOM” felt like it should be solvable.)

Pass 2 — PIECEWISE graphs with restricted capture sizes

vLLM’s CLI doesn’t expose --cudagraph-mode directly. The setting lives inside --compilation-config, as a JSON blob:

args:
  - nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-NVFP4
  - --max-model-len 4096
  - --gpu-memory-utilization 0.95
  - --max-num-seqs 1
  - --compilation-config
  - '{"cudagraph_mode": 1, "max_cudagraph_capture_size": 4, "cudagraph_capture_sizes": [1, 2, 4]}'
  - --trust-remote-code

What changed:

Boot log:

Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 3/3 [00:00<00:00, 8.30it/s]
CUDA graph pool memory: 0.04 GiB (actual), 0.05 GiB (estimated)

0.04 GiB. 40 megabytes. That’s the entire CUDA graph pool. Versus the default’s ~4 GiB. 100× reduction in graph memory, and vLLM still captures three useful sizes.

Pod loads cleanly. 23.13 GB used / 23.42 GB available. 290 MB margin.

Bench

10 runs of Space Invaders HTML completion (2000 tokens each), temp=0.6 top_k=20 min_p=0.

Runt/sWall time
1 (graph capture / JIT warmup)23.2885.89s
2166.6212.00s
3166.4612.02s
4165.9112.05s
5165.4612.09s
6165.3112.10s
7165.2512.10s
8165.4012.09s
9166.4112.02s
10166.3912.02s

Post-warmup: 165.91 t/s AVG, σ ≈ 0.5 (range 165.25 – 166.62).

That’s 5.1× the eager-mode bench. Almost all the difference comes from CUDA graphs — same model weights, same backend kernels, same context window. The host-CPU roundtrip cost on small-batch decode is just that brutal without graphs.

The new ranking on Olares One

This is my full Olares One leaderboard as of tonight (single user, single-stream, 2000-token completions):

Stackt/s AVGModelQuant
Nemotron-Labs Elastic 30B-A3B NVFP4 + vLLM + PIECEWISE165.91NemotronH 30B-A3BNVFP4 ModelOpt
Gemma 4 26B-A4B vLLM135.97Gemma 4 26B-A4BAWQ-INT4
BeeLlama Qwen3.6 27B + DFlash + turbo3 KV107.54Qwen3.6 27B denseQ3_K_XL
llama.cpp MTP master ad2775774.28Qwen3.6 27B denseQ3_K_XL
Nemotron-Labs eager (no graphs)32.60NemotronH 30B-A3BNVFP4 ModelOpt

vs the prior Gemma 4 champion: +22%. vs my best Qwen3.6 path (BeeLlama): +55%. vs Qwen3.6 + MTP-master: +124%.

For reference, the Reddit r/LocalLLaMA bench reports for the same model class on desktop hardware:

The 5090M Laptop is at ~50% the memory bandwidth of the 5090 desktop on paper. Closing within 10% of desktop 5090 on a 30B model is something native NVFP4 kernels can do that quantized GGUF formats can’t — there’s no dequantize step in the inference loop.

Why this works so well

Three things compound to give us +22% over the prior champion:

  1. Native NVFP4 on Blackwell tensor cores. AWQ-INT4 (Gemma 4 path) and Q3_K_XL (Qwen3.6 GGUF) both need a dequantize-and-multiply: take 4-bit, expand to FP16/BF16, then run the GEMM. Blackwell’s FP4 tensor cores skip that. vLLM picks FlashInferCutlassNvFp4LinearKernel for the GEMM and FLASHINFER_CUTLASS for the MoE — both written specifically for the NVFP4 path. The same 30B model in BF16 wouldn’t fit in 24 GB at all.

  2. MoE-A3B routing on FlashInfer. 128 experts, 3B active per token. The active-parameter count is comparable to a dense 3B model, but you get the breadth of a 30B model when the router needs it. The FlashInfer CUTLASS MoE backend has been tuned for this exact pattern; the routing overhead is small (~5%) compared to the savings from not running all 30B params per token.

  3. Hybrid Mamba+Attention. Half the layers are state-space (Mamba) — O(n) per token regardless of context. At 4K context this matters less, but it means roughly half the decode cost is constant-per-token instead of scaling with KV cache size.

Constraints

max_model_len = 4096. This is the painful one. The KV cache is what’s left over after model + graphs + compute buffers. At 32K context you’d need ~4 GB more KV space which doesn’t fit. Options to push it:

For my use case — Hermes Agent calls, code completion, single-shot Q&A — 4K is enough most of the time. Long PDFs and multi-turn coding sessions need a different path.

max_num_seqs = 1. Single-stream. For multi-user / concurrent agentic workflows you want more, but on 24 GB with this model we can’t afford more KV cache. Same constraint as max_model_len.

Warmup: Run 1 = 23 t/s. CUDA graph capture + JIT compilation + first-pass kernel autotune. Subsequent runs hit 165 immediately. For one-shot evals the first response is slow; for sustained use, no impact.

NVFP4 quality. vLLM warns “Detected ModelOpt NVFP4 checkpoint. Please note that the format is experimental and could change in future.” For Kimi K2.5 / K2.6 NVFP4 (the only other NVFP4 models out so far), NVIDIA’s published MMLU divergences were within 1% of the INT4 baseline. For this 30B I haven’t done quality eval yet — needs an MMLU / HumanEval sweep before recommending for production.

What I’m shipping next

Working on packaging this as nemotronlabselastic30bnvfp4one (or whatever shorter name I can fit) in the Olares market source. Once the chart is up, it’ll be a one-click install: pull the image, download the weights, boot with the compilation-config above.

Two things to watch:

If you’re on consumer Blackwell (5090 desktop, 5090M mobile, 5080) and want to try this, the compilation-config trick is the key. Default vLLM args OOM. cudagraph_mode: 1 + restricted capture sizes is the unlock. Same recipe works on any NVFP4 model where the issue is graph capture eating too much VRAM.


Hardware: Olares One — RTX 5090M Laptop (24 GB GDDR7, sm_120 Blackwell consumer mobile), Intel Core Ultra 9 275HX 24-core, 96 GB DDR5. Software: vLLM v0.19.2rc1.dev107+g4eafc7292 nightly (2026-05-20 06:22 UTC). Bench prompt: Space Invaders HTML game completion, 2000 tokens, temp=0.6 top_k=20 min_p=0. Ten runs single-stream, single-user.

Share this post on:

Comments