Skip to content

cuda graphs

reflex serve --cuda-graphs captures the two ONNX Runtime sessions on the decomposed path (vlm_prefix + expert_denoise) into CUDA graphs at startup and replays them thereafter. Replay is 3-4× faster than eager execution because the GPU skips kernel-launch overhead for every call.

Tier-aware out of the box:

GPUvlm_prefixexpert_denoiseCombined effect
A100-40 / A100-80CapturedCapturedFull speedup on every /act
A10G (24 GB)Eager (graceful degrade at init)Captured~95% of total speedup since expert fires 10× per /act vs vlm_prefix 1× per cache-miss
Jetson Orin Nano (8 GB)Not validatedNot validatedRun without --cuda-graphs for now — research tier
Jetson AGX Orin (64 GB)Expected to captureExpected to captureValidate before production

Measured on Modal 2026-04-25 against a SnapFlow pi0.5 decomposed export (Franka): A100 hit 4.44× on vlm_prefix + 3.03× on expert_denoise; A10G hit 3.76× on expert_denoise and fell back to eager on vlm_prefix.

Terminal window
reflex serve ./my-export/ --cuda-graphs

That’s the whole knob. Run, hit /act, observe latency drop. The first /act per session is slower than eager (graph capture cost, ~50-400 ms depending on session); every subsequent /act hits the fast-path replay.

  1. Server constructs vlm_prefix ONNX session with enable_cuda_graph=1
  2. Probes capture by running one synthetic forward pass
  3. On success: wraps with CudaGraphWrapper, emits reflex_cuda_graph_captured_total{session=vlm_prefix}
  4. On capture failure (OOM, unsupported op): rebuilds the session without enable_cuda_graph, wraps with EagerSessionWrapper, emits reflex_cuda_graph_capture_failed_at_init_total{session=vlm_prefix,reason=<ExceptionClass>} ONCE, logs an INFO line. This is the A10G vlm_prefix path today.
  5. Same sequence for expert_denoise

Request handling proceeds transparently — the wrappers expose the same .run() API whether the session is captured or eager.

Cardinality is bounded. Labels: embodiment (franka / so100 / ur5 / custom) × session (vlm_prefix / expert_denoise) × reason (capture_failed / replay_failed / capture_failed_oom).

MetricTypeFires when
reflex_cuda_graph_captured_totalCounterFirst successful capture per session
reflex_cuda_graph_replayed_totalCounterEvery replay
reflex_cuda_graph_eager_fallback_totalCounterIn-request capture or replay raised
reflex_cuda_graph_capture_failed_at_init_totalCounterSession-init probe failed → eager-for-process-lifetime
reflex_cuda_graph_capture_secondsHistogramCapture wall-clock (first run)
reflex_cuda_graph_replay_secondsHistogramReplay wall-clock

Distinguishing capture_failed_at_init from eager_fallback matters: the init-failure is a hardware-tier signal (this process will NEVER capture this session), while the replay-time fallback is a per-request error signal.

reflex_cuda_graph_capture_failed_at_init_total fires at startup on A10G. Expected for session=vlm_prefix. A10G’s CUDA-graph memory overhead (~128 MB reserved per captured graph) exceeds the BFCArena pre-allocated pool for the vlm_prefix model. The session runs eager instead; expert_denoise still captures and gives ~95% of total speedup. No customer action.

eager_fallback_total climbing during traffic. Replay-time failure — rarer. Likely causes: CUDA context contention with another process on the same GPU, input shape changed (impossible on a static-shape Reflex export — file a bug), or GPU memory pressure from a concurrent workload.

No speedup after enabling --cuda-graphs. Your serve backend isn’t on the decomposed dispatch path. The flag applies to Pi05DecomposedInference dispatch. Check the startup log for: "--cuda-graphs was set but this backend does not consume the flag."

Latency spike on the first request after startup. Expected — capture cost, ~100-400 ms on the first /act. Keep --prewarm enabled (default) so capture runs at startup, not on the first user request.

  • Not on the decomposed dispatch path (no-op)
  • Hardware tier we haven’t validated (Orin Nano, Thor, custom Hopper/Blackwell). Baseline first, then enable + compare p99
  • You depend on torch.cuda.graph semantics — Reflex uses ORT-native capture (per ADR 2026-04-24-cuda-graphs-architecture)

Day 8-9 A/B on A100-80GB, N=200: vlm_prefix 1.07× mean + jitter 1.4× tighter; expert_denoise 1.47× mean + p99 -40% + jitter 4.1× tighter. Combined per-chunk pi0.5 num_steps=10: 270.85 → 207.74 ms = 1.30× speedup. See reflex_context/03_experiments/2026-04-29-cuda-graphs-ab-modal-a100.md.