cuda graphs
reflex serve --cuda-graphs captures the two ONNX Runtime sessions on the decomposed path (vlm_prefix + expert_denoise) into CUDA graphs at startup and replays them thereafter. Replay is 3-4× faster than eager execution because the GPU skips kernel-launch overhead for every call.
Tier-aware out of the box:
| GPU | vlm_prefix | expert_denoise | Combined effect |
|---|---|---|---|
| A100-40 / A100-80 | Captured | Captured | Full speedup on every /act |
| A10G (24 GB) | Eager (graceful degrade at init) | Captured | ~95% of total speedup since expert fires 10× per /act vs vlm_prefix 1× per cache-miss |
| Jetson Orin Nano (8 GB) | Not validated | Not validated | Run without --cuda-graphs for now — research tier |
| Jetson AGX Orin (64 GB) | Expected to capture | Expected to capture | Validate before production |
Measured on Modal 2026-04-25 against a SnapFlow pi0.5 decomposed export (Franka): A100 hit 4.44× on vlm_prefix + 3.03× on expert_denoise; A10G hit 3.76× on expert_denoise and fell back to eager on vlm_prefix.
Quick start
Section titled “Quick start”reflex serve ./my-export/ --cuda-graphsThat’s the whole knob. Run, hit /act, observe latency drop. The first /act per session is slower than eager (graph capture cost, ~50-400 ms depending on session); every subsequent /act hits the fast-path replay.
What happens at startup
Section titled “What happens at startup”- Server constructs
vlm_prefixONNX session withenable_cuda_graph=1 - Probes capture by running one synthetic forward pass
- On success: wraps with
CudaGraphWrapper, emitsreflex_cuda_graph_captured_total{session=vlm_prefix} - On capture failure (OOM, unsupported op): rebuilds the session without
enable_cuda_graph, wraps withEagerSessionWrapper, emitsreflex_cuda_graph_capture_failed_at_init_total{session=vlm_prefix,reason=<ExceptionClass>}ONCE, logs an INFO line. This is the A10Gvlm_prefixpath today. - Same sequence for
expert_denoise
Request handling proceeds transparently — the wrappers expose the same .run() API whether the session is captured or eager.
Metrics
Section titled “Metrics”Cardinality is bounded. Labels: embodiment (franka / so100 / ur5 / custom) × session (vlm_prefix / expert_denoise) × reason (capture_failed / replay_failed / capture_failed_oom).
| Metric | Type | Fires when |
|---|---|---|
reflex_cuda_graph_captured_total | Counter | First successful capture per session |
reflex_cuda_graph_replayed_total | Counter | Every replay |
reflex_cuda_graph_eager_fallback_total | Counter | In-request capture or replay raised |
reflex_cuda_graph_capture_failed_at_init_total | Counter | Session-init probe failed → eager-for-process-lifetime |
reflex_cuda_graph_capture_seconds | Histogram | Capture wall-clock (first run) |
reflex_cuda_graph_replay_seconds | Histogram | Replay wall-clock |
Distinguishing capture_failed_at_init from eager_fallback matters: the init-failure is a hardware-tier signal (this process will NEVER capture this session), while the replay-time fallback is a per-request error signal.
Troubleshooting
Section titled “Troubleshooting”reflex_cuda_graph_capture_failed_at_init_total fires at startup on A10G. Expected for session=vlm_prefix. A10G’s CUDA-graph memory overhead (~128 MB reserved per captured graph) exceeds the BFCArena pre-allocated pool for the vlm_prefix model. The session runs eager instead; expert_denoise still captures and gives ~95% of total speedup. No customer action.
eager_fallback_total climbing during traffic. Replay-time failure — rarer. Likely causes: CUDA context contention with another process on the same GPU, input shape changed (impossible on a static-shape Reflex export — file a bug), or GPU memory pressure from a concurrent workload.
No speedup after enabling --cuda-graphs. Your serve backend isn’t on the decomposed dispatch path. The flag applies to Pi05DecomposedInference dispatch. Check the startup log for: "--cuda-graphs was set but this backend does not consume the flag."
Latency spike on the first request after startup. Expected — capture cost, ~100-400 ms on the first /act. Keep --prewarm enabled (default) so capture runs at startup, not on the first user request.
When NOT to enable
Section titled “When NOT to enable”- Not on the decomposed dispatch path (no-op)
- Hardware tier we haven’t validated (Orin Nano, Thor, custom Hopper/Blackwell). Baseline first, then enable + compare p99
- You depend on
torch.cuda.graphsemantics — Reflex uses ORT-native capture (per ADR2026-04-24-cuda-graphs-architecture)
Validation
Section titled “Validation”Day 8-9 A/B on A100-80GB, N=200: vlm_prefix 1.07× mean + jitter 1.4× tighter; expert_denoise 1.47× mean + p99 -40% + jitter 4.1× tighter. Combined per-chunk pi0.5 num_steps=10: 270.85 → 207.74 ms = 1.30× speedup. See reflex_context/03_experiments/2026-04-29-cuda-graphs-ab-modal-a100.md.