cuda graphs

reflex serve --cuda-graphs captures the two ONNX Runtime sessions on the decomposed path (vlm_prefix + expert_denoise) into CUDA graphs at startup and replays them thereafter. Replay is 3-4× faster than eager execution because the GPU skips kernel-launch overhead for every call.

Tier-aware out of the box:

GPU	vlm_prefix	expert_denoise	Combined effect
A100-40 / A100-80	Captured	Captured	Full speedup on every /act
A10G (24 GB)	Eager (graceful degrade at init)	Captured	~95% of total speedup since expert fires 10× per /act vs vlm_prefix 1× per cache-miss
Jetson Orin Nano (8 GB)	Not validated	Not validated	Run without `--cuda-graphs` for now — research tier
Jetson AGX Orin (64 GB)	Expected to capture	Expected to capture	Validate before production

Measured on Modal 2026-04-25 against a SnapFlow pi0.5 decomposed export (Franka): A100 hit 4.44× on vlm_prefix + 3.03× on expert_denoise; A10G hit 3.76× on expert_denoise and fell back to eager on vlm_prefix.

Quick start

reflex serve ./my-export/ --cuda-graphs

That’s the whole knob. Run, hit /act, observe latency drop. The first /act per session is slower than eager (graph capture cost, ~50-400 ms depending on session); every subsequent /act hits the fast-path replay.

What happens at startup

Server constructs vlm_prefix ONNX session with enable_cuda_graph=1
Probes capture by running one synthetic forward pass
On success: wraps with CudaGraphWrapper, emits reflex_cuda_graph_captured_total{session=vlm_prefix}
On capture failure (OOM, unsupported op): rebuilds the session without enable_cuda_graph, wraps with EagerSessionWrapper, emits reflex_cuda_graph_capture_failed_at_init_total{session=vlm_prefix,reason=<ExceptionClass>} ONCE, logs an INFO line. This is the A10G vlm_prefix path today.
Same sequence for expert_denoise

Request handling proceeds transparently — the wrappers expose the same .run() API whether the session is captured or eager.

Metrics

Cardinality is bounded. Labels: embodiment (franka / so100 / ur5 / custom) × session (vlm_prefix / expert_denoise) × reason (capture_failed / replay_failed / capture_failed_oom).

Metric	Type	Fires when
`reflex_cuda_graph_captured_total`	Counter	First successful capture per session
`reflex_cuda_graph_replayed_total`	Counter	Every replay
`reflex_cuda_graph_eager_fallback_total`	Counter	In-request capture or replay raised
`reflex_cuda_graph_capture_failed_at_init_total`	Counter	Session-init probe failed → eager-for-process-lifetime
`reflex_cuda_graph_capture_seconds`	Histogram	Capture wall-clock (first run)
`reflex_cuda_graph_replay_seconds`	Histogram	Replay wall-clock

Distinguishing capture_failed_at_init from eager_fallback matters: the init-failure is a hardware-tier signal (this process will NEVER capture this session), while the replay-time fallback is a per-request error signal.

Troubleshooting

reflex_cuda_graph_capture_failed_at_init_total fires at startup on A10G. Expected for session=vlm_prefix. A10G’s CUDA-graph memory overhead (~128 MB reserved per captured graph) exceeds the BFCArena pre-allocated pool for the vlm_prefix model. The session runs eager instead; expert_denoise still captures and gives ~95% of total speedup. No customer action.

eager_fallback_total climbing during traffic. Replay-time failure — rarer. Likely causes: CUDA context contention with another process on the same GPU, input shape changed (impossible on a static-shape Reflex export — file a bug), or GPU memory pressure from a concurrent workload.

No speedup after enabling --cuda-graphs. Your serve backend isn’t on the decomposed dispatch path. The flag applies to Pi05DecomposedInference dispatch. Check the startup log for: "--cuda-graphs was set but this backend does not consume the flag."

Latency spike on the first request after startup. Expected — capture cost, ~100-400 ms on the first /act. Keep --prewarm enabled (default) so capture runs at startup, not on the first user request.

When NOT to enable

Not on the decomposed dispatch path (no-op)
Hardware tier we haven’t validated (Orin Nano, Thor, custom Hopper/Blackwell). Baseline first, then enable + compare p99
You depend on torch.cuda.graph semantics — Reflex uses ORT-native capture (per ADR 2026-04-24-cuda-graphs-architecture)

Validation

Day 8-9 A/B on A100-80GB, N=200: vlm_prefix 1.07× mean + jitter 1.4× tighter; expert_denoise 1.47× mean + p99 -40% + jitter 4.1× tighter. Combined per-chunk pi0.5 num_steps=10: 270.85 → 207.74 ms = 1.30× speedup. See reflex_context/03_experiments/2026-04-29-cuda-graphs-ab-modal-a100.md.