Decomposed pi0.5 (9× speedup)

The default ONNX export of pi0 / pi0.5 is monolithic: VLM backbone + action expert + 10-step Euler denoise loop, all baked into one graph. It’s correct (cos = +1.0) but slow on edge GPUs because every /act re-runs the VLM even when the image hasn’t changed.

The decomposed export splits the model into two ONNX sessions:

vlm_prefix.onnx — the vision-language backbone. Runs once per cache miss (i.e. once per new image / instruction).
expert_denoise.onnx — the action expert + one Euler step. Runs 10× per /act (once per denoise step), reading the cached past_kv from the prefix.

Combined with KV cache reuse across denoise steps and across timesteps with the same prefix, this delivers a 9× speedup over monolithic on Jetson AGX Orin.

Why it works

Within a single /act call, the VLM prefix is computed once and the action expert runs 10 times against the same KV cache. The monolithic graph re-executes the VLM 10 times (once per denoise step). Most of that work is redundant.

Across consecutive /act calls, if the image and instruction are the same (the usual case during steady-state operation), the KV cache from the previous call is still valid. Reflex’s EpisodeCache keeps the prefix’s past_kv warm for the duration of an episode, so cache-hit calls skip the vlm_prefix.onnx step entirely.

Call type	Monolithic	Decomposed
Cache miss (new image)	VLM + 10 × expert	VLM + 10 × expert
Cache hit (same image)	VLM + 10 × expert (redundant!)	10 × expert only

Steady-state robotics traffic is mostly cache hits — once per task, the image changes only when the camera frame meaningfully shifts. The decomposed path exploits this.

Why pi0.5 specifically

pi0.5 is the most pi-Flow-like of the four supported VLAs:

3.62B parameters
PaliGemma-2-3B backbone (Gemma-2-2B + SigLIP-So400M)
AdaRMSNorm time conditioning (faster to capture in CUDA graphs than DynamicCache flavors)
Bake-able 10-step Euler loop (the action expert is the same per step; only t and the velocity field shift)

SmolVLA is small enough that the monolithic path is fine on Orin Nano. GR00T uses DDPM diffusion (4 steps) with a different KV layout — decomposing it gains less. pi0 has the same architecture as pi0.5 and benefits similarly from decomposition.

Use it

reflex export lerobot/pi05_base --output ./pi05 --mode decomposed

reflex serve ./pi05/ --embodiment franka --cuda-graphs

The serve flag --cuda-graphs composes naturally — capture the two ONNX sessions into CUDA graphs at startup, replay them thereafter for another 1.3-1.5× on top of the decomposition win.

Output structure:

./pi05/
├── vlm_prefix.onnx           # VLM backbone (~1 GB)
├── vlm_prefix.onnx.data
├── expert_denoise.onnx       # Action expert (~250 MB)
├── expert_denoise.onnx.data
├── reflex_config.json        # decomposed=true, cache_dim, etc.
└── (optional) *.trt          # TRT engines built by trtexec at export

Numerical parity

Decomposed pi0.5 still hits machine-precision parity to monolithic PyTorch:

Single-step ONNX vs sample_actions(num_steps=1): max_abs 2.38 × 10⁻⁷
10-step ONNX vs sample_actions(num_steps=10): max_abs 2.38 × 10⁻⁷

Verified end-to-end. See Verified parity for the full ledger.

Memory fit

Hardware	Memory	Decomposed pi0.5 fit
Orin Nano (8 GB)	tight	Doesn’t fit (12.5 GB monolithic; even decomposed is too large for 8 GB at FP32). Use SnapFlow distill student.
Orin AGX (32 GB / 64 GB)	comfortable	Fits at FP16 with headroom
Desktop NVIDIA (RTX 4090)	comfortable	Fits at FP16 with headroom
Cloud A10G (24 GB)	tight on vlm_prefix CUDA-graph capture	`expert_denoise` captures cleanly; `vlm_prefix` falls back to eager (still gets ~95% of the speedup)
Cloud A100 / H100	comfortable	Both sessions capture

Validated 2026-04-22

State-out distill retry: 14/15 = 93.3% on LIBERO, matches pi0.5 PyTorch teacher baseline. Unlocks the prefix-cache moat in production.

See reflex_context/03_experiments/2026-04-22-v0.5-retry-stage3-libero-result.md.