Decomposed pi0.5 (9× speedup)
The default ONNX export of pi0 / pi0.5 is monolithic: VLM backbone + action expert + 10-step Euler denoise loop, all baked into one graph. It’s correct (cos = +1.0) but slow on edge GPUs because every /act re-runs the VLM even when the image hasn’t changed.
The decomposed export splits the model into two ONNX sessions:
vlm_prefix.onnx— the vision-language backbone. Runs once per cache miss (i.e. once per new image / instruction).expert_denoise.onnx— the action expert + one Euler step. Runs 10× per/act(once per denoise step), reading the cachedpast_kvfrom the prefix.
Combined with KV cache reuse across denoise steps and across timesteps with the same prefix, this delivers a 9× speedup over monolithic on Jetson AGX Orin.
Why it works
Section titled “Why it works”Within a single /act call, the VLM prefix is computed once and the action expert runs 10 times against the same KV cache. The monolithic graph re-executes the VLM 10 times (once per denoise step). Most of that work is redundant.
Across consecutive /act calls, if the image and instruction are the same (the usual case during steady-state operation), the KV cache from the previous call is still valid. Reflex’s EpisodeCache keeps the prefix’s past_kv warm for the duration of an episode, so cache-hit calls skip the vlm_prefix.onnx step entirely.
| Call type | Monolithic | Decomposed |
|---|---|---|
| Cache miss (new image) | VLM + 10 × expert | VLM + 10 × expert |
| Cache hit (same image) | VLM + 10 × expert (redundant!) | 10 × expert only |
Steady-state robotics traffic is mostly cache hits — once per task, the image changes only when the camera frame meaningfully shifts. The decomposed path exploits this.
Why pi0.5 specifically
Section titled “Why pi0.5 specifically”pi0.5 is the most pi-Flow-like of the four supported VLAs:
- 3.62B parameters
- PaliGemma-2-3B backbone (Gemma-2-2B + SigLIP-So400M)
- AdaRMSNorm time conditioning (faster to capture in CUDA graphs than DynamicCache flavors)
- Bake-able 10-step Euler loop (the action expert is the same per step; only
tand the velocity field shift)
SmolVLA is small enough that the monolithic path is fine on Orin Nano. GR00T uses DDPM diffusion (4 steps) with a different KV layout — decomposing it gains less. pi0 has the same architecture as pi0.5 and benefits similarly from decomposition.
Use it
Section titled “Use it”reflex export lerobot/pi05_base --output ./pi05 --mode decomposedreflex serve ./pi05/ --embodiment franka --cuda-graphsThe serve flag --cuda-graphs composes naturally — capture the two ONNX sessions into CUDA graphs at startup, replay them thereafter for another 1.3-1.5× on top of the decomposition win.
Output structure:
./pi05/├── vlm_prefix.onnx # VLM backbone (~1 GB)├── vlm_prefix.onnx.data├── expert_denoise.onnx # Action expert (~250 MB)├── expert_denoise.onnx.data├── reflex_config.json # decomposed=true, cache_dim, etc.└── (optional) *.trt # TRT engines built by trtexec at exportNumerical parity
Section titled “Numerical parity”Decomposed pi0.5 still hits machine-precision parity to monolithic PyTorch:
- Single-step ONNX vs
sample_actions(num_steps=1): max_abs 2.38 × 10⁻⁷ - 10-step ONNX vs
sample_actions(num_steps=10): max_abs 2.38 × 10⁻⁷
Verified end-to-end. See Verified parity for the full ledger.
Memory fit
Section titled “Memory fit”| Hardware | Memory | Decomposed pi0.5 fit |
|---|---|---|
| Orin Nano (8 GB) | tight | Doesn’t fit (12.5 GB monolithic; even decomposed is too large for 8 GB at FP32). Use SnapFlow distill student. |
| Orin AGX (32 GB / 64 GB) | comfortable | Fits at FP16 with headroom |
| Desktop NVIDIA (RTX 4090) | comfortable | Fits at FP16 with headroom |
| Cloud A10G (24 GB) | tight on vlm_prefix CUDA-graph capture | expert_denoise captures cleanly; vlm_prefix falls back to eager (still gets ~95% of the speedup) |
| Cloud A100 / H100 | comfortable | Both sessions capture |
Validated 2026-04-22
Section titled “Validated 2026-04-22”State-out distill retry: 14/15 = 93.3% on LIBERO, matches pi0.5 PyTorch teacher baseline. Unlocks the prefix-cache moat in production.
See reflex_context/03_experiments/2026-04-22-v0.5-retry-stage3-libero-result.md.