export — ONNX with parity

reflex export <hf_id> is the export front door. It dispatches to a model-specific exporter, runs the export under a torch.export trace (with the patches each architecture needs), validates numerical parity against the PyTorch reference, and writes ONNX + config to the output directory.

Quick start

# Smolest, easiest first try
reflex export lerobot/smolvla_base --target orin-nano --output ./smolvla

# pi0
reflex export lerobot/pi0_base --target orin-nano --output ./pi0

# pi0.5 — newer pi0 with AdaRMSNorm time conditioning
reflex export lerobot/pi05_base --target orin-nano --output ./pi05

# GR00T N1.6
reflex export nvidia/GR00T-N1.6-3B --target orin-nano --output ./groot

Each command:

Downloads the checkpoint from HuggingFace (cached after first run)
Runs the model-specific exporter
Writes ONNX + reflex_config.json to --output
Validates the ONNX against PyTorch (max_diff < 1e-5)
If trtexec is available, builds and caches a TensorRT engine

Output structure

./pi0/
├── expert_stack.onnx       # the graph (~1.25 MB metadata)
├── expert_stack.onnx.data  # the weights (~1.3 GB for pi0)
├── reflex_config.json      # model meta — used by serve
└── expert_stack.trt        # TRT engine (only if trtexec was available)

Two export modes

Monolithic (default)

The whole model — VLM backbone, action expert, denoise loop — exported as a single ONNX graph. The 10-step Euler integration is unrolled into the graph itself. Production default for all four supported VLAs.

Decomposed

Splits the model into vlm_prefix + expert_denoise. The VLM (large, runs once per cache miss) and the action expert (small, runs 10× per call) become separate ONNX sessions. Combined with KV cache reuse across denoise steps, this delivers a 9× speedup over monolithic on pi0.5.

reflex export lerobot/pi05_base --output ./pi05 --mode decomposed

--export-mode {auto,parallel,sequential} controls how the two ONNX sessions are constructed during export. auto picks based on a VRAM probe; parallel runs both at once (faster but needs 2 × model_vram + 1 GB); sequential runs them in series (slower but works on lower-VRAM hosts).

What was hard

Getting pi0 / pi0.5 to export at machine precision required three interacting patches under torch.export:

F.pad for causal masks (the default trace produced incorrect mask shapes for pi0’s PaliGemma backbone)
Frozen DynamicLayer.update (transformers 5.x’s KV cache dynamism doesn’t trace cleanly without it)
Manually computing past_kv.get_seq_length() for mask assembly

GR00T’s simpler DiT graph (no DynamicCache, no PaliGemma masking) traces cleanly via torch.onnx.export(opset=19). Details in reflex_context/01_architecture/pi0_monolithic_wrap_pattern.md.

Parity verification

Every export runs a parity check before writing the output:

fixture_idx  max_abs_diff  mean_abs_diff  passed
0            3.21e-06      8.40e-07       PASS
1            2.98e-06      7.92e-07       PASS
...
Summary
max_abs_diff_across_all  3.21e-06
passed                   PASS

The threshold is 1e-4 by default, 1e-5 strict. All four supported VLAs pass at strict on their canonical paths. See Verified parity for the full ledger.

Targets

Target	Hardware	Precision
`orin-nano`	Jetson Orin Nano	fp16
`orin`	Jetson Orin (32 GB)	fp16
`orin-64`	Jetson Orin 64	fp16
`thor`	Jetson Thor	fp8
`desktop`	RTX / A100	fp16
`cpu`	Apple Silicon, x86_64	fp32

reflex inspect targets lists the current profiles and shows which models support each.

Failure modes

If the export fails, the tool exits non-zero with a remediation hint. Common cases:

Where Cast missing — post-export Where Cast wasn’t inserted. Re-run with --patch-where-cast. (v0.5+ does this automatically.)
InsufficientVRAMError during decomposed parallel export — drop --export-mode parallel, or use a higher-VRAM host. The error is fail-loud by design (no silent fallback to sequential).
max_abs_diff > 1e-4 — file a bug. This means PyTorch and ONNX disagree, which is the deployment failure mode the tool exists to prevent.