export — ONNX with parity
reflex export <hf_id> is the export front door. It dispatches to a model-specific exporter, runs the export under a torch.export trace (with the patches each architecture needs), validates numerical parity against the PyTorch reference, and writes ONNX + config to the output directory.
Quick start
Section titled “Quick start”# Smolest, easiest first tryreflex export lerobot/smolvla_base --target orin-nano --output ./smolvla
# pi0reflex export lerobot/pi0_base --target orin-nano --output ./pi0
# pi0.5 — newer pi0 with AdaRMSNorm time conditioningreflex export lerobot/pi05_base --target orin-nano --output ./pi05
# GR00T N1.6reflex export nvidia/GR00T-N1.6-3B --target orin-nano --output ./grootEach command:
- Downloads the checkpoint from HuggingFace (cached after first run)
- Runs the model-specific exporter
- Writes ONNX +
reflex_config.jsonto--output - Validates the ONNX against PyTorch (
max_diff < 1e-5) - If
trtexecis available, builds and caches a TensorRT engine
Output structure
Section titled “Output structure”./pi0/├── expert_stack.onnx # the graph (~1.25 MB metadata)├── expert_stack.onnx.data # the weights (~1.3 GB for pi0)├── reflex_config.json # model meta — used by serve└── expert_stack.trt # TRT engine (only if trtexec was available)Two export modes
Section titled “Two export modes”Monolithic (default)
Section titled “Monolithic (default)”The whole model — VLM backbone, action expert, denoise loop — exported as a single ONNX graph. The 10-step Euler integration is unrolled into the graph itself. Production default for all four supported VLAs.
Decomposed
Section titled “Decomposed”Splits the model into vlm_prefix + expert_denoise. The VLM (large, runs once per cache miss) and the action expert (small, runs 10× per call) become separate ONNX sessions. Combined with KV cache reuse across denoise steps, this delivers a 9× speedup over monolithic on pi0.5.
reflex export lerobot/pi05_base --output ./pi05 --mode decomposed--export-mode {auto,parallel,sequential} controls how the two ONNX sessions are constructed during export. auto picks based on a VRAM probe; parallel runs both at once (faster but needs 2 × model_vram + 1 GB); sequential runs them in series (slower but works on lower-VRAM hosts).
What was hard
Section titled “What was hard”Getting pi0 / pi0.5 to export at machine precision required three interacting patches under torch.export:
F.padfor causal masks (the default trace produced incorrect mask shapes for pi0’s PaliGemma backbone)- Frozen
DynamicLayer.update(transformers 5.x’s KV cache dynamism doesn’t trace cleanly without it) - Manually computing
past_kv.get_seq_length()for mask assembly
GR00T’s simpler DiT graph (no DynamicCache, no PaliGemma masking) traces cleanly via torch.onnx.export(opset=19). Details in reflex_context/01_architecture/pi0_monolithic_wrap_pattern.md.
Parity verification
Section titled “Parity verification”Every export runs a parity check before writing the output:
fixture_idx max_abs_diff mean_abs_diff passed0 3.21e-06 8.40e-07 PASS1 2.98e-06 7.92e-07 PASS...Summarymax_abs_diff_across_all 3.21e-06passed PASSThe threshold is 1e-4 by default, 1e-5 strict. All four supported VLAs pass at strict on their canonical paths. See Verified parity for the full ledger.
Targets
Section titled “Targets”| Target | Hardware | Precision |
|---|---|---|
orin-nano | Jetson Orin Nano | fp16 |
orin | Jetson Orin (32 GB) | fp16 |
orin-64 | Jetson Orin 64 | fp16 |
thor | Jetson Thor | fp8 |
desktop | RTX / A100 | fp16 |
cpu | Apple Silicon, x86_64 | fp32 |
reflex inspect targets lists the current profiles and shows which models support each.
Failure modes
Section titled “Failure modes”If the export fails, the tool exits non-zero with a remediation hint. Common cases:
Where Cast missing— post-export Where Cast wasn’t inserted. Re-run with--patch-where-cast. (v0.5+ does this automatically.)InsufficientVRAMErrorduring decomposed parallel export — drop--export-mode parallel, or use a higher-VRAM host. The error is fail-loud by design (no silent fallback to sequential).max_abs_diff > 1e-4— file a bug. This means PyTorch and ONNX disagree, which is the deployment failure mode the tool exists to prevent.