Skip to content

export — ONNX with parity

reflex export <hf_id> is the export front door. It dispatches to a model-specific exporter, runs the export under a torch.export trace (with the patches each architecture needs), validates numerical parity against the PyTorch reference, and writes ONNX + config to the output directory.

Terminal window
# Smolest, easiest first try
reflex export lerobot/smolvla_base --target orin-nano --output ./smolvla
# pi0
reflex export lerobot/pi0_base --target orin-nano --output ./pi0
# pi0.5 — newer pi0 with AdaRMSNorm time conditioning
reflex export lerobot/pi05_base --target orin-nano --output ./pi05
# GR00T N1.6
reflex export nvidia/GR00T-N1.6-3B --target orin-nano --output ./groot

Each command:

  1. Downloads the checkpoint from HuggingFace (cached after first run)
  2. Runs the model-specific exporter
  3. Writes ONNX + reflex_config.json to --output
  4. Validates the ONNX against PyTorch (max_diff < 1e-5)
  5. If trtexec is available, builds and caches a TensorRT engine
./pi0/
├── expert_stack.onnx # the graph (~1.25 MB metadata)
├── expert_stack.onnx.data # the weights (~1.3 GB for pi0)
├── reflex_config.json # model meta — used by serve
└── expert_stack.trt # TRT engine (only if trtexec was available)

The whole model — VLM backbone, action expert, denoise loop — exported as a single ONNX graph. The 10-step Euler integration is unrolled into the graph itself. Production default for all four supported VLAs.

Splits the model into vlm_prefix + expert_denoise. The VLM (large, runs once per cache miss) and the action expert (small, runs 10× per call) become separate ONNX sessions. Combined with KV cache reuse across denoise steps, this delivers a 9× speedup over monolithic on pi0.5.

Terminal window
reflex export lerobot/pi05_base --output ./pi05 --mode decomposed

--export-mode {auto,parallel,sequential} controls how the two ONNX sessions are constructed during export. auto picks based on a VRAM probe; parallel runs both at once (faster but needs 2 × model_vram + 1 GB); sequential runs them in series (slower but works on lower-VRAM hosts).

Getting pi0 / pi0.5 to export at machine precision required three interacting patches under torch.export:

  • F.pad for causal masks (the default trace produced incorrect mask shapes for pi0’s PaliGemma backbone)
  • Frozen DynamicLayer.update (transformers 5.x’s KV cache dynamism doesn’t trace cleanly without it)
  • Manually computing past_kv.get_seq_length() for mask assembly

GR00T’s simpler DiT graph (no DynamicCache, no PaliGemma masking) traces cleanly via torch.onnx.export(opset=19). Details in reflex_context/01_architecture/pi0_monolithic_wrap_pattern.md.

Every export runs a parity check before writing the output:

fixture_idx max_abs_diff mean_abs_diff passed
0 3.21e-06 8.40e-07 PASS
1 2.98e-06 7.92e-07 PASS
...
Summary
max_abs_diff_across_all 3.21e-06
passed PASS

The threshold is 1e-4 by default, 1e-5 strict. All four supported VLAs pass at strict on their canonical paths. See Verified parity for the full ledger.

TargetHardwarePrecision
orin-nanoJetson Orin Nanofp16
orinJetson Orin (32 GB)fp16
orin-64Jetson Orin 64fp16
thorJetson Thorfp8
desktopRTX / A100fp16
cpuApple Silicon, x86_64fp32

reflex inspect targets lists the current profiles and shows which models support each.

If the export fails, the tool exits non-zero with a remediation hint. Common cases:

  • Where Cast missing — post-export Where Cast wasn’t inserted. Re-run with --patch-where-cast. (v0.5+ does this automatically.)
  • InsufficientVRAMError during decomposed parallel export — drop --export-mode parallel, or use a higher-VRAM host. The error is fail-loud by design (no silent fallback to sequential).
  • max_abs_diff > 1e-4 — file a bug. This means PyTorch and ONNX disagree, which is the deployment failure mode the tool exists to prevent.