Edge GPU
Jetson Orin Nano / AGX / Thor or a desktop NVIDIA GPU. The whole inference stack runs locally; nothing crosses the network for /act.
Reflex is a deployment toolchain. It does not train models; it does not run cloud inference at scale. It takes a trained vision-language-action policy and runs it on the robot you have, with verified numerical parity to the PyTorch reference and a composable runtime around it.
Each stage hands off a verifiable artifact. The export step writes ONNX + a reflex_config.json and refuses to ship if numerical parity to the PyTorch reference fails the threshold (default 1e-4). The serve step reads that artifact, picks the right ONNX provider for your hardware, and listens on /act.
reflex serve is one FastAPI process that loads one ONNX session and exposes /act. Every other capability is a wedge — a composable opt-in flag that adds telemetry, safety, optimization, or transport without changing the core call.
Every wedge is independently opt-in. A bare reflex serve ./export is a working server. Adding flags layers in policy versioning, cost-weighted batching, an SLO tracker, an A2C2 correction head, and so on — each composable in any order.
For pi0.5 specifically, the export defaults to a decomposed mode that splits the model into a vision-language prefix and an action-expert denoise step. The prefix runs once per cache miss; the expert runs 10 times per /act (once per Euler step) reading from the cached past_kv.
The decomposition is mathematically equivalent — cos = +1.000000 parity verified end-to-end. The win is purely from removing the redundant VLM forwards that the monolithic graph bakes in. Details: decomposed pi0.5.
Edge GPU
Jetson Orin Nano / AGX / Thor or a desktop NVIDIA GPU. The whole inference stack runs locally; nothing crosses the network for /act.
Reflex serve
One Python process per robot. FastAPI on the front, ORT-TensorRT under it. Wedges layered as middleware.
Reflex export
Run once at deploy time. Validates parity, writes the artifact, exits. Not in the hot path.
Cloud (optional)
Modal A100/H100 for benchmark runs and reflex eval LIBERO sweeps. Never on the robot’s hot path.
| Layer | Owner | Reflex’s role |
|---|---|---|
| Model training | PyTorch / JAX, lerobot, openpi | Consumer — Reflex pulls trained weights from HuggingFace |
| Cloud inference at scale | vLLM, Triton, Baseten | Out of scope — those serve LLMs at scale, not VLAs at the edge |
| Robot controllers | ROS2, SO-ARM firmware, manufacturer SDKs | Out of scope — Reflex returns action chunks; the controller actuates |
| Model registry | HuggingFace Hub | Consumer — Reflex pulls; doesn’t host |
This is the “win narrow” thesis. Generic inference servers are infinitely more polished than Reflex on their dimensions. Reflex wins by being VLA-specific where they can’t (numerical parity verification at machine precision, decomposed architecture, A2C2 chunk correction, ActionGuard with URDF-derived joint limits, episode-aware policy routing).