Architecture

Reflex is a deployment toolchain. It does not train models; it does not run cloud inference at scale. It takes a trained vision-language-action policy and runs it on the robot you have, with verified numerical parity to the PyTorch reference and a composable runtime around it.

The end-to-end flow

Each stage hands off a verifiable artifact. The export step writes ONNX + a reflex_config.json and refuses to ship if numerical parity to the PyTorch reference fails the threshold (default 1e-4). The serve step reads that artifact, picks the right ONNX provider for your hardware, and listens on /act.

The serve runtime

reflex serve is one FastAPI process that loads one ONNX session and exposes /act. Every other capability is a wedge — a composable opt-in flag that adds telemetry, safety, optimization, or transport without changing the core call.

Every wedge is independently opt-in. A bare reflex serve ./export is a working server. Adding flags layers in policy versioning, cost-weighted batching, an SLO tracker, an A2C2 correction head, and so on — each composable in any order.

Decomposed pi0.5 — the 9× moat

For pi0.5 specifically, the export defaults to a decomposed mode that splits the model into a vision-language prefix and an action-expert denoise step. The prefix runs once per cache miss; the expert runs 10 times per /act (once per Euler step) reading from the cached past_kv.

The decomposition is mathematically equivalent — cos = +1.000000 parity verified end-to-end. The win is purely from removing the redundant VLM forwards that the monolithic graph bakes in. Details: decomposed pi0.5.

Where each layer lives

Edge GPU

Jetson Orin Nano / AGX / Thor or a desktop NVIDIA GPU. The whole inference stack runs locally; nothing crosses the network for /act.

Reflex serve

One Python process per robot. FastAPI on the front, ORT-TensorRT under it. Wedges layered as middleware.

Reflex export

Run once at deploy time. Validates parity, writes the artifact, exits. Not in the hot path.

Cloud (optional)

Modal A100/H100 for benchmark runs and reflex eval LIBERO sweeps. Never on the robot’s hot path.

What Reflex deliberately does not own

Layer	Owner	Reflex’s role
Model training	PyTorch / JAX, lerobot, openpi	Consumer — Reflex pulls trained weights from HuggingFace
Cloud inference at scale	vLLM, Triton, Baseten	Out of scope — those serve LLMs at scale, not VLAs at the edge
Robot controllers	ROS2, SO-ARM firmware, manufacturer SDKs	Out of scope — Reflex returns action chunks; the controller actuates
Model registry	HuggingFace Hub	Consumer — Reflex pulls; doesn’t host

This is the “win narrow” thesis. Generic inference servers are infinitely more polished than Reflex on their dimensions. Reflex wins by being VLA-specific where they can’t (numerical parity verification at machine precision, decomposed architecture, A2C2 chunk correction, ActionGuard with URDF-derived joint limits, episode-aware policy routing).