Skip to content

Architecture

Reflex is a deployment toolchain. It does not train models; it does not run cloud inference at scale. It takes a trained vision-language-action policy and runs it on the robot you have, with verified numerical parity to the PyTorch reference and a composable runtime around it.

HuggingFacepi0 · pi0.5SmolVLA · GR00Treflex exporttorch → ONNXcos = +1.0 verifiedmonolithic / decomposedreflex serveFastAPI + ORT-TRTcomposable wedges/act, /health, /metricsReal robotFranka · UR5SO-100 · Trossentrained modeldeployable artifactedge GPU processphysical actuation

Each stage hands off a verifiable artifact. The export step writes ONNX + a reflex_config.json and refuses to ship if numerical parity to the PyTorch reference fails the threshold (default 1e-4). The serve step reads that artifact, picks the right ONNX provider for your hardware, and listens on /act.

reflex serve is one FastAPI process that loads one ONNX session and exposes /act. Every other capability is a wedge — a composable opt-in flag that adds telemetry, safety, optimization, or transport without changing the core call.

Client (control loop / ROS2 / MCP agent)POST /act { instruction, state, image, episode_id }FastAPI + middlewareCORS · auth · OpenTelemetry tracing · request_idWedge stackpolicy router(2-policy A/B)cost-batchingschedulerSLO + circuitbreakerdeadline +cloud fallbackA2C2 head(post-policy)ActionGuard+ audit logInference engineORT TensorRT EP (fp16/fp8)CUDA graphs · cache reuse · adaptive denoiseEdge GPU (Jetson Orin / Thor / desktop NVIDIA)

Every wedge is independently opt-in. A bare reflex serve ./export is a working server. Adding flags layers in policy versioning, cost-weighted batching, an SLO tracker, an A2C2 correction head, and so on — each composable in any order.

For pi0.5 specifically, the export defaults to a decomposed mode that splits the model into a vision-language prefix and an action-expert denoise step. The prefix runs once per cache miss; the expert runs 10 times per /act (once per Euler step) reading from the cached past_kv.

Decomposed pi0.5 — one /act callvlm_prefix.onnxPaliGemma backboneimage + instructionruns ONCE percache misspast_kvcache (warm forwhole episode)expert_denoise.onnxaction expert + 1 stepof Euler integrationruns 10× per /act(num_steps=10)action chunk50 × 7-dim(franka)Within-call latency on Jetson AGX OrinMonolithic — 900 msDecomposed — 100 ms (9.0×)Monolithic re-runs the VLM 10× per call. Decomposed reuses the prefix.

The decomposition is mathematically equivalent — cos = +1.000000 parity verified end-to-end. The win is purely from removing the redundant VLM forwards that the monolithic graph bakes in. Details: decomposed pi0.5.

Edge GPU

Jetson Orin Nano / AGX / Thor or a desktop NVIDIA GPU. The whole inference stack runs locally; nothing crosses the network for /act.

Reflex serve

One Python process per robot. FastAPI on the front, ORT-TensorRT under it. Wedges layered as middleware.

Reflex export

Run once at deploy time. Validates parity, writes the artifact, exits. Not in the hot path.

Cloud (optional)

Modal A100/H100 for benchmark runs and reflex eval LIBERO sweeps. Never on the robot’s hot path.

LayerOwnerReflex’s role
Model trainingPyTorch / JAX, lerobot, openpiConsumer — Reflex pulls trained weights from HuggingFace
Cloud inference at scalevLLM, Triton, BasetenOut of scope — those serve LLMs at scale, not VLAs at the edge
Robot controllersROS2, SO-ARM firmware, manufacturer SDKsOut of scope — Reflex returns action chunks; the controller actuates
Model registryHuggingFace HubConsumer — Reflex pulls; doesn’t host

This is the “win narrow” thesis. Generic inference servers are infinitely more polished than Reflex on their dimensions. Reflex wins by being VLA-specific where they can’t (numerical parity verification at machine precision, decomposed architecture, A2C2 chunk correction, ActionGuard with URDF-derived joint limits, episode-aware policy routing).