Why edge-first

The dominant pattern in LLM inference is cloud — vLLM, Triton, Modal, Baseten all target datacenter GPUs first. Robotics is the opposite. Most production robot deployments need inference on the robot, not in a datacenter, and that means Jetson Orin / Thor or a desktop NVIDIA GPU bolted to the robot’s chassis. Reflex is built around that constraint.

The deployment-tier pyramid

Tier	Hardware	Dominant use case	Reflex priority
Edge	Jetson Orin Nano / AGX / Thor	Production deployment on a real robot	Primary
Workstation	Desktop NVIDIA (RTX 4090, RTX 5090, RTX PRO 6000)	Researcher’s desk, dev rig	Primary
Cloud (datacenter)	A10G, A100, H100	Benchmarking, training, validation runs	Secondary — supported but not the design center

Why edge-first matters in practice:

Latency. Robots can’t wait for a cloud round-trip. Stripe-tier latency to the cloud (~50-100 ms) is fine for /v1/charge; it’s a control loop death sentence at 30 Hz.
Privacy / regulation. Surgical robots, household robots, defense — all have legal or contractual reasons not to ship camera frames to the cloud.
Network availability. Warehouse robots, agricultural drones, exploration robots — none can rely on stable network access.
Cost. A Jetson Orin AGX is $1,999 once. A cloud A10G runs ~$0.80/hour, which adds up fast at robotics scale.
Power. Edge silicon is power-budget-aware. A robot can’t spend 300W on inference.

What this means for Reflex’s design

Static-shape ONNX, not dynamic-batch

Reflex exports are shape-specialized at export time. Per-embodiment, per-model, per-target. This bakes in better TRT optimizations and avoids the “dynamic batch dim explodes engine size” problem on Jetson’s smaller VRAM.

The cost: cross-shape batching needs separate engines per shape. The win: every shape gets best-in-class TRT FP16 (or FP8 on Thor) without runtime overhead.

Memory-fit checks at startup, not runtime

reflex doctor flags memory pressure before serve starts. The decomposed pi0.5 export includes a refuse-to-load check on Orin Nano because the 12.5 GB monolithic model can’t load even in FP16. We’d rather fail loud at startup than silently oom on the first /act.

Cloud-first tools default to “load it and see” — an Orin Nano user can’t afford that.

CUDA graphs default-off, tier-aware

CUDA graphs add ~128 MB VRAM overhead per captured graph. On A100, that’s a rounding error. On A10G it costs the vlm_prefix capture (graceful degrade to eager). On Orin Nano, we haven’t validated. So --cuda-graphs is opt-in for Phase 1, with explicit per-tier behavior documented.

Cloud-first tools default-on; edge-first tools surface the trade-off.

TensorRT EP, not just CUDA EP

A cloud-first VLA stack would settle for ORT-CUDA — fine in a datacenter where memory is cheap. Edge-first care about the 5.55× ORT-CUDA → ORT-TRT win because that’s the difference between “Jetson works” and “Jetson is too slow.” Reflex pulls tensorrt>=10.0 in the [serve,gpu] extras and patches LD_LIBRARY_PATH automatically at import.

One-process-per-robot, not multi-tenant

Cloud inference servers are multi-tenant by default — many models, many queues, many requests in flight. Edge robotics is the opposite — one model, one robot, one process. Reflex’s queue/scheduler optimizes for low-tail-latency on a single workload, not throughput across many. The 1000-request queue cap with Retry-After: 1 backpressure is overkill at home; on a busy warehouse fleet, it’s the right shape.

What we still get right for cloud

Reflex isn’t only edge:

All four supported VLAs run on cloud A10G / A100 / H100 with verified parity
Modal-hosted benchmarks and regression evals are part of the development loop (see reflex eval)
Cloud-hosted serve is fine for stress testing or as a fallback target via --cloud-fallback

The tradeoff lands on defaults and design priority, not capability. Reflex on H100 works great; it’s just not what we’re optimizing the trade-offs for.

Hardware support matrix

Architecture	Compute	Status
Ampere (RTX 30, A10G, A100)	sm_8.0–8.6	Supported
Ada Lovelace (RTX 40, L4)	sm_8.9	Supported
Hopper (H100, H200)	sm_9.0	Supported
Jetson Orin (Orin Nano / NX / AGX)	sm_8.7	Supported (JetPack 5.x or 6.x)
Jetson Thor	sm_10.x	Untested (Blackwell silicon; ORT-bundled CUDA EP needs Blackwell support)
Blackwell desktop (RTX 5090, B200)	sm_10.0	Not yet supported (ORT bundled cuBLAS/cuDNN don’t ship sm_100 kernels)
Older NVIDIA (Turing RTX 20, GTX 16)	sm_7.5	Best-effort

See Supported hardware for the full breakdown.