Why edge-first
The dominant pattern in LLM inference is cloud — vLLM, Triton, Modal, Baseten all target datacenter GPUs first. Robotics is the opposite. Most production robot deployments need inference on the robot, not in a datacenter, and that means Jetson Orin / Thor or a desktop NVIDIA GPU bolted to the robot’s chassis. Reflex is built around that constraint.
The deployment-tier pyramid
Section titled “The deployment-tier pyramid”| Tier | Hardware | Dominant use case | Reflex priority |
|---|---|---|---|
| Edge | Jetson Orin Nano / AGX / Thor | Production deployment on a real robot | Primary |
| Workstation | Desktop NVIDIA (RTX 4090, RTX 5090, RTX PRO 6000) | Researcher’s desk, dev rig | Primary |
| Cloud (datacenter) | A10G, A100, H100 | Benchmarking, training, validation runs | Secondary — supported but not the design center |
Why edge-first matters in practice:
- Latency. Robots can’t wait for a cloud round-trip. Stripe-tier latency to the cloud (~50-100 ms) is fine for
/v1/charge; it’s a control loop death sentence at 30 Hz. - Privacy / regulation. Surgical robots, household robots, defense — all have legal or contractual reasons not to ship camera frames to the cloud.
- Network availability. Warehouse robots, agricultural drones, exploration robots — none can rely on stable network access.
- Cost. A Jetson Orin AGX is $1,999 once. A cloud A10G runs ~$0.80/hour, which adds up fast at robotics scale.
- Power. Edge silicon is power-budget-aware. A robot can’t spend 300W on inference.
What this means for Reflex’s design
Section titled “What this means for Reflex’s design”Static-shape ONNX, not dynamic-batch
Section titled “Static-shape ONNX, not dynamic-batch”Reflex exports are shape-specialized at export time. Per-embodiment, per-model, per-target. This bakes in better TRT optimizations and avoids the “dynamic batch dim explodes engine size” problem on Jetson’s smaller VRAM.
The cost: cross-shape batching needs separate engines per shape. The win: every shape gets best-in-class TRT FP16 (or FP8 on Thor) without runtime overhead.
Memory-fit checks at startup, not runtime
Section titled “Memory-fit checks at startup, not runtime”reflex doctor flags memory pressure before serve starts. The decomposed pi0.5 export includes a refuse-to-load check on Orin Nano because the 12.5 GB monolithic model can’t load even in FP16. We’d rather fail loud at startup than silently oom on the first /act.
Cloud-first tools default to “load it and see” — an Orin Nano user can’t afford that.
CUDA graphs default-off, tier-aware
Section titled “CUDA graphs default-off, tier-aware”CUDA graphs add ~128 MB VRAM overhead per captured graph. On A100, that’s a rounding error. On A10G it costs the vlm_prefix capture (graceful degrade to eager). On Orin Nano, we haven’t validated. So --cuda-graphs is opt-in for Phase 1, with explicit per-tier behavior documented.
Cloud-first tools default-on; edge-first tools surface the trade-off.
TensorRT EP, not just CUDA EP
Section titled “TensorRT EP, not just CUDA EP”A cloud-first VLA stack would settle for ORT-CUDA — fine in a datacenter where memory is cheap. Edge-first care about the 5.55× ORT-CUDA → ORT-TRT win because that’s the difference between “Jetson works” and “Jetson is too slow.” Reflex pulls tensorrt>=10.0 in the [serve,gpu] extras and patches LD_LIBRARY_PATH automatically at import.
One-process-per-robot, not multi-tenant
Section titled “One-process-per-robot, not multi-tenant”Cloud inference servers are multi-tenant by default — many models, many queues, many requests in flight. Edge robotics is the opposite — one model, one robot, one process. Reflex’s queue/scheduler optimizes for low-tail-latency on a single workload, not throughput across many. The 1000-request queue cap with Retry-After: 1 backpressure is overkill at home; on a busy warehouse fleet, it’s the right shape.
What we still get right for cloud
Section titled “What we still get right for cloud”Reflex isn’t only edge:
- All four supported VLAs run on cloud A10G / A100 / H100 with verified parity
- Modal-hosted benchmarks and regression evals are part of the development loop (see
reflex eval) - Cloud-hosted serve is fine for stress testing or as a fallback target via
--cloud-fallback
The tradeoff lands on defaults and design priority, not capability. Reflex on H100 works great; it’s just not what we’re optimizing the trade-offs for.
Hardware support matrix
Section titled “Hardware support matrix”| Architecture | Compute | Status |
|---|---|---|
| Ampere (RTX 30, A10G, A100) | sm_8.0–8.6 | Supported |
| Ada Lovelace (RTX 40, L4) | sm_8.9 | Supported |
| Hopper (H100, H200) | sm_9.0 | Supported |
| Jetson Orin (Orin Nano / NX / AGX) | sm_8.7 | Supported (JetPack 5.x or 6.x) |
| Jetson Thor | sm_10.x | Untested (Blackwell silicon; ORT-bundled CUDA EP needs Blackwell support) |
| Blackwell desktop (RTX 5090, B200) | sm_10.0 | Not yet supported (ORT bundled cuBLAS/cuDNN don’t ship sm_100 kernels) |
| Older NVIDIA (Turing RTX 20, GTX 16) | sm_7.5 | Best-effort |
See Supported hardware for the full breakdown.