Skip to content

Why edge-first

The dominant pattern in LLM inference is cloud — vLLM, Triton, Modal, Baseten all target datacenter GPUs first. Robotics is the opposite. Most production robot deployments need inference on the robot, not in a datacenter, and that means Jetson Orin / Thor or a desktop NVIDIA GPU bolted to the robot’s chassis. Reflex is built around that constraint.

TierHardwareDominant use caseReflex priority
EdgeJetson Orin Nano / AGX / ThorProduction deployment on a real robotPrimary
WorkstationDesktop NVIDIA (RTX 4090, RTX 5090, RTX PRO 6000)Researcher’s desk, dev rigPrimary
Cloud (datacenter)A10G, A100, H100Benchmarking, training, validation runsSecondary — supported but not the design center

Why edge-first matters in practice:

  • Latency. Robots can’t wait for a cloud round-trip. Stripe-tier latency to the cloud (~50-100 ms) is fine for /v1/charge; it’s a control loop death sentence at 30 Hz.
  • Privacy / regulation. Surgical robots, household robots, defense — all have legal or contractual reasons not to ship camera frames to the cloud.
  • Network availability. Warehouse robots, agricultural drones, exploration robots — none can rely on stable network access.
  • Cost. A Jetson Orin AGX is $1,999 once. A cloud A10G runs ~$0.80/hour, which adds up fast at robotics scale.
  • Power. Edge silicon is power-budget-aware. A robot can’t spend 300W on inference.

Reflex exports are shape-specialized at export time. Per-embodiment, per-model, per-target. This bakes in better TRT optimizations and avoids the “dynamic batch dim explodes engine size” problem on Jetson’s smaller VRAM.

The cost: cross-shape batching needs separate engines per shape. The win: every shape gets best-in-class TRT FP16 (or FP8 on Thor) without runtime overhead.

reflex doctor flags memory pressure before serve starts. The decomposed pi0.5 export includes a refuse-to-load check on Orin Nano because the 12.5 GB monolithic model can’t load even in FP16. We’d rather fail loud at startup than silently oom on the first /act.

Cloud-first tools default to “load it and see” — an Orin Nano user can’t afford that.

CUDA graphs add ~128 MB VRAM overhead per captured graph. On A100, that’s a rounding error. On A10G it costs the vlm_prefix capture (graceful degrade to eager). On Orin Nano, we haven’t validated. So --cuda-graphs is opt-in for Phase 1, with explicit per-tier behavior documented.

Cloud-first tools default-on; edge-first tools surface the trade-off.

A cloud-first VLA stack would settle for ORT-CUDA — fine in a datacenter where memory is cheap. Edge-first care about the 5.55× ORT-CUDA → ORT-TRT win because that’s the difference between “Jetson works” and “Jetson is too slow.” Reflex pulls tensorrt>=10.0 in the [serve,gpu] extras and patches LD_LIBRARY_PATH automatically at import.

Cloud inference servers are multi-tenant by default — many models, many queues, many requests in flight. Edge robotics is the opposite — one model, one robot, one process. Reflex’s queue/scheduler optimizes for low-tail-latency on a single workload, not throughput across many. The 1000-request queue cap with Retry-After: 1 backpressure is overkill at home; on a busy warehouse fleet, it’s the right shape.

Reflex isn’t only edge:

  • All four supported VLAs run on cloud A10G / A100 / H100 with verified parity
  • Modal-hosted benchmarks and regression evals are part of the development loop (see reflex eval)
  • Cloud-hosted serve is fine for stress testing or as a fallback target via --cloud-fallback

The tradeoff lands on defaults and design priority, not capability. Reflex on H100 works great; it’s just not what we’re optimizing the trade-offs for.

ArchitectureComputeStatus
Ampere (RTX 30, A10G, A100)sm_8.0–8.6Supported
Ada Lovelace (RTX 40, L4)sm_8.9Supported
Hopper (H100, H200)sm_9.0Supported
Jetson Orin (Orin Nano / NX / AGX)sm_8.7Supported (JetPack 5.x or 6.x)
Jetson Thorsm_10.xUntested (Blackwell silicon; ORT-bundled CUDA EP needs Blackwell support)
Blackwell desktop (RTX 5090, B200)sm_10.0Not yet supported (ORT bundled cuBLAS/cuDNN don’t ship sm_100 kernels)
Older NVIDIA (Turing RTX 20, GTX 16)sm_7.5Best-effort

See Supported hardware for the full breakdown.