Skip to content

serve — runtime + safety

reflex serve <export_dir> is the production entry point. It reads the export, picks the right ONNX provider, applies the embodiment config, and listens on :8000 for /act, /health, /config, and /metrics.

Terminal window
reflex serve ./my-export/

Returns unscaled raw actions on /act. Fine for smoke tests.

Terminal window
reflex serve ./my-export/ \
--embodiment franka \
--safety-config ./robot_limits.json \
--adaptive-steps \
--deadline-ms 33 \
--cloud-fallback http://cloud:8000 \
--inject-latency-ms 0 \
--record /tmp/traces \
--max-consecutive-crashes 5 \
--cuda-graphs \
--auto-calibrate \
--slo p99=150ms \
--robot-id warehouse-01
FlagWedgeDoc
--embodimentper-robot action ranges + ActionGuard clampingembodiments
--safety-configURDF-derived joint limits + EU AI Act audit logguard
--adaptive-stepsstop denoise loop early on velocity convergencethis page
--deadline-msreturn last-known-good action if over budgetthis page
--cloud-fallbackedge-first with cloud backupthis page
--inject-latency-mssynthetic delay (matches A2C2 paper methodology)a2c2
--recordJSONL request/response capturerecord & replay
--max-consecutive-crashescircuit breaker (503 + Retry-After: 60 on trip)this page
--cuda-graphscapture and replay ORT sessionscuda graphs
--auto-calibratehardware-tier auto-configauto-calibrate
--slorolling p99 enforcementSLO
--robot-idper-robot Prometheus identityfleet

Every response surfaces telemetry from each enabled wedge. Example:

{
"actions": [[...], [...], ...],
"latency_ms": 11.9,
"inference_mode": "onnx_trt_fp16",
"guard_clamped": false,
"guard_violations": [],
"adaptive_enabled": true,
"adaptive_steps_used": 7,
"injected_latency_ms": 0,
"robot_id": "warehouse-01"
}
PathMethodReturns
/actPOSTAction chunk + telemetry
/healthGET{status, model_loaded, inference_mode, robot_id}
/configGETSaved reflex_config.json
/metricsGETPrometheus exposition format

/health returns HTTP 503 during the cold-start warmup window (10-70 seconds depending on model). Load balancers correctly skip the server during this period.

For flow-matching models (SmolVLA, pi0, pi0.5), the denoise loop normally runs 10 fixed Euler steps. --adaptive-steps measures velocity-field convergence between consecutive steps and stops early when velocity stabilizes. Per-step telemetry exposes how many steps were actually taken.

Typical wins: 30-40% fewer steps on cache-hit calls, 0% wins on cache-miss. Composes with --cuda-graphs (the captured graph still benefits when the loop exits early).

A hard wall on per-/act latency. If inference doesn’t return within the budget, the server returns the last-known-good action chunk and emits a Prometheus counter. Use when your downstream control loop has a strict tick budget and a stale action is better than a missed tick.

Terminal window
reflex serve ./my-export/ --deadline-ms 33 # 30 Hz tick budget

Edge-first deployment with a cloud backup. If the local inference fails (OOM, crashed engine, timeout), the request transparently retries against the cloud URL. Useful for “my Jetson is the primary but I want a cloud safety net.”

Terminal window
reflex serve ./my-export/ --cloud-fallback https://cloud.example.com

The cloud endpoint must speak the same /act API as the local server (run reflex serve there too).

Circuit breaker (--max-consecutive-crashes)

Section titled “Circuit breaker (--max-consecutive-crashes)”

If the model raises N consecutive exceptions, the server enters a degraded state, returns HTTP 503 with Retry-After: 60, and reports degraded on /health. Reset by a successful /act call after the cooldown.

Default 5. Set lower (2) for a paranoid early-warning posture, higher (10) for a permissive one.