serve — runtime + safety

reflex serve <export_dir> is the production entry point. It reads the export, picks the right ONNX provider, applies the embodiment config, and listens on :8000 for /act, /health, /config, and /metrics.

Minimal

reflex serve ./my-export/

Returns unscaled raw actions on /act. Fine for smoke tests.

With every wedge

reflex serve ./my-export/ \
  --embodiment franka \
  --safety-config ./robot_limits.json \
  --adaptive-steps \
  --deadline-ms 33 \
  --cloud-fallback http://cloud:8000 \
  --inject-latency-ms 0 \
  --record /tmp/traces \
  --max-consecutive-crashes 5 \
  --cuda-graphs \
  --auto-calibrate \
  --slo p99=150ms \
  --robot-id warehouse-01

Flag	Wedge	Doc
`--embodiment`	per-robot action ranges + ActionGuard clamping	embodiments
`--safety-config`	URDF-derived joint limits + EU AI Act audit log	guard
`--adaptive-steps`	stop denoise loop early on velocity convergence	this page
`--deadline-ms`	return last-known-good action if over budget	this page
`--cloud-fallback`	edge-first with cloud backup	this page
`--inject-latency-ms`	synthetic delay (matches A2C2 paper methodology)	a2c2
`--record`	JSONL request/response capture	record & replay
`--max-consecutive-crashes`	circuit breaker (503 + Retry-After: 60 on trip)	this page
`--cuda-graphs`	capture and replay ORT sessions	cuda graphs
`--auto-calibrate`	hardware-tier auto-config	auto-calibrate
`--slo`	rolling p99 enforcement	SLO
`--robot-id`	per-robot Prometheus identity	fleet

Every response surfaces telemetry from each enabled wedge. Example:

{
  "actions": [[...], [...], ...],
  "latency_ms": 11.9,
  "inference_mode": "onnx_trt_fp16",
  "guard_clamped": false,
  "guard_violations": [],
  "adaptive_enabled": true,
  "adaptive_steps_used": 7,
  "injected_latency_ms": 0,
  "robot_id": "warehouse-01"
}

Endpoints

Path	Method	Returns
`/act`	POST	Action chunk + telemetry
`/health`	GET	`{status, model_loaded, inference_mode, robot_id}`
`/config`	GET	Saved `reflex_config.json`
`/metrics`	GET	Prometheus exposition format

/health returns HTTP 503 during the cold-start warmup window (10-70 seconds depending on model). Load balancers correctly skip the server during this period.

Adaptive denoising (`--adaptive-steps`)

For flow-matching models (SmolVLA, pi0, pi0.5), the denoise loop normally runs 10 fixed Euler steps. --adaptive-steps measures velocity-field convergence between consecutive steps and stops early when velocity stabilizes. Per-step telemetry exposes how many steps were actually taken.

Typical wins: 30-40% fewer steps on cache-hit calls, 0% wins on cache-miss. Composes with --cuda-graphs (the captured graph still benefits when the loop exits early).

Deadline guard (`--deadline-ms`)

A hard wall on per-/act latency. If inference doesn’t return within the budget, the server returns the last-known-good action chunk and emits a Prometheus counter. Use when your downstream control loop has a strict tick budget and a stale action is better than a missed tick.

reflex serve ./my-export/ --deadline-ms 33   # 30 Hz tick budget

Cloud fallback (`--cloud-fallback`)

Edge-first deployment with a cloud backup. If the local inference fails (OOM, crashed engine, timeout), the request transparently retries against the cloud URL. Useful for “my Jetson is the primary but I want a cloud safety net.”

reflex serve ./my-export/ --cloud-fallback https://cloud.example.com

The cloud endpoint must speak the same /act API as the local server (run reflex serve there too).

Circuit breaker (`--max-consecutive-crashes`)

If the model raises N consecutive exceptions, the server enters a degraded state, returns HTTP 503 with Retry-After: 60, and reports degraded on /health. Reset by a successful /act call after the cooldown.

Default 5. Set lower (2) for a paranoid early-warning posture, higher (10) for a permissive one.