auto-calibrate

reflex serve --auto-calibrate picks the right pre-shipped configuration for your hardware + embodiment, then passively learns latency_compensation_ms from real /act traffic. One-flag first-deploy DX win.

Per ADR 2026-04-25-auto-calibration-architecture — selection, not tuning. Strict partial-order resolver, passive actuator-latency observation (no boot-time probe).

Quick start

# Calibrate + serve. First run: ~5-7s for the measurement pass.
# Subsequent runs hit the cache and start instantly.
reflex serve ./my-export/ --embodiment franka --auto-calibrate

# Override cache location (e.g., to ship a frozen cache inside a container)
reflex serve ./my-export/ --auto-calibrate \
    --calibration-cache /opt/reflex/calibration.json

# Force re-measurement on next start (after hardware swap or driver upgrade)
reflex serve ./my-export/ --auto-calibrate --calibrate-force

When --auto-calibrate is unset, behavior is unchanged from baseline. Phase 1 is opt-in; Phase 1.5 will flip to default-on.

What gets selected

Five parameters in strict partial order — each downstream choice narrows by the upstream choice:

Parameter	Choices	Selected by
`variant`	fp16, int8, fp8	Hardware + which `.onnx` files exist. fp8 only on sm_89+ (Hopper / Ada Lovelace / Blackwell).
`provider`	TensorRT-EP, CUDA-EP, CPU-EP	Variant + `--max-batch`. TRT-EP requires fp16 + batch=1.
`nfe` (denoise steps)	1, 2, 4, 8, 10	Largest NFE such that `nfe × measured_expert_step_ms ≤ chunk_period × 0.7`. Falls back to NFE=1 (forces SnapFlow path) when no candidate fits.
`chunk_size`	embodiment default	Default unless even NFE=1 doesn’t fit; then halves.
`latency_compensation_ms`	embodiment cold-start (franka=40, so100=60, ur5=40)	Cold-start at startup; warm-update from real `/act` p95 after 30s of traffic.

The resolver runs single-pass at startup. The warm-up tracker continuously refines latency_compensation_ms and writes back to the cache when stable.

The cache

Lives at ~/.reflex/calibration.json by default. JSON schema v1:

{
  "schema_version": 1,
  "reflex_version": "0.7.0",
  "calibration_date": "2026-04-25T16:30:00.000000Z",
  "hardware_fingerprint": {
    "gpu_uuid": "GPU-abc123...",
    "gpu_name": "NVIDIA A10G",
    "driver_version_major": 535,
    "driver_version_minor": 129,
    "cuda_version_major": 12,
    "cuda_version_minor": 2,
    "cpu_count": 8,
    "ram_gb": 32
  },
  "entries": {
    "franka::pi05_decomposed_libero_v3": {
      "chunk_size": 50,
      "nfe": 4,
      "latency_compensation_ms": 42.5,
      "provider": "TensorrtExecutionProvider",
      "variant": "fp16",
      "measurement_quality": {
        "warmup_iters": 10, "measurement_iters": 100,
        "median_ms": 38.2, "p99_ms": 47.1,
        "n_outliers_dropped": 10, "quality_score": 0.94
      }
    }
  }
}

Keyed by {embodiment}::{model_hash} so multi-embodiment / multi-model deployments coexist in one cache file.

Inspecting

# Pretty-print
reflex doctor --show-calibration

# Machine-readable (for CI / scripts)
reflex doctor --show-calibration --format json | jq .

Staleness

The cache is stale when ANY of these is true:

Hardware fingerprint mismatches the current host (GPU swap, driver major-version bump, reflex_version change)
Cache age > 30 days
--calibrate-force is set

Driver patch-version bumps are ignored.

Override ladder

Explicit CLI flags > calibration cache > embodiment_config default.

If you pass --chunk-size 40 AND --auto-calibrate, the explicit 40 wins. A yellow stderr warning makes this visible at startup:

[auto-calibrate] ignoring cached chunk_size=50 because --chunk-size=40

What’s the warm-up?

Per the ADR, no active probe at boot — that would risk an unintended first-move on a robot in an unsafe pose. Instead:

Server starts with the embodiment cold-start default for latency_compensation_ms
As real /act traffic flows, the warmup tracker records each request’s wall-clock latency in a rolling 100-sample window
Once 30+ samples accumulate AND the rolling p95 has been stable within 5 ms for 3 consecutive checks, the tracker writes the new value to the cache + persists atomically
Subsequent server restarts pick up the warmed-up value

Customer experience: works correctly from the first request (cold-start default is a reasonable bound), gets steadily more accurate over the first ~30 seconds.

Hardware-tier expectations

Approximate expectations on Franka:

Hardware	variant	provider	NFE	chunk_size	latency_comp_ms (warmed)
Cloud A100-80GB	fp16	TRT-EP	10	50	25-40
Cloud A10G-24GB	fp16	TRT-EP	4-8	50	35-50
Jetson AGX Orin (64GB)	fp16	TRT-EP	4	50	50-70
Jetson Orin Nano (8GB)	fp16	CUDA-EP	1 (SnapFlow path)	50	80-120
H100 (Hopper)	fp8	TRT-EP	10	50	20-30

Illustrative. Your numbers depend on the export, the model, and customer-specific traffic shape.

Troubleshooting

auto-calibrate: cache stale (fingerprint mismatch ...) — expected after hardware swap or driver upgrade. Re-measurement runs on the next start.

No NFE fits budget — falling back to NFE=1 — your hardware × embodiment × model has no legal NFE that fits the chunk-period budget. Switch to a SnapFlow-distilled student (single-step inherently), or lower the embodiment’s control.frequency_hz.

reflex doctor --show-calibration says “No calibration cache found” — you haven’t run reflex serve --auto-calibrate yet on this host. Or the cache lives at a different path.