policy versioning
reflex serve --policy-a ./v1/ --policy-b ./v2/ --split 80 --no-rtc loads two policies side-by-side and routes /act traffic deterministically per-episode. 80% of episodes go to A, 20% to B. Sticky-per-episode (router uses SHA-256 hash of episode_id) so cache locality + RTC carry-over are preserved within an episode.
Per ADR 2026-04-25-policy-versioning-architecture.
Quick start
Section titled “Quick start”# 1. Verify both policies pass reflex doctor firstreflex doctor ./v1/reflex doctor ./v2/
# 2. Start 2-policy serve (80/20 split, RTC off)reflex serve ./v1/ \ --policy-a ./v1/ \ --policy-b ./v2/ \ --split 80 \ --no-rtc
# 3. /act requests now carry routing decisions in headers + record-replay tracescurl -X POST http://localhost:8000/act \ -H "Content-Type: application/json" \ -d '{"episode_id": "ep_xyz", "image": "...", "instruction": "pick up the cup"}'# Response headers:# X-Reflex-Policy-Slot: a# X-Reflex-Model-Version: pi0-libero-v1@<hash>Why this exists
Section titled “Why this exists”Three load-bearing customer signals:
- Production rollout — ship a new policy to 5% of traffic, watch metrics for an hour, ramp to 100%. The classic A/B framework, applied to robot policies.
- Risk-free comparison — load the next-gen policy alongside the current one + compare per-episode metrics in your dashboard, no production traffic risk (set
--split 100to keep all traffic on A while B is loaded inert). - Self-distilling-serve safety — the auto-distill loop needs a warm secondary slot for ≤60s rollback when the post-swap monitor trips.
The 5 flags
Section titled “The 5 flags”| Flag | Default | Notes |
|---|---|---|
--policy-a <path> | unset | Path to policy A export. Must be set together with --policy-b. |
--policy-b <path> | unset | Path to policy B. Mutually exclusive with --shadow-policy. |
--split <int> | 50 | Percent of episodes routed to A. 0 = all to B; 100 = all to A. |
--shadow-policy <path> | unset | Phase 1.5 — shadow inference. Phase 1 ships INERT (warning only). |
--no-rtc | false | REQUIRED in 2-policy mode. RTC carry-over is per-policy; cross-policy carry-over produces OOD actions. |
Sticky-per-episode routing
Section titled “Sticky-per-episode routing”Routing decision hashes on episode_id (or request_id when episode_id is missing — the degraded path). The first request of an episode picks the slot; all subsequent requests within the same episode_id get the same slot. This preserves:
- 9× episode-cache moat —
Pi05DecomposedInference’sEpisodeCacheis per-policy. Switching mid-episode destroys the cachedpast_kvand falls back to full denoise. - RTC carry-over — chunk N+1’s denoise anchors to chunk N’s trailing actions. Cross-policy carry-over produces out-of-distribution actions (this is why
--no-rtcis enforced).
Hash distribution is deterministic across processes + Python restarts (SHA-256 of the routing key). Same episode_id → same slot, every time.
Degraded mode
Section titled “Degraded mode”When the caller doesn’t pass episode_id, the router falls back to hashing request_id. Each request gets an independent decision → flip-flopping between policies → cache + RTC discontinuities. The router logs a one-time warning per process.
WARN policy_router.degraded_mode request_id=req_abc — no episode_id provided.Fix: every client-side /act call should set episode_id to a stable identifier (e.g., ep_<robot_id>_<task_start_ts>).
Per-policy circuit breaker
Section titled “Per-policy circuit breaker”Each policy slot has its own consecutive-crash counter. When one slot exceeds --max-consecutive-crashes (default 5):
| Scenario | Verdict | Action |
|---|---|---|
| Slot A crashes ≥5x; B is clean | drain-a | Caller routes 100% to B |
| Slot B crashes ≥5x; A is clean | drain-b | Mirror: route 100% to A |
| Both slots crash ≥5x | degraded | Full server degraded state — both policies are contributing errors |
A clean response on a slot resets that slot’s counter to 0. Drain decision is sticky until operator intervention or a clean response.
Record-replay schema additions
Section titled “Record-replay schema additions”The JSONL trace gains two optional, additive fields in 2-policy mode (no schema_version bump):
Header gains a policies block:
{ "kind": "header", "schema_version": 1, "policies": [ {"slot": "a", "model_id": "pi0-libero-v1", "model_hash": "aaaa..."}, {"slot": "b", "model_id": "pi0-libero-v2", "model_hash": "bbbb..."} ]}Per-request gains a routing block:
{ "kind": "request", "seq": 0, "routing": { "slot": "a", "routing_key": "ep_xyz", "degraded": false, "cached": false }}Replay tools that parse the trace can split per-slot statistics by grouping on routing.slot.
Prometheus metrics
Section titled “Prometheus metrics”5 metrics gain a policy_slot bounded-enum label (prod | a | b):
reflex_act_latency_secondsreflex_cache_hit_totalreflex_cache_miss_totalreflex_denoise_steps_totalreflex_in_flight_requests
Default policy_slot="prod" preserves series meaning under single-policy deployments. Cardinality stays well within budget.
# A vs B p99 latencyhistogram_quantile(0.99, sum(rate(reflex_act_latency_seconds_bucket{policy_slot="a"}[5m])) by (le))histogram_quantile(0.99, sum(rate(reflex_act_latency_seconds_bucket{policy_slot="b"}[5m])) by (le))
# Cache hit rate per slotrate(reflex_cache_hit_total{policy_slot="a"}[5m]) / (rate(reflex_cache_hit_total{policy_slot="a"}[5m]) + rate(reflex_cache_miss_total{policy_slot="a"}[5m]))Memory check (refuse-to-load)
Section titled “Memory check (refuse-to-load)”2-policy mode requires roughly 2× model_size_bytes of GPU VRAM. Before loading the second policy:
2 × model_size_bytes > 0.7 × total_gpu_bytesIf true, the server refuses to start with a clear error message (no silent OOM at first inference). 0.7 leaves 30% of VRAM for cuDNN workspace, IO buffers, OS, etc.
2-policy mode requires 16.0 GB VRAM but only 11.2 GB (70% of 16.0 GB)is available. Either pick smaller models, run on a larger GPU, OR dropto single-policy mode.