Skip to content

policy versioning

reflex serve --policy-a ./v1/ --policy-b ./v2/ --split 80 --no-rtc loads two policies side-by-side and routes /act traffic deterministically per-episode. 80% of episodes go to A, 20% to B. Sticky-per-episode (router uses SHA-256 hash of episode_id) so cache locality + RTC carry-over are preserved within an episode.

Per ADR 2026-04-25-policy-versioning-architecture.

Terminal window
# 1. Verify both policies pass reflex doctor first
reflex doctor ./v1/
reflex doctor ./v2/
# 2. Start 2-policy serve (80/20 split, RTC off)
reflex serve ./v1/ \
--policy-a ./v1/ \
--policy-b ./v2/ \
--split 80 \
--no-rtc
# 3. /act requests now carry routing decisions in headers + record-replay traces
curl -X POST http://localhost:8000/act \
-H "Content-Type: application/json" \
-d '{"episode_id": "ep_xyz", "image": "...", "instruction": "pick up the cup"}'
# Response headers:
# X-Reflex-Policy-Slot: a
# X-Reflex-Model-Version: pi0-libero-v1@<hash>

Three load-bearing customer signals:

  1. Production rollout — ship a new policy to 5% of traffic, watch metrics for an hour, ramp to 100%. The classic A/B framework, applied to robot policies.
  2. Risk-free comparison — load the next-gen policy alongside the current one + compare per-episode metrics in your dashboard, no production traffic risk (set --split 100 to keep all traffic on A while B is loaded inert).
  3. Self-distilling-serve safety — the auto-distill loop needs a warm secondary slot for ≤60s rollback when the post-swap monitor trips.
FlagDefaultNotes
--policy-a <path>unsetPath to policy A export. Must be set together with --policy-b.
--policy-b <path>unsetPath to policy B. Mutually exclusive with --shadow-policy.
--split <int>50Percent of episodes routed to A. 0 = all to B; 100 = all to A.
--shadow-policy <path>unsetPhase 1.5 — shadow inference. Phase 1 ships INERT (warning only).
--no-rtcfalseREQUIRED in 2-policy mode. RTC carry-over is per-policy; cross-policy carry-over produces OOD actions.

Routing decision hashes on episode_id (or request_id when episode_id is missing — the degraded path). The first request of an episode picks the slot; all subsequent requests within the same episode_id get the same slot. This preserves:

  • 9× episode-cache moatPi05DecomposedInference’s EpisodeCache is per-policy. Switching mid-episode destroys the cached past_kv and falls back to full denoise.
  • RTC carry-over — chunk N+1’s denoise anchors to chunk N’s trailing actions. Cross-policy carry-over produces out-of-distribution actions (this is why --no-rtc is enforced).

Hash distribution is deterministic across processes + Python restarts (SHA-256 of the routing key). Same episode_id → same slot, every time.

When the caller doesn’t pass episode_id, the router falls back to hashing request_id. Each request gets an independent decision → flip-flopping between policies → cache + RTC discontinuities. The router logs a one-time warning per process.

WARN policy_router.degraded_mode request_id=req_abc — no episode_id provided.

Fix: every client-side /act call should set episode_id to a stable identifier (e.g., ep_<robot_id>_<task_start_ts>).

Each policy slot has its own consecutive-crash counter. When one slot exceeds --max-consecutive-crashes (default 5):

ScenarioVerdictAction
Slot A crashes ≥5x; B is cleandrain-aCaller routes 100% to B
Slot B crashes ≥5x; A is cleandrain-bMirror: route 100% to A
Both slots crash ≥5xdegradedFull server degraded state — both policies are contributing errors

A clean response on a slot resets that slot’s counter to 0. Drain decision is sticky until operator intervention or a clean response.

The JSONL trace gains two optional, additive fields in 2-policy mode (no schema_version bump):

Header gains a policies block:

{
"kind": "header",
"schema_version": 1,
"policies": [
{"slot": "a", "model_id": "pi0-libero-v1", "model_hash": "aaaa..."},
{"slot": "b", "model_id": "pi0-libero-v2", "model_hash": "bbbb..."}
]
}

Per-request gains a routing block:

{
"kind": "request",
"seq": 0,
"routing": {
"slot": "a",
"routing_key": "ep_xyz",
"degraded": false,
"cached": false
}
}

Replay tools that parse the trace can split per-slot statistics by grouping on routing.slot.

5 metrics gain a policy_slot bounded-enum label (prod | a | b):

  • reflex_act_latency_seconds
  • reflex_cache_hit_total
  • reflex_cache_miss_total
  • reflex_denoise_steps_total
  • reflex_in_flight_requests

Default policy_slot="prod" preserves series meaning under single-policy deployments. Cardinality stays well within budget.

# A vs B p99 latency
histogram_quantile(0.99,
sum(rate(reflex_act_latency_seconds_bucket{policy_slot="a"}[5m])) by (le)
)
histogram_quantile(0.99,
sum(rate(reflex_act_latency_seconds_bucket{policy_slot="b"}[5m])) by (le)
)
# Cache hit rate per slot
rate(reflex_cache_hit_total{policy_slot="a"}[5m])
/ (rate(reflex_cache_hit_total{policy_slot="a"}[5m])
+ rate(reflex_cache_miss_total{policy_slot="a"}[5m]))

2-policy mode requires roughly 2× model_size_bytes of GPU VRAM. Before loading the second policy:

2 × model_size_bytes > 0.7 × total_gpu_bytes

If true, the server refuses to start with a clear error message (no silent OOM at first inference). 0.7 leaves 30% of VRAM for cuDNN workspace, IO buffers, OS, etc.

2-policy mode requires 16.0 GB VRAM but only 11.2 GB (70% of 16.0 GB)
is available. Either pick smaller models, run on a larger GPU, OR drop
to single-policy mode.