policy versioning

reflex serve --policy-a ./v1/ --policy-b ./v2/ --split 80 --no-rtc loads two policies side-by-side and routes /act traffic deterministically per-episode. 80% of episodes go to A, 20% to B. Sticky-per-episode (router uses SHA-256 hash of episode_id) so cache locality + RTC carry-over are preserved within an episode.

Per ADR 2026-04-25-policy-versioning-architecture.

Quick start

# 1. Verify both policies pass reflex doctor first
reflex doctor ./v1/
reflex doctor ./v2/

# 2. Start 2-policy serve (80/20 split, RTC off)
reflex serve ./v1/ \
    --policy-a ./v1/ \
    --policy-b ./v2/ \
    --split 80 \
    --no-rtc

# 3. /act requests now carry routing decisions in headers + record-replay traces
curl -X POST http://localhost:8000/act \
    -H "Content-Type: application/json" \
    -d '{"episode_id": "ep_xyz", "image": "...", "instruction": "pick up the cup"}'
# Response headers:
#   X-Reflex-Policy-Slot: a
#   X-Reflex-Model-Version: pi0-libero-v1@<hash>

Why this exists

Three load-bearing customer signals:

Production rollout — ship a new policy to 5% of traffic, watch metrics for an hour, ramp to 100%. The classic A/B framework, applied to robot policies.
Risk-free comparison — load the next-gen policy alongside the current one + compare per-episode metrics in your dashboard, no production traffic risk (set --split 100 to keep all traffic on A while B is loaded inert).
Self-distilling-serve safety — the auto-distill loop needs a warm secondary slot for ≤60s rollback when the post-swap monitor trips.

The 5 flags

Flag	Default	Notes
`--policy-a <path>`	unset	Path to policy A export. Must be set together with `--policy-b`.
`--policy-b <path>`	unset	Path to policy B. Mutually exclusive with `--shadow-policy`.
`--split <int>`	`50`	Percent of episodes routed to A. `0` = all to B; `100` = all to A.
`--shadow-policy <path>`	unset	Phase 1.5 — shadow inference. Phase 1 ships INERT (warning only).
`--no-rtc`	`false`	REQUIRED in 2-policy mode. RTC carry-over is per-policy; cross-policy carry-over produces OOD actions.

Sticky-per-episode routing

Routing decision hashes on episode_id (or request_id when episode_id is missing — the degraded path). The first request of an episode picks the slot; all subsequent requests within the same episode_id get the same slot. This preserves:

9× episode-cache moat — Pi05DecomposedInference’s EpisodeCache is per-policy. Switching mid-episode destroys the cached past_kv and falls back to full denoise.
RTC carry-over — chunk N+1’s denoise anchors to chunk N’s trailing actions. Cross-policy carry-over produces out-of-distribution actions (this is why --no-rtc is enforced).

Hash distribution is deterministic across processes + Python restarts (SHA-256 of the routing key). Same episode_id → same slot, every time.

Degraded mode

When the caller doesn’t pass episode_id, the router falls back to hashing request_id. Each request gets an independent decision → flip-flopping between policies → cache + RTC discontinuities. The router logs a one-time warning per process.

WARN policy_router.degraded_mode request_id=req_abc — no episode_id provided.

Fix: every client-side /act call should set episode_id to a stable identifier (e.g., ep_<robot_id>_<task_start_ts>).

Per-policy circuit breaker

Each policy slot has its own consecutive-crash counter. When one slot exceeds --max-consecutive-crashes (default 5):

Scenario	Verdict	Action
Slot A crashes ≥5x; B is clean	`drain-a`	Caller routes 100% to B
Slot B crashes ≥5x; A is clean	`drain-b`	Mirror: route 100% to A
Both slots crash ≥5x	`degraded`	Full server `degraded` state — both policies are contributing errors

A clean response on a slot resets that slot’s counter to 0. Drain decision is sticky until operator intervention or a clean response.

Record-replay schema additions

The JSONL trace gains two optional, additive fields in 2-policy mode (no schema_version bump):

Header gains a policies block:

{
  "kind": "header",
  "schema_version": 1,
  "policies": [
    {"slot": "a", "model_id": "pi0-libero-v1", "model_hash": "aaaa..."},
    {"slot": "b", "model_id": "pi0-libero-v2", "model_hash": "bbbb..."}
  ]
}

Per-request gains a routing block:

{
  "kind": "request",
  "seq": 0,
  "routing": {
    "slot": "a",
    "routing_key": "ep_xyz",
    "degraded": false,
    "cached": false
  }
}

Replay tools that parse the trace can split per-slot statistics by grouping on routing.slot.

Prometheus metrics

5 metrics gain a policy_slot bounded-enum label (prod | a | b):

reflex_act_latency_seconds
reflex_cache_hit_total
reflex_cache_miss_total
reflex_denoise_steps_total
reflex_in_flight_requests

Default policy_slot="prod" preserves series meaning under single-policy deployments. Cardinality stays well within budget.

# A vs B p99 latency
histogram_quantile(0.99,
  sum(rate(reflex_act_latency_seconds_bucket{policy_slot="a"}[5m])) by (le)
)
histogram_quantile(0.99,
  sum(rate(reflex_act_latency_seconds_bucket{policy_slot="b"}[5m])) by (le)
)

# Cache hit rate per slot
rate(reflex_cache_hit_total{policy_slot="a"}[5m])
  / (rate(reflex_cache_hit_total{policy_slot="a"}[5m])
   + rate(reflex_cache_miss_total{policy_slot="a"}[5m]))

Memory check (refuse-to-load)

2-policy mode requires roughly 2× model_size_bytes of GPU VRAM. Before loading the second policy:

2 × model_size_bytes > 0.7 × total_gpu_bytes

If true, the server refuses to start with a clear error message (no silent OOM at first inference). 0.7 leaves 30% of VRAM for cuDNN workspace, IO buffers, OS, etc.

2-policy mode requires 16.0 GB VRAM but only 11.2 GB (70% of 16.0 GB)
is available. Either pick smaller models, run on a larger GPU, OR drop
to single-policy mode.