batching
reflex serve queues /act requests through a per-policy PolicyRuntime and flushes batches based on estimated GPU-ms cost, not fixed request count. There’s no opt-in flag — it’s just better default behavior.
Quick start
Section titled “Quick start”# Default — 100ms cost budget, 5ms timeoutreflex serve ./my-export/
# Tune for throughput (deeper batches, longer per-request latency)reflex serve ./my-export/ --max-batch-cost-ms 250
# Tune for tail latency (smaller batches, faster flush)reflex serve ./my-export/ --max-batch-cost-ms 50 --batch-timeout-ms 2How it decides to flush
Section titled “How it decides to flush”The scheduler flushes when ANY of these is true:
- Budget reached. Summed estimated GPU-ms ≥
--max-batch-cost-ms. The good case. - Timeout. The oldest queued request has waited ≥
--batch-timeout-ms. Fires under low traffic. - Single request over budget. A single request whose estimated cost exceeds the budget — no point waiting for more.
Cost estimates come from a per-policy rolling-window cost model (GpuMsCostModel). Starts with a 50 ms cold-start default per shape; after 3 measurements switches to the rolling median. Updates post-flush with actual measured wall-clock divided by batch size.
Why cost-weighted, not count-weighted
Section titled “Why cost-weighted, not count-weighted”A naïve --max-batch=8 ignores per-request cost variance:
- 8 cache-hit requests (~50 ms each) = 400 ms / batch — fine
- 8 cache-miss requests (~400 ms each) = 3200 ms / batch — blows the SLO
--max-batch-cost-ms 100 adapts: 2 cache-hits + 1 cache-miss is the same scheduler decision as 8 cache-hits, even though one is “size 3” and the other is “size 8.”
Migration from --max-batch
Section titled “Migration from --max-batch”The legacy --max-batch=N flag still parses but is ignored at the runtime layer. Setting --max-batch > 1 triggers a one-time deprecation warning at startup. Migration:
--max-batch-cost-ms ≈ max_batch × per_request_cost_estimate_msFor the 50 ms cold-start default: --max-batch 8 → --max-batch-cost-ms 400.
Metrics
Section titled “Metrics”Every flush emits five Prometheus series. Labels: embodiment × policy_slot (prod for single-policy; a / b / shadow after policy-versioning lands).
| Metric | Type | Description |
|---|---|---|
reflex_batch_cost_per_flush_ms | Histogram | Estimated GPU-ms cost of each flushed batch |
reflex_batch_size_per_flush | Histogram | Request count per flush |
reflex_batch_flush_total | Counter | Cumulative flushes by reason (budget_reached / timeout / single_request_over_budget) |
reflex_captured_graph_hit_rate | Gauge | Fraction of recent flushes whose batch landed on a captured-graph shape |
reflex_policy_runtime_queue_depth | Gauge | Current pending requests |
captured_graph_hit_rate composes with --cuda-graphs. When both are set, the hit rate tells you how often the scheduler actually exercised the captured graph path vs eager fallback.
Backpressure
Section titled “Backpressure”When the queue hits its capacity (default 1000 requests), /act returns HTTP 503 with:
{ "error": "queue_full", "message": "policy runtime queue at capacity", "policy_id": "prod", "max_queue": 1000}and a Retry-After: 1 header. Well-behaved clients with backoff are fine.
Tuning
Section titled “Tuning”| Symptom | Likely cause | Fix |
|---|---|---|
| p99 latency spikes | Budget too high — long batches | Lower --max-batch-cost-ms |
| Throughput plateau | Budget too low — small batches dominate | Raise --max-batch-cost-ms (in 50ms steps) |
timeout flush ratio > 80% | Load too low to benefit from batching | Lower --batch-timeout-ms (e.g. 1ms) |
queue_full 503s | Sustained burst exceeds runtime drain rate | Raise --max-concurrent ceiling, or scale horizontally |
captured_graph_hit_rate < 1.0 | Phase 2 mixed-shape batches | Check gen_ai.action.embodiment distribution |
What’s not in Phase 1
Section titled “What’s not in Phase 1”- True dynamic-batch ORT dispatch. Decomposed exports are static-shape (per ADR 2026-04-21); batching today fans out sequentially under the queue. Per-request compute cost is unchanged.
- Sub-queue-by-shape separation. ADR-flagged as a Phase 1.5 follow-up.
- Cross-policy batching. Requires policy-versioning’s per-embodiment routing (Phase 2).
- Online bandit cost updates. Phase 2 refinement if profiled GPU-ms proves insufficient.