batching

reflex serve queues /act requests through a per-policy PolicyRuntime and flushes batches based on estimated GPU-ms cost, not fixed request count. There’s no opt-in flag — it’s just better default behavior.

Quick start

# Default — 100ms cost budget, 5ms timeout
reflex serve ./my-export/

# Tune for throughput (deeper batches, longer per-request latency)
reflex serve ./my-export/ --max-batch-cost-ms 250

# Tune for tail latency (smaller batches, faster flush)
reflex serve ./my-export/ --max-batch-cost-ms 50 --batch-timeout-ms 2

How it decides to flush

The scheduler flushes when ANY of these is true:

Budget reached. Summed estimated GPU-ms ≥ --max-batch-cost-ms. The good case.
Timeout. The oldest queued request has waited ≥ --batch-timeout-ms. Fires under low traffic.
Single request over budget. A single request whose estimated cost exceeds the budget — no point waiting for more.

Cost estimates come from a per-policy rolling-window cost model (GpuMsCostModel). Starts with a 50 ms cold-start default per shape; after 3 measurements switches to the rolling median. Updates post-flush with actual measured wall-clock divided by batch size.

Why cost-weighted, not count-weighted

A naïve --max-batch=8 ignores per-request cost variance:

8 cache-hit requests (~50 ms each) = 400 ms / batch — fine
8 cache-miss requests (~400 ms each) = 3200 ms / batch — blows the SLO

--max-batch-cost-ms 100 adapts: 2 cache-hits + 1 cache-miss is the same scheduler decision as 8 cache-hits, even though one is “size 3” and the other is “size 8.”

Migration from `--max-batch`

The legacy --max-batch=N flag still parses but is ignored at the runtime layer. Setting --max-batch > 1 triggers a one-time deprecation warning at startup. Migration:

--max-batch-cost-ms ≈ max_batch × per_request_cost_estimate_ms

For the 50 ms cold-start default: --max-batch 8 → --max-batch-cost-ms 400.

Metrics

Every flush emits five Prometheus series. Labels: embodiment × policy_slot (prod for single-policy; a / b / shadow after policy-versioning lands).

Metric	Type	Description
`reflex_batch_cost_per_flush_ms`	Histogram	Estimated GPU-ms cost of each flushed batch
`reflex_batch_size_per_flush`	Histogram	Request count per flush
`reflex_batch_flush_total`	Counter	Cumulative flushes by reason (`budget_reached` / `timeout` / `single_request_over_budget`)
`reflex_captured_graph_hit_rate`	Gauge	Fraction of recent flushes whose batch landed on a captured-graph shape
`reflex_policy_runtime_queue_depth`	Gauge	Current pending requests

captured_graph_hit_rate composes with --cuda-graphs. When both are set, the hit rate tells you how often the scheduler actually exercised the captured graph path vs eager fallback.

Backpressure

When the queue hits its capacity (default 1000 requests), /act returns HTTP 503 with:

{
  "error": "queue_full",
  "message": "policy runtime queue at capacity",
  "policy_id": "prod",
  "max_queue": 1000
}

and a Retry-After: 1 header. Well-behaved clients with backoff are fine.

Tuning

Symptom	Likely cause	Fix
p99 latency spikes	Budget too high — long batches	Lower `--max-batch-cost-ms`
Throughput plateau	Budget too low — small batches dominate	Raise `--max-batch-cost-ms` (in 50ms steps)
`timeout` flush ratio > 80%	Load too low to benefit from batching	Lower `--batch-timeout-ms` (e.g. 1ms)
`queue_full` 503s	Sustained burst exceeds runtime drain rate	Raise `--max-concurrent` ceiling, or scale horizontally
`captured_graph_hit_rate` < 1.0	Phase 2 mixed-shape batches	Check `gen_ai.action.embodiment` distribution

What’s not in Phase 1

True dynamic-batch ORT dispatch. Decomposed exports are static-shape (per ADR 2026-04-21); batching today fans out sequentially under the queue. Per-request compute cost is unchanged.
Sub-queue-by-shape separation. ADR-flagged as a Phase 1.5 follow-up.
Cross-policy batching. Requires policy-versioning’s per-embodiment routing (Phase 2).
Online bandit cost updates. Phase 2 refinement if profiled GPU-ms proves insufficient.