Skip to content

batching

reflex serve queues /act requests through a per-policy PolicyRuntime and flushes batches based on estimated GPU-ms cost, not fixed request count. There’s no opt-in flag — it’s just better default behavior.

Terminal window
# Default — 100ms cost budget, 5ms timeout
reflex serve ./my-export/
# Tune for throughput (deeper batches, longer per-request latency)
reflex serve ./my-export/ --max-batch-cost-ms 250
# Tune for tail latency (smaller batches, faster flush)
reflex serve ./my-export/ --max-batch-cost-ms 50 --batch-timeout-ms 2

The scheduler flushes when ANY of these is true:

  1. Budget reached. Summed estimated GPU-ms ≥ --max-batch-cost-ms. The good case.
  2. Timeout. The oldest queued request has waited ≥ --batch-timeout-ms. Fires under low traffic.
  3. Single request over budget. A single request whose estimated cost exceeds the budget — no point waiting for more.

Cost estimates come from a per-policy rolling-window cost model (GpuMsCostModel). Starts with a 50 ms cold-start default per shape; after 3 measurements switches to the rolling median. Updates post-flush with actual measured wall-clock divided by batch size.

A naïve --max-batch=8 ignores per-request cost variance:

  • 8 cache-hit requests (~50 ms each) = 400 ms / batch — fine
  • 8 cache-miss requests (~400 ms each) = 3200 ms / batch — blows the SLO

--max-batch-cost-ms 100 adapts: 2 cache-hits + 1 cache-miss is the same scheduler decision as 8 cache-hits, even though one is “size 3” and the other is “size 8.”

The legacy --max-batch=N flag still parses but is ignored at the runtime layer. Setting --max-batch > 1 triggers a one-time deprecation warning at startup. Migration:

--max-batch-cost-ms ≈ max_batch × per_request_cost_estimate_ms

For the 50 ms cold-start default: --max-batch 8--max-batch-cost-ms 400.

Every flush emits five Prometheus series. Labels: embodiment × policy_slot (prod for single-policy; a / b / shadow after policy-versioning lands).

MetricTypeDescription
reflex_batch_cost_per_flush_msHistogramEstimated GPU-ms cost of each flushed batch
reflex_batch_size_per_flushHistogramRequest count per flush
reflex_batch_flush_totalCounterCumulative flushes by reason (budget_reached / timeout / single_request_over_budget)
reflex_captured_graph_hit_rateGaugeFraction of recent flushes whose batch landed on a captured-graph shape
reflex_policy_runtime_queue_depthGaugeCurrent pending requests

captured_graph_hit_rate composes with --cuda-graphs. When both are set, the hit rate tells you how often the scheduler actually exercised the captured graph path vs eager fallback.

When the queue hits its capacity (default 1000 requests), /act returns HTTP 503 with:

{
"error": "queue_full",
"message": "policy runtime queue at capacity",
"policy_id": "prod",
"max_queue": 1000
}

and a Retry-After: 1 header. Well-behaved clients with backoff are fine.

SymptomLikely causeFix
p99 latency spikesBudget too high — long batchesLower --max-batch-cost-ms
Throughput plateauBudget too low — small batches dominateRaise --max-batch-cost-ms (in 50ms steps)
timeout flush ratio > 80%Load too low to benefit from batchingLower --batch-timeout-ms (e.g. 1ms)
queue_full 503sSustained burst exceeds runtime drain rateRaise --max-concurrent ceiling, or scale horizontally
captured_graph_hit_rate < 1.0Phase 2 mixed-shape batchesCheck gen_ai.action.embodiment distribution
  • True dynamic-batch ORT dispatch. Decomposed exports are static-shape (per ADR 2026-04-21); batching today fans out sequentially under the queue. Per-request compute cost is unchanged.
  • Sub-queue-by-shape separation. ADR-flagged as a Phase 1.5 follow-up.
  • Cross-policy batching. Requires policy-versioning’s per-embodiment routing (Phase 2).
  • Online bandit cost updates. Phase 2 refinement if profiled GPU-ms proves insufficient.