HTTP /act endpoint

reflex serve listens on :8000 (configurable) with four endpoints. /act is the inference path; the rest are observability.

POST /act

Send {instruction, state, image?, episode_id?}, get back a 50-step action chunk.

Request

{
  "instruction": "pick up the red cup",
  "state": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6],
  "image_b64": "<base64-encoded JPEG/PNG>",
  "episode_id": "ep-2026-05-01-001"
}

Field	Type	Required	Notes
`instruction`	string	Yes	Natural-language task spec. ≤ 512 chars (Phase 1).
`state`	array of float	Yes	Proprioceptive state vector. Length must match the embodiment’s `mean_state`.
`image_b64`	string	Optional	Base64-encoded JPEG or PNG. Required when the model is multi-modal (most are).
`image`	array	Optional	Alternative to `image_b64` — raw HxWxC pixel array. Use base64 for HTTP.
`episode_id`	string	Recommended	Stable identifier for the current task. Required for cache locality + RTC carry-over + policy-versioning routing.
`request_id`	string	Auto	Set if you want client-side request tracing. Falls back to a UUID.

Response (success)

{
  "actions": [[0.01, -0.02, 0.0, 0.0, 0.0, 0.0, 0.5], ...],
  "num_actions": 50,
  "action_dim": 7,
  "latency_ms": 11.9,
  "hz": 84.0,
  "denoising_steps": 10,
  "inference_mode": "onnx_trt_fp16",
  "guard_clamped": false,
  "guard_violations": []
}

Field	Type	Notes
`actions`	2D array	`num_actions × action_dim` action chunk. Pre-clamped to embodiment ranges.
`num_actions`	int	Length of the chunk. Usually 50 for flow-matching models.
`action_dim`	int	Per-action dimensionality (7 for Franka 6-DOF + gripper).
`latency_ms`	float	Wall-clock from request receipt to response.
`hz`	float	`1000 / latency_ms`.
`denoising_steps`	int	Actual denoise steps used (may be fewer than max if `--adaptive-steps` engaged).
`inference_mode`	string	`onnx_trt_fp16`, `onnx_cuda`, `onnx_cpu`, `decomposed`, `monolithic`
`guard_clamped`	bool	True if ActionGuard clamped any action in the chunk.
`guard_violations`	array	Per-axis violations when `guard_clamped` is true.

Response (telemetry — when wedges are active)

When you enable wedges, additional fields appear:

{
  "actions": [...],
  "latency_ms": 45.2,

  // a2c2 (when --a2c2-checkpoint is set)
  "a2c2_applied": true,
  "a2c2_reason": "applied",
  "a2c2_correction_magnitude": 0.073,

  // record (when --record is set)
  "record_seq": 1842,

  // policy-versioning (when --policy-a/--policy-b are set)
  "policy_slot": "a",

  // robot identity (when --robot-id is set)
  "robot_id": "warehouse-01"
}

These fields are additive — clients ignoring unknown fields stay backwards-compatible.

When 2-policy mode is active, the response also carries headers:

X-Reflex-Policy-Slot: a
X-Reflex-Model-Version: pi0-libero-v1@<hash>
X-Reflex-Routing-Key: ep_xyz
X-Reflex-Routing-Degraded: false

Error responses

{
  "error": "queue_full",
  "message": "policy runtime queue at capacity",
  "policy_id": "prod",
  "max_queue": 1000
}

Status codes:

Code	Meaning
200	Success
400	Malformed request (missing field, wrong type, image decode failure)
422	Schema-valid but semantically invalid (e.g. state length doesn’t match embodiment)
503	Server unavailable (warming up, queue full, SLO violation, circuit breaker tripped)
500	Internal error (model crashed; check `/health` and the audit log)

503 always carries Retry-After indicating when to retry.

GET /health

curl http://localhost:8000/health

{
  "status": "ready",
  "model_loaded": true,
  "model_version": "pi0-libero-v1@7a8b3c1d",
  "inference_mode": "onnx_trt_fp16",
  "uptime_seconds": 1245,
  "robot_id": "warehouse-01",
  "cuda_graphs_active": true
}

status is one of:

warming — first cold-start (10-70 sec). Returns HTTP 503.
ready — operational. HTTP 200.
degraded — circuit breaker tripped or both 2-policy slots failed. HTTP 503.

Load balancers should treat 503 as “skip this instance” — Reflex never returns 503 for transient single-request failures, only durable instance state.

GET /config

Returns the saved reflex_config.json from the loaded export. Useful for verifying which model + embodiment + version is actually serving.

GET /metrics

Prometheus exposition format. Scrape interval: 15 seconds is typical.

Key metrics:

Metric	Type	Labels
`reflex_act_latency_seconds`	Histogram	`embodiment`, `policy_slot`, `inference_mode`
`reflex_act_total`	Counter	`embodiment`, `status`
`reflex_guard_clamped_total`	Counter	`embodiment`
`reflex_cache_hit_total` / `reflex_cache_miss_total`	Counter	`embodiment`, `policy_slot`
`reflex_in_flight_requests`	Gauge	`policy_slot`
`reflex_robot_info`	Gauge	`robot_id`, `embodiment`, `model_id`
`reflex_slo_violations_total`	Counter	`embodiment`, `kind`

CORS

Default: Access-Control-Allow-Origin: * (open). Override with --cors-origins https://app.example.com for restricted origins. CORS is on for browser-based control loops; never disable in non-browser deploys.

Client SDK

Python client is provided:

from reflex.client import ReflexClient

with ReflexClient("http://localhost:8000") as client:
    with client.episode() as ep:
        result = ep.act(image=numpy_frame, state=[0.1, 0.2, ...])
        print(result["actions"])

The client handles 503 retries, episode_id management, and image base64 encoding. Other languages: just use any HTTP client; the wire format is plain JSON.