Skip to content

fleet telemetry

When you deploy Reflex one process per robot, --robot-id gives each process a human-readable identity that Prometheus + Grafana can pivot on. Per-robot p99, per-robot error rates, per-robot safety violations — all in one dashboard.

Zero cost when you’re not using it. Single-robot deploys see no extra cardinality.

Terminal window
# On robot A:
reflex serve ./my-export/ --robot-id warehouse-01 --port 8000
# On robot B:
reflex serve ./my-export/ --robot-id warehouse-02 --port 8000
# On robot C:
reflex serve ./my-export/ --robot-id arm-prototype-alpha --port 8000

Your Prometheus config scrapes each instance. In Grafana, import dashboards/reflex_fleet.json and select one or more robots from the dropdown.

Each reflex serve process exports a single info-style gauge:

reflex_robot_info{robot_id="warehouse-01",embodiment="franka",model_id="pi0-libero"} 1

Grafana joins hot metrics to this gauge via instance:

histogram_quantile(0.99,
sum by (le, instance) (rate(reflex_act_latency_seconds_bucket[5m]))
) * on (instance) group_left(robot_id) reflex_robot_info

robot_id appears as a label on p99 latency even though the underlying histogram doesn’t carry it. Cardinality stays flat — one series per process on reflex_robot_info, not one per request on every histogram.

Why not put robot_id as a label on every metric?

Section titled “Why not put robot_id as a label on every metric?”

A fleet of 1,000 robots × 3 embodiments × 6 models × N metrics = hundreds of thousands of series. Prometheus handles that but pays memory for it, and most per-label slicing operators actually want — per-robot, not per-(robot × embodiment) — is available from the info-metric join.

We keep the existing label set tight (embodiment, model_id, violation_kind, etc.) and let operators opt into per-robot slicing via --robot-id + the info join.

Every reflex serve process surfaces the robot_id via:

  • GET /health"robot_id": "warehouse-01" in the JSON body
  • GET /config — same key
  • GET /metricsreflex_robot_info{robot_id="warehouse-01",...} (when set)

When --robot-id is unset, robot_id is "" on /health and /config, and no reflex_robot_info series is emitted.

- alert: ReflexRobotLatencyHigh
expr: |
histogram_quantile(0.99,
sum by (le, instance) (rate(reflex_act_latency_seconds_bucket[5m]))
) * on (instance) group_left(robot_id) reflex_robot_info{robot_id="warehouse-01"} > 0.2
for: 3m
labels: { severity: page, robot_id: warehouse-01 }
annotations:
summary: "Reflex on {{ $labels.robot_id }} over p99=200ms"

Drop the robot_id= filter to alert on any robot in the fleet.

Terminal window
# systemd unit per robot — hostname macro gives each robot its own identity
ExecStart=/usr/local/bin/reflex serve /opt/reflex/export \
--robot-id %H \
--port 8000 \
--slo p99=150ms

Don’t. Reflex does per-process inference; a central aggregator adds network latency that violates the real-time invariant. Instead, scrape each robot’s /metrics from a central Prometheus and render one dashboard against the aggregate.