fleet telemetry

When you deploy Tether one process per robot, --robot-id gives each process a human-readable identity that Prometheus + Grafana can pivot on. Per-robot p99, per-robot error rates, per-robot safety violations — all in one dashboard.

Zero cost when you’re not using it. Single-robot deploys see no extra cardinality.

Quick start

# On robot A:
tether serve ./my-export/ --robot-id warehouse-01 --port 8000

# On robot B:
tether serve ./my-export/ --robot-id warehouse-02 --port 8000

# On robot C:
tether serve ./my-export/ --robot-id arm-prototype-alpha --port 8000

Your Prometheus config scrapes each instance. In Grafana, import dashboards/tether_fleet.json and select one or more robots from the dropdown.

How it works

Each tether serve process exports a single info-style gauge:

reflex_robot_info{robot_id="warehouse-01",embodiment="franka",model_id="pi0-libero"} 1

Grafana joins hot metrics to this gauge via instance:

histogram_quantile(0.99,
  sum by (le, instance) (rate(reflex_act_latency_seconds_bucket[5m]))
) * on (instance) group_left(robot_id) reflex_robot_info

robot_id appears as a label on p99 latency even though the underlying histogram doesn’t carry it. Cardinality stays flat — one series per process on reflex_robot_info, not one per request on every histogram.

Why not put `robot_id` as a label on every metric?

A fleet of 1,000 robots × 3 embodiments × 6 models × N metrics = hundreds of thousands of series. Prometheus handles that but pays memory for it, and most per-label slicing operators actually want — per-robot, not per-(robot × embodiment) — is available from the info-metric join.

We keep the existing label set tight (embodiment, model_id, violation_kind, etc.) and let operators opt into per-robot slicing via --robot-id + the info join.

Endpoints that expose `robot_id`

Every tether serve process surfaces the robot_id via:

GET /health — "robot_id": "warehouse-01" in the JSON body
GET /config — same key
GET /metrics — reflex_robot_info{robot_id="warehouse-01",...} (when set)

When --robot-id is unset, robot_id is "" on /health and /config, and no reflex_robot_info series is emitted.

Alerting on a single robot

- alert: ReflexRobotLatencyHigh
  expr: |
    histogram_quantile(0.99,
      sum by (le, instance) (rate(reflex_act_latency_seconds_bucket[5m]))
    ) * on (instance) group_left(robot_id) reflex_robot_info{robot_id="warehouse-01"} > 0.2
  for: 3m
  labels: { severity: page, robot_id: warehouse-01 }
  annotations:
    summary: "Tether on {{ $labels.robot_id }} over p99=200ms"

Drop the robot_id= filter to alert on any robot in the fleet.

Deployment patterns

One process per robot (recommended)

# systemd unit per robot — hostname macro gives each robot its own identity
ExecStart=/usr/local/bin/tether serve /opt/tether/export \
    --robot-id %H \
    --port 8000 \
    --slo p99=150ms

Central aggregator

Don’t. Tether does per-process inference; a central aggregator adds network latency that violates the real-time invariant. Instead, scrape each robot’s /metrics from a central Prometheus and render one dashboard against the aggregate.