Skip to content

tether doctor

Terminal window
tether doctor --model ./my-export/

10 registered diagnostic checks (in src/tether/diagnostics/) plus a system-probe layer added in v0.9.3 / v0.9.4 that covers silent-failure traps the static checks can’t see. Every registered check has at least one pass test and one fail test in tests/test_doctor_diagnostics.py per the falsifiability gate.

IDNameWhat it tests
check_model_loadModel loadExport dir exists + contains ONNX + fits in available RAM (×1.4 overhead, 20% headroom)
check_onnx_providerONNX provideronnxruntime importable + CPU EP present (always required) + GPU EP noted
check_vlm_tokenizationVLM tokenizationTokenizer config loads + 5 probe prompts produce in-range token IDs
check_image_dimsImage dim mismatchembodiment.cameras[*].resolution appears in ONNX image input shape
check_action_denormAction denormalizationembodiment.normalization.mean_action / std_action length == action_dim, no NaN/Inf, std > 0
check_gripperGripper configgripper.component_idx < action_dim, close_threshold ∈ [0, 1], inverted flag sanity
check_state_proprioState/proprio dtypeONNX state input is float32 (not float64 — silent truncation drops fps to ~0.3)
check_gpu_memoryGPU memorynvidia-smi reports ≥ 90% headroom over estimated model footprint (×1.6 file size for KV + activations)
check_rtc_chunksRTC chunk boundarychunk_sizefrequency_hz × rtc_execution_horizon (one horizon’s worth of actions)
check_hardware_compatHardware compatCUDA driver ≥ 12.x + ORT GPU EP present when CUDA detected

Each check links back to a load-bearing LeRobot GitHub issue. The full table with issue links is in docs/doctor_check_list.md in the source repo.

System-probe rows (added in v0.9.3 – v0.9.4)

Section titled “System-probe rows (added in v0.9.3 – v0.9.4)”

These fire from the CLI top-level (not the registered diagnostics/ pipeline) and target silent-failure traps that aren’t fingerprintable from the export alone:

RowDetects
Blackwell sm_120 / ORT bumpRTX 50-series / B200 / GB200 hardware on onnxruntime-gpu < 1.25.1 → loud fail with exact pip install -U upgrade command
JetPack target checkJetson via /etc/nv_tegra_release — R35 (CUDA 11.4) fails ORT 1.20+‘s CUDA 12.x requirement (silent CPU fallback). R36+ passes.
cuDNN-vs-driver skewcuDNN 9.5+ needs NVIDIA driver R555+; pinning old driver via apt-hold + bundled cuDNN 9.5 silently fails at first inference.
ORT-TRT EP empirical session testavailable_providers says lib loaded; this row creates a stub session + forces TRT EP + checks sess.get_providers() to catch missing libnvinfer.so.10 on dlopen path.
Multi-GPU mixed-architecture2+ GPUs of different generations (e.g. 1× H100 + 1× RTX 5090) — ORT uses CUDA_VISIBLE_DEVICES[0] only; surface the trap.

Every check returns a CheckResult (see src/tether/diagnostics/__init__.py):

FieldTypeNotes
check_idstrStable ID (e.g. check_model_load); used by --skip
namestrHuman-readable name
statusenumpass / fail / warn / skip
expectedstrWhat the check wanted to see
actualstrWhat it actually saw
remediationstrRequired when status="fail". Empty otherwise.
duration_msfloatWall-clock for the check
github_issuestr or NoneURL to the load-bearing LeRobot issue

Falsifiability gate: CheckResult.__post_init__ raises ValueError if status="fail" and remediation is empty. Enforced at construction time so a check with no fix-it suggestion can never ship.

  • pass — verified the expected condition. No action.
  • fail — verified a known-broken condition. Doctor exits 1. Caller should follow remediation.
  • warn — non-blocking concern (e.g. CPU-only on a system that should have GPU). Doctor exits 0 but the warning is surfaced.
  • skip — couldn’t run because a precondition wasn’t met (e.g. embodiment is custom so embodiment-dependent checks have nothing to compare against).
Tether Doctor v0.9.6
Checking: ./my-export/
✓ Model load (12 ms)
✓ ONNX provider (8 ms) — TensorRT, CUDA, CPU EPs available
✓ VLM tokenization (43 ms)
✓ Image dim mismatch (3 ms) — 224×224 matches export
✓ Action denormalization (2 ms)
✓ Gripper config (2 ms)
✗ State/proprio dtype (1 ms)
Expected: state input is float32
Actual: state input is float64
Fix: Cast state to np.float32 before sending to /act. Float64 silently
truncates to float32, dropping fps to ~0.3 in production.
See https://github.com/huggingface/lerobot/issues/2458
✓ GPU memory (15 ms) — 18 GB available, 4 GB needed
⚠ RTC chunk boundary (1 ms)
Expected: chunk_size ≥ frequency_hz × rtc_execution_horizon (15 ≥ 30 × 0.5)
Actual: chunk_size = 15, frequency_hz = 30, rtc_execution_horizon = 0.5
(15 = 15 — passes by hair; recommend chunk_size = 25 for headroom)
✓ Hardware compat (84 ms)
Summary: 8 pass / 1 warn / 1 fail
Exit: 1 (fail)
Terminal window
tether doctor --model ./my-export/ --format json | jq .
{
"schema_version": 1,
"reflex_version": "0.9.6",
"model_path": "./my-export/",
"embodiment": "franka",
"checks": [
{
"check_id": "check_model_load",
"name": "Model load",
"status": "pass",
"expected": "...", "actual": "...",
"remediation": "",
"duration_ms": 12.4,
"github_issue": "https://github.com/huggingface/lerobot/issues/386"
},
/* ... */
],
"summary": {"pass": 8, "warn": 1, "fail": 1, "skip": 0}
}

Schema v1 is locked — additive fields don’t bump version, breaking changes do.

  1. Create src/tether/diagnostics/check_<name>.py with a _run(model_path, embodiment_name, **kwargs) -> CheckResult function
  2. At the bottom: register(Check(check_id=..., name=..., severity=..., github_issue=..., run_fn=_run))
  3. Import the new module in _ensure_registry_loaded() in src/tether/diagnostics/__init__.py
  4. Add at least 1 pass test + 1 fail test to tests/test_doctor_diagnostics.py
  5. Update the canonical doc table

The registry is auto-loaded; no other wiring needed.

Terminal window
# Skip specific checks (CSV, by check_id)
tether doctor --model ./my-export/ --skip check_gpu_memory,check_hardware_compat
# Run only environment checks (no model needed)
tether doctor

Skipped checks return status=skip with a reason. Don’t silently drop them — operators want to see what wasn’t verified.