reflex eval

reflex eval ./my-export/ --suite libero --num-episodes 3 — one command, LIBERO success rate, per-task numbers, optional MP4 clips, cost transparency. Wraps the existing Modal image + osmesa/MuJoCo recipe + the vla-eval adapter.

Per ADR 2026-04-25-eval-as-a-service-architecture. Phase 1 ships LIBERO only on Modal (with Linux x86_64 local fallback); Phase 2 adds SimplerEnv + customer suite + HF Hub video upload.

Quick start

# 1. Set up Modal auth (skip if --runtime local)
modal token new

# 2. Smoke run (~$0.20, ~3 minutes on A10G cold start)
reflex eval ./my-export/ \
    --suite libero \
    --num-episodes 3 \
    --tasks libero_spatial

# 3. Full bench (~$10, ~30 minutes)
reflex eval ./my-export/ \
    --suite libero \
    --num-episodes 50 \
    --video \
    --output ./eval-out/

# 4. Cost preview before kicking off something expensive
reflex eval ./my-export/ \
    --num-episodes 100 \
    --cost-preview

Output: ./eval-out/report.json (machine-readable, schema v1) + ./eval-out/videos/<task>_episode_<N>.mp4 when --video set.

Why this exists

Every research group evaluating a new VLA asks for the same thing: “give me task-success numbers I can put in a paper.” The existing path is clone modal_libero_*.py + figure out auth + figure out dep pins + handle the 5 documented failure modes + parse the output yourself — 1-2 days of yak-shaving per group. reflex eval ships the whole path in one verb.

The 10 flags

Flag	Default	Notes
`--suite`	`libero`	Phase 1 ships LIBERO only. Phase 2: `simpler`, `customer`.
`--num-episodes`	`3`	Per-task. 3 = smoke; 50-100 = published-paper grade.
`--tasks`	(all)	Comma-separated. Empty = the 4 LIBERO families (spatial / object / goal / 10).
`--runtime`	`modal`	`modal` = bundled image (turnkey). `local` = Linux x86_64 + `[eval-local]` extra.
`--seed`	`0`	Pass `--seed 7` to reproduce prior `modal_libero_*.py` published runs.
`--max-parallel`	`1`	Honored when the runtime supports it (Modal yes, local no).
`--cost-preview`	`false`	Dry-run: estimate `$` without invoking.
`--video`	`false`	Per-episode MP4 to `<output>/videos/`. Cap ~10 MB per episode.
`--output`	`./eval_output`	Directory for JSON envelope + (optional) videos.
`--preflight-timeout`	`300`	Seconds for the LIBERO smoke test. Cold osmesa scene-compile can take 60-180s.

Pre-flight smoke test

Before invoking the expensive run, reflex eval runs a pre-flight in an isolated subprocess that exercises the LIBERO init path. Catches 4 of the 5 documented LIBERO failure modes in ~2 seconds, before you spend $$ on a doomed run.

Mode	Fix
`input-hang`	Run `scripts/patch_libero.py` first (or use `--runtime modal` — bundled image patches this in)
`egl-black-frames`	Force `MUJOCO_GL=osmesa`
`dep-version-conflict`	Pin `robosuite==1.4.1`, `bddl==1.0.1`, `mujoco==3.3.2`. Or use `--runtime modal`
`osmesa-compile-hang`	Increase `--preflight-timeout` (cold containers take 60-180s for first-scene compile)
`import-error`	`pip install 'reflex-vla[eval-local]'` for local; `--runtime modal` for the bundled image

The 5th failure (per-episode OOM) is per-call probabilistic; backoff + a legible error in the runner covers it.

JSON envelope (schema v1 — LOCKED)

<output>/report.json is the machine-readable envelope. Schema v1 is locked; Phase 2 evolution is additive-only. Customers grep on these fields in CI; renaming = breakage.

{
  "schema_version": 1,
  "reflex_version": "0.7.0",
  "suite": "libero",
  "runtime": "modal",
  "seed": 0,
  "started_at": "2026-04-25T14:30:00Z",
  "finished_at": "2026-04-25T14:33:21Z",
  "wall_clock_s": 201.0,
  "tasks": ["libero_spatial", "libero_object", "libero_goal", "libero_10"],
  "num_episodes_per_task": 3,
  "aggregate": {"success_rate": 0.83, "n_success": 10, "n_total": 12},
  "results": [
    {"task_id": "libero_spatial", "n_success": 3, "n_total": 3, "success_rate": 1.0},
    /* ... per-task ... */
  ],
  "episodes": [
    {"task_id": "libero_spatial", "episode_index": 0, "success": true,
     "terminal_reason": "success", "wall_clock_s": 28.4, "n_steps": 200,
     "video_path": "./eval-out/videos/libero_spatial_episode_0.mp4",
     "error_message": null},
    /* ... flat list across all tasks ... */
  ],
  "cost": {
    "total_usd": 0.50,
    "by_task": {"libero_spatial": 0.175, /* ... */ },
    "cost_table_schema_version": 1
  },
  "modal": {"image_digest": "sha256:abc...", "provider": "modal.com"},
  "env": {
    "git_sha": "deadbeefcafe", "git_dirty": false,
    "python_version": "3.13.11", "platform": "Darwin-25.3.0-arm64",
    "onnx_files": [{"name": "model.onnx", "sha256": "...", "bytes": 12345}]
  }
}

`terminal_reason` enum

Bounded; stable across releases:

success — task completed successfully (success: true REQUIRED)
timeout — episode hit --preflight-timeout or runner-side cap
bddl_failure — task BDDL file failed to parse
rendering_failure — osmesa / EGL render returned an error
adapter_error — anything else the runner didn’t classify

Cross-field invariant: success == True if and only if terminal_reason == "success". Enforced at construction time.

Cost transparency

reflex eval --cost-preview prints a $ estimate before invoking. Cost table is baked at ship time and refreshed quarterly against actual Modal billing logs.

Suite × Runtime	$ / episode	$ / task-startup
`libero` × `modal`	$0.025 (A10G)	$0.10 (cold container + image pull + osmesa scene compile)
`libero` × `local`	$0	$0

Above-$50 estimate triggers an “are you sure?” warning so you don’t accidentally fire a 1000-episode × 90-task run.

Reproducing prior LIBERO numbers

modal_libero_*.py produced our published 80%+ LIBERO numbers using seed=7:

reflex eval ./my-export/ \
    --suite libero \
    --num-episodes 50 \
    --seed 7 \
    --runtime modal

The env block in report.json captures git_sha, python_version, platform, and per-*.onnx sha256 hashes — enough to re-run + cross-check. Treat cost_table_schema_version as the canonical pin for cost numbers.

Common errors

Pre-flight FAILED (osmesa-compile-hang, 300.1s) — cold containers take 60-180s for first-scene compile. Bump --preflight-timeout 600 and retry.

Pre-flight FAILED (dep-version-conflict, ...) — pin robosuite==1.4.1, bddl==1.0.1, mujoco==3.3.2. Or drop --runtime local and use --runtime modal.

Estimate exceeds $50 guardrail — drop --num-episodes or use fewer --tasks. The guardrail is conservative; if you really do want a 1000-ep run, just hit Enter — it’ll proceed.

All episodes returned adapter_error (exit 5) — Modal subprocess crashed mid-run. Check report.json for the error_message on per-episode rows. The first ~500 chars of stderr surface there.

modal CLI not found on PATH (exit 6) — install via pip install modal then modal token new. --runtime modal cannot proceed without it; we never silently fall back to local (would mask real config issues + cost surprises).