reflex eval
reflex eval ./my-export/ --suite libero --num-episodes 3 — one command, LIBERO success rate, per-task numbers, optional MP4 clips, cost transparency. Wraps the existing Modal image + osmesa/MuJoCo recipe + the vla-eval adapter.
Per ADR 2026-04-25-eval-as-a-service-architecture. Phase 1 ships LIBERO only on Modal (with Linux x86_64 local fallback); Phase 2 adds SimplerEnv + customer suite + HF Hub video upload.
Quick start
Section titled “Quick start”# 1. Set up Modal auth (skip if --runtime local)modal token new
# 2. Smoke run (~$0.20, ~3 minutes on A10G cold start)reflex eval ./my-export/ \ --suite libero \ --num-episodes 3 \ --tasks libero_spatial
# 3. Full bench (~$10, ~30 minutes)reflex eval ./my-export/ \ --suite libero \ --num-episodes 50 \ --video \ --output ./eval-out/
# 4. Cost preview before kicking off something expensivereflex eval ./my-export/ \ --num-episodes 100 \ --cost-previewOutput: ./eval-out/report.json (machine-readable, schema v1) + ./eval-out/videos/<task>_episode_<N>.mp4 when --video set.
Why this exists
Section titled “Why this exists”Every research group evaluating a new VLA asks for the same thing: “give me task-success numbers I can put in a paper.” The existing path is clone modal_libero_*.py + figure out auth + figure out dep pins + handle the 5 documented failure modes + parse the output yourself — 1-2 days of yak-shaving per group. reflex eval ships the whole path in one verb.
The 10 flags
Section titled “The 10 flags”| Flag | Default | Notes |
|---|---|---|
--suite | libero | Phase 1 ships LIBERO only. Phase 2: simpler, customer. |
--num-episodes | 3 | Per-task. 3 = smoke; 50-100 = published-paper grade. |
--tasks | (all) | Comma-separated. Empty = the 4 LIBERO families (spatial / object / goal / 10). |
--runtime | modal | modal = bundled image (turnkey). local = Linux x86_64 + [eval-local] extra. |
--seed | 0 | Pass --seed 7 to reproduce prior modal_libero_*.py published runs. |
--max-parallel | 1 | Honored when the runtime supports it (Modal yes, local no). |
--cost-preview | false | Dry-run: estimate $ without invoking. |
--video | false | Per-episode MP4 to <output>/videos/. Cap ~10 MB per episode. |
--output | ./eval_output | Directory for JSON envelope + (optional) videos. |
--preflight-timeout | 300 | Seconds for the LIBERO smoke test. Cold osmesa scene-compile can take 60-180s. |
Pre-flight smoke test
Section titled “Pre-flight smoke test”Before invoking the expensive run, reflex eval runs a pre-flight in an isolated subprocess that exercises the LIBERO init path. Catches 4 of the 5 documented LIBERO failure modes in ~2 seconds, before you spend $$ on a doomed run.
| Mode | Fix |
|---|---|
input-hang | Run scripts/patch_libero.py first (or use --runtime modal — bundled image patches this in) |
egl-black-frames | Force MUJOCO_GL=osmesa |
dep-version-conflict | Pin robosuite==1.4.1, bddl==1.0.1, mujoco==3.3.2. Or use --runtime modal |
osmesa-compile-hang | Increase --preflight-timeout (cold containers take 60-180s for first-scene compile) |
import-error | pip install 'reflex-vla[eval-local]' for local; --runtime modal for the bundled image |
The 5th failure (per-episode OOM) is per-call probabilistic; backoff + a legible error in the runner covers it.
JSON envelope (schema v1 — LOCKED)
Section titled “JSON envelope (schema v1 — LOCKED)”<output>/report.json is the machine-readable envelope. Schema v1 is locked; Phase 2 evolution is additive-only. Customers grep on these fields in CI; renaming = breakage.
{ "schema_version": 1, "reflex_version": "0.7.0", "suite": "libero", "runtime": "modal", "seed": 0, "started_at": "2026-04-25T14:30:00Z", "finished_at": "2026-04-25T14:33:21Z", "wall_clock_s": 201.0, "tasks": ["libero_spatial", "libero_object", "libero_goal", "libero_10"], "num_episodes_per_task": 3, "aggregate": {"success_rate": 0.83, "n_success": 10, "n_total": 12}, "results": [ {"task_id": "libero_spatial", "n_success": 3, "n_total": 3, "success_rate": 1.0}, /* ... per-task ... */ ], "episodes": [ {"task_id": "libero_spatial", "episode_index": 0, "success": true, "terminal_reason": "success", "wall_clock_s": 28.4, "n_steps": 200, "video_path": "./eval-out/videos/libero_spatial_episode_0.mp4", "error_message": null}, /* ... flat list across all tasks ... */ ], "cost": { "total_usd": 0.50, "by_task": {"libero_spatial": 0.175, /* ... */ }, "cost_table_schema_version": 1 }, "modal": {"image_digest": "sha256:abc...", "provider": "modal.com"}, "env": { "git_sha": "deadbeefcafe", "git_dirty": false, "python_version": "3.13.11", "platform": "Darwin-25.3.0-arm64", "onnx_files": [{"name": "model.onnx", "sha256": "...", "bytes": 12345}] }}terminal_reason enum
Section titled “terminal_reason enum”Bounded; stable across releases:
success— task completed successfully (success: trueREQUIRED)timeout— episode hit--preflight-timeoutor runner-side capbddl_failure— task BDDL file failed to parserendering_failure— osmesa / EGL render returned an erroradapter_error— anything else the runner didn’t classify
Cross-field invariant: success == True if and only if terminal_reason == "success". Enforced at construction time.
Cost transparency
Section titled “Cost transparency”reflex eval --cost-preview prints a $ estimate before invoking. Cost table is baked at ship time and refreshed quarterly against actual Modal billing logs.
| Suite × Runtime | $ / episode | $ / task-startup |
|---|---|---|
libero × modal | $0.025 (A10G) | $0.10 (cold container + image pull + osmesa scene compile) |
libero × local | $0 | $0 |
Above-$50 estimate triggers an “are you sure?” warning so you don’t accidentally fire a 1000-episode × 90-task run.
Reproducing prior LIBERO numbers
Section titled “Reproducing prior LIBERO numbers”modal_libero_*.py produced our published 80%+ LIBERO numbers using seed=7:
reflex eval ./my-export/ \ --suite libero \ --num-episodes 50 \ --seed 7 \ --runtime modalThe env block in report.json captures git_sha, python_version, platform, and per-*.onnx sha256 hashes — enough to re-run + cross-check. Treat cost_table_schema_version as the canonical pin for cost numbers.
Common errors
Section titled “Common errors”Pre-flight FAILED (osmesa-compile-hang, 300.1s) — cold containers take 60-180s for first-scene compile. Bump --preflight-timeout 600 and retry.
Pre-flight FAILED (dep-version-conflict, ...) — pin robosuite==1.4.1, bddl==1.0.1, mujoco==3.3.2. Or drop --runtime local and use --runtime modal.
Estimate exceeds $50 guardrail — drop --num-episodes or use fewer --tasks. The guardrail is conservative; if you really do want a 1000-ep run, just hit Enter — it’ll proceed.
All episodes returned adapter_error (exit 5) — Modal subprocess crashed mid-run. Check report.json for the error_message on per-episode rows. The first ~500 chars of stderr surface there.
modal CLI not found on PATH (exit 6) — install via pip install modal then modal token new. --runtime modal cannot proceed without it; we never silently fall back to local (would mask real config issues + cost surprises).