Skip to content

reflex eval

reflex eval ./my-export/ --suite libero --num-episodes 3 — one command, LIBERO success rate, per-task numbers, optional MP4 clips, cost transparency. Wraps the existing Modal image + osmesa/MuJoCo recipe + the vla-eval adapter.

Per ADR 2026-04-25-eval-as-a-service-architecture. Phase 1 ships LIBERO only on Modal (with Linux x86_64 local fallback); Phase 2 adds SimplerEnv + customer suite + HF Hub video upload.

Terminal window
# 1. Set up Modal auth (skip if --runtime local)
modal token new
# 2. Smoke run (~$0.20, ~3 minutes on A10G cold start)
reflex eval ./my-export/ \
--suite libero \
--num-episodes 3 \
--tasks libero_spatial
# 3. Full bench (~$10, ~30 minutes)
reflex eval ./my-export/ \
--suite libero \
--num-episodes 50 \
--video \
--output ./eval-out/
# 4. Cost preview before kicking off something expensive
reflex eval ./my-export/ \
--num-episodes 100 \
--cost-preview

Output: ./eval-out/report.json (machine-readable, schema v1) + ./eval-out/videos/<task>_episode_<N>.mp4 when --video set.

Every research group evaluating a new VLA asks for the same thing: “give me task-success numbers I can put in a paper.” The existing path is clone modal_libero_*.py + figure out auth + figure out dep pins + handle the 5 documented failure modes + parse the output yourself — 1-2 days of yak-shaving per group. reflex eval ships the whole path in one verb.

FlagDefaultNotes
--suiteliberoPhase 1 ships LIBERO only. Phase 2: simpler, customer.
--num-episodes3Per-task. 3 = smoke; 50-100 = published-paper grade.
--tasks(all)Comma-separated. Empty = the 4 LIBERO families (spatial / object / goal / 10).
--runtimemodalmodal = bundled image (turnkey). local = Linux x86_64 + [eval-local] extra.
--seed0Pass --seed 7 to reproduce prior modal_libero_*.py published runs.
--max-parallel1Honored when the runtime supports it (Modal yes, local no).
--cost-previewfalseDry-run: estimate $ without invoking.
--videofalsePer-episode MP4 to <output>/videos/. Cap ~10 MB per episode.
--output./eval_outputDirectory for JSON envelope + (optional) videos.
--preflight-timeout300Seconds for the LIBERO smoke test. Cold osmesa scene-compile can take 60-180s.

Before invoking the expensive run, reflex eval runs a pre-flight in an isolated subprocess that exercises the LIBERO init path. Catches 4 of the 5 documented LIBERO failure modes in ~2 seconds, before you spend $$ on a doomed run.

ModeFix
input-hangRun scripts/patch_libero.py first (or use --runtime modal — bundled image patches this in)
egl-black-framesForce MUJOCO_GL=osmesa
dep-version-conflictPin robosuite==1.4.1, bddl==1.0.1, mujoco==3.3.2. Or use --runtime modal
osmesa-compile-hangIncrease --preflight-timeout (cold containers take 60-180s for first-scene compile)
import-errorpip install 'reflex-vla[eval-local]' for local; --runtime modal for the bundled image

The 5th failure (per-episode OOM) is per-call probabilistic; backoff + a legible error in the runner covers it.

<output>/report.json is the machine-readable envelope. Schema v1 is locked; Phase 2 evolution is additive-only. Customers grep on these fields in CI; renaming = breakage.

{
"schema_version": 1,
"reflex_version": "0.7.0",
"suite": "libero",
"runtime": "modal",
"seed": 0,
"started_at": "2026-04-25T14:30:00Z",
"finished_at": "2026-04-25T14:33:21Z",
"wall_clock_s": 201.0,
"tasks": ["libero_spatial", "libero_object", "libero_goal", "libero_10"],
"num_episodes_per_task": 3,
"aggregate": {"success_rate": 0.83, "n_success": 10, "n_total": 12},
"results": [
{"task_id": "libero_spatial", "n_success": 3, "n_total": 3, "success_rate": 1.0},
/* ... per-task ... */
],
"episodes": [
{"task_id": "libero_spatial", "episode_index": 0, "success": true,
"terminal_reason": "success", "wall_clock_s": 28.4, "n_steps": 200,
"video_path": "./eval-out/videos/libero_spatial_episode_0.mp4",
"error_message": null},
/* ... flat list across all tasks ... */
],
"cost": {
"total_usd": 0.50,
"by_task": {"libero_spatial": 0.175, /* ... */ },
"cost_table_schema_version": 1
},
"modal": {"image_digest": "sha256:abc...", "provider": "modal.com"},
"env": {
"git_sha": "deadbeefcafe", "git_dirty": false,
"python_version": "3.13.11", "platform": "Darwin-25.3.0-arm64",
"onnx_files": [{"name": "model.onnx", "sha256": "...", "bytes": 12345}]
}
}

Bounded; stable across releases:

  • success — task completed successfully (success: true REQUIRED)
  • timeout — episode hit --preflight-timeout or runner-side cap
  • bddl_failure — task BDDL file failed to parse
  • rendering_failure — osmesa / EGL render returned an error
  • adapter_error — anything else the runner didn’t classify

Cross-field invariant: success == True if and only if terminal_reason == "success". Enforced at construction time.

reflex eval --cost-preview prints a $ estimate before invoking. Cost table is baked at ship time and refreshed quarterly against actual Modal billing logs.

Suite × Runtime$ / episode$ / task-startup
libero × modal$0.025 (A10G)$0.10 (cold container + image pull + osmesa scene compile)
libero × local$0$0

Above-$50 estimate triggers an “are you sure?” warning so you don’t accidentally fire a 1000-episode × 90-task run.

modal_libero_*.py produced our published 80%+ LIBERO numbers using seed=7:

Terminal window
reflex eval ./my-export/ \
--suite libero \
--num-episodes 50 \
--seed 7 \
--runtime modal

The env block in report.json captures git_sha, python_version, platform, and per-*.onnx sha256 hashes — enough to re-run + cross-check. Treat cost_table_schema_version as the canonical pin for cost numbers.

Pre-flight FAILED (osmesa-compile-hang, 300.1s) — cold containers take 60-180s for first-scene compile. Bump --preflight-timeout 600 and retry.

Pre-flight FAILED (dep-version-conflict, ...) — pin robosuite==1.4.1, bddl==1.0.1, mujoco==3.3.2. Or drop --runtime local and use --runtime modal.

Estimate exceeds $50 guardrail — drop --num-episodes or use fewer --tasks. The guardrail is conservative; if you really do want a 1000-ep run, just hit Enter — it’ll proceed.

All episodes returned adapter_error (exit 5) — Modal subprocess crashed mid-run. Check report.json for the error_message on per-episode rows. The first ~500 chars of stderr surface there.

modal CLI not found on PATH (exit 6) — install via pip install modal then modal token new. --runtime modal cannot proceed without it; we never silently fall back to local (would mask real config issues + cost surprises).