Skip to content

Troubleshooting

If you hit something not on this page, run reflex doctor first — 10 falsifiable checks against your environment + export catch most issues. Then check GitHub issues or ask in Discord.

pip install reflex-vla[serve,gpu] succeeds but reflex go fails with TensorRT errors

Section titled “pip install reflex-vla[serve,gpu] succeeds but reflex go fails with TensorRT errors”

The [serve,gpu] extra pulls tensorrt>=10 and the NVIDIA CUDA libraries, but on some hosts the runtime can’t find them. Verify with reflex doctor:

Terminal window
reflex doctor

Look for these specific checks:

  • TensorRT runtime (libnvinfer.so.10) — loadable
  • CUDA cuBLAS (libcublas.so.12) — loadable
  • CUDA cuDNN (libcudnn.so.9) — loadable
  • ORT-TRT EP active

If any are red ✗, the hint says exactly which pip install to run. Most common cause is [serve,gpu-min] (no TRT) or an old release that didn’t pull tensorrt automatically.

Easiest fix on a fresh GPU box: use NVIDIA’s container image as your environment.

Terminal window
docker run --gpus all -it nvcr.io/nvidia/tensorrt:24.10-py3
apt-get install -y clang # for lerobot → evdev
pip install 'reflex-vla[serve,gpu,monolithic]'

--device cuda errors with “CUDAExecutionProvider not available”

Section titled “--device cuda errors with “CUDAExecutionProvider not available””

The pip wheel for ONNX Runtime is the CPU-only version. Either:

Terminal window
# Get the GPU build
pip uninstall onnxruntime
pip install onnxruntime-gpu==1.20.*

Or just install with the right extra: pip install 'reflex-vla[serve,gpu]'. The [gpu] extra excludes the CPU wheel and includes the GPU one.

CPU-only path:

Terminal window
pip install 'reflex-vla[serve,onnx,monolithic]'

The [gpu] extra requires NVIDIA hardware; it’ll fail on macOS regardless.

lerobotevdev requires a compiler:

Terminal window
apt-get install -y clang
# or on Debian/Ubuntu:
apt-get install -y build-essential

reflex export <model> fails partway through

Section titled “reflex export <model> fails partway through”

Run with reflex doctor first to verify the environment. Then look at the error class:

Error messageCauseFix
Where Cast missingpost-export Where Cast wasn’t insertedv0.5+ does this automatically. If you’re on older, re-run with --patch-where-cast.
InsufficientVRAMError (decomposed parallel)Not enough VRAM for parallel export of two ONNX sessionsDrop --export-mode parallel, or use a higher-VRAM host
max_abs_diff > 1e-4PyTorch and ONNX disagree at the thresholdFile a GitHub issue — this is a real bug, not user error
transformers.Cache type incompatibleWrong transformers versionPin: pip install 'transformers==5.3.0'
OOM during torch.export traceRAM not VRAM — the trace itself is heavyRun on a host with ≥ 32 GB RAM, or use --device cpu --mode monolithic (slower trace, less RAM)
max_abs_diff_across_all 3.21e-04
passed FAIL

This means the exported ONNX disagrees with PyTorch at the third decimal place. Don’t deploy this artifact to a robot. It will silently misbehave.

Most common causes:

  1. Float64 vs float32 in inputs. Cast all inputs to float32 before serving. ONNX Runtime silently truncates float64 to float32, dropping fps to ~0.3.
  2. Custom transformers version. Reflex pins transformers==5.3.0 for a reason — earlier versions produced incorrect ONNX exports for pi0/pi0.5.
  3. You’re comparing against the wrong PyTorch reference. reflex validate export ./p0 --model lerobot/pi0_base validates against lerobot/pi0_base specifically. Use --reference <hf_id> to compare against a different model.

Server starts but /act returns 503 forever

Section titled “Server starts but /act returns 503 forever”

The server is in degraded state. Check /health:

Terminal window
curl http://localhost:8000/health

If status is degraded, the circuit breaker tripped — N consecutive /act failures. Look at the server logs to see the underlying error. After fixing, restart the server (or wait for the cooldown if --max-consecutive-crashes is N > 5).

If status is warming, the cold-start hasn’t finished. First serve takes 10-70s. /health returns 503 until ready (load balancers correctly skip the server during this window).

In order of likelihood:

  1. TRT EP isn’t actually active. Run reflex doctor and check for “ORT-TRT EP active.” If it’s not, you’re falling back to ORT-CUDA, which is 5.55× slower.
  2. The export is monolithic on a model where decomposed would help. For pi0.5, re-export with --mode decomposed.
  3. You’re on a hardware tier that doesn’t capture CUDA graphs. A10G vlm_prefix falls back to eager. Expected; expert_denoise still captures.
  4. CUDA context contention. Another process is using the same GPU. Either move it or use a GPU not shared.
  5. Cold start. First /act per session is 100-400ms slower than subsequent ones (capture + warmup costs). Use --prewarm (default) so this happens at startup.

/act returns “queue_full” 503 errors under load

Section titled “/act returns “queue_full” 503 errors under load”

The cost-budget batching scheduler is dropping requests because the runtime queue hit capacity (default 1000 pending).

{
"error": "queue_full",
"message": "policy runtime queue at capacity",
"policy_id": "prod",
"max_queue": 1000
}

Two fixes:

Terminal window
# Allow more pending requests (memory cost; usually fine on edge GPUs)
reflex serve ./my-export/ --max-queue-size 5000
# Or tune the batching scheduler for higher throughput
reflex serve ./my-export/ --max-batch-cost-ms 250

For sustained traffic above what one GPU can handle, scale horizontally — one Reflex process per robot, not one process serving N robots. See fleet telemetry.

Server crashes on first /act with Blackwell GPU (RTX 5090)

Section titled “Server crashes on first /act with Blackwell GPU (RTX 5090)”

Known limitation. ORT’s bundled cuBLAS / cuDNN don’t ship sm_100 kernels yet. Workarounds:

  1. reflex chat works without GPU — natural-language CLI mode, no inference
  2. reflex doctor, reflex models list all work
  3. /act itself needs a non-Blackwell GPU temporarily — Modal A10G/A100, RTX 4090, etc.

Tracking ORT upstream for Blackwell support; v0.7+ may add a TensorRT-LLM runtime path.

--cuda-graphs enabled but I see no speedup

Section titled “--cuda-graphs enabled but I see no speedup”

You’re not on the decomposed dispatch path. The flag applies to Pi05DecomposedInference dispatch (used in Modal scripts today; production wire-up pending the decomposed-dispatch fix).

In that case, the startup log reads:

--cuda-graphs was set but this backend does not consume the flag.

Fix: re-export with --mode decomposed, or wait for the legacy backend wire-up.

embodiment.normalization.mean_action length mismatch

Section titled “embodiment.normalization.mean_action length mismatch”

The mean_action / std_action arrays don’t match action_space.dim. Fix the embodiment JSON:

{
"action_space": {"dim": 7, "ranges": [...]},
"normalization": {
"mean_action": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], // length 7
"std_action": [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] // length 7
}
}

gripper.component_idx >= action_space.dim. Pick the right index (0-based) for whichever action component controls the gripper.

Two cameras with the same name. Rename one.

reflex eval fails with osmesa-compile-hang

Section titled “reflex eval fails with osmesa-compile-hang”

Cold containers take 60-180s for first-scene compile. Bump:

Terminal window
reflex eval ./my-export/ --preflight-timeout 600

If reproducible, file an issue with the stderr from the smoke test.

reflex eval says “Estimate exceeds $50 guardrail”

Section titled “reflex eval says “Estimate exceeds $50 guardrail””

The guardrail is conservative. If you really want a 1000-episode run, just hit Enter — it’ll proceed.

All episodes returned adapter_error (exit 5)

Section titled “All episodes returned adapter_error (exit 5)”

Either:

  • Modal subprocess crashed mid-run — check report.json for the per-episode error_message. The first ~500 chars of stderr surface there.
  • You’re on --runtime local and the local runner is broken — use --runtime modal for now.
  • The Modal run_libero_* script’s stdout format changed — the parser pins on ====== <suite> (ONNX monolithic) ====== headers. File a bug if the script was bumped.
Terminal window
reflex doctor --format json | jq .

Capture the full doctor output + the relevant log excerpt + your platform info, and either:

Don’t suffer alone — most common errors have known fixes; the diagnosis is usually faster with a second pair of eyes.