ADR-0009: Separate simulation from benchmarking

Status: Accepted
Date: 2026-05-24
Amended: 2026-05-24 (see Amendments below)

Context

OpenRAL ships two subsystems that both call themselves "eval", and the boundary between them is muddy enough that contributors ask which one to reach for. The user-visible artefact of the confusion: today's examples/configs/*.yaml pins all four sim axes (robot × scene × task × vla), so swapping only the rSkill (the apples-to- apples question a benchmark answers) is not a first-class operation.

What is actually shipped:

Sim eval — ral eval / ral-eval / openral_eval runs single rollouts driven by a SimEnvironment YAML. Every axis is free (python/core/src/openral_core/schemas.py:1136-1348 — PhysicsBackend, SceneSpec, TaskSpec, VLASpec, SimEnvironment; ADR-0002). Runtime in python/eval/src/openral_eval/ with three registries (SCENES, POLICIES, ROBOTS) and adapters for LIBERO, MetaWorld, gym-aloha, gym-pusht. Output: an EpisodeResult, optional video, OTel spans. Six YAMLs under examples/configs/, six just sim-* Justfile targets — every one a thin wrapper around ral eval --config <yaml>. Default n_episodes: 1. Used for dev, debug, demo videos.
Benchmark report — openral benchmark report is a read-only aggregator (python/cli/src/openral_cli/main.py:994-1118) over hand-curated rskills/<id>/eval/*.json files. Each JSON is validated against RSkillEvalResult (schemas.py:1006-1117) at rSkill load and at report time. It does not run anything. No openral benchmark run, no Justfile target.

Two additional facts make the gap concrete:

The seven shipped rskills/<id>/eval/*.json files all carry paper-cited numbers with reproduced_locally: false and a reproduction_cli that points at the external lerobot-eval tool — not at anything in this repo. The harness cannot today reproduce its own benchmark numbers without an out-of-tree CLI.
A LIBERO suite is 10 tasks × N seeds × success criteria. The SimEnvironment schema pins exactly one TaskSpec. Running a suite today means copying 10 YAMLs and orchestrating runs by hand. That is exactly the structure that should be a BenchmarkSpec, not ten SimEnvironments.

The user's framing — and the goal of this change — is to give the two subsystems different shapes that match their different purposes:

Benchmark = fixed (robot, scene, suite, protocol). Only the vla varies. Apples-to-apples leaderboard. Output: a validated RSkillEvalResult JSON dropped into rskills/<vla>/eval/<id>.json with reproduced_locally: true.
Simulation = everything is configurable. One-off rollouts, dev, demo videos. No reproducibility guarantee across rSkills.

Decision

New BenchmarkSpec (and ProtocolSpec) Pydantic model in openral_core, next to SimEnvironment. BenchmarkSpec pins id, robot_id, scene: SceneSpec, tasks: list[TaskSpec] (the full suite, not one task), and protocol: ProtocolSpec (n_episodes, seeds, success_key, max_steps, min_reps). The VLA is not a field on BenchmarkSpec — it is the only free axis and is supplied at the CLI.
New openral benchmark run --suite <id> --vla <id>:<weights_uri> command that iterates tasks × seeds internally by reusing openral_eval.run_evaluation (renamed below). It writes a validated RSkillEvalResult to rskills/<vla>/eval/<suite_id>.json with reproduced_locally: true. The existing openral benchmark report aggregator is unchanged — it consumes the JSONs the runner now produces.
Built-in benchmarks/ catalogue at repo root, parallel to robots/ and rskills/. Initial files mirror the existing paper-cited JSONs: libero_spatial.yaml, libero_object.yaml, libero_goal.yaml, libero_10.yaml, metaworld_mt50.yaml, aloha_transfer_cube.yaml, pusht.yaml.
Aggressive rename:
ral eval → openral sim run. CLI mounted under a new openral sim Typer group.
Package openral_eval → openral_sim.
examples/configs/ → scenes/. The YAML schema is unchanged — it is still a SimEnvironment — only the directory moves.
The ral-eval console script remains for one minor release as a thin shim that emits a DeprecationWarning and forwards to openral sim run. Imports of openral_eval.* re-export from openral_sim with the same deprecation strategy.
CLAUDE.md §6.4 update (lands in the runner PR, not this ADR PR): every published rSkill that targets a built-in benchmark must ship eval/<benchmark_id>.json generated by openral benchmark run with reproduced_locally: true. ral skill check validates the linkage.
Phased migration, one PR per phase (see Migration below). This ADR commits to the destination; the steps to it land separately so each PR stays small and reviewable per CLAUDE.md §7.2.

Consequences

Pros
The two CLIs match the two purposes. openral sim run is for free-form rollouts; openral benchmark run is for reproducible suite evaluation. The word "eval" stays on the side that produces RSkillEvalResults.
openral benchmark run closes the "reproduction deferred — use external lerobot-eval" loop that the existing libero.json files document. The on-disk RSkillEvalResult format is unchanged; only its producer changes from "hand-edited" to "openral benchmark run output", so openral benchmark report and the rSkill loader keep working without modification.
BenchmarkSpec makes the suite a first-class object. CI matrices over (suite, rSkill) become a one-liner instead of ten copied YAMLs.
The rename removes the "two evals" overload that confuses contributors at first contact with the repo.
Cons
Naming churn. Touches the Justfile, every examples/configs/ YAML's docstring, every rskills/<id>/eval/<benchmark>.json's reproduction_cli field (currently points at lerobot-eval), docs/contributing/development.md, docs/quickstart/so100.md, docs/adr/0002-eval-and-sim-environments.md, the repo state map, and docs/METHODS.md. Mitigated by the one-release ral-eval / openral_eval deprecation shim.
One more normative schema (BenchmarkSpec + ProtocolSpec) in openral_core. Lands on the existing pre-publish baseline (schema_version: "0.1"); no migrator while the schema is pre-publish (CLAUDE.md §1.6).
The Justfile gains bench-* targets symmetric with sim-*. Surface area grows.

Migration

Phased — one PR per phase. This ADR is PR A.

PR A (this PR): ADR-0009 + ADR-0002 amendment + repo state map planning blocks. No code, no schema, no rename. Verification: just docs-build (mkdocs --strict) and offline-load of the repo state map.
PR B: Schema-first. Add BenchmarkSpec and ProtocolSpec to openral_core on the existing pre-publish baseline (schema_version: "0.1"); add hypothesis fuzz tests; export updated JSON Schema via just schema-export. No CLI, no runtime change. The schema validates against a real fixture benchmarks/libero_spatial.yaml added in the same PR (CLAUDE.md §1.11 — real fixture, not a placeholder).
PR C: Rename. openral_eval → openral_sim; ral eval → openral sim run; examples/configs/ → scenes/. Ship the back-compat shim (ral-eval script + openral_eval re-export module emitting DeprecationWarning). Justfile, docs, repo state map, and docs/METHODS.md updates. No functional change beyond the rename — _check_rskill_compatibility, the registries, and the scene adapters are untouched.
PR D: Implement openral benchmark run. Reuses openral_sim.run_evaluation internally to loop over tasks × seeds. Adds the benchmarks/ catalogue. Writes a validated RSkillEvalResult JSON. CLAUDE.md §6.4 wording update lands here so the contract and the code that satisfies it ship together.
PR E: Regenerate rskills/<id>/eval/*.json via the new runner where feasible (GPU runner time permitting). Where infeasible (e.g. LIBERO-Long is documented as ~8 h on A100), keep the paper-quoted JSON, but replace its reproduction_cli text with the in-tree openral benchmark run … command so the documented reproduction path is the one this repo owns.

Why not other options

Keep ral eval and only add openral benchmark run. Lower churn, but the "two evals" overload survives and we resolve it only via docs. CLAUDE.md operating principle 4 ("explicit beats implicit") argues against carrying a documented-only disambiguation.
Wrap external lerobot-eval. Lowest implementation cost; what the current libero.json files already document. Rejected because we don't own the trace surface (no OTel rskill_span / inference_span across the suite), the external CLI's flags are not API-stable, and CLAUDE.md §8 ("reproducibility over speed") wants the reproduction path inside the repo. We can still record the external command in the RSkillEvalSource.reproduction_cli.notes field as a cross-reference.
Put BenchmarkSpec in a separate python/benchmark/ package instead of core. Lower blast radius on openral_core, but RSkillEvalResult already lives in core; splitting the spec from the result it produces is asymmetric. Mirroring SimEnvironment's home keeps both kinds of typed eval contract in one place and prevents a second package depending circularly on openral_core.
Soft rename (CLI only, keep openral_eval package name). Keeps internal imports stable but leaves the package name pointing at sim — exactly the overload we are trying to eliminate. The rename's whole value is removing the ambiguity at every layer the word appears.
Make BenchmarkSpec just a list of SimEnvironments. Conceptually close, but each SimEnvironment carries its own vla, seed, n_episodes, and record_video. The benchmark protocol must own those across all tasks for the numbers to compare; threading "this field is authoritative, that one is not" through a list of SimEnvironments is more fragile than a dedicated model.

Amendments

2026-05-16 — Reconciled with ADR-0010 amendment 1 (SimRunner unification)

The Decision text above describes openral_sim.run_evaluation as the loop driver and openral_eval as a one-release deprecation shim. Both are now gone:

run_evaluation (and run_episode) deleted; replaced by openral_sim.SimRunner, a per-step InferenceRunner that shares the same Protocol surface as openral_runner.HardwareRunner. openral sim run and openral benchmark run both drive SimRunner.activate / run / deactivate.
python/eval-shim/ (the openral_eval re-export package) removed in the same PR — the one-release clock declared in PR C had elapsed by the time of the unification.

The PR D protocol still applies for benchmark artifacts: ral benchmark run continues to emit RSkillEvalResult(reproduced_locally=true) JSONs at rskills/<dir>/eval/<benchmark_id>.json. Only the loop driver underneath changed. See ADR-0010 amendment 1 for the full unification.

2026-05-18 — Status flipped Proposed → Accepted

All five migration phases declared above have landed on main:

PR A — this ADR + ADR-0002 amendment + repo state map planning blocks (merged).
PR B — BenchmarkSpec and ProtocolSpec Pydantic models live in python/core/src/openral_core/schemas.py alongside SimEnvironment, with hypothesis fuzz coverage in python/core/tests/ and exported JSON Schema under docs/reference/schemas/.
PR C — rename complete: package openral_eval → openral_sim, CLI ral eval → openral sim run, examples/configs/ → scenes/ (the back-compat shim from PR C has since been removed by ADR-0010 amendment 1 above; see that amendment for the unification details).
PR D — openral benchmark run --suite <id> --rskill <weights_uri> is implemented at python/sim/src/openral_sim/benchmark.py:53 and wired into the CLI at python/cli/src/openral_cli/main.py:1432 (the benchmark Typer group exposes list / run / report). The benchmarks/ catalogue ships 12 suites covering LIBERO (spatial / object / goal / 10), MetaWorld MT50, gym-aloha (transfer cube + insertion), gym-pusht, ManiSkill3 pick-place, SimplerEnv google-robot, RoboCasa PnP, and GR1 tabletop.
PR E — openral benchmark report aggregates rskills/<id>/eval/*.json into per-benchmark roll-ups. CLAUDE.md §6.4 carries the authoritative contract referencing this ADR as the canonical producer.

openral benchmark run is now the only canonical producer of RSkillEvalResult(reproduced_locally=true) per CLAUDE.md §6.4. No behavioural change against the Decision text — only the status field flips.

2026-06-08 — `BenchmarkSpec` deleted (ADR-0042)

PR B above declared BenchmarkSpec (and ProtocolSpec) the normative suite contract. That class is now gone.

ADR-0041 Task 10 (June 2026) first flattened BenchmarkSpec to a near-empty wrapper around list[BenchmarkScene] (each scene already carried its own n_episodes / seed / task.max_steps / task.success_key / metadata). ADR-0042 then deleted the wrapper class outright:

benchmarks/<id>.yaml is now a bare YAML list of BenchmarkScene mappings at the root; the suite id is the filename stem.
openral_core.BenchmarkSpec is gone. The schema export docs/reference/schemas/BenchmarkSpec.json and the 13 byte-identical baseline fixtures under tests/fixtures/benchmark_aggregator/ are deleted.
The five suite invariants previously enforced by BenchmarkSpec.model_post_init (non-empty list, unique task.id, uniform robot_id / n_episodes / seed / metadata) moved to a free function openral_core.raise_on_invalid_suite(scenes, *, suite_id). The matching loader is openral_core.load_benchmark_suite(path). Splitting them lets tests build invalid in-memory suites without disk I/O.
ProtocolSpec remains exported as a free-standing schema for ADR drafts and report tooling that quote a published protocol verbatim; it is no longer embedded in any other model.
BenchmarkMetadata gained two optional fields, display_name and simulator, that surface into RSkillEvalResult.benchmark.name / .simulator when present (previously a free-form dict on BenchmarkSpec.metadata). The suite-level uniformity invariant covers them via the per-scene metadata equality check.
openral_sim.run_benchmark now takes (scenes, vla, *, suite_id, ...); the aggregator pulls display_name / simulator from per-scene metadata and auto-derives arxiv from metadata.paper when the URL contains arxiv.org/.

No behavioural change to RSkillEvalResult JSON for any shipped suite, and the openral benchmark report reader is unchanged. PR D's claim that the runner is the canonical producer still holds.