ADR-0009: Separate simulation from benchmarking
- Status: Accepted
- Date: 2026-05-24
- Amended: 2026-05-24 (see Amendments below)
Context
OpenRAL ships two subsystems that both call themselves "eval", and the
boundary between them is muddy enough that contributors ask which one to
reach for. The user-visible artefact of the confusion: today's
examples/configs/*.yaml pins all four sim axes
(robot × scene × task × vla), so swapping only the rSkill (the apples-to-
apples question a benchmark answers) is not a first-class operation.
What is actually shipped:
- Sim eval —
ral eval/ral-eval/openral_evalruns single rollouts driven by aSimEnvironmentYAML. Every axis is free (python/core/src/openral_core/schemas.py:1136-1348—PhysicsBackend,SceneSpec,TaskSpec,VLASpec,SimEnvironment; ADR-0002). Runtime inpython/eval/src/openral_eval/with three registries (SCENES,POLICIES,ROBOTS) and adapters for LIBERO, MetaWorld, gym-aloha, gym-pusht. Output: anEpisodeResult, optional video, OTel spans. Six YAMLs underexamples/configs/, sixjust sim-*Justfile targets — every one a thin wrapper aroundral eval --config <yaml>. Defaultn_episodes: 1. Used for dev, debug, demo videos. - Benchmark report —
openral benchmark reportis a read-only aggregator (python/cli/src/openral_cli/main.py:994-1118) over hand-curatedrskills/<id>/eval/*.jsonfiles. Each JSON is validated againstRSkillEvalResult(schemas.py:1006-1117) at rSkill load and at report time. It does not run anything. Noopenral benchmark run, no Justfile target.
Two additional facts make the gap concrete:
- The seven shipped
rskills/<id>/eval/*.jsonfiles all carry paper-cited numbers withreproduced_locally: falseand areproduction_clithat points at the externallerobot-evaltool — not at anything in this repo. The harness cannot today reproduce its own benchmark numbers without an out-of-tree CLI. - A LIBERO suite is 10 tasks × N seeds × success criteria. The
SimEnvironmentschema pins exactly oneTaskSpec. Running a suite today means copying 10 YAMLs and orchestrating runs by hand. That is exactly the structure that should be aBenchmarkSpec, not tenSimEnvironments.
The user's framing — and the goal of this change — is to give the two subsystems different shapes that match their different purposes:
- Benchmark = fixed
(robot, scene, suite, protocol). Only thevlavaries. Apples-to-apples leaderboard. Output: a validatedRSkillEvalResultJSON dropped intorskills/<vla>/eval/<id>.jsonwithreproduced_locally: true. - Simulation = everything is configurable. One-off rollouts, dev, demo videos. No reproducibility guarantee across rSkills.
Decision
- New
BenchmarkSpec(andProtocolSpec) Pydantic model inopenral_core, next toSimEnvironment.BenchmarkSpecpinsid,robot_id,scene: SceneSpec,tasks: list[TaskSpec](the full suite, not one task), andprotocol: ProtocolSpec(n_episodes,seeds,success_key,max_steps,min_reps). The VLA is not a field onBenchmarkSpec— it is the only free axis and is supplied at the CLI. - New
openral benchmark run --suite <id> --vla <id>:<weights_uri>command that iteratestasks × seedsinternally by reusingopenral_eval.run_evaluation(renamed below). It writes a validatedRSkillEvalResulttorskills/<vla>/eval/<suite_id>.jsonwithreproduced_locally: true. The existingopenral benchmark reportaggregator is unchanged — it consumes the JSONs the runner now produces. - Built-in
benchmarks/catalogue at repo root, parallel torobots/andrskills/. Initial files mirror the existing paper-cited JSONs:libero_spatial.yaml,libero_object.yaml,libero_goal.yaml,libero_10.yaml,metaworld_mt50.yaml,aloha_transfer_cube.yaml,pusht.yaml. - Aggressive rename:
ral eval→openral sim run. CLI mounted under a newopenral simTyper group.- Package
openral_eval→openral_sim. examples/configs/→scenes/. The YAML schema is unchanged — it is still aSimEnvironment— only the directory moves.- The
ral-evalconsole script remains for one minor release as a thin shim that emits aDeprecationWarningand forwards toopenral sim run. Imports ofopenral_eval.*re-export fromopenral_simwith the same deprecation strategy. - CLAUDE.md §6.4 update (lands in the runner PR, not this ADR PR):
every published rSkill that targets a built-in benchmark must ship
eval/<benchmark_id>.jsongenerated byopenral benchmark runwithreproduced_locally: true.ral skill checkvalidates the linkage. - Phased migration, one PR per phase (see Migration below). This ADR commits to the destination; the steps to it land separately so each PR stays small and reviewable per CLAUDE.md §7.2.
Consequences
- Pros
- The two CLIs match the two purposes.
openral sim runis for free-form rollouts;openral benchmark runis for reproducible suite evaluation. The word "eval" stays on the side that producesRSkillEvalResults. openral benchmark runcloses the "reproduction deferred — use externallerobot-eval" loop that the existinglibero.jsonfiles document. The on-diskRSkillEvalResultformat is unchanged; only its producer changes from "hand-edited" to "openral benchmark runoutput", soopenral benchmark reportand the rSkill loader keep working without modification.BenchmarkSpecmakes the suite a first-class object. CI matrices over(suite, rSkill)become a one-liner instead of ten copied YAMLs.-
The rename removes the "two
evals" overload that confuses contributors at first contact with the repo. -
Cons
- Naming churn. Touches the Justfile, every
examples/configs/YAML's docstring, everyrskills/<id>/eval/<benchmark>.json'sreproduction_clifield (currently points atlerobot-eval),docs/contributing/development.md,docs/quickstart/so100.md,docs/adr/0002-eval-and-sim-environments.md, the repo state map, anddocs/METHODS.md. Mitigated by the one-releaseral-eval/openral_evaldeprecation shim. - One more normative schema (
BenchmarkSpec+ProtocolSpec) inopenral_core. Lands on the existing pre-publish baseline (schema_version: "0.1"); no migrator while the schema is pre-publish (CLAUDE.md §1.6). - The Justfile gains
bench-*targets symmetric withsim-*. Surface area grows.
Migration
Phased — one PR per phase. This ADR is PR A.
- PR A (this PR): ADR-0009 + ADR-0002 amendment + repo state map
planning blocks. No code, no schema, no rename. Verification:
just docs-build(mkdocs --strict) and offline-load of the repo state map. - PR B: Schema-first. Add
BenchmarkSpecandProtocolSpectoopenral_coreon the existing pre-publish baseline (schema_version: "0.1"); add hypothesis fuzz tests; export updated JSON Schema viajust schema-export. No CLI, no runtime change. The schema validates against a real fixturebenchmarks/libero_spatial.yamladded in the same PR (CLAUDE.md §1.11 — real fixture, not a placeholder). - PR C: Rename.
openral_eval→openral_sim;ral eval→openral sim run;examples/configs/→scenes/. Ship the back-compat shim (ral-evalscript +openral_evalre-export module emittingDeprecationWarning). Justfile, docs, repo state map, anddocs/METHODS.mdupdates. No functional change beyond the rename —_check_rskill_compatibility, the registries, and the scene adapters are untouched. - PR D: Implement
openral benchmark run. Reusesopenral_sim.run_evaluationinternally to loop overtasks × seeds. Adds thebenchmarks/catalogue. Writes a validatedRSkillEvalResultJSON. CLAUDE.md §6.4 wording update lands here so the contract and the code that satisfies it ship together. - PR E: Regenerate
rskills/<id>/eval/*.jsonvia the new runner where feasible (GPU runner time permitting). Where infeasible (e.g. LIBERO-Long is documented as ~8 h on A100), keep the paper-quoted JSON, but replace itsreproduction_clitext with the in-treeopenral benchmark run …command so the documented reproduction path is the one this repo owns.
Why not other options
- Keep
ral evaland only addopenral benchmark run. Lower churn, but the "two evals" overload survives and we resolve it only via docs. CLAUDE.md operating principle 4 ("explicit beats implicit") argues against carrying a documented-only disambiguation. - Wrap external
lerobot-eval. Lowest implementation cost; what the currentlibero.jsonfiles already document. Rejected because we don't own the trace surface (no OTelrskill_span/inference_spanacross the suite), the external CLI's flags are not API-stable, and CLAUDE.md §8 ("reproducibility over speed") wants the reproduction path inside the repo. We can still record the external command in theRSkillEvalSource.reproduction_cli.notesfield as a cross-reference. - Put
BenchmarkSpecin a separatepython/benchmark/package instead of core. Lower blast radius onopenral_core, butRSkillEvalResultalready lives in core; splitting the spec from the result it produces is asymmetric. MirroringSimEnvironment's home keeps both kinds of typed eval contract in one place and prevents a second package depending circularly onopenral_core. - Soft rename (CLI only, keep
openral_evalpackage name). Keeps internal imports stable but leaves the package name pointing at sim — exactly the overload we are trying to eliminate. The rename's whole value is removing the ambiguity at every layer the word appears. - Make
BenchmarkSpecjust a list ofSimEnvironments. Conceptually close, but eachSimEnvironmentcarries its ownvla,seed,n_episodes, andrecord_video. The benchmark protocol must own those across all tasks for the numbers to compare; threading "this field is authoritative, that one is not" through a list ofSimEnvironments is more fragile than a dedicated model.
Amendments
2026-05-16 — Reconciled with ADR-0010 amendment 1 (SimRunner unification)
The Decision text above describes openral_sim.run_evaluation as
the loop driver and openral_eval as a one-release deprecation
shim. Both are now gone:
run_evaluation(andrun_episode) deleted; replaced byopenral_sim.SimRunner, a per-stepInferenceRunnerthat shares the same Protocol surface asopenral_runner.HardwareRunner.openral sim runandopenral benchmark runboth driveSimRunner.activate / run / deactivate.python/eval-shim/(theopenral_evalre-export package) removed in the same PR — the one-release clock declared in PR C had elapsed by the time of the unification.
The PR D protocol still applies for benchmark artifacts: ral
benchmark run continues to emit
RSkillEvalResult(reproduced_locally=true) JSONs at
rskills/<dir>/eval/<benchmark_id>.json. Only the loop driver
underneath changed. See ADR-0010 amendment 1 for the full unification.
2026-05-18 — Status flipped Proposed → Accepted
All five migration phases declared above have landed on main:
- PR A — this ADR + ADR-0002 amendment + repo state map planning blocks (merged).
- PR B —
BenchmarkSpecandProtocolSpecPydantic models live inpython/core/src/openral_core/schemas.pyalongsideSimEnvironment, with hypothesis fuzz coverage inpython/core/tests/and exported JSON Schema underdocs/reference/schemas/. - PR C — rename complete: package
openral_eval→openral_sim, CLIral eval→openral sim run,examples/configs/→scenes/(the back-compat shim from PR C has since been removed by ADR-0010 amendment 1 above; see that amendment for the unification details). - PR D —
openral benchmark run --suite <id> --rskill <weights_uri>is implemented atpython/sim/src/openral_sim/benchmark.py:53and wired into the CLI atpython/cli/src/openral_cli/main.py:1432(thebenchmarkTyper group exposeslist/run/report). Thebenchmarks/catalogue ships 12 suites covering LIBERO (spatial / object / goal / 10), MetaWorld MT50, gym-aloha (transfer cube + insertion), gym-pusht, ManiSkill3 pick-place, SimplerEnv google-robot, RoboCasa PnP, and GR1 tabletop. - PR E —
openral benchmark reportaggregatesrskills/<id>/eval/*.jsoninto per-benchmark roll-ups. CLAUDE.md §6.4 carries the authoritative contract referencing this ADR as the canonical producer.
openral benchmark run is now the only canonical producer of
RSkillEvalResult(reproduced_locally=true) per CLAUDE.md §6.4. No
behavioural change against the Decision text — only the status field
flips.
2026-06-08 — BenchmarkSpec deleted (ADR-0042)
PR B above declared BenchmarkSpec (and ProtocolSpec) the normative
suite contract. That class is now gone.
ADR-0041 Task 10 (June 2026)
first flattened BenchmarkSpec to a near-empty wrapper around
list[BenchmarkScene] (each scene already carried its own
n_episodes / seed / task.max_steps / task.success_key /
metadata). ADR-0042 then deleted the
wrapper class outright:
benchmarks/<id>.yamlis now a bare YAML list ofBenchmarkScenemappings at the root; the suite id is the filename stem.openral_core.BenchmarkSpecis gone. The schema exportdocs/reference/schemas/BenchmarkSpec.jsonand the 13 byte-identical baseline fixtures undertests/fixtures/benchmark_aggregator/are deleted.- The five suite invariants previously enforced by
BenchmarkSpec.model_post_init(non-empty list, uniquetask.id, uniformrobot_id/n_episodes/seed/metadata) moved to a free functionopenral_core.raise_on_invalid_suite(scenes, *, suite_id). The matching loader isopenral_core.load_benchmark_suite(path). Splitting them lets tests build invalid in-memory suites without disk I/O. ProtocolSpecremains exported as a free-standing schema for ADR drafts and report tooling that quote a published protocol verbatim; it is no longer embedded in any other model.BenchmarkMetadatagained two optional fields,display_nameandsimulator, that surface intoRSkillEvalResult.benchmark.name/.simulatorwhen present (previously a free-form dict onBenchmarkSpec.metadata). The suite-level uniformity invariant covers them via the per-scene metadata equality check.openral_sim.run_benchmarknow takes(scenes, vla, *, suite_id, ...); the aggregator pullsdisplay_name/simulatorfrom per-scene metadata and auto-derivesarxivfrommetadata.paperwhen the URL containsarxiv.org/.
No behavioural change to RSkillEvalResult JSON for any shipped suite,
and the openral benchmark report reader is unchanged. PR D's claim
that the runner is the canonical producer still holds.