ADR-0042: Drop BenchmarkSpec for a bare list[BenchmarkScene]
- Status: Accepted
- Date: 2026-06-09
- Deciders: TSC, sim-WG
- Related: ADR-0009 (the
original
BenchmarkSpec/ProtocolSpecproposal — this ADR rescinds the suite Pydantic model while keeping the eval contract unchanged); ADR-0041 (the three-tier scene hierarchy whose Task 10 flattenedBenchmarkSpecto{id, tasks: list[BenchmarkScene], metadata}— the immediate precursor that exposedBenchmarkSpecas a near-empty wrapper); CLAUDE.md §1.3 (types are the contract), §1.4 (explicit beats implicit), §1.6 (schemas evolve, but never silently), §1.11 (real components, not mocks), §1.13 (no duplicate helpers).
Context
ADR-0041 / Task 10 flattened BenchmarkSpec from
{robot_id, scene, protocol, tasks: list[TaskSpec]} to
{id, tasks: list[BenchmarkScene], metadata}. Each BenchmarkScene
became self-contained — carrying its own robot_id, task,
n_episodes, seed, and metadata: BenchmarkMetadata (paper +
honest_scope provenance) — and the suite wrapper retained only:
id: str— a stable suite identifier (e.g."libero_spatial") that doubled as the JSON filename underrskills/<vla>/eval/<id>.json.metadata: dict[str, object]— a free-form bag holding the suite display name (metadata["suite"]→"LIBERO-Spatial"), a human-readable simulator description (metadata["simulator"]→"gym-pusht (pymunk 2-D)"), and an optionalmetadata["arxiv"]URL.tasks: list[BenchmarkScene]— the actual eval payload.model_post_initenforcing five suite-level invariants (non-empty, uniquetask.ids, uniformrobot_id/n_episodes/seed/metadataacross the list).
The result is a class whose only structural job is to be a list with two adjacent labels and a validator. Three specific failures motivated removing it entirely:
-
The
idfield duplicates the filename. Everybenchmarks/<id>.yamlcarries a top-levelid: <id>matching its own filename stem. The CLI already addresses suites by filename (openral benchmark run --suite libero_spatial→benchmarks/libero_spatial.yaml). The YAML field is a redundancy that goes wrong when authors rename a file without updating the field, or vice versa. The validator never catches it because the only ground truth is the path the user typed. -
The free-form
metadatadict is a structurally unbounded surface on a typed contract.BenchmarkSpec.metadatais declared asdict[str, object]so the aggregator doesspec.metadata.get(...)with string keys and runtimeisinstancenarrowing — exactly the shape Pydantic v2 was adopted to retire (CLAUDE.md §1.3). The two strings the aggregator actually emits (benchmark.name,benchmark.simulator) are paper-comparison labels that belong with the per-paper provenance block (BenchmarkMetadata) — they describe the published protocol the scene reproduces, not the wrapper that collects them. -
The invariants are list-shape invariants, not class invariants. "Every entry shares
robot_id" and "alltask.ids are unique" are properties of alist[BenchmarkScene]. Encoding them asmodel_post_initon a wrapper class hides them from callers that build the list programmatically (the unit tests,_make_tiny_libero_specintests/sim/, future scripted suite generators) — those callers either round-trip through the Pydantic wrapper (cost: re-validation - a constructor call) or skip the invariants entirely.
Non-goals
- This ADR does not change the on-disk
schema_version(CLAUDE.md §1.6 — pre-publish, the surface evolves in place; the file stays"0.1"). - It does not change the
RSkillEvalResultschema or its filename convention (rskills/<vla>/eval/<suite_id>.json). - It does not alter
BenchmarkSceneorBenchmarkMetadatafield semantics — only adds two optional display fields to the latter. - It does not alter
ProtocolSpec, which remains an independent schema for ADR / report tooling that quotes published protocols verbatim. - It does not introduce a versioned migrator. The 13 in-tree
benchmarks/*.yamlfiles are rewritten in the same commit; there is no released artefact to migrate.
Decision
Delete BenchmarkSpec. A benchmark suite is a bare
list[BenchmarkScene] on disk and in memory. The suite identifier is
the filename basename. Two paper-comparison labels move from the
free-form suite dict onto the per-scene BenchmarkMetadata block.
Shape change
# Before (post-Task-10, pre-ADR-0042):
class BenchmarkSpec(BaseModel):
id: str
tasks: list[BenchmarkScene]
metadata: dict[str, object]
# model_post_init enforces suite invariants
# After (ADR-0042):
# (no class — a benchmark suite is just list[BenchmarkScene])
class BenchmarkMetadata(BaseModel):
paper: str
honest_scope: str
display_name: str | None = None # was BenchmarkSpec.metadata["suite"]
simulator: str | None = None # was BenchmarkSpec.metadata["simulator"]
On-disk YAML shape
# benchmarks/libero_spatial.yaml — bare YAML list, no top-level dict
- &libero_scene
scene: &scene_block
id: libero_spatial
backend: mujoco
observation_height: 256
observation_width: 256
task: &task_proto
id: libero_spatial/0
scene_id: libero_spatial
max_steps: 280
success_key: is_success
robot_id: franka_panda
n_episodes: 10
seed: 0
metadata: &meta_block
paper: "https://arxiv.org/abs/2306.03310"
honest_scope: "10 episodes per task across all 10 LIBERO-Spatial tasks (100 rollouts total)."
display_name: "LIBERO-Spatial"
simulator: "LIBERO (MuJoCo)"
- <<: *libero_scene
task:
<<: *task_proto
id: libero_spatial/1
# … etc
The top-level id: and metadata: blocks are gone. The YAML anchor
pattern (&libero_scene / <<: *libero_scene) carries through
unchanged — DRY-ness was never tied to BenchmarkSpec.
Loader API
# python/core/src/openral_core/loaders.py
def load_benchmark_suite(path: str | Path) -> list[BenchmarkScene]:
"""Load benchmarks/<id>.yaml — a bare YAML list of BenchmarkScene entries.
Suite-id is derived from the filename stem at the call site. Calls
`raise_on_invalid_suite(scenes, suite_id=Path(path).stem)` so the
same five invariants the deleted `BenchmarkSpec.model_post_init`
enforced still hold.
"""
def raise_on_invalid_suite(
scenes: list[BenchmarkScene],
*,
suite_id: str,
) -> None:
"""Suite-level invariants: non-empty, unique task ids, uniform
robot_id (non-None) / n_episodes / seed / metadata across the list.
"""
raise_on_invalid_suite is public — callers building suites
programmatically (sim tests, future scripted generators) validate
without round-tripping through a Pydantic model. The invariants are
exactly the five from the deleted BenchmarkSpec.model_post_init;
their error messages name the offending suite_id so the rejection
points at the right benchmarks/*.yaml.
Runner API
# python/sim/src/openral_sim/benchmark.py
def run_benchmark(
scenes: list[BenchmarkScene],
*,
suite_id: str,
vla: VLASpec,
device: str | None = None,
save_dir: str | None = None,
) -> tuple[RSkillEvalResult, list[EpisodeResult]]:
"""Run a benchmark suite end-to-end against one VLA."""
Callers pass (scenes, suite_id) rather than a BenchmarkSpec.
suite_id is the only thing the deleted class added that the list
cannot represent itself; making it a keyword-only argument keeps the
runner signature self-documenting.
Aggregator changes
The _aggregate_results rollup that emits RSkillEvalResult was
reading three fields off spec.metadata. Their replacements:
| Old source | New source |
|---|---|
spec.metadata.get("suite", spec.id) |
first.metadata.display_name or suite_id |
spec.metadata.get("simulator", first.scene.id) |
first.metadata.simulator or first.scene.id |
spec.metadata.get("arxiv") |
first.metadata.paper if "arxiv.org/" in paper else None |
The arxiv auto-derivation mirrors the existing behaviour of
_aggregate_scene_results (the single-scene sibling added by
ADR-0041 Task 9), so paper-comparison reports built on top of both
runner entrypoints stay uniform.
CLI changes
openral benchmark run --suite <id-or-path> is unchanged at the user
surface. Internally, _resolve_benchmark_spec(suite, benchmarks_dir)
becomes _resolve_benchmark_suite(suite, benchmarks_dir) and returns
tuple[list[BenchmarkScene], str] (the scenes plus the derived
suite-id). openral benchmark list continues to walk
benchmarks/*.yaml and emit basenames; that path was always
filename-driven.
openral benchmark scene --config <BenchmarkScene> is untouched — it
already accepted a single BenchmarkScene YAML, never a suite.
Tests
tests/unit/test_benchmark_schemas.pyrewritten: theBenchmarkSpechappy-path / invariants tests becomeraise_on_invalid_suite+load_benchmark_suitetests. The 13-row catalogue parametric stays — now viaload_benchmark_suite.tests/unit/test_benchmark_runner.pyswitches from_mini_spec()returningBenchmarkSpecto returningtuple[list[BenchmarkScene], str].tests/unit/test_benchmark_aggregator_byte_identical.pydeleted along with its 13 baseline JSONs undertests/unit/fixtures/benchmark_eval_baseline/. The test was a one-shot Task-10 regression guard; with theBenchmarkSpecshape gone there is no pre-refactor surface to compare against. The new catalogue parametric (test_benchmarks_catalogue_fixture_is_a_valid_benchmark_spec, ironically retained name) plus the runner tests cover the same ground.tests/sim/test_*_cli_benchmark*.pyrewrite their_make_tiny_*_spechelpers to emit a bare YAML list — the sim CLI tests exercise the full new ingest path.
Consequences
Positive
- One fewer class on the public surface.
BenchmarkSpecwas a validator-on-a-list with two free-form labels. Deleting it removes ~170 lines of schema, onefrom_yamlclassmethod, one model_post_init validator, one JSON Schema export, and one entry fromopenral_core.__all__. - Suite identity is unforgeable.
suite_idis the filename stem; there is noid:field that can desync. Renamingbenchmarks/foo.yamltobenchmarks/bar.yamlis a one-step rename — no editing. - Display labels travel with their paper provenance.
BenchmarkMetadata.{display_name, simulator}live alongsidepaper honest_scopeon everyBenchmarkScene. A scene reused in a different suite carries its labels with it.- Invariants are reusable.
raise_on_invalid_suite(scenes, suite_id=...)validates anylist[BenchmarkScene]regardless of origin — sim tests that build suites programmatically no longer have to construct a Pydantic wrapper just to get the validator. - Aggregator output is structured all the way down. The auto-derived
arxivURL mirrors_aggregate_scene_results, so the two runner entrypoints emit identically-shapedRSkillEvalResultJSONs.
Negative
- Breaking change for any consumer importing
BenchmarkSpec.from openral_core import BenchmarkSpecno longer works. Consumers switch tofrom openral_core import load_benchmark_suite, raise_on_invalid_suite(or accept the barelist[BenchmarkScene]shape). No deprecation shim — the symbol is removed in the same commit so the build fails loudly. - Byte-identicality baselines deleted. The 13 JSONs under
tests/unit/fixtures/benchmark_eval_baseline/were captured against the pre-Task-10BenchmarkSpec. Their content is now stale by construction (paths through_aggregate_resultsdiffer). Re-capturing them against the new aggregator would only re-pin the post-ADR-0042 shape; the dedicated catalogue + runner tests already cover that. - YAML migration tax (again). All 13
benchmarks/*.yamlfiles are rewritten in the same commit to drop the top-level dict and inline the two display fields onto each scene'smetadata. Hand-edited once; no migration script ships. - Suite-level
notes:andarxiv:fields disappear. A handful of YAMLs carried free-formmetadata.notesstrings that were never read by any consumer (only the suite name, simulator, and arxiv URL ever surfaced inRSkillEvalResult). The notes are preserved as YAML comments at the top of each rewritten file — visible to authors, invisible to the loader.
Neutral
ProtocolSpecsurvives unchanged. Independent schema, no embedding in benchmark suites since Task 10. Kept for ADR drafts and benchmark-report tooling that wants to quote a published protocol outside a suite context.tasksfield disappears with the wrapper. The on-disk YAML is a bare list now, so there is notasks:key to bikeshed. Code-side the list is justscenes(variable name) — keeping the old name would only confuse new readers about what kind of object it is.
Implementation status (this branch)
Phased delivery on the refactor/benchmark-spec-removal branch
(forked off refactor/scenes HEAD after ADR-0041 Task 16):
| Task | Scope | Status |
|---|---|---|
| 1 | This ADR | done |
| 2 | Schema: add BenchmarkMetadata.{display_name, simulator}; delete BenchmarkSpec; export load_benchmark_suite + raise_on_invalid_suite |
done |
| 3 | Runner: run_benchmark(scenes, *, suite_id, vla, …); aggregator switches to per-scene metadata |
done |
| 4 | CLI: _resolve_benchmark_spec → _resolve_benchmark_suite; --dry-run / --out paths updated |
done |
| 5 | Rewrite all 13 benchmarks/*.yaml as bare lists with display fields on per-scene metadata |
done |
| 6 | Tests: rewrite schema + runner tests; rewrite tiny-suite helpers in sim tests; delete byte-identicality fixtures + test | done |
| 7 | Docs: scenes/README.md, benchmarks/README.md, docs/reference/sim-environments.md, docs/METHODS.md, repo state map, regenerated JSON Schema export |
done |
Regression coverage:
tests/unit/test_benchmark_schemas.py—load_benchmark_suitehappy path +raise_on_invalid_suiteinvariants (non-empty, unique ids, uniform robot_id/n_episodes/seed/metadata, first-scene robot_id non-None) + 13-row parametrised catalogue load.tests/unit/test_benchmark_runner.py—_aggregate_resultsrolluprun_benchmarkend-to-end against the mock scene + zero policy (2 tasks × 3 episodes = 6 episodes without GPU).tests/sim/test_franka_panda_smolvla_cli_benchmark.py+tests/sim/test_panda_mobile_pi05_cli_benchmark_robocasa.py— real CLI invocation against LIBERO + SmolVLA and RoboCasa + pi05, with the tiny-suite helpers rewriting to a bare YAML list and the manifest writeback assertion preserved.
References
- ADR-0009 — the original
BenchmarkSpec/ProtocolSpecproposal. This ADR rescinds the suite class while leaving the eval contract intact. - ADR-0041 — three-tier scene hierarchy. Task 10 flattened
BenchmarkSpecto the near-empty wrapper that ADR-0042 removes. - CLAUDE.md §1.3 / §1.4 / §1.6 / §1.11 / §1.13.