ADR-0041: Scene three-tier hierarchy (DeployScene / SimScene / BenchmarkScene)
- Status: Accepted
- Date: 2026-06-08
- Deciders: TSC, sim-WG
- Related: ADR-0002 (eval and sim
environments — the single-file
SceneEnvironmentshape this ADR retires); ADR-0009 (rSkill eval contract —BenchmarkSpec/RSkillEvalResult/openral benchmark run); ADR-0010 (inference runner amendment 1 — the per-task seed-loop semantics now formalised onBenchmarkScene); ADR-0019 (dataset bridge — state/action dim contracts that travel via the rSkill manifest, not the scene); ADR-0034 (deploy-sim scene attach for arms — the no-task, env-only scene shape this ADR formalises asDeployScene); CLAUDE.md §1.3 (types are the contract), §1.4 (explicit beats implicit), §1.6 (schemas evolve, but never silently), §1.11 (real components, not mocks).
Context
SceneEnvironment (ADR-0002) was a single Pydantic model serving every
config-driven entrypoint in the codebase — openral sim run,
openral deploy sim, openral benchmark run, the audit tool, the
tutorials. The model accreted optional fields one entrypoint at a time:
n_episodes / seed / record_video for sim, task: TaskSpec | None
for deploy, and (via BenchmarkSpec) a protocol / per-task seeds
list for benchmark eval. The result was a permissive schema that
silently widened across tiers: a YAML missing task was a legal deploy
config and a silently broken sim config; a YAML with n_episodes: 500
+ rich metadata was a legal sim config and a silently sub-canonical
benchmark config; the audit tool, tutorials, and Justfile each had to
pick the right one by convention, not by type.
Two specific failures motivated the refactor:
-
rSkill names leaked into scene filenames.
scenes/benchmarks/contained 21 YAMLs named after<rskill>_<scene>.yaml(smolvla_libero_spatial.yaml,pi05_robocasa_pnp_nf4.yaml, …). Pairing any other rSkill with the same scene meant duplicating the YAML or editing in place. The scene/policy axis cross-product was muddied at the filesystem layer. -
The CLI couldn't tell tiers apart.
openral sim runhappily loaded a benchmark YAML and silently dropped itsn_episodes/seed/metadata.paper.openral deploy simhappily loaded a sim YAML and silently ignored itstask. The user got the wrong run semantics with no error. -
BenchmarkSpeccarried duplicate scene metadata. A suite YAML embeddedrobot_id+scene+protocolat the top level and per-taskTaskSpecentries. When the protocol'smax_stepscontradicted a per-task value, the loader silently preferred one over the other. The per-task list was syntacticallyTaskSpecbut semantically a(scene, task, n_episodes, seed, metadata)tuple — the same informationSceneEnvironmentcarried in a flat YAML.
Non-goals
- This ADR does not change the
openral deploy runreal-HAL entrypoint, which never took a scene config (ADR-0032). - It does not alter the on-disk
schema_version(per CLAUDE.md §1.6 the surface evolves in place pre-publish; the file stays at"0.1"). - It does not add new actuation-path code;
Skill/Reasoner/Safetylayers are untouched. - It does not introduce a
rskill_reffield on any scene tier; every entrypoint takes--rskillon the CLI so scenes stay rSkill-agnostic.
Decision
Replace the single SceneEnvironment model with three Pydantic
models forming a strict inheritance chain:
DeployScene ⊆ SimScene ⊆ BenchmarkScene
each backing exactly one CLI entrypoint, each with its own subdirectory
under scenes/, and each strictly rejected by the other CLIs (no
silent widening across tiers).
scenes/
├── deploy/ # DeployScene YAMLs — env-only, no task
│ ├── libero_pnp.yaml
│ ├── openarm_tabletop.yaml
│ ├── robocasa_pnp.yaml
│ └── so101_box.yaml
├── sim/ # SimScene YAMLs — env + task, no metadata
│ ├── franka_libero_pnp.yaml
│ ├── libero_spatial.yaml
│ ├── openarm_tabletop.yaml
│ ├── robocasa_gr1_pnp_cup_to_drawer.yaml
│ ├── robocasa_panda_mobile_kitchen.yaml
│ ├── robocasa_pnp.yaml
│ ├── so101_tube_insertion.yaml
│ └── tabletop_cube_push.yaml
└── benchmark/ # BenchmarkScene YAMLs — env + task + n_episodes + seed + metadata
├── aloha_insertion.yaml
├── aloha_transfer_cube.yaml
├── libero_spatial.yaml
├── maniskill_pick_cube.yaml
├── metaworld_push.yaml
├── pusht.yaml
└── widowx_carrot_on_plate.yaml
Schema contracts
| Field | DeployScene | SimScene | BenchmarkScene |
|---|---|---|---|
scene: SceneSpec |
required | required | required |
robot_id: str | None |
optional | optional | required (non-None) |
task: TaskSpec | None |
forbidden | optional | required |
task.max_steps |
n/a | optional | required |
task.success_key |
n/a | optional | required |
n_episodes: int |
not used | optional (defaults 1) | required |
seed: int |
not used | optional | required |
metadata: BenchmarkMetadata |
forbidden | forbidden | required (paper URL + honest_scope) |
TaskSpec loses the dead seed field (seeding lives on the scene
runtime, not per-task — fixed in a separate prior fix(core) commit)
and gains optional max_steps / success_key so SimScene tasks can
omit eval-only fields. BenchmarkScene.model_post_init enforces that
task.max_steps and task.success_key are non-None (no defaults
inherited).
CLI strictness
A new openral_core.load_scene_strict(path, expected) helper is the
single ingest path for every scene-driven CLI. It loads path as a
YAML mapping, refuses to widen across tiers, and raises
ROSConfigError carrying a redirect message that names the right CLI
command:
| Entrypoint | Accepts | Rejects (with redirect) |
|---|---|---|
openral deploy run |
(no config) | n/a |
openral deploy sim --config <DeployScene> |
DeployScene |
SimScene / BenchmarkScene |
openral sim run --config <SimScene> |
SimScene |
DeployScene / BenchmarkScene |
openral benchmark scene --config <BenchmarkScene> |
BenchmarkScene |
DeployScene / SimScene |
openral benchmark run --suite <id> |
BenchmarkSpec |
n/a (suite-file loader) |
mypy --strict overloads on load_scene_strict narrow the return type
to the exact expected tier so call sites cannot accidentally use a
field that is None at runtime. Tier detection is by structural
matching of the YAML's top-level keys (presence of metadata +
n_episodes + seed ⇒ BenchmarkScene; absence of task ⇒
DeployScene; otherwise SimScene). The matching is conservative — a
DeployScene YAML that grows a task: key becomes a SimScene by virtue
of the new key, and the loader's existing extra-key strictness catches
misnamed keys early.
BenchmarkSpec convergence (C2)
A multi-scene suite is exactly an ordered collection of reproducible
single-scene evals. To reflect that, BenchmarkSpec is flattened to
three fields:
class BenchmarkSpec(BaseModel):
id: str
tasks: list[BenchmarkScene] # field name kept for backward semantic
metadata: dict[str, Any] # free-form suite-level provenance
model_post_init enforces suite-level invariants:
tasksnon-empty- per-task
task.idunique within the suite - every scene shares
robot_id(non-None),n_episodes,seed, andmetadata: BenchmarkMetadata
Per-task success_key and task.max_steps may differ across scenes
(maniskill3_pick_place ships PickCube-v1=100 + StackCube-v1=200,
which the aggregator now reports as a worst-case max_steps=200 in the
RSkillEvalResult.protocol). The standalone ProtocolSpec schema is
retained for ADR-0009 report tooling but no longer embedded in
BenchmarkSpec.
run_benchmark_scene: the single-scene sibling
openral benchmark scene --config <BenchmarkScene> (added by Task 9)
fills the gap between openral sim run (sim-only, drops eval
metadata) and openral benchmark run --suite <id> (multi-scene). It
iterates range(scene.seed, scene.seed + scene.n_episodes) against
the one (scene, task) pair, writes the same RSkillEvalResult JSON
shape as the suite runner (so openral benchmark report does not need
to distinguish entrypoints), and surgically updates the rSkill
manifest's benchmarks.<scene_id> field with the average success rate
(opt-out via --no-update-manifest).
Consequences
Positive
- No silent widening. Each CLI accepts exactly its tier; the loader's redirect message names the right command so users hit the right error in <5 s of staring at a stack trace.
- rSkills are a CLI flag, not a filename. Pairing a new rSkill
with an existing scene is
--rskill <id>, not a YAML duplicate. The cross-product of (scene × rSkill) is the catalogue (tools/audit_sim_configs.py), not the filesystem layout. BenchmarkSpecis a list of reproducible units. No more top-levelrobot_id/scene/protocolcontradicting per-task values. The_aggregate_resultsrollup is byte-identical to the pre-refactor JSON on all 13 in-tree suites (tests/unit/test_benchmark_aggregator_byte_identical.py, gated by fixtures intests/unit/fixtures/benchmark_eval_baseline/).- Tier-aware audit.
tools/audit_sim_configs.pynow carries a per-rowrun_mode: Literal["sim", "benchmark", "deploy"]and dispatches to the matching CLI (openral sim run,openral benchmark scene --no-update-manifest --n-episodes 1,openral deploy sim --no-dashboard+ SIGINT probe), mirroring the Justfile'ssim-*recipes. Catalogue is pure scene×rSkill pairs that exist in tree — scenes without a matching in-tree rSkill are not represented (schema-load coverage stays intests/unit/test_examples_sim_configs_load.py). mypy --strictclean across the boundary. Theload_scene_strictoverloads remove every existing# type: ignoreon the scene-loader path.
Negative
- Breaking change for any consumer importing
SceneEnvironment. Pre-refactor code that didfrom openral_core import SceneEnvironmentmust switch toSimScene/DeployScene/BenchmarkScene(whichever role the YAML plays). The symbol is removed, not deprecated, so the build fails loudly. - Three Pydantic models instead of one. Marginal duplication on
the common
scene: SceneSpecblock; a small amount of overlap onrobot_id/taskfield declarations. Accepted as the cost of per-tier strictness. - One additional CLI verb (
openral benchmark scene). Mirrorsopenral sim runsemantically; documented as the single-scene sibling ofopenral benchmark run --suite. - YAML migration tax. All 13
benchmarks/*.yamlsuite files were rewritten to inlineBenchmarkSceneentries via YAML anchors (&scene/&task_proto/<<: *libero_scene). Hand-edited once; no migration script ships (per CLAUDE.md §1.6, the on-diskschema_versionstays at"0.1").
Neutral
- The scene-id binds to the
@SCENES.register("…")Python factory key; the YAML'sscene.idis the source of truth, not the filename. Renaming a YAML on disk is a free operation; renamingscene.idrequires a matching factory rename. - Historical ADRs (ADR-0007, 0010, 0017, 0019, 0025, 0026, 0038) keep
their stale
scenes/benchmarks/<rskill>_<scene>.yamlreferences — ADRs are frozen historical text. Forward-looking ADRs (this one included) use the new paths.
Implementation status (this branch)
Phased delivery on the refactor/scenes branch, one PR (#274):
| Task | Scope | Status |
|---|---|---|
| 1 | Remove dead TaskSpec.seed field |
done |
| 2 | TaskSpec.max_steps / success_key optional |
done |
| 3 | Add DeployScene / SimScene / BenchmarkScene / BenchmarkMetadata Pydantic models |
done |
| 4 | Add load_scene_strict() typed-overload helper |
done |
| 5 | Create scenes/benchmark/*.yaml (7 files) |
done |
| 6 | Create scenes/sim/*.yaml (9 files) and scenes/deploy/*.yaml (4 files) |
done |
| 7 | Delete scenes/benchmarks/ (21 YAMLs) and scenes/native/ (4 YAMLs + 1 BDDL) |
done |
| 8 | Migrate every SceneEnvironment callsite (3 production + 8 tests) to the right tier |
done |
| 9 | Add openral benchmark scene --config <BenchmarkScene> CLI + run_benchmark_scene runner |
done |
| 10 | Flatten BenchmarkSpec to list[BenchmarkScene] + migrate 13 benchmarks/*.yaml files |
done (superseded by ADR-0042) |
| 11 | Justfile sim-* recipes repointed at the new layout |
done |
| 12 | tools/audit_sim_configs.py catalogue rewritten with run_mode dispatch |
done |
| 13 | This ADR + tier-aware scenes/README.md + tutorial rewrite |
in progress |
Regression coverage:
tests/unit/test_load_scene_strict.py— 15 tests asserting redirect behaviour for every (got, expected) mismatch.tests/unit/test_scene_tier_schemas.py— per-tier construction + invariant tests forDeployScene/SimScene/BenchmarkScene/BenchmarkMetadata.tests/unit/test_run_benchmark_scene.py(4) +test_cli_benchmark_scene.py(5) — single-scene runner end-to-end + CLI wiring.tests/unit/test_benchmark_schemas.py— 30 tests including the 10-row parametrised catalogue load ofbenchmarks/*.yaml.tests/unit/test_benchmark_aggregator_byte_identical.py— 13 parametrised cases assertingRSkillEvalResultJSON output is byte-identical across the refactor for every in-tree suite.tests/unit/test_examples_sim_configs_load.py— per-tier guard that every YAML underscenes/<tier>/validates as the matching schema.
Amendments
2026-06-08 — BenchmarkSpec wrapper deleted by ADR-0042
Task 10 above flattened BenchmarkSpec to a near-empty wrapper around
list[BenchmarkScene]. ADR-0042 then
deleted the wrapper class entirely: benchmarks/<id>.yaml is now a
bare YAML list of BenchmarkScene mappings at the root, the suite id
is the filename stem, and the five suite invariants previously enforced
in BenchmarkSpec.model_post_init moved to a free function
openral_core.raise_on_invalid_suite(scenes, *, suite_id) that the new
loader openral_core.load_benchmark_suite(path) calls separately. The
BenchmarkSpec.{from_yaml, model_post_init, byte-identical baseline
fixtures} and the matching public-symbol export are gone.
Schema-rejection note: the pre-Task-10 {robot_id, scene, protocol,
tasks, metadata} wrapper had already been removed when this task
landed; the loader's only legacy redirect is for the post-Task-10
shape ({id, tasks, metadata} ⇒ ADR-0042 redirect message).
run_benchmark now takes (scenes, vla, *, suite_id, ...) and its
aggregator pulls display_name / simulator from per-scene
BenchmarkMetadata instead of a free-form suite-level dict
(BenchmarkMetadata.display_name / .simulator added in ADR-0042).
arxiv is auto-derived from metadata.paper when the URL contains
arxiv.org/. Output JSON is unchanged for every shipped suite.
References
- ADR-0002 — eval and sim environments (the
SceneEnvironmentshape this ADR retires). - ADR-0009 — separate sim and benchmarking (
BenchmarkSpec/RSkillEvalResult/openral benchmark run/openral benchmark report). - ADR-0010 — inference runner, amendment 1 (per-task seed-loop, now
formalised on
BenchmarkScene). - ADR-0034 — deploy-sim scene attach for arms (the no-task, env-only
shape now formalised as
DeployScene). - ADR-0042 — drop
BenchmarkSpecfor a barelist[BenchmarkScene](the post-Task-10 wrapper deletion). - CLAUDE.md §1.3 / §1.4 / §1.6 / §1.11.
- Implementation plan:
docs/superpowers/plans/2026-06-07-scene-hierarchy-refactor.md.