ADR-0050: Single-resident-skill VRAM eviction (unload-on-switch)
- Status: Proposed
- Date: 2026-06-12
- Related: ADR-0018 (skill runtime + reasoner);
ADR-0025 (LifecycleTransitionTool + lifecycle peers);
ADR-0037 (
kind: detectorperception producers); ADR-0043 (livelocate_in_viewopen-vocab detector); ADR-0046 (out-of-process VLA sidecars).
Context
On an 8 GB GPU (the RTX 4070 Laptop reference dev host) the autonomous
find→navigate→grab loop cannot complete: the open-vocab detector
(LocateAnything-3B, NF4 sidecar, ~5.3 GB peak) and the grab policy
(pi05-robocasa365, ~4.3 GB) do not co-reside in 8 GB. They never run
simultaneously in the reasoner cascade (locate_in_view → then
execute_rskill), but both stay resident, and the overlap OOMs.
Today nothing evicts GPU models:
- The skill runner (
rskill_runner_node._execute_cb) overwritesself._active_skill = skillon each dispatch. The prior model is never explicitly released (noempty_cache, no weight drop); it lingers until GC. rSkillBasehas a symmetric lifecycle (configure/on_load_weights→activate→deactivate→shutdown) but noon_unload_weightshook — there is no contract for releasing weights.- The detector runs as an always-on producer.
deploy_sim.pyalready carries a comment lamenting "GR00T/RLDX weights resident … starves the GPU (~6.5 GiB)".
A general policy is wanted: at most one heavy model GPU-resident at a time; the previous one unloads when the reasoner switches to another — generalized to all rSkills and the detector, not a one-off.
Constraint discovered during design
RosImageObjectDetectorNode is a plain rclpy.node.Node, not a
LifecycleNode (and is not in the reasoner's lifecycle_peer_node_ids, which
today lists only openral_slam_toolbox). So "the reasoner deactivates the
detector via the existing LifecycleTransitionTool" is not wireable without
first converting the detector to a lifecycle node. LocateAnythingDetector.close()
already terminates the sidecar subprocess (frees its VRAM) — the release
primitive exists; the lifecycle host does not.
Decision
A single-resident-skill eviction policy built from four parts:
-
rSkillBase.on_unload_weights()hook (new, default no-op) — symmetric withon_load_weights.shutdown()calls it before transitioning toFINALIZEDand clearsweights_loaded. Subclasses override to drop model references +torch.cuda.empty_cache()(or terminate their sidecar). This is the generalized contract: any GPU-backed skill releases here. -
Skill-runner eviction — the runner keeps the resolved
_active_skillkeyed by(rskill_id, revision). On anexecute_rskillwhose key differs from the resident skill, it callsold.shutdown()(→on_unload_weights, freeing VRAM) before resolving/loading the new skill. Re-dispatching the same key reuses the resident skill (no reload). Nodeon_cleanup/on_shutdownevict the resident skill. -
Detector →
LifecycleNode— convertRosImageObjectDetectorNodeto a managed lifecycle node.on_activatebuilds/starts the detector (sidecar);on_deactivatecallsself._detector.close()(releases sidecar VRAM);on_cleanupdrops it. Launch wires it under the existinglifecycle_managerand addsopenral_ros_image_detectorto the reasoner'slifecycle_peer_node_idswhenenable_object_detector. -
Reasoner sequencing — uses the existing
LifecycleTransitionTool(ADR-0025): before dispatching a GPU-heavy actuation skill it candeactivatethe detector, andactivateit again afterward. The detector being a lifecycle peer (part 3) is what makes this expressible.
Amendment (2026-06-12) — automatic pre-dispatch eviction. Relying on the
LLM to emit the deactivate was unreliable: in the live autonomous robocasa
run the reasoner dispatched a VLA without freeing the detector, so the
detector (~1.3 GB) co-resident with the policy (~4.5 GB) CUDA-OOM'd the 8 GB
card at load (rldx_sidecar_died_during_boot). The deactivation is now an
automatic, deterministic pre-dispatch policy in reasoner_node, not an
LLM choice: a vram_lifecycle_peers parameter (the deploy launch sets it to
openral_ros_image_detector when --enable-object-detector) lists the GPU
peers the reasoner deactivates before every execute_rskill and
reactivates on its result (_free_vram_peers_then_send /
_reactivate_vram_peers). The send is sequenced behind the change_state
responses so the VRAM is released before the goal reaches the runner.
Reactivation fires on the terminal result and on goal-reject/error (never on
deadline, where the policy may still be resident). This is distinct from
lifecycle_peer_node_ids, which only surfaces peers to the LLM tool palette.
The reasoner change alone was not sufficient: the launch autostart
(_autostart_lifecycle) re-activated the detector ~15 ms after each
deactivate, because its activate handler matched a bare
goal_state="inactive" — which also fires on a runtime deactivate
(active → deactivating → inactive), not just the boot configure. Scoping it
to start_state="configuring" makes the autostart one-shot, so a
reasoner-driven deactivate sticks. Verified live (2026-06-12): detector frees
~1.3 GB, the rldx1-ft-rc365-nf4 VLA then loads and runs policy steps on the
8 GB card instead of OOMing at load.
Alternatives considered
- Central VRAM arbiter service. A node all GPU consumers register with; acquiring exclusive GPU auto-evicts the holder. Cleanest fully-automatic policy, but new always-on infrastructure + protocol. Rejected as heavier than needed; the lifecycle primitives already exist for (1)–(4).
- Idle-timer sidecar unload. Detector sidecar self-releases after N seconds idle, reloads on next detect. No reasoner/lifecycle changes, but timer-based (not switch-driven), and pays full reload latency on every wake. Doesn't match the "unload when the reasoner switches" requirement. Kept as a possible detector-local optimization, not the mechanism.
Consequences
- Positive: the autonomous detect→navigate→grab loop fits in 8 GB; VRAM policy is explicit and generalized; reuses ADR-0025 lifecycle primitives.
- Negative / costs: switching skills now pays a reload (weights + warmup) on each change — acceptable for the S2-paced autonomous loop, not for tight S1 skill alternation. Converting the detector to a lifecycle node touches launch + lifecycle-manager wiring. Spans three layers (perception 1/3, skill runtime 3, reasoner 4) — phase the PR (see below).
- Eviction must never weaken safety: unload happens only between goals, never
mid-
step;ROSSafetyViolationhandling is unchanged.
Testing
- Unit:
rSkillBase.shutdown()invokeson_unload_weights+ clearsweights_loaded; runner evicts on key-change and reuses on same-key (fake skills counting load/unload calls). - Integration (
launch_testing): detector lifecycleactivate→deactivatereleases the sidecar (process gone); reasonerLifecycleTransition(detector, deactivate)succeeds with the detector as a peer. - Sim/HIL: the 8 GB co-residency case —
locate_in_viewthenexecute_rskillonpanda_mobile/robocasa completes without CUDA OOM (the reproduction this ADR exists to fix).
Phasing (PR boundaries — avoid the >800-line / multi-layer single PR)
- P1 —
rSkillBase.on_unload_weights+ runner eviction/caching (layer 3) + unit tests. - P2 — detector →
LifecycleNode+ sidecar release on deactivate + launch wiring (layer 1/3) + integration test. - P3 — reasoner detector lifecycle peer + sequencing + the 8 GB sim repro.