ADR-0035: Perception → spatial-memory object lift (2D→3D object-center lift)
- Status: Accepted
- Date: 2026-06-03
- Related: ADR-0030 (OccupancyVoxels on
/openral/world_voxels, base frame); ADR-0037 (object-detection rSkill that producesObjectsMetadataon/openral/perception/objects— the input this ADR consumes); ADR-0018 (reasoner consumesWorldState.detected_objectsvia the snapshot); CLAUDE.md §3 (layer boundaries — Sensors/rSkill → World State is a new cross-layer data flow).
Context
WorldState.detected_objects: list[DetectedObject] has existed in the schema since the earliest
spatial-memory work (PRs #217–#220) but has always been an empty list — no node produced it.
ADR-0037 deliberately deferred the "2D→3D pose-lift" step, noting that the detector was the
missing producer but that the lift itself (needing depth, intrinsics, TF2, and a temporal
association policy) was a separate, later decision.
Two existing systems provide the raw material:
- ADR-0037 object-detector rSkill publishes
ObjectsMetadata{ detections: list[ObjectDetection2D], model_id, sensor_id, frame_width, frame_height }on/openral/perception/objects. EachObjectDetection2Dcarries a pixel bounding box but no depth. - ADR-0030 occupancy voxels publishes a dense
OccupancyVoxelsgrid on/openral/world_voxels(base frame, row-major x-fastest). These cells are in 3D, but they are associated with the robot's local environment as a whole, not with individual objects.
The lift gap: no node correlates the 2D pixel boxes against the 3D voxel grid and writes the
resulting object centers into the aggregator. The reasoner cannot reason spatially about objects
because WorldState.detected_objects is always empty.
Decision
New cross-layer data flow: Sensors/rSkill perception (Layer 1/3) → World State (Layer 2). This boundary crossing is what requires this ADR (CLAUDE.md §3).
Lift algorithm
For each detected 2D box, lift to a single (x, y, z) center in map frame via
frustum-projection + K-nearest-to-box-center from the /openral/world_voxels grid:
- Decode occupied voxel centers from
OccupancyVoxelsinto base-frame 3D points. - Project all base-frame centers into the camera image plane using
SensorSpec.intrinsicsand thebase_link → <camera_optical>TF2 transform. Discardz ≤ 0. - Scale each detection's
bbox_xyxyfrom the detector'sframe_width × frame_heighttointrinsics.width × intrinsics.height(handles detector stream ≠ sensor native resolution). - Per detection: keep voxels whose projected
(u, v)falls inside the scaled box; rank by 2D distance from box center; take the K nearest (default K = 25). If fewer thanmin_voxels(default 3) survive → skip (insufficient 3D evidence; never fabricate a pose). - Transform the K selected centers to
mapframe using thebase_link → mapTF2 transform.center_map = mean(K points).bbox_3d = (min_xyz, max_xyz)over the K points in map frame. - Emit
DetectedObject(label, confidence, pose=Pose6D(center_map, (0,0,0,1), frame_id="map"), bbox_3d=bbox_3d).
Association policy — ObjectMemory
A per-run "tick" applies IoU-gated association (greedy, highest-confidence-first) to keep the memory stable across frames:
- Same-label + 3D AABB IoU ≥
iou_threshold(default 0.3) → freeze. Leave stored pose andbbox_3dunchanged; bumpconfidence = max(stored, candidate); reset miss counter. This prevents jitter from re-writing an object that is already well-remembered. - No match → create. New tracked entry with a fresh monotonic
track_id. - In-FOV + unmatched → evict after
max_misses(default 1). An object the camera should have seen but did not is presumed moved or removed. - Out-of-FOV → retain unchanged. Objects outside the current camera view are never evicted by the absence of detections — the memory persists across camera pans.
The FOV predicate projects each DetectedObject.pose.xyz (map frame) back into the current
camera image; an object is "in FOV" iff its center projects within (intrinsics.width × intrinsics.height).
Hosting
The lift and memory tick are hosted inside the existing _WorldStateLifecycleNode, feeding
the shared WorldStateAggregator via a new typed setter update_detected_objects(). This is
consistent with the existing pattern (the node subscribes sensor topics and calls update_*
setters on the aggregator; the aggregator stays ROS-free).
All lift logic lives in pure, ROS-free, unit-testable Python modules in openral_world_state:
object_lift.py (VoxelFrustumLifter, geometry helpers, ObjectsLiftError) and
object_memory.py (ObjectMemory). The node wiring (subscriptions, TF2 buffer, timer) is in
lifecycle_node.py as usual.
Best-effort contract (invariant)
A missing, empty, or stale voxel grid is a normal condition. When there is no usable grid the world-state node behaves exactly as today:
WorldState.detected_objects == []. No error, no warning spam, no degradation of the existing snapshot publish path. Never fabricate a pose (CLAUDE.md §1.2): any path without a truthful 3D lift — nomapTF, no camera intrinsics, no in-frustum voxels, stale grid — skips the detection silently.
Schema change
ObjectsMetadata gains two required fields:
frame_width: int = Field(gt=0) # pixel width of the detector's input frame
frame_height: int = Field(gt=0) # pixel height of that frame
These make the pixel space of bbox_xyxy explicit so the lifter can scale to intrinsics
resolution rather than silently assuming stream resolution == sensor native resolution
(CLAUDE.md §1.4 "explicit beats implicit"). The detector backends (ObjectsDetector,
NvmmObjectsDetector) populate these at detect() time. Pre-publish: schema_version
stays "0.1" (CLAUDE.md §1.6, no migrators).
Alternatives considered
- Approach B: standalone
object_memory_node+openral_msgs/DetectedObjectsIDL. A separate process would consume/openral/perception/objectsand/openral/world_voxels, do the lift, and publish a newDetectedObjectsIDL the world-state node subscribes to. Cleaner separation of concerns and allows the memory to be consumed by separate-process nodes (e.g. the reasoner node) without sharing a process. Deferred: adds a new IDL (a layer boundary), a new node binary, and inter-process latency. The v1 scope (in-process Pydantic memory for the collocated skill_runner / reasoner) does not require it; this ADR records it as the top documented follow-up. - Depth-image lift. Use a registered
DEPTH16sensor frame fromWorldState.image_framesinstead of the occupancy voxels. Rejected for v1: not every deployment carries a depth sensor alongside the RGB camera used by the detector; the voxel grid (ADR-0030) is an existing mandatory component. Depth-based lift is a future alternative. - K-nearest with depth tie-break. Background voxels directly behind the object center can bias the mean center outward. Documented as a future tunable (v1 accepts the simplification).
Consequences
Layer boundary (new, authorized by this ADR)
The world-state lifecycle node now subscribes to /openral/perception/objects (a perception
output published by an rSkill/Layer 3 node) and /openral/world_voxels (Layer 2 ADR-0030
output). This is the first data-flow edge from the rSkill layer into the World State layer
(perception → World State). It is authorized here.
Scope (v1) and top follow-ups
- Detected objects now on the wire (landed). The former top follow-up has shipped: the
shared in-process
WorldStateAggregatorstill ownsWorldState.detected_objects, butopenral_msgs/WorldStateStampednow also carries them asdetected_object_*parallel arrays (detected_object_labels/detected_object_confidences/detected_object_positions(geometry_msgs/Point[]) /detected_object_track_ids(int32[],-1= unset) /detected_object_frame). They are serialised by_fill_detected_objectsinsidebuild_world_state_stamped_msg, and read back byworld_state_from_idl. Separate-process consumers reading/openral/world_state_slow(e.g. a standalone reasoner node) now do see the spatial memory. This was a deliberate IDL boundary crossing, authorized by this ADR. - No orientation:
quat_xyzw = (0, 0, 0, 1)(identity). - Greedy (not Hungarian) data association.
- Single-camera FOV predicate.
- No safety-kernel feed: converting
detected_objectsintoWorldCollisionPrimitives is a separate downstream concern. - Approach B (standalone process) is the documented path if a separate process is later required.
Deploy-sim integration (landed)
To exercise the lift in openral deploy sim without standing up the GStreamer perception tee
(ADR-0018 F6 / ADR-0037) that the on-robot path uses, a standalone ROS-Image detector ships in
the new openral_perception_ros package. Its ros_image_detector_node
(RosImageObjectDetectorNode) subscribes the panda_mobile agentview_left RGB
sensor_msgs/Image, runs the GStreamer-free openral_runner ObjectsDetector (the
rtdetr-coco-r18 rSkill ONNX), and publishes openral_msgs/PromptStamped (carrying
ObjectsMetadata) on /openral/perception/objects — the same topic and schema the world-state
node already consumes. Detections are attributed to sensor_id (default front_depth) so the
lift projects through the co-located depth camera's optical frame.
It is wired into the generic sim_e2e.launch.py behind the enable_object_detector launch
argument, driven by openral deploy sim --enable-object-detector (auto-enabled when
rskills/rtdetr-coco-r18/model.onnx is present). The detection camera can render at up to 640²
with resolution-consistent intrinsics (scale_intrinsics_to / the depth_synth_kwargs
render_size) so the lift scales bbox_xyxy to the intrinsics resolution correctly.
New node parameters
See packages/world_state/README.md §Object-lift parameters for the full table.
Key parameters: object_lift_enabled (master toggle, default True), object_lift_k_nearest
(default 25), object_lift_min_voxels (default 3), object_lift_iou_threshold (default 0.3),
object_lift_memory_cadence_hz (default 2.0 Hz), object_lift_max_misses (default 1),
object_lift_voxel_staleness_s (default 1.0 s).
New exception
ObjectsLiftError(ValueError) — raised by geometry helpers on degenerate inputs (e.g. a
near-zero quaternion or an occupancy buffer whose byte length does not match size_x * size_y * size_z).
Amendment — 2026-06-15 (octomap-free depth fallback; GitHub #11)
The object lift previously required an octomap voxel grid (/openral/world_voxels) as its depth
source and silently dropped every detection when none was published — so with --no-enable-octomap
(common in dense scenes that false-positive the kernel's octomap collision check) spatial memory
never accumulated and recall_object always missed. The lift now falls back to the depth camera
point cloud when no fresh voxel grid is available: _WorldStateLifecycleNode subscribes the
configurable object_depth_points_topic (default /openral/cameras/front_depth/points,
sensor_msgs/PointCloud2, BEST_EFFORT) and, when _latest_voxels is stale/absent, decodes it via
depth_cloud_to_centers_base (drop non-finite returns, subsample to object_lift_depth_max_points
= 4000, transform the cloud's optical frame → base frame) and feeds the result to
VoxelFrustumLifter.lift as occupied_centers_base — interchangeable with the octomap path. The
octomap voxel grid remains preferred when fresh (filtered, persistent). Verified live on
robocasa_baguette with octomap off: world_state_slow.detected_object_labels populated and
recall_object returns matches. New public symbol: openral_world_state.depth_cloud_to_centers_base.
SLAM stays on-by-default for any robot that can run it (capabilities.has_lidar) so the map frame
the lift projects through is present (deploy-sim CLI, firm default).