ADR-0049: Dedicated HAL publisher thread + proprio snapshot (odom-starvation fix)
- Status: Proposed
- Date: 2026-06-11
- Related: ADR-0034 (deploy-sim scene-attach + the idle stepper; this ADR keeps its single-thread contention rule and extends it); ADR-0029 (unified HAL lifecycle node); ADR-0027 (TF + state assembly rates); ADR-0048 (deploy-sim clock domain — the sibling nav-leg fix that exposed this).
Context
openral deploy sim for panda_mobile (RoboCasa kitchen) runs the manifest-driven HAL
lifecycle node, which spins single-threaded (rclpy.spin(node) in
make_lifecycle_main_from_manifest). That one thread services every HAL timer/subscription:
| Callback | Rate | Cost | Touches MjData? |
|---|---|---|---|
_idle_step_tick → idle_step → env.step(+render) |
10 Hz | ~15–80 ms | writes |
_on_safe_action → send_action → env.step(+render) |
≤20 Hz | ~15–80 ms | writes |
_publish_scan (2-D raycast synthesize_laser_scan_2d) |
10 Hz | ~10–30 ms | reads |
_publish_images / _publish_depth (serve cached obs) |
10 Hz | small | reads cache |
_publish_joint_state (hal.read_state) |
30 Hz | small | reads |
_publish_odom (hal.base_pose_6dof) |
20 Hz | small | reads |
Because everything serializes on one thread, the expensive env.step+render and scan raycast
saturate it, and the cheap, control-critical publishers starve. Measured live on an
RTX 4070: odom→base_link TF was published at ~1.8 Hz (configured 20 Hz; slam's map→odom
by contrast was a clean 20 Hz). At a ~0.55 s odom interval > Nav2's costmap
transform_tolerance (0.3 s), the costmap dropped fresh /scan messages ("extrapolation into
the future") and Nav2's progress checker could not confirm base motion → "Failed to make
progress". (This was masked by the ADR-0048 frozen-clock bug, which broke nav entirely; fixing
the clock exposed it.)
Why not a MultiThreadedExecutor (tried; rejected, with live evidence)
The obvious fix — a MultiThreadedExecutor with the env work in one
MutuallyExclusiveCallbackGroup and the publishers in another — does not work for the
in-process MuJoCo HAL, for two reasons:
- MuJoCo's EGL/GL context is thread-affine. A
MutuallyExclusiveCallbackGroupprevents concurrency but not thread-hopping: the executor's worker pool dispatches successive sim-group callbacks on whatever thread is free. Soenv.step's render runs on a different thread than the one that created the GL context, and crashes. Observed live (this exact wiring,deploy simon an RTX 4070):idle stepper disabled after error: SimAttachedHAL.idle_step: env.step failed: EGLError. The MuJoCo viewer is thread-affine too. MjDatais not thread-safe. Even setting rendering aside, any callback readingMjData(read_state,base_pose, the raycast) concurrent with anenv.stepwrite is a data race. ADR-0034's Contention rule ("exactly one writer toenv.stepat a time") assumed a single thread.
Conclusion: keep all MuJoCo/GL work on one thread. The decoupling must come from moving the publishing off that thread — not from multi-threading the executor.
Decision
Keep the HAL node single-threaded (all env.step / render / raycast on one thread, so
MuJoCo's GL context never moves) and move only the publishing of odom / joint_state / TF onto
a dedicated publisher thread that reads a plain-data proprio snapshot.
1. Single executor thread + a dedicated publisher thread
- The node spins single-threaded (
rclpy.spin). All env /MjData/ GL callbacks (_on_safe_action→send_action,_idle_step_tick→idle_step, the scan raycast, the camera / depth publishers serving_last_obs, the viewer sync) keep running there, mutually exclusive — ADR-0034's "exactly oneenv.stepat a time" is preserved unchanged, and the GL context is only ever used on this thread. - A
threading.Threadstarted at activate (sim-attached HALs only) loops at the joint-state rate (~30 Hz) and publishesjoint_state+odom+odom→base_linkTF from the snapshot. It touches only the snapshot (thread-safe) and rclpy publishers (publish()/sendTransform()are thread-safe) — neverMjData/GL — so it runs truly concurrently with the executor thread's heavy sim work. Stopped (signalled + joined) at deactivate, before the publishers it writes to are torn down.
For sim-attached HALs the legacy joint_state / odom timers are not created (the thread owns
publishing). Real HALs keep the timers and read encoders directly (_proprio is None).
2. Proprio snapshot seam (so the publisher thread never touches MjData)
A small ProprioSnapshot holder (one threading.Lock + an immutable ProprioFrame: the last
JointState, planar + 6-DoF base pose, base twist) owned by the HAL node:
- Captured only on the executor thread — immediately after each
env.step(_capture_proprio()in_send_action_tracedafterhal.send_action, and via theSimSensorBridgeon_stephook afterhal.idle_step), plus once at activate. The capture callshal.read_state()+hal.base_pose_6dof()(safeMjDataread — same thread as the step) and stores plain data; the lock guards only the reference swap. - Read by the publisher thread —
_publish_joint_state/MobileBaseBridge._publish_odomreadsnapshot.latest()(a sub-microsecond locked read), then publish with a fresh wall-clock (or ADR-0048 sim-time) stamp. They never call intoMjData.
The snapshot is captured at the step rate (≥10 Hz: idle stepper idle, ≤20 Hz under active nav)
and re-emitted by the thread at ~30 Hz, giving Nav2 a steady high-rate odom→base_link stream
(TF interpolates between captures). Verified live: odom rose from ~1.8 Hz → ~28 Hz, the
costmap stopped dropping scans, and navigate_to_pose no longer aborts on the progress checker.
3. Scope
HAL layer only — make_lifecycle_main_from_manifest (stays rclpy.spin), the lifecycle base
(snapshot + capture + publisher thread + conditional joint_state timer), MobileBaseBridge
(publish_from_snapshot + conditional odom timer), SimSensorBridge (the on_step capture
hook). No change to sim_attached.py's sim semantics (so it composes cleanly with ADR-0048
Phase 1, which also edits that file), the safety kernel, or the real-hardware HAL (no
idle_step → _proprio is None → unchanged timer-based publishing). schema_version stays
"0.1".
Consequences
Positive
- odom→base_link (and joint_state TF) publish at ~30 Hz regardless of per-step sim cost →
Nav2 costmap stops dropping scans, progress checker works, the base navigates. Verified live
(~1.8 Hz → ~28 Hz).
- ADR-0034's single-writer / single-thread guarantee is preserved and MuJoCo's GL context
stays on one thread (no EGL breakage — the trap a multi-threaded executor falls into).
- Lower control-loop jitter generally (sensor/render work no longer head-of-line-blocks odom).
Negative / risk
- Correctness rests on the invariant "only the executor thread touches MjData/GL; the
publisher thread touches only the snapshot + publishers." A future publisher-thread change
that reads MjData would reintroduce a race — guarded by comments and the snapshot's
HAL-agnostic design (it physically cannot reach the HAL).
- The snapshot adds one copy per step + a tiny lock; negligible vs env.step.
- A raw thread needs clean lifecycle handling (start at activate, signal+join at deactivate,
daemon so a crash can't wedge shutdown) — implemented.
Alternatives considered
MultiThreadedExecutor+ callback groups (the original plan). Rejected — implemented and tested live, it crashesenv.stepwithEGLErrorbecause MuJoCo's GL context is thread-affine and aMutuallyExclusiveCallbackGroupdoes not pin a thread (see §Why not a MultiThreadedExecutor).- Single lock around all
MjDataaccess, one executor. The publishers would still block for the fullenv.step(~15–80 ms) whenever they race a step — only partial relief. The snapshot + separate thread removes the block entirely. - Publish odom/joint_state from inside the step callback. Couples publish rate to step rate (10 Hz idle) — too slow for Nav2's 20 Hz loop, and bursts during active nav. Rejected.
- Raise Nav2
transform_toleranceto ~1.0 s (config-only band-aid). Lets the costmap tolerate the laggy odom, but uses stale transforms for a moving base (mis-placed obstacles) and doesn't fix the progress checker, which needs real odom updates. Interim mitigation, not the root-cause fix. (Documented so the band-aid isn't mistaken for the solution.) - Reduce camera/render load (fewer cameras, lower rate). Treats the symptom; the cameras are a product requirement (perception bus). Rejected.
Tests
- Unit:
ProprioSnapshotunder contention (a writer thread swapping frames while readers spin) never yields a torn frame; the holder is HAL-agnostic so a reader cannot touchMjData. RealJointStatedata, no mocks. (test_proprio_snapshot.py, incl. a 20k-iteration race.) - Live (deploy-sim, reference host — done): under the full robocasa+nav+slam+octomap load,
odom→base_link≈ 28 Hz (was 1.8 Hz), zeroEGLError, zero costmap scan-drops, andnavigate_to_poseactuates the base without the odom-starvation "Failed to make progress" abort. Node-level wiring can't be unit-tested off a real ROS graph (no rclpy/GL under the test venv), so the live run is the verification of record.