ADR-0011: NVMM → CUDA tensor handoff across the Sensors / Skill boundary
- Status: Proposed (2026-05-24)
- Deciders: Adrian Llopart, OpenRAL Contributors
- Supersedes / amends: extends ADR-0010 §SensorReader and the §6.1 layer-boundary discipline.
Context
ADR-0010 defined a SensorReader Protocol with three carry-modes on
SensorFrame: data (inline bytes), topic (ROS 2 ref), and handle
(opaque in-process integer — reserved for "NVMM pointer, DMA-BUF fd").
M8 / PR I implemented the GStreamerSensorReader backend. On Jetson
Orin / Spark / Thor the GStreamer pipeline produces frames in
video/x-raw(memory:NVMM) caps; the underlying buffer is an
NvBufSurface whose surfaceList[i].dataPtr is a CUDA device
pointer. Passing that pointer to a downstream Skill / TensorRT
engine is the zero-copy path the inference runner needs to hit its
latency budgets — at 30 Hz × 4 cameras × 1080p RGB, a bytes(...)
copy through Python is roughly 30 ms of wasted CPU per tick before
inference even starts.
The unresolved question ADR-0010 left open: how does a Skill consume
a handle-flavoured SensorFrame? That's a cross-layer concern —
the producer is Layer 1 (Sensors), the consumer is Layer 3 (Skill),
and the handle's lifetime is owned by GStreamer at Layer 1.
Decision
A handle-flavoured SensorFrame carries three pieces of
information that together let any CUDA-aware consumer read pixels
without an intermediate copy:
SensorFrame.handle: int— the rawdataPtrfromNvBufSurfaceParams[0]. Suitable as a kernel argument, atorch.cuda.Tensorview via__cuda_array_interface__, or input tocudaMemcpyAsync.SensorFrame.encoding == FrameEncoding.CUDA_NV12— tells the consumer the layout is semi-planar Y (full-res) + UV (half-res interleaved). NV12 is whatnvarguscamerasrc/nvv4l2decoder/nvvidconvnatively produce; we do not force a colour conversion on the NVMM side because the cost of that conversion outweighs the convenience of BGR for most VLA inputs (which prefer NV12 or YUV420 anyway).SensorFrame.metadata["nvbufsurface"]— a Pydantic-validated dump ofNvBufSurfaceHandle(gpu_ptr, width, height, pitch, color_format, size, batch_size). Lets consumers assert the pitch and pick the right stride.
Lifetime contract
The Sensors layer owns the GStreamer buffer ref. The
GStreamerSensorReader holds both the latest Gst.Buffer and
its Gst.MapInfo in lock-guarded slots for the lifetime of the
latest-frame slot. The pair is released (buffer.unmap(map_info))
when a newer frame replaces the slot or close() is called.
Consumers that need the data to persist past one tick must copy
it — either to system memory (cudaMemcpy to a host buffer) or to
their own CUDA allocation (cudaMallocAsync + cudaMemcpyAsync).
This matches GStreamer's normal buffer-recycling semantics: the upstream element is free to reuse the buffer slot as soon as the appsink stops referencing it.
Shared CUDA context
openral_runner.backends.gstreamer.cuda_context.get_shared_cuda_context()
returns a process-wide singleton context bound to
OPENRAL_CUDA_DEVICE_INDEX (default 0). Consumers that go through this
helper share one context with GStreamer + TRT + any future Skill GPU
code — necessary to avoid the
CUDA_ERROR_INVALID_CONTEXT failure mode when multiple PyCUDA users
each call cuda.init() independently.
Skill consumer pattern
Skills that opt into the NVMM path import the helper module and
construct a CUDA-aware view in their step():
import torch
from torch.utils import dlpack
def step(self, world_state):
frame = world_state.image_frames["wrist_csi"]
if frame.encoding is FrameEncoding.CUDA_NV12:
meta = frame.metadata["nvbufsurface"]
# Build a __cuda_array_interface__ dict directly from the handle.
# Concrete adapter helpers will land alongside the first NVMM
# skill (out of scope for this ADR).
...
elif frame.data is not None:
# CPU fallback; current behavior for skills that don't opt in yet.
...
Skills that have not opted in continue to receive CPU-bytes frames because the reader's NVMM branch only fires when the pipeline itself negotiates NVMM caps — i.e. the user wrote a Jetson YAML. The CPU path is unchanged.
Layer-boundary status
This decision crosses the Sensors → Skill boundary by introducing a
GPU-resident contract. Per CLAUDE.md §6.1 that requires an ADR; this
file is that ADR. The boundary is not widened — Skills still
consume a SensorFrame, the only change is that handle becomes a
load-bearing field for one of the four SensorReaderBackend values.
The HAL, Reasoning, WAM, Safety, and Observability layers see no
change: HardwareRunner's tick loop passes the SensorFrame through
to Skill.step verbatim.
Consequences
Positive
- Latency: per-camera frame delivery drops from ~5–10 ms (
bytes(...)copy + JSON-safe base64 serialization on CPU) to ~50 µs (NvBufSurface struct read) on Jetson AGX-class hardware. - DeepStream compatibility: our pipelines emit and consume NvBufSurface, so a user can prepend
nvinferto any pipeline string without our code changing. - Skill freedom: Skills that want a numpy array can ignore the handle and let the runner fall back to the CPU caps form by setting
enable_nvmm: falsein the YAML.
Negative
- The handle is non-serializable. Trace capture for the NVMM path can only persist a hash / footprint, not the pixels themselves. The reproducibility-from-trace property (CLAUDE.md §1.8) is preserved by storing a downsampled CPU copy in the trace alongside the handle. This is a future-PR concern; flagged here for visibility.
- Skills opting into the NVMM path are bound to PyCUDA / torch.cuda. We accept this; the alternative (forcing a CPU copy at the boundary) defeats the purpose.
Mitigations
openral_runner.backends.gstreamer.cuda_context.cuda_context_state()is exposed foropenral doctorso misconfigured devices surface during onboarding rather than during a hot rollout.NvBufSurfaceHandleis a frozen Pydantic model — consumers cannot mutate the handle in-place, which keeps the lifetime invariant trivially auditable.
Alternatives considered
- Copy NVMM → CPU at the reader boundary (
cudaMemcpyinside_on_new_sampleto system memory, then deliverdata=bytes). Loses the headline latency benefit; equivalent to running on the CPU path. Rejected. - Use DLPack tensors directly at the
SensorFramelevel instead of an opaquehandleint. Cleaner consumer API but pulls a hard dependency on a DLPack-capable framework (torch / cupy / numpy ≥ 1.22) intoopenral_core. Rejected for layering reasons; the helper adapter (future PR) can wrap the handle in a DLPack capsule on demand. - Use Holoscan tensors. See ADR-0010 Amendment 2026-05-12 — Holoscan's tensor map is cleaner but the SDK adds ~1.5 GB and would require a 50–70% rewrite of the NVMM handoff. Deferred via the
SensorReaderBackend.HOLOSCANenum reservation.
Provenance
The nvbufsurface.py ctypes binding and cuda_context.py shared
context manager derive from work originally authored by Adrian Llopart
(adrianllopart@gmail.com) — see the module docstrings for explicit
re-licensing. Struct layout follows NVIDIA's publicly distributed
nvbufsurface.h (L4T Multimedia API, JetPack-bundled).