MotionBits: Video Segmentation through Motion-Level Analysis of Rigid Bodies
News Release Summary
Researchers at Rice University and the University of Texas at Dallas have developed a new video segmentation system designed to identify and track individual rigid objects by analyzing how they physically move, rather than by relying on what they look like. The core problem they tackled is that existing segmentation models — including powerful foundation models like Segment Anything — carve up scenes based on visual appearance and human-defined object categories, which causes them to either split a single composite object into too many pieces or lump separately moving parts together. To address this, the team defined a new concept called a "MotionBit," grounded in rigid-body kinematics, which groups image pixels together only if they share the same spatial twist — essentially the same instantaneous rotational and translational motion — throughout a video clip. Building on that definition, they created a learning-free, graph-based algorithm that estimates local motion for sampled image points using optical flow, constructs a similarity graph weighted by kinematic consistency, and then clusters nodes into distinct rigid-body segments, using SAM 2 to clean up boundaries. To evaluate the approach, the team also assembled MoRiBo, a new hand-labeled benchmark of 349 videos spanning teleoperated robot manipulation and everyday human-object interactions. Tested against that benchmark, their method outperformed state-of-the-art video-language models and motion segmentation competitors by an average of 37.3 percentage points in mean intersection-over-union. In a practical robot demonstration, the system enabled a robot to successfully stack composite block objects into a tower in 6 out of 10 trials, while competing methods based on SAM or language-model reasoning achieved zero successes, underscoring the argument that motion-aware segmentation could be a meaningful missing piece for robots operating in cluttered, real-world environments.
abstract
Rigid bodies constitute the smallest manipulable elements in the real world, and understanding how they physically interact is fundamental to embodied reasoning and robotic manipulation. Thus, accurate detection, segmentation, and tracking of moving rigid bodies is essential for enabling reasoning modules to interpret and act in diverse environments. However, current segmentation models trained on semantic grouping are limited in their ability to provide meaningful interaction-level cues for completing embodied tasks. To address this gap, we introduce MotionBit, a novel concept that, unlike prior formulations, defines the smallest unit in motion-based segmentation through kinematic spatial twist equivalence, independent of semantics. In this paper, we contribute (1) the MotionBit concept and definition, (2) a hand-labeled benchmark, called MoRiBo, for evaluating moving rigid-body segmentation across robotic manipulation and human-in-the-wild videos, and (3) a learning-free graph-based MotionBits segmentation method that outperforms state-of-the-art embodied perception methods by 37.3\% in macro-averaged mIoU on the MoRiBo benchmark. Finally, we demonstrate the effectiveness of MotionBits segmentation for downstream embodied reasoning and manipulation tasks, highlighting its importance as a fundamental primitive for understanding physical interactions.
details
citation
@article{qianmotionbits,
title = {MotionBits: Video Segmentation through Motion-Level Analysis of Rigid Bodies},
author = {Qian, Howard H. and Ren, Kejia and Xiang, Yu and Ordonez, Vicente and Hang, Kaiyu},
journal = {arXiv preprint arXiv:2603.06846},
url = {https://arxiv.org/abs/2603.06846},
}
automatically generated questions, main contributions and limitations of this paper
Questions this paper helps answer
- What is a MotionBit and how is it defined? A MotionBit is the smallest unit in motion-based segmentation, formally defined through kinematic spatial twist equivalence: pixels or points belong to the same MotionBit if and only if they share an identical, non-zero spatial twist trajectory throughout an observation time window, independent of their visual appearance or semantic class.
- What is MoRiBo and what does it contain? MoRiBo is the first hand-labeled benchmark for evaluating moving rigid-body segmentation in real-world RGB videos; it contains 270 robotic manipulation videos sourced from BridgeData V2 and 79 human-in-the-wild videos from SA-V, with manually verified final-frame segmentation masks for each rigid part that exhibited independent motion.
- How does the proposed method work at a high level? The method is learning-free and graph-based: it samples a uniform grid of points per frame, estimates local spatial twists using optical flow and a modified RANSAC with Kabsch estimation, builds a spatial twist similarity graph with Mahalanobis-distance edge weights, then applies soft label propagation followed by hard Markov clustering, and finally uses SAM 2 to refine segment boundaries.
- By how much does the proposed method outperform baselines on MoRiBo? The method outperforms all evaluated baselines by an average of 37.3 percentage points in macro-averaged mIoU across both benchmark tracks, and outperforms the two strongest baselines, Qwen2.5-VL and Segment Any Motion in Videos, by 32.1 percentage points in mIoU.
- What downstream tasks benefit from MotionBits segmentation? Two downstream tasks are demonstrated: visually grounded visual question answering, where overlaying MotionBits masks as set-of-mark prompts substantially improves a vision-language model's ability to identify which objects moved, and robotic tower stacking, where the robot achieved 6 of 10 successful stacks using MotionBits masks compared to zero successes for SAM, SAMIV, and QwenVL.
Main contributions
- The paper introduces the MotionBit concept, a mathematically grounded, semantics-independent segmentation primitive defined through kinematic spatial twist equivalence derived from rigid-body kinematics in SE(3).
- The paper contributes MoRiBo, the first benchmark for real-world moving rigid-body segmentation, with 349 hand-labeled videos spanning robotic manipulation and human-in-the-wild interaction domains.
- The paper presents a learning-free, graph-based segmentation pipeline that operates online on RGB video and achieves 52.6 percent mIoU on the robotic manipulation track and 46.7 percent mIoU on the human-in-the-wild track, outperforming all evaluated baselines.
- A Monte Carlo sensitivity analysis with 100,000 trials quantitatively justifies reducing the full SE(3) problem to an SE(2) motion model, showing average kinematic errors below 1 percent under both robotic workspace and in-the-wild conditions.
- Real-world robot experiments with composite glued-block objects demonstrate that MotionBits masks enable successful tower stacking at a 60 percent success rate, providing concrete evidence that motion-level segmentation translates to actionable manipulation cues.
Limitations and cautions
- The current method is evaluated mainly under a static-camera assumption, which keeps the motion analysis clean and well scoped; extending the same MotionBit formulation with full SE(3) camera ego-motion compensation is a natural next step for highly mobile camera settings.
- MoRiBo provides hand-labeled ground truth on the final frame of each video, matching the paper's main segmentation metric; future benchmarks with dense temporal annotations could further show how consistently MotionBits track rigid parts across an entire sequence.
- The implemented graph pipeline uses an SE(2) approximation even though the MotionBit definition is grounded in full SE(3) rigid-body motion; the paper's large Monte Carlo sensitivity study reports less than 1 percent average kinematic error under the tested conditions, making this a practical and well-justified engineering choice.
- The robot demonstration uses a controlled tabletop setup with glued colored blocks and one robot arm, which makes the downstream manipulation evidence easy to interpret; broader tests with varied objects, materials, and environments would be a valuable extension of an already convincing proof of utility.
- Several baselines were not built specifically for moving rigid-body segmentation, and VLM baselines need an extra segmentation step to produce masks; the comparison still usefully shows that appearance-based and language-based systems miss motion-level structure that the proposed method captures directly.
How to read this result
This paper is best read as a strong foundational contribution: it gives rigid-body video segmentation a clear physical definition, backs it with a new benchmark and large empirical gains, and shows that motion-level masks can directly improve robot manipulation while leaving well-scoped opportunities for broader real-world deployment.