
[Ai2 said Molmo 2 improves on its earlier models despite its compact size. | Source: Ai2]
Deep video understanding is key to building models that can understand and act on sensor streams for robotics. However, most models today either lack video understanding capabilities or are locked behind proprietary systems without transparency into the data. Ai2 said it is giving researchers access to advanced video grounding, tracking, and multi-frame reasoning, all with open weights and data.
Molmo 2 can identify exactly where and when events occur, track multiple objects through complex scenes, and connect actions to frame-level timelines. The company said these capabilities support safer automation, more accurate real-world systems, and open research the global community can inspect, reproduce, and build upon.
Ai2 listed key capabilities:
Frame-level spatial and temporal grounding: Molmo 2 goes beyond description. It returns precise pixel coordinates, object positions, and timestamps for events across a video.
Robust multi-object tracking and counting: The model maintains consistent object identities across occlusions, scene changes, and long clips, enabling applications in robotics, inspection, transportation, and industry.
Dense long-form video captioning and anomaly detection: Molmo 2 produces highly detailed, searchable descriptions and flags unusual events in long sequences.

