NEO-ov is a native one-vision model that learns cross-frame and pixel-word correspondence end-to-end and narrows the gap to modular VLMs on multi-image, video, and spatial tasks.
Haplovl: A single- transformer baseline for multi-modal understanding
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
fields
cs.CV 2years
2026 2verdicts
UNVERDICTED 2roles
background 1polarities
background 1representative citing papers
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
citing papers explorer
-
From Pixels to Words -- Towards Native One-Vision Models at Scale
NEO-ov is a native one-vision model that learns cross-frame and pixel-word correspondence end-to-end and narrows the gap to modular VLMs on multi-image, video, and spatial tasks.
-
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.