Haplovl: A single- transformer baseline for multi-modal understanding

Longvideobench: A benchmark for longcontext interleaved video-language understanding · 2024 · arXiv 2503.14694

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

From Pixels to Words -- Towards Native One-Vision Models at Scale

cs.CV · 2026-05-27 · unverdicted · novelty 6.0

NEO-ov is a native one-vision model that learns cross-frame and pixel-word correspondence end-to-end and narrows the gap to modular VLMs on multi-image, video, and spatial tasks.

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

cs.CV · 2026-05-12 · unverdicted · novelty 5.0

SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.

citing papers explorer

Showing 2 of 2 citing papers.

From Pixels to Words -- Towards Native One-Vision Models at Scale cs.CV · 2026-05-27 · unverdicted · none · ref 6
NEO-ov is a native one-vision model that learns cross-frame and pixel-word correspondence end-to-end and narrows the gap to modular VLMs on multi-image, video, and spatial tasks.
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture cs.CV · 2026-05-12 · unverdicted · none · ref 151
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.

Haplovl: A single- transformer baseline for multi-modal understanding

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer