hub Baseline reference

Oryx mllm: On- demand spatial-temporal understanding at arbitrary resolution

Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Jiwen Lu, Yongming Rao · 2024 · arXiv 2409.12961

Baseline reference. 62% of citing Pith papers use this work as a benchmark or comparison.

21 Pith papers citing it

Baseline 62% of classified citations

read on arXiv browse 21 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

baseline 5 background 3

citation-polarity summary

baseline 5 background 3

representative citing papers

LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding

cs.CV · 2026-02-24 · unverdicted · novelty 7.0

LongVideo-R1 trains a reasoning agent on 33K trajectories to intelligently select informative video clips via iterative refinement and RL, achieving better accuracy-efficiency tradeoffs on long video QA benchmarks.

High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

cs.CV · 2025-07-08 · conditional · novelty 7.0

MGPO elicits grounding in LMMs via multi-turn RL with binary rewards, yielding 5.4% and 5.2% gains on MME-Realworld and V* Bench and surpassing GPT-4o on the latter after training on 21K samples.

ViQ: Text-Aligned Visual Quantized Representations at Any Resolution

cs.CV · 2026-06-25 · unverdicted · novelty 6.0

ViQ is a new two-stage text-aligned quantization method for visual features supporting arbitrary resolutions that claims competitive multimodal performance with efficiency gains of 20-70%.

Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

Flat-Pack Bench is a new evaluation suite that shows state-of-the-art LVLMs perform poorly on nuanced spatio-temporal reasoning required for furniture assembly videos.

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.

VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding

cs.CV · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

VideoRouter uses dual semantic and image routers for query-adaptive token compression in long-video models, delivering up to 67.9% reduction while outperforming the InternVL baseline on VideoMME, MLVU, and LongVideoBench.

Streaming Video Instruction Tuning

cs.CV · 2025-12-24 · unverdicted · novelty 6.0

Streamo is a streaming video LLM trained end-to-end on the new Streamo-Instruct-465K dataset that unifies multiple real-time video tasks with claimed strong temporal reasoning and generalization.

Detector-Empowered Video Large Language Model for Efficient Spatio-Temporal Grounding

cs.CV · 2025-12-07 · conditional · novelty 6.0

DEViL offloads spatial grounding to a detector via a distilled reference-semantic token and temporal consistency regularization, reaching 43.1% m_vIoU at 14.33 FPS on HC-STVG.

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

cs.CV · 2025-08-25 · unverdicted · novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

cs.CV · 2025-05-29 · unverdicted · novelty 6.0 · 2 refs

Spatial-MLLM adds a 3D spatial encoder initialized from a visual geometry model and space-aware frame sampling to MLLMs to improve spatial understanding and reasoning from purely 2D visual inputs.

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

cs.CV · 2025-05-26 · unverdicted · novelty 6.0

VLM-3R augments VLMs with implicit 3D tokens from monocular video via geometry encoding and 200K+ 3D reconstructive QA pairs, plus a new 138K-pair temporal benchmark, to support spatial and embodied reasoning.

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

cs.CV · 2025-04-14 · conditional · novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

cs.CV · 2025-01-06 · unverdicted · novelty 6.0

MotionBench is a new benchmark showing poor fine-grained motion understanding in VLMs and proposes TE Fusion to improve performance with higher frame rates.

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

cs.CV · 2024-12-06 · unverdicted · novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

VISD: Enhancing Video Reasoning via Structured Self-Distillation

cs.CV · 2026-05-07 · unverdicted · novelty 5.0 · 4 refs

VISD proposes structured self-distillation with a multi-dimensional judge model and direction-magnitude decoupling to improve token-level credit assignment and convergence speed in VideoLLM reasoning training.

3D-IDE: 3D Implicit Depth Emergent

cs.CV · 2026-03-28 · unverdicted · novelty 5.0

3D awareness emerges implicitly in MLLMs via self-supervised geometric constraints that create an information bottleneck, removing depth and pose dependencies at inference and cutting latency by 55%.

LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding

cs.CV · 2025-01-09 · unverdicted · novelty 5.0

LLaVA-Octopus introduces instruction-driven adaptive fusion of multiple visual projectors in a multimodal LLM to improve video understanding performance.

NVILA: Efficient Frontier Visual Language Models

cs.CV · 2024-12-05 · unverdicted · novelty 5.0

NVILA improves on VILA with a scale-then-compress visual token strategy and full-lifecycle efficiency optimizations, matching or exceeding leading VLMs on image and video benchmarks while reducing training cost 1.9-5.1x and latencies 1.2-2.8x.

LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

cs.CV · 2026-05-25 · unverdicted · novelty 4.0

LLaVA-OV-2 uses codec-stream tokenization and a shared 3D RoPE to improve video, spatial, and tracking performance over Qwen3-VL-8B, while introducing the JumpScore benchmark for fine-grained motion localization.

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

cs.CV · 2025-01-22 · unverdicted · novelty 4.0

VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

Toward Native Multimodal Modeling: A Roadmap

cs.CV · 2026-05-25 · unverdicted · novelty 3.0

A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.

citing papers explorer

Showing 21 of 21 citing papers.

LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding cs.CV · 2026-02-24 · unverdicted · none · ref 29
LongVideo-R1 trains a reasoning agent on 33K trajectories to intelligently select informative video clips via iterative refinement and RL, achieving better accuracy-efficiency tradeoffs on long video QA benchmarks.
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning cs.CV · 2025-07-08 · conditional · none · ref 22
MGPO elicits grounding in LMMs via multi-turn RL with binary rewards, yielding 5.4% and 5.2% gains on MME-Realworld and V* Bench and surpassing GPT-4o on the latter after training on 21K samples.
ViQ: Text-Aligned Visual Quantized Representations at Any Resolution cs.CV · 2026-06-25 · unverdicted · none · ref 7
ViQ is a new two-stage text-aligned quantization method for visual features supporting arbitrary resolutions that claims competitive multimodal performance with efficiency gains of 20-70%.
Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly cs.CV · 2026-05-20 · unverdicted · none · ref 17
Flat-Pack Bench is a new evaluation suite that shows state-of-the-art LVLMs perform poorly on nuanced spatio-temporal reasoning required for furniture assembly videos.
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding cs.CV · 2026-05-18 · unverdicted · none · ref 45
SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding cs.CV · 2026-05-07 · unverdicted · none · ref 38 · 2 links
VideoRouter uses dual semantic and image routers for query-adaptive token compression in long-video models, delivering up to 67.9% reduction while outperforming the InternVL baseline on VideoMME, MLVU, and LongVideoBench.
Streaming Video Instruction Tuning cs.CV · 2025-12-24 · unverdicted · none · ref 16
Streamo is a streaming video LLM trained end-to-end on the new Streamo-Instruct-465K dataset that unifies multiple real-time video tasks with claimed strong temporal reasoning and generalization.
Detector-Empowered Video Large Language Model for Efficient Spatio-Temporal Grounding cs.CV · 2025-12-07 · conditional · none · ref 46
DEViL offloads spatial grounding to a detector via a distilled reference-semantic token and temporal consistency regularization, reaching 43.1% m_vIoU at 14.33 FPS on HC-STVG.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency cs.CV · 2025-08-25 · unverdicted · none · ref 74
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence cs.CV · 2025-05-29 · unverdicted · none · ref 51 · 2 links
Spatial-MLLM adds a 3D spatial encoder initialized from a visual geometry model and space-aware frame sampling to MLLMs to improve spatial understanding and reasoning from purely 2D visual inputs.
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction cs.CV · 2025-05-26 · unverdicted · none · ref 46
VLM-3R augments VLMs with implicit 3D tokens from monocular video via geometry encoding and 200K+ 3D reconstructive QA pairs, plus a new 138K-pair temporal benchmark, to support spatial and embodied reasoning.
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models cs.CV · 2025-04-14 · conditional · none · ref 78
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models cs.CV · 2025-01-06 · unverdicted · none · ref 29
MotionBench is a new benchmark showing poor fine-grained motion understanding in VLMs and proposes TE Fusion to improve performance with higher frame rates.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling cs.CV · 2024-12-06 · unverdicted · none · ref 160
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
VISD: Enhancing Video Reasoning via Structured Self-Distillation cs.CV · 2026-05-07 · unverdicted · none · ref 27 · 4 links
VISD proposes structured self-distillation with a multi-dimensional judge model and direction-magnitude decoupling to improve token-level credit assignment and convergence speed in VideoLLM reasoning training.
3D-IDE: 3D Implicit Depth Emergent cs.CV · 2026-03-28 · unverdicted · none · ref 31
3D awareness emerges implicitly in MLLMs via self-supervised geometric constraints that create an information bottleneck, removing depth and pose dependencies at inference and cutting latency by 55%.
LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding cs.CV · 2025-01-09 · unverdicted · none · ref 45
LLaVA-Octopus introduces instruction-driven adaptive fusion of multiple visual projectors in a multimodal LLM to improve video understanding performance.
NVILA: Efficient Frontier Visual Language Models cs.CV · 2024-12-05 · unverdicted · none · ref 63
NVILA improves on VILA with a scale-then-compress visual token strategy and full-lifecycle efficiency optimizations, matching or exceeding leading VLMs on image and video benchmarks while reducing training cost 1.9-5.1x and latencies 1.2-2.8x.
LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence cs.CV · 2026-05-25 · unverdicted · none · ref 23
LLaVA-OV-2 uses codec-stream tokenization and a shared 3D RoPE to improve video, spatial, and tracking performance over Qwen3-VL-8B, while introducing the JumpScore benchmark for fine-grained motion localization.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding cs.CV · 2025-01-22 · unverdicted · none · ref 19
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
Toward Native Multimodal Modeling: A Roadmap cs.CV · 2026-05-25 · unverdicted · none · ref 211
A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.

Oryx mllm: On- demand spatial-temporal understanding at arbitrary resolution

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer