pith. sign in

hub Baseline reference

Oryx mllm: On- demand spatial-temporal understanding at arbitrary resolution

Baseline reference. 62% of citing Pith papers use this work as a benchmark or comparison.

21 Pith papers citing it
Baseline 62% of classified citations

hub tools

citation-role summary

baseline 5 background 3

citation-polarity summary

fields

cs.CV 21

representative citing papers

Streaming Video Instruction Tuning

cs.CV · 2025-12-24 · unverdicted · novelty 6.0

Streamo is a streaming video LLM trained end-to-end on the new Streamo-Instruct-465K dataset that unifies multiple real-time video tasks with claimed strong temporal reasoning and generalization.

VISD: Enhancing Video Reasoning via Structured Self-Distillation

cs.CV · 2026-05-07 · unverdicted · novelty 5.0 · 4 refs

VISD proposes structured self-distillation with a multi-dimensional judge model and direction-magnitude decoupling to improve token-level credit assignment and convergence speed in VideoLLM reasoning training.

3D-IDE: 3D Implicit Depth Emergent

cs.CV · 2026-03-28 · unverdicted · novelty 5.0

3D awareness emerges implicitly in MLLMs via self-supervised geometric constraints that create an information bottleneck, removing depth and pose dependencies at inference and cutting latency by 55%.

NVILA: Efficient Frontier Visual Language Models

cs.CV · 2024-12-05 · unverdicted · novelty 5.0

NVILA improves on VILA with a scale-then-compress visual token strategy and full-lifecycle efficiency optimizations, matching or exceeding leading VLMs on image and video benchmarks while reducing training cost 1.9-5.1x and latencies 1.2-2.8x.

LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

cs.CV · 2026-05-25 · unverdicted · novelty 4.0

LLaVA-OV-2 uses codec-stream tokenization and a shared 3D RoPE to improve video, spatial, and tracking performance over Qwen3-VL-8B, while introducing the JumpScore benchmark for fine-grained motion localization.

Toward Native Multimodal Modeling: A Roadmap

cs.CV · 2026-05-25 · unverdicted · novelty 3.0

A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.

citing papers explorer

Showing 21 of 21 citing papers.