Patch n’ pack: Navit, a vision trans- former for any aspect ratio and resolution

Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim Alabdulmohsin, Avital Oliver, Piotr Padlewski, Alexey Gritsenko, Mario Luˇci´c, and · 2023 · arXiv 2307.06304

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

background 2 dataset 1

citation-polarity summary

background 2 use dataset 1

representative citing papers

The Last Visible Pixel: Probing Fine-Scale Perception in Vision-Language Models

cs.CV · 2026-06-05 · unverdicted · novelty 6.0

FineSightBench reveals VLMs perceive patterns down to 12px but show persistent failures in fine-scale reasoning such as numeracy and sequencing.

MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production

cs.DC · 2026-05-09 · unverdicted · novelty 6.0

MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.

MultiMedVision: Multi-Modal Medical Vision Framework

cs.CV · 2026-05-09 · unverdicted · novelty 5.0

A unified Sparse Vision Transformer learns joint 2D/3D medical image representations via self-supervision and achieves competitive AUROC on chest X-ray and CT benchmarks with 5x less data than modality-specific models.

Kimi K2.5: Visual Agentic Intelligence

cs.CL · 2026-02-02 · unverdicted · novelty 5.0

Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

cs.CV · 2025-01-22 · unverdicted · novelty 4.0

VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

citing papers explorer

Showing 5 of 5 citing papers.

The Last Visible Pixel: Probing Fine-Scale Perception in Vision-Language Models cs.CV · 2026-06-05 · unverdicted · none · ref 12
FineSightBench reveals VLMs perceive patterns down to 12px but show persistent failures in fine-scale reasoning such as numeracy and sequencing.
MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production cs.DC · 2026-05-09 · unverdicted · none · ref 11
MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.
MultiMedVision: Multi-Modal Medical Vision Framework cs.CV · 2026-05-09 · unverdicted · none · ref 4
A unified Sparse Vision Transformer learns joint 2D/3D medical image representations via self-supervision and achieves competitive AUROC on chest X-ray and CT benchmarks with 5x less data than modality-specific models.
Kimi K2.5: Visual Agentic Intelligence cs.CL · 2026-02-02 · unverdicted · none · ref 16
Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding cs.CV · 2025-01-22 · unverdicted · none · ref 50
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

Patch n’ pack: Navit, a vision trans- former for any aspect ratio and resolution

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer