FineSightBench reveals VLMs perceive patterns down to 12px but show persistent failures in fine-scale reasoning such as numeracy and sequencing.
Patch n’ pack: Navit, a vision trans- former for any aspect ratio and resolution
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 5representative citing papers
MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.
A unified Sparse Vision Transformer learns joint 2D/3D medical image representations via self-supervision and achieves competitive AUROC on chest X-ray and CT benchmarks with 5x less data than modality-specific models.
Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
citing papers explorer
-
The Last Visible Pixel: Probing Fine-Scale Perception in Vision-Language Models
FineSightBench reveals VLMs perceive patterns down to 12px but show persistent failures in fine-scale reasoning such as numeracy and sequencing.
-
MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production
MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.
-
MultiMedVision: Multi-Modal Medical Vision Framework
A unified Sparse Vision Transformer learns joint 2D/3D medical image representations via self-supervision and achieves competitive AUROC on chest X-ray and CT benchmarks with 5x less data than modality-specific models.
-
Kimi K2.5: Visual Agentic Intelligence
Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
-
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.