CiF is a large new civil infrastructure segmentation dataset that shows zero-shot foundation models and domain-supervised models plateau at roughly 25% mAP, establishing infrastructure inspection as an open challenge for current visual AI.
hub Mixed citations
Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond
Mixed citation behavior. Most common role is background (40%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
GeoVista introduces a planning-driven active perception framework with global exploration plans, branch-wise local inspection, and explicit evidence tracking to achieve state-of-the-art results on ultra-high-resolution remote sensing benchmarks.
MMVIAD is the first multi-view continuous video dataset for industrial anomaly detection with four supported tasks, and the VISTA model improves average benchmark scores from 45.0 to 57.5 on unseen data while surpassing GPT-5.4.
RobotEQ is a new benchmark dataset and evaluation suite showing that current embodied AI models fall short on active social-norm compliance, especially spatial grounding, though RAG with external knowledge helps.
WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.
LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and VQA benchmarks.
MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
OCRBench provides the largest evaluation suite yet for OCR capabilities in large multimodal models, revealing gaps in multilingual, handwritten, and mathematical text handling.
RAC adds ranking-aware group loss and clean-corrupted pairwise loss to RL post-training to boost both accuracy and calibration in multimodal reasoning without extra annotations.
An 8B MLLM reaches state-of-the-art efficiency and performance under 30B by combining a unified 3D resampler, joint document-text training, and hybrid RL for reasoning modes.
citing papers explorer
-
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and VQA benchmarks.