hub Baseline reference

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park · 2024 · cs.CV · arXiv 2409.17146

Baseline reference. 53% of citing Pith papers use this work as a benchmark or comparison.

45 Pith papers citing it

Baseline 53% of classified citations

open full Pith review browse 45 citing papers arXiv PDF

abstract

Today's most advanced vision-language models (VLMs) remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed VLMs into open ones. As a result, the community has been missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of VLMs that are state-of-the-art in their class of openness. Our key contribution is a collection of new datasets called PixMo, including a dataset of highly detailed image captions for pre-training, a free-form image Q&A dataset for fine-tuning, and an innovative 2D pointing dataset, all collected without the use of external VLMs. The success of our approach relies on careful modeling choices, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets. Our best-in-class 72B model not only outperforms others in the class of open weight and data models, but also outperforms larger proprietary models including Claude 3.5 Sonnet, and Gemini 1.5 Pro and Flash, second only to GPT-4o based on both academic benchmarks and on a large human evaluation. Our model weights, new datasets, and source code are available at https://molmo.allenai.org/blog.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 baseline 5 dataset 3 method 2

citation-polarity summary

background 5 baseline 5 use dataset 3 use method 2

representative citing papers

Visual Para-Thinker: Divide-and-Conquer Reasoning for Visual Comprehension

cs.CV · 2026-02-10 · unverdicted · novelty 7.0

Visual Para-Thinker is the first parallel reasoning framework for MLLMs that uses visual partitioning strategies, Pa-Attention, and LPRoPE to extend test-time scaling benefits to visual comprehension tasks.

High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

cs.CV · 2025-07-08 · conditional · novelty 7.0

MGPO elicits grounding in LMMs via multi-turn RL with binary rewards, yielding 5.4% and 5.2% gains on MME-Realworld and V* Bench and surpassing GPT-4o on the latter after training on 21K samples.

FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding

cs.CV · 2025-04-14 · unverdicted · novelty 7.0

FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.

GuideDog: A Real-World Egocentric Multimodal Dataset for Blind and Low-Vision Accessibility-Aware Guidance

cs.CV · 2025-03-17 · unverdicted · novelty 7.0

GuideDog supplies 22K egocentric image-description pairs from 46 countries and an 818-sample QA benchmark showing that current multimodal models still struggle with depth perception and BLV-specific guidance rules.

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

cs.CV · 2024-12-31 · accept · novelty 7.0

OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.

Self-Correcting Text-to-Video Generation with Misalignment Detection and Localized Refinement

cs.CV · 2024-11-22 · unverdicted · novelty 7.0

VideoRepair detects text-video misalignments via MLLM-generated questions and performs localized, region-preserving refinement to improve alignment in existing T2V diffusion models.

20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

cs.LG · 2026-05-12 · conditional · novelty 6.0 · 2 refs

Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.

PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations

cs.AI · 2026-04-30 · unverdicted · novelty 6.0

PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.

Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

Entropy-gradient grounding uses model uncertainty to retrieve evidence regions in VLMs, improving performance on detail-critical and compositional tasks across multiple architectures.

ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

ABMamba uses Mamba-based linear-complexity processing plus a novel Aligned Hierarchical Bidirectional Scan to deliver competitive video captioning on VATEX and MSR-VTT at roughly 3x higher throughput than typical Transformer MLLMs.

A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model

cs.RO · 2026-04-07 · unverdicted · novelty 6.0

A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.

Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control

cs.RO · 2026-02-13 · unverdicted · novelty 6.0

Steerable VLAs trained on rich synthetic commands at subtask, motion, and pixel levels enable VLMs to steer robot behavior more effectively, outperforming prior hierarchical baselines on real-world manipulation and generalization tasks.

BOP-ASK: Object-Interaction Reasoning for Vision-Language Models

cs.CV · 2025-11-20 · unverdicted · novelty 6.0

BOP-ASK supplies 150k images and 33M QA pairs across six tasks to improve VLMs on precise 3D object interaction reasoning and spatial planning.

VisCoder2: Building Multi-Language Visualization Coding Agents

cs.SE · 2025-10-24 · unverdicted · novelty 6.0

VisCoder2 is a family of multi-language visualization coding models trained on the VisCode-Multi-679K dataset that reaches 82.4% execution pass rate at 32B scale and approaches GPT-4.1 performance with iterative self-debug.

InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

cs.RO · 2025-10-15 · unverdicted · novelty 6.0

InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

cs.CV · 2025-08-25 · unverdicted · novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.

Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation

cs.RO · 2025-08-19 · conditional · novelty 6.0

Embodied-R1 uses a pointing-centric representation and reinforced fine-tuning on a 200K dataset to achieve state-of-the-art results on embodied benchmarks plus 56.2% success in SIMPLEREnv and 87.5% on real XArm tasks without task-specific training.

When Seeing Overrides Knowing: Disentangling Knowledge Conflicts in Vision-Language Models

cs.CV · 2025-07-18 · unverdicted · novelty 6.0

The work identifies a small set of attention heads in VLMs that mediate conflicts between parametric knowledge and visual input, and shows that intervening on them steers model behavior while attention patterns provide precise image-region attribution.

Grounded Reinforcement Learning for Visual Reasoning

cs.CV · 2025-05-29 · unverdicted · novelty 6.0

ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

cs.LG · 2025-04-22 · unverdicted · novelty 6.0

π_{0.5} is a VLA model that achieves long-horizon dexterous manipulation in entirely new homes through co-training on heterogeneous tasks and multi-source data including web and semantic predictions.

Perception Encoder: The best visual embeddings are not at the output of the network

cs.CV · 2025-04-17 · unverdicted · novelty 6.0

Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, and dense prediction after simple alignment.

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

cs.CV · 2025-04-14 · conditional · novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.

SmolVLM: Redefining small and efficient multimodal models

cs.AI · 2025-04-07 · unverdicted · novelty 6.0

SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.

Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts

cs.LG · 2025-03-07 · conditional · novelty 6.0

Capacity-aware dropping techniques mitigate load imbalance in MoE inference, delivering up to 1.85x speedup with 0.2% or less performance change on models including Mixtral-8x7B.

citing papers explorer

Showing 45 of 45 citing papers.

Visual Para-Thinker: Divide-and-Conquer Reasoning for Visual Comprehension cs.CV · 2026-02-10 · unverdicted · none · ref 2 · internal anchor
Visual Para-Thinker is the first parallel reasoning framework for MLLMs that uses visual partitioning strategies, Pa-Attention, and LPRoPE to extend test-time scaling benefits to visual comprehension tasks.
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning cs.CV · 2025-07-08 · conditional · none · ref 7 · internal anchor
MGPO elicits grounding in LMMs via multi-turn RL with binary rewards, yielding 5.4% and 5.2% gains on MME-Realworld and V* Bench and surpassing GPT-4o on the latter after training on 21K samples.
FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding cs.CV · 2025-04-14 · unverdicted · none · ref 14 · internal anchor
FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.
GuideDog: A Real-World Egocentric Multimodal Dataset for Blind and Low-Vision Accessibility-Aware Guidance cs.CV · 2025-03-17 · unverdicted · none · ref 8 · internal anchor
GuideDog supplies 22K egocentric image-description pairs from 46 countries and an 818-sample QA benchmark showing that current multimodal models still struggle with depth perception and BLV-specific guidance rules.
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning cs.CV · 2024-12-31 · accept · none · ref 52 · internal anchor
OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
Self-Correcting Text-to-Video Generation with Misalignment Detection and Localized Refinement cs.CV · 2024-11-22 · unverdicted · none · ref 8 · internal anchor
VideoRepair detects text-video misalignments via MLLM-generated questions and performs localized, region-preserving refinement to improve alignment in existing T2V diffusion models.
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone cs.LG · 2026-05-12 · conditional · none · ref 12 · 2 links · internal anchor
Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.
PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations cs.AI · 2026-04-30 · unverdicted · none · ref 11 · internal anchor
PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.
Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models cs.CV · 2026-04-09 · unverdicted · none · ref 6 · internal anchor
Entropy-gradient grounding uses model uncertainty to retrieve evidence regions in VLMs, improving performance on detail-critical and compositional tasks across multiple architectures.
ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning cs.CV · 2026-04-09 · unverdicted · none · ref 14 · internal anchor
ABMamba uses Mamba-based linear-complexity processing plus a novel Aligned Hierarchical Bidirectional Scan to deliver competitive video captioning on VATEX and MSR-VTT at roughly 3x higher throughput than typical Transformer MLLMs.
A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model cs.RO · 2026-04-07 · unverdicted · none · ref 8 · internal anchor
A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.
Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control cs.RO · 2026-02-13 · unverdicted · none · ref 49 · internal anchor
Steerable VLAs trained on rich synthetic commands at subtask, motion, and pixel levels enable VLMs to steer robot behavior more effectively, outperforming prior hierarchical baselines on real-world manipulation and generalization tasks.
BOP-ASK: Object-Interaction Reasoning for Vision-Language Models cs.CV · 2025-11-20 · unverdicted · none · ref 9 · internal anchor
BOP-ASK supplies 150k images and 33M QA pairs across six tasks to improve VLMs on precise 3D object interaction reasoning and spatial planning.
VisCoder2: Building Multi-Language Visualization Coding Agents cs.SE · 2025-10-24 · unverdicted · none · ref 2 · internal anchor
VisCoder2 is a family of multi-language visualization coding models trained on the VisCode-Multi-679K dataset that reaches 82.4% execution pass rate at 32B scale and approaches GPT-4.1 performance with iterative self-debug.
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy cs.RO · 2025-10-15 · unverdicted · none · ref 11 · internal anchor
InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency cs.CV · 2025-08-25 · unverdicted · none · ref 30 · internal anchor
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.
Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation cs.RO · 2025-08-19 · conditional · none · ref 7 · internal anchor
Embodied-R1 uses a pointing-centric representation and reinforced fine-tuning on a 200K dataset to achieve state-of-the-art results on embodied benchmarks plus 56.2% success in SIMPLEREnv and 87.5% on real XArm tasks without task-specific training.
When Seeing Overrides Knowing: Disentangling Knowledge Conflicts in Vision-Language Models cs.CV · 2025-07-18 · unverdicted · none · ref 6 · internal anchor
The work identifies a small set of attention heads in VLMs that mediate conflicts between parametric knowledge and visual input, and shows that intervening on them steers model behavior while attention patterns provide precise image-region attribution.
Grounded Reinforcement Learning for Visual Reasoning cs.CV · 2025-05-29 · unverdicted · none · ref 10 · internal anchor
ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization cs.LG · 2025-04-22 · unverdicted · none · ref 19 · internal anchor
π_{0.5} is a VLA model that achieves long-horizon dexterous manipulation in entirely new homes through co-training on heterogeneous tasks and multi-source data including web and semantic predictions.
Perception Encoder: The best visual embeddings are not at the output of the network cs.CV · 2025-04-17 · unverdicted · none · ref 25 · internal anchor
Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, and dense prediction after simple alignment.
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models cs.CV · 2025-04-14 · conditional · none · ref 31 · internal anchor
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
SmolVLM: Redefining small and efficient multimodal models cs.AI · 2025-04-07 · unverdicted · none · ref 10 · internal anchor
SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.
Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts cs.LG · 2025-03-07 · conditional · none · ref 7 · internal anchor
Capacity-aware dropping techniques mitigate load imbalance in MoE inference, delivering up to 1.85x speedup with 0.2% or less performance change on models including Mixtral-8x7B.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling cs.CV · 2024-12-06 · unverdicted · none · ref 54 · internal anchor
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction cs.CL · 2024-12-05 · conditional · none · ref 75 · internal anchor
Aguvis presents a pure vision-based framework for autonomous GUI agents using structured reasoning via inner monologue, a new multimodal dataset, and two-stage training to reach SOTA on offline and online benchmarks.
Pixtral 12B cs.CV · 2024-10-09 · unverdicted · none · ref 4 · internal anchor
Pixtral-12B is a 12B multimodal LLM with a custom vision encoder that ingests images at native resolution and aspect ratio, achieving leading benchmark results among open models while preserving text capabilities.
VCap: Hypergeometric Rewards for Weak-to-Strong Visual Captioning cs.CV · 2026-05-27 · unverdicted · none · ref 12 · internal anchor
VCap pairs reference captions as witnesses with visual signals as adjudicators to deliver hypergeometric-precision rewards for RL in visual captioning, enabling an 8B model to outperform SOTA on benchmarks and improve weak-to-strong generalization.
Beyond Waypoints: Dual-Heatmap Grounding for Cross-Embodiment Semantic Navigation cs.RO · 2026-05-19 · unverdicted · none · ref 9 · internal anchor
A vision-language model outputs dual heatmaps for navigation affordance and facing to ground semantic instructions into executable free space, achieving higher affordance rates than waypoint regression across simulated robot embodiments.
Visibility-Aware Mobile Grasping in Dynamic Environments cs.RO · 2026-05-04 · unverdicted · none · ref 57 · 2 links · internal anchor
A visibility-aware mobile grasping system with iterative whole-body planning and behavior-tree subgoal generation achieves 68.8% success in unknown static and 58% in dynamic environments, outperforming a baseline by 22.8% and 18%.
UniMesh: Unifying 3D Mesh Understanding and Generation cs.CV · 2026-04-19 · unverdicted · none · ref 6 · internal anchor
UniMesh unifies 3D mesh generation and understanding in one model via a Mesh Head interface, Chain of Mesh iterative editing, and an Actor-Evaluator self-reflection loop.
Anthropogenic Regional Adaptation in Multimodal Vision-Language Model cs.AI · 2026-04-13 · unverdicted · none · ref 21 · internal anchor
Anthropogenic Regional Adaptation with GG-EZ improves cultural relevance in multimodal vision-language models for Southeast Asia by 5-15% while retaining over 98% of global performance.
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective cs.RO · 2025-07-02 · unverdicted · none · ref 182 · internal anchor
The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.
Common Inpainted Objects In-N-Out of Context cs.CV · 2025-05-31 · unverdicted · none · ref 17 · internal anchor
COinCO is a new dataset of inpainted COCO images with in- and out-of-context objects, enabling context reasoning, object prediction from scenes, and improved fake image detection.
Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs cs.CV · 2025-03-04 · unverdicted · none · ref 11 · internal anchor
Modality-mutual attention (MMA) is introduced to replace causal attention in MLLMs, enabling mutual attention between image and text tokens and claiming SOTA results on 12 multimodal benchmarks with no extra parameters.
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs cs.CL · 2025-03-03 · unverdicted · none · ref 14 · internal anchor
Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding cs.CV · 2024-12-13 · accept · none · ref 23 · internal anchor
DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B activated parameters.
NVILA: Efficient Frontier Visual Language Models cs.CV · 2024-12-05 · unverdicted · none · ref 30 · internal anchor
NVILA improves on VILA with a scale-then-compress visual token strategy and full-lifecycle efficiency optimizations, matching or exceeding leading VLMs on image and video benchmarks while reducing training cost 1.9-5.1x and latencies 1.2-2.8x.
Mitigating Manifold Departure: Uncertainty-Aware Subspace Rectification for Trustworthy MLLM Decoding cs.LG · 2026-05-31 · unverdicted · none · ref 43 · internal anchor
MGAP constructs a language-prior subspace from blind hidden states via SVD and applies a consistency-aware gate to attenuate only the projected prior component in multimodal hidden states during decoding.
When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise cs.CV · 2026-05-06 · unverdicted · none · ref 8 · 2 links · internal anchor
Mild rotations and noise significantly increase relation hallucinations in VLMs across models and datasets, with prompt and preprocessing fixes providing only partial relief.
Seed1.5-VL Technical Report cs.CV · 2025-05-11 · unverdicted · none · ref 21 · internal anchor
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding cs.CV · 2025-01-22 · unverdicted · none · ref 35 · internal anchor
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
PaliGemma 2: A Family of Versatile VLMs for Transfer cs.CV · 2024-12-04 · unverdicted · none · ref 19 · internal anchor
PaliGemma 2 is a family of vision-language models that achieves state-of-the-art results on transfer tasks like table structure recognition and radiography report generation by combining SigLIP with Gemma 2 models at various sizes and resolutions.
TrajTok: Learning Trajectory Tokens enables better Video Understanding cs.CV · 2026-02-26 · unreviewed · ref 17 · internal anchor
FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks cs.AI · 2025-05-26 · unreviewed · ref 10 · internal anchor

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer