FineVision: Open Data Is All You Need

Amir Mahla; Andr\'es Marafioti; Aritra Roy Gosthipaty; Leandro Von Werra; Luis Wiedmann; Orr Zohar; Rui Li; Thibaud Frere; Xiaohan Wang

arxiv: 2510.17269 · v2 · pith:IUHBZTYJnew · submitted 2025-10-20 · 💻 cs.CV · cs.AI

FineVision: Open Data Is All You Need

Luis Wiedmann , Orr Zohar , Amir Mahla , Xiaohan Wang , Rui Li , Thibaud Frere , Leandro von Werra , Aritra Roy Gosthipaty

show 1 more author

Andr\'es Marafioti

This is my paper

Pith reviewed 2026-05-21 21:11 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords vision-language modelsopen datasetsdata curationmultimodal learningdataset unificationVLM trainingagentic tasksdata decontamination

0 comments

The pith

A carefully cleaned open corpus of 24 million vision-language samples lets models outperform those trained on prior public mixtures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper assembles FineVision as a single collection of 24 million samples by pulling together more than 200 separate public sources. A semi-automated process handles the bulk work of ingesting and standardizing the data while human reviewers check that annotations are used correctly, formats are consistent, diversity is maintained, and safety issues are addressed. The same pipeline also cleans out duplicates and removes overlaps with common test benchmarks. When models are trained on this corpus they score higher than models trained on earlier open collections across a wide range of vision-language tests. The authors release both the data and the curation tools so others can build on the same approach.

Core claim

We introduce FineVision, a corpus of 24 million vision-language samples created by unifying more than 200 sources into 185 subsets. The unification relies on a semi-automated human-in-the-loop pipeline that performs bulk ingestion and schema mapping, followed by reviewer audits to confirm faithful annotation use, proper formatting, diversity, and safety. The workflow adds rigorous de-duplication within and across sources plus decontamination against 66 public benchmarks and extends to agentic and GUI tasks with a unified action space whose trajectories are validated for executability. Models trained on FineVision consistently outperform those trained on existing open mixtures across a broad,

What carries the argument

the semi-automated human-in-the-loop curation pipeline that ingests sources, maps schemas, audits outputs for annotation fidelity and safety, applies de-duplication and decontamination, and unifies agentic tasks into a shared action space

Load-bearing premise

Human reviewers catch and correct every significant problem in data mapping, formatting, diversity, safety, and annotation accuracy without missing issues or introducing new biases.

What would settle it

Training the same model architecture and schedule on FineVision and on a representative prior open mixture then evaluating both on the paper's broad test suite would show whether the reported performance advantage is present.

read the original abstract

The advancement of vision-language models (VLMs) is hampered by a fragmented landscape of inconsistent and contaminated public datasets. We introduce FineVision, a meticulously collected, curated, and unified corpus of 24 million samples - the largest open resource of its kind. We unify more than 200 sources into 185 subsets via a semi-automated, human-in-the-loop pipeline: automation performs bulk ingestion and schema mapping, while reviewers audit mappings and spot-check outputs to verify faithful consumption of annotations, appropriate formatting and diversity, and safety; issues trigger targeted fixes and re-runs. The workflow further applies rigorous de-duplication within and across sources and decontamination against 66 public benchmarks. FineVision also encompasses agentic/GUI tasks with a unified action space; reviewers validate schemas and inspect a sample of trajectories to confirm executable fidelity. Models trained on FineVision consistently outperform those trained on existing open mixtures across a broad evaluation suite, underscoring the benefits of scale, data hygiene, and balanced automation with human oversight. We release the corpus and curation tools to accelerate data-centric VLM research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FineVision delivers a large 24M-sample open VLM dataset through unified curation and decontamination, but the outperformance claims rest on spot-checks whose coverage and effectiveness are not quantified.

read the letter

The main point is that this paper puts together a 24 million sample open corpus for vision-language models by pulling more than 200 sources into 185 subsets, applying de-duplication, decontamination against 66 benchmarks, and a semi-automated pipeline with human reviewers for quality checks. They claim models trained on it beat those trained on existing open mixtures across evaluations, and they release the data plus tools.

Referee Report

1 major / 2 minor

Summary. The paper introduces FineVision, a curated open corpus of 24 million vision-language samples unified from over 200 sources into 185 subsets. A semi-automated human-in-the-loop pipeline handles bulk ingestion, schema mapping, reviewer audits and spot-checks for annotation fidelity, formatting, diversity and safety, followed by within- and cross-source de-duplication and decontamination against 66 public benchmarks. The work also unifies agentic/GUI tasks under a single action space with trajectory validation. Models trained on FineVision are reported to outperform those trained on existing open mixtures across a broad evaluation suite, attributing gains to scale, data hygiene and balanced automation with human oversight. The corpus and curation tools are released.

Significance. If the performance gains are robust and causally linked to the described curation rather than confounding factors, FineVision would constitute a substantial public resource for data-centric VLM research. The release of both the 24 M sample corpus and the associated tooling directly supports reproducibility and follow-on work. The inclusion of agentic tasks with validated executable trajectories broadens the dataset's applicability beyond standard VQA and captioning.

major comments (1)

[Curation Pipeline] Curation Pipeline section: The manuscript states that reviewers audit mappings and spot-check outputs to verify faithful annotation consumption, formatting, diversity and safety, with targeted fixes applied. However, it reports neither the fraction of the 24 million samples (or of the 185 subsets) that received manual review, nor inter-reviewer agreement statistics, nor quantitative pre-/post-review error rates. Without these metrics it is difficult to determine whether residual contamination or bias remains at a level that could explain the observed performance differences rather than the claimed hygiene benefits.

minor comments (2)

[Abstract] The abstract and methods should explicitly state whether the 185 subsets represent a complete unification of all 200+ sources or whether some sources contribute only partially after filtering.
[Results] Figure captions and table legends should include the exact number of evaluation benchmarks and the precise train/eval split sizes used for the reported outperformance comparisons.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. We address the single major comment point by point below and will update the manuscript accordingly to improve transparency on the curation process.

read point-by-point responses

Referee: [Curation Pipeline] Curation Pipeline section: The manuscript states that reviewers audit mappings and spot-check outputs to verify faithful annotation consumption, formatting, diversity and safety, with targeted fixes applied. However, it reports neither the fraction of the 24 million samples (or of the 185 subsets) that received manual review, nor inter-reviewer agreement statistics, nor quantitative pre-/post-review error rates. Without these metrics it is difficult to determine whether residual contamination or bias remains at a level that could explain the observed performance differences rather than the claimed hygiene benefits.

Authors: We agree that additional quantitative details on the human review component would strengthen the description of the pipeline. The semi-automated design was chosen because exhaustive manual review of 24 million samples is infeasible; instead, every one of the 185 subsets received a full audit of its schema mapping, followed by spot-checks on a representative sample of instances from each subset (with issues triggering targeted re-ingestion and re-audit). In the revised manuscript we will report: (i) that all 185 subsets underwent mapping audits, (ii) the approximate fraction of samples per subset that received spot-checks, (iii) inter-reviewer agreement statistics for the subsets reviewed by multiple annotators, and (iv) available pre- and post-audit error rates extracted from our internal audit logs. These additions will help readers better assess the residual risk of contamination or bias. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical dataset contribution with direct evaluations

full rationale

The paper presents FineVision as a new curated dataset of 24M samples unified from >200 sources via a described semi-automated pipeline with human review, de-duplication, and decontamination. It reports that models trained on this corpus outperform those trained on prior open mixtures across evaluations. No equations, first-principles derivations, or statistical predictions appear in the provided text. The central claim rests on empirical training results rather than any reduction of outputs to fitted inputs, self-definitions, or self-citation chains. The curation workflow is presented as the contribution itself, with performance gains attributed to scale and hygiene without circular self-reference. This is a standard non-circular empirical data paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of human oversight in the curation pipeline and the assumption that the resulting data is measurably superior for model training.

axioms (1)

domain assumption Human reviewers can reliably detect and correct schema mapping errors, annotation fidelity issues, and safety problems during spot-checks of large data subsets.
Invoked in the description of the human-in-the-loop auditing step.

pith-pipeline@v0.9.0 · 5746 in / 1180 out tokens · 47373 ms · 2026-05-21T21:11:36.637909+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

semi-automated, human-in-the-loop pipeline: automation performs bulk ingestion and schema mapping, while reviewers audit mappings and spot-check outputs
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

rigorous de-duplication within and across sources and decontamination against 66 public benchmarks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation
cs.CV 2026-04 unverdicted novelty 7.0

BARD bridges autoregressive and diffusion VLMs with progressive block merging plus stage-wise intra-diffusion distillation, delivering 3x speedup and new SOTA on open dVLMs using under 4.4M data points.
Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models
cs.CV 2026-03 unverdicted novelty 7.0

Q-Mask uses query-conditioned causal masks to separate text location from recognition in OCR VLMs, backed by a new benchmark and 26M-pair training dataset.
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
cs.LG 2026-05 unverdicted novelty 6.0

Data curation alone raises VLM accuracy by 11+ points on average, improves reliability and OOD generalization, and achieves near-frontier results at far lower training and inference cost.
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
cs.LG 2026-05 conditional novelty 6.0

Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.
Bernini: Latent Semantic Planning for Video Diffusion
cs.CV 2026-05 unverdicted novelty 5.0

Bernini is a framework that uses an MLLM planner to output semantic representations for a DiT renderer to generate or edit videos, reporting SOTA benchmark performance.
OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models
cs.CL 2026-05 unverdicted novelty 5.0

OmniThoughtVis curates 1.8M multimodal CoT samples via teacher distillation, difficulty annotation, and tag-based sampling, yielding consistent gains on nine reasoning benchmarks and allowing 4B models to match or bea...
ZAYA1-VL-8B Technical Report
cs.CV 2026-05 unverdicted novelty 4.0

ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting b...