SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Alexey Gritsenko; Andreas Steiner; Basil Mustafa; Ibrahim Alabdulmohsin; Jeremiah Harmsen; Lucas Beyer; Michael Tschannen; Muhammad Ferjad Naeem; Nikhil Parthasarathy; Olivier H\'enaff

arxiv: 2502.14786 · v1 · submitted 2025-02-20 · 💻 cs.CV · cs.AI

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen , Alexey Gritsenko , Xiao Wang , Muhammad Ferjad Naeem , Ibrahim Alabdulmohsin , Nikhil Parthasarathy , Talfan Evans , Lucas Beyer

show 6 more authors

Ye Xia Basil Mustafa Olivier H\'enaff Jeremiah Harmsen Andreas Steiner Xiaohua Zhai

This is my paper

Pith reviewed 2026-05-10 15:44 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords vision-language encodersSigLIPmultilingual vision-language modelszero-shot classificationimage-text retrievallocalizationdense predictionself-supervised learning

0 comments

The pith

SigLIP 2 encoders outperform the original SigLIP at every scale on core vision-language tasks and show large gains on localization and dense prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SigLIP 2, a family of multilingual vision-language encoders that extend the original SigLIP image-text objective with captioning-based pretraining, self-supervised losses such as self-distillation and masked prediction, and online data curation. These additions are combined into one training recipe, and the resulting models beat their SigLIP counterparts across model sizes in zero-shot classification, image-text retrieval, and transfer to vision-language models. The same recipe produces clear improvements on localization and dense feature tasks, supports multiple resolutions while preserving native aspect ratios, and uses a more diverse de-biased data mixture to strengthen multilingual performance and fairness. Checkpoints are released at four sizes from 86 million to 1 billion parameters so users can balance speed and accuracy.

Core claim

SigLIP 2 models trained with the extended recipe that unifies captioning pretraining, self-supervised objectives, and online curation outperform prior SigLIP versions at all scales on zero-shot classification, image-text retrieval, and visual representation transfer for VLMs, while also delivering significant gains on localization and dense prediction tasks; multi-resolution variants preserve native aspect ratios and a de-biased diverse data mixture improves multilingual understanding and fairness.

What carries the argument

The unified training recipe that adds captioning-based pretraining, self-supervised losses (self-distillation and masked prediction), and online data curation to the base SigLIP image-text objective, plus multi-resolution support and de-biasing on a diverse data mixture.

If this is right

Outperforms original SigLIP at every model scale on zero-shot classification and image-text retrieval.
Better visual representations for downstream vision-language models.
Substantial gains on localization and dense prediction benchmarks.
Multi-resolution models that keep native aspect ratios improve flexibility.
De-biased diverse training yields stronger multilingual results and fairness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The localization and dense-feature improvements could make these encoders more useful for tasks like object detection or segmentation inside larger systems.
Releasing multiple sizes from 86M to 1B parameters lets practitioners match model capacity to available compute while keeping the same training benefits.
The de-biasing step may reduce cultural or linguistic skew in applications that serve global users, though its effect on other biases remains untested here.
Because the gains come from a modular recipe, similar combinations could be tested on other vision-language bases to check whether they transfer.

Load-bearing premise

That the added captioning pretraining, self-supervised losses, and online curation combine without negative interactions or overfitting to the chosen data mixture, and that de-biasing improves fairness without hurting main performance.

What would settle it

Retraining the exact original SigLIP architecture and data with only the new combined recipe and checking whether zero-shot accuracy, retrieval scores, and localization metrics rise by the claimed margins without trade-offs.

read the original abstract

We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe -- this includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements on localization and dense prediction tasks. We also train variants which support multiple resolutions and preserve the input's native aspect ratio. Finally, we train on a more diverse data-mixture that includes de-biasing techniques, leading to much better multilingual understanding and improved fairness. To allow users to trade off inference cost with performance, we release model checkpoints at four sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SigLIP 2 folds captioning pretraining, self-distillation, masked prediction, and online curation into the original contrastive recipe and reports gains on zero-shot, retrieval, localization, and multilingual tasks across scales.

read the letter

The core update is straightforward: they take the SigLIP contrastive baseline and add a set of previously published pieces—captioning pretraining, self-supervised losses, and online data curation—then train on a broader, de-biased mixture. The result is consistent lifts over the prior SigLIP models at every size from ViT-B to 1B, with the biggest reported improvements on localization and dense prediction. They also ship multi-resolution variants that keep native aspect ratios and release the checkpoints, which is immediately useful for anyone swapping encoders into VLMs or retrieval systems. The multilingual and fairness angle from the data mix is a practical addition rather than a side claim. What the paper does cleanly is show that these pieces can be combined without obvious breakage and that the gains appear across standard benchmarks and transfer settings. The multi-scale release and the focus on dense features give it more immediate engineering value than a pure scaling paper. The main soft spot is attribution. The abstract and results tie the improvements to the unified recipe, but the write-up does not yet isolate how much each added loss or curation step contributes versus simply using more or better data. Controls for total compute and data volume would make the causal story tighter, and the fairness claims would benefit from explicit before-and-after metrics on the core tasks. No internal contradictions jump out, and the work stays grounded in the prior SigLIP results rather than overclaiming novelty. This is for groups that train or fine-tune vision-language models and want a stronger off-the-shelf encoder, especially if they care about localization or non-English performance. It is incremental engineering rather than a new paradigm, but the empirical pattern is clear enough to be worth checking. I would send it to peer review; the claims are testable and the released models let others verify quickly.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces SigLIP 2, a family of multilingual vision-language encoders extending the original SigLIP image-text objective with captioning-based pretraining, self-supervised losses (self-distillation and masked prediction), and online data curation. The central claim is that this unified recipe yields consistent outperformance over SigLIP baselines at all scales (ViT-B to 1B) on zero-shot classification, image-text retrieval, and VLM transfer tasks, plus substantial gains on localization and dense prediction. Additional variants support multiple resolutions while preserving native aspect ratios, and a more diverse de-biased data mixture improves multilingual understanding and fairness. Checkpoints are released at four sizes.

Significance. If the empirical results hold with proper controls, the work would provide a stronger, practical baseline for vision-language pretraining by showing additive benefits from combining established techniques. Improvements in localization/dense features and multilingual fairness address real limitations in current encoders, and the multi-scale releases enable cost-performance trade-offs. The approach of unifying prior methods into a single recipe could influence subsequent training pipelines, though its value depends on whether gains are attributable to the recipe rather than uncontrolled factors such as total compute or data volume.

major comments (1)

The abstract asserts consistent outperformance and localization gains but provides no quantitative results, ablation studies, or details on experimental controls (e.g., matched data volume, training steps, or resolution); this makes it impossible to assess whether the reported improvements are load-bearing for the central claim or could be explained by confounding factors.

minor comments (2)

Notation for the extended loss (captioning + self-supervised terms) should be defined explicitly, including weighting coefficients, to allow reproduction.
Clarify how online data curation interacts with the de-biasing mixture; any overlap or filtering steps should be described to avoid ambiguity in the data pipeline.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed review. We address the single major comment below and have prepared revisions to strengthen the presentation of our results.

read point-by-point responses

Referee: The abstract asserts consistent outperformance and localization gains but provides no quantitative results, ablation studies, or details on experimental controls (e.g., matched data volume, training steps, or resolution); this makes it impossible to assess whether the reported improvements are load-bearing for the central claim or could be explained by confounding factors.

Authors: We agree that the abstract, due to its length constraints, does not contain specific quantitative results, ablation details, or explicit statements on experimental controls. The full manuscript addresses these points through quantitative comparisons across multiple tables and figures, ablation studies in Section 4 that isolate the contribution of each added component (captioning, self-supervised losses, and data curation), and Section 3 which describes the training protocol with matched data volumes, step counts, and resolutions relative to the SigLIP baselines. To make this immediately visible, we will revise the abstract to include a small number of key performance deltas and a brief reference to the controlled experimental setup. These changes ensure the central claim can be evaluated without requiring the reader to consult the full text first. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical recipe evaluated on external benchmarks

full rationale

The paper describes an empirical training recipe that extends the prior SigLIP objective with captioning pretraining, self-supervised losses, and online curation, then reports performance gains on standard zero-shot, retrieval, VLM transfer, localization, and dense-prediction benchmarks. No equations, uniqueness theorems, or first-principles derivations are present that could reduce a claimed result to a fitted parameter or self-referential definition. Self-citations to the original SigLIP work serve only as the baseline for comparison and do not carry the load of proving the new gains; those gains are measured against held-out test sets. The argument is therefore self-contained against external benchmarks and contains no circular steps.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

As an empirical scaling paper, the central claim rests on standard assumptions of deep learning optimization and data representativeness rather than new mathematical derivations.

free parameters (2)

loss weighting coefficients
Weights balancing the original contrastive loss with added captioning and self-supervised terms are chosen during training.
data mixture proportions
Proportions in the diverse multilingual data mixture including de-biasing are selected to achieve reported fairness gains.

axioms (2)

domain assumption ViT-based encoder architecture behaves consistently under the added objectives
The paper assumes the base SigLIP architecture scales without modification when new losses are introduced.
domain assumption Online data curation selects representative samples without introducing selection bias
Assumes the curation process improves quality without distorting the underlying data distribution.

pith-pipeline@v0.9.0 · 5571 in / 1351 out tokens · 69744 ms · 2026-05-10T15:44:04.883873+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Is Dimensionality a Barrier for Retrieval Models?
cs.LG 2026-05 unverdicted novelty 8.0

Dimension d = O(m^{-2} log n) nearly achieves the optimal margin m^rd(+∞, A) for retrieval embeddings, with matching lower bounds showing d = O(k log(n/k)) suffices and is necessary for m = Θ(k^{-1/2}) on k-sparse que...
On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models
cs.CR 2026-05 conditional novelty 8.0

Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-...
Representation Fr\'echet Loss for Visual Generation
cs.CV 2026-04 unverdicted novelty 8.0

Fréchet Distance optimized as FD-loss in representation space by decoupling population size from batch size improves generator quality, enables one-step generation from multi-step models, and motivates a multi-represe...
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
cs.CV 2026-01 unverdicted novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding
cs.CV 2026-01 unverdicted novelty 8.0

S1-MMAlign is a new large-scale dataset of 15.5 million semantically enhanced scientific image-text pairs created via an AI recaptioning pipeline to improve multimodal understanding.
ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors
cs.CV 2025-12 unverdicted novelty 8.0

ConceptPose delivers state-of-the-art zero-shot relative pose estimation by matching open-vocabulary 3D concept vectors derived from VLM saliency maps, beating the strongest baseline by 62% in ADD(-S) without training.
Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval
cs.CV 2026-05 unverdicted novelty 7.0

ToolMerge decomposes queries into LLM-planned tool calls merged by boolean operators for long-video keyframe retrieval and introduces the M2M benchmark, showing competitive results with 5% gains on caption retrieval.
DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 7.0

DriveSpatial benchmark shows the best of 15 VLMs trails humans by 28.4 points on spatiotemporal driving tasks, with cognitive scene construction as the main failure mode.
DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders
cs.CV 2026-05 unverdicted novelty 7.0

DecQ uses detail-condensing queries on shallow and deep VFM features to improve both reconstruction PSNR and generative convergence/FID in RAEs without fine-tuning the encoder.
Structured Layout Priors for Robust Out-of-Distribution Visual Document Understanding
cs.CV 2026-05 conditional novelty 7.0

Injecting pre-computed layout priors from RT-DETR into VLM prompts raises markdown F1 from 0.37 to 0.92 on a 10k-page OOD benchmark and cuts infinite-loop failures across domains.
Vision Harnessing Agent for Open Ad-hoc Segmentation
cs.CV 2026-05 unverdicted novelty 7.0

VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.
Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models
cs.CV 2026-05 unverdicted novelty 7.0

Mask prior drift and positional attention collapse cause failures in LDVLMs for long generations, fixed by training-free Mask Prior Suppression and Monotonic RoPE Scaling.
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
cs.CV 2026-05 unverdicted novelty 7.0

A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation
cs.CL 2026-05 unverdicted novelty 7.0

Evidence utility is defined as information gain on the model's output distribution, with ranking by gain on a latent helpfulness variable shown equivalent to answer-space utility under mild assumptions, enabling a tra...
CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models
cs.CV 2026-05 conditional novelty 7.0

LiteLVLM prunes visual tokens for pixel grounding by reversing CLIP visual-text similarity to retain referent region tokens, outperforming prior methods by over 5% with 22% speedup and 2.3x memory reduction without an...
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
cs.RO 2026-05 unverdicted novelty 7.0

MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
VIP: Visual-guided Prompt Evolution for Efficient Dense Vision-Language Inference
cs.CV 2026-05 unverdicted novelty 7.0

VIP evolves text prompts using visual cues and saliency-aware aggregation inside dino.txt to deliver 1.4-8.4% higher mIoU on dense vision-language tasks with low overhead.
Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation
cs.RO 2026-05 conditional novelty 7.0

A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.
LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models
cs.AI 2026-05 unverdicted novelty 7.0

LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers
cs.CL 2026-05 unverdicted novelty 7.0

Jina-embeddings-v5-omni creates multimodal embeddings for text, image, audio, and video by freezing the text and media encoders and training only 0.35% of the weights via a VLM-style connector.
BRIDGE: Background Routing and Isolated Discrete Gating for Coarse-Mask Local Editing
cs.CV 2026-05 unverdicted novelty 7.0

BRIDGE uses separate main and subject paths plus a discrete gate on positional embeddings to improve local edits with coarse masks, raising local SigLIP2-T from 0.39 to 0.50 on its benchmark.
Attention Transfer Is Not Universally Effective for Vision Transformers
cs.CV 2026-05 accept novelty 7.0

Attention transfer from ViT teachers succeeds for only 7 of 11 families and fails for the rest because of architectural mismatch between teacher and student.
Attributions All the Way Down? The Metagame of Interpretability
cs.LG 2026-05 unverdicted novelty 7.0

Defines meta-attributions as directional second-order Shapley values on attribution methods, proves hierarchical decomposition of attributions, and demonstrates applications in language models, vision-language encoder...
OpenGaFF: Open-Vocabulary Gaussian Feature Field with Codebook Attention
cs.CV 2026-05 unverdicted novelty 7.0

OpenGaFF combines a geometry-conditioned Gaussian Feature Field with codebook-guided attention to deliver more spatially coherent open-vocabulary 3D semantic segmentation than prior methods.
MolmoAct2: Action Reasoning Models for Real-world Deployment
cs.RO 2026-05 unverdicted novelty 7.0

MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.
Posterior Augmented Flow Matching
cs.CV 2026-05 unverdicted novelty 7.0

PAFM augments flow matching with an importance-sampled mixture over an approximate posterior of target completions, yielding an unbiased lower-variance estimator that improves FID by up to 3.4 on ImageNet and CC12M.
Differentially Private Contrastive Learning via Bounding Group-level Contribution
cs.CR 2026-04 unverdicted novelty 7.0

DP-GCL improves differentially private contrastive learning by bounding group-level contributions through batch partitioning and intra-group augmentation, delivering 5.6% higher image classification accuracy and 20.1%...
GramSR: Visual Feature Conditioning for Diffusion-Based Super-Resolution
cs.CV 2026-04 unverdicted novelty 7.0

GramSR uses DINOv3 visual features instead of text captions to condition a one-step diffusion model for super-resolution via sequential pixel, semantic, and texture LoRA modules.
StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition
cs.GR 2026-04 unverdicted novelty 7.0

StyleID supplies human-perception-aligned benchmarks and fine-tuned encoders that improve facial identity recognition robustness across stylization types and strengths.
RSRCC: A Remote Sensing Regional Change Comprehension Benchmark Constructed via Retrieval-Augmented Best-of-N Ranking
cs.CV 2026-04 unverdicted novelty 7.0

RSRCC is a new 126k-question benchmark for fine-grained remote sensing change question-answering, constructed via a hierarchical semi-supervised pipeline with retrieval-augmented Best-of-N ranking.
Evaluating Remote Sensing Image Captions Beyond Metric Biases
cs.CV 2026-04 unverdicted novelty 7.0

Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA pe...
Hybrid Latent Reasoning with Decoupled Policy Optimization
cs.CV 2026-04 unverdicted novelty 7.0

HyLaR with DePO enables effective RL in hybrid discrete-continuous spaces for multimodal models, outperforming prior MLLMs on perception and understanding benchmarks.
Coevolving Representations in Joint Image-Feature Diffusion
cs.CV 2026-04 unverdicted novelty 7.0

CoReDi coevolves semantic representations with the diffusion model via a jointly learned linear projection stabilized by stop-gradient, normalization, and regularization, yielding faster convergence and higher sample ...
Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution Shapes
cs.CV 2026-04 unverdicted novelty 7.0

Text-to-3D models lose prompt sensitivity for out-of-distribution shapes due to sink traps but retain geometric diversity via unconditional priors, enabling a decoupled inversion method for robust editing.
Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models
cs.CV 2026-04 unverdicted novelty 7.0

Audio-Contrastive Preference Optimization (ACPO) mitigates audio hallucination in AVLMs via output-contrastive and input-contrastive objectives that enforce faithful audio grounding.
UNIGEOCLIP: Unified Geospatial Contrastive Learning
cs.CV 2026-04 unverdicted novelty 7.0

UNIGEOCLIP creates a unified embedding for aerial imagery, street views, elevation, text, and coordinates via all-to-all contrastive alignment plus a scaled lat-long encoder, outperforming single-modality and coordina...
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
cs.CV 2026-04 unverdicted novelty 7.0

A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
Bottleneck Tokens for Unified Multimodal Retrieval
cs.LG 2026-04 unverdicted novelty 7.0

Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.
RewardFlow: Generate Images by Optimizing What You Reward
cs.CV 2026-04 unverdicted novelty 7.0

RewardFlow unifies differentiable rewards including a new VQA-based one and uses a prompt-aware adaptive policy with Langevin dynamics to achieve state-of-the-art image editing and compositional generation.
InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding
cs.CV 2026-04 unverdicted novelty 7.0

InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.
Show Me the Infographic I Imagine: Intent-Aware Infographic Retrieval for Authoring Support
cs.IR 2026-04 unverdicted novelty 7.0

Presents a new retrieval system that enriches user queries with an intent taxonomy to improve matching of natural language descriptions to infographic designs and support authoring.
Personalizing Text-to-Image Generation to Individual Taste
cs.CV 2026-04 unverdicted novelty 7.0

PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.
A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens
cs.CV 2026-04 conditional novelty 7.0

Delta tokens compress VFM feature differences into single tokens, enabling a lightweight generative world model that predicts diverse futures with far lower compute than existing approaches.
No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models
cs.CV 2026-03 unverdicted novelty 7.0

Concept-centric short captions and cross-modal attention pooling yield SOTA compositionality in contrastive V&L models without degrading zero-shot or retrieval performance.
TrajTok: Learning Trajectory Tokens enables better Video Understanding
cs.CV 2026-02 unverdicted novelty 7.0

TrajTok learns adaptive trajectory tokens for videos through a unified end-to-end segmenter, improving understanding performance and efficiency over patch-based or external-pipeline tokenizers.
Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models
cs.CV 2026-01 unverdicted novelty 7.0

LocalDPO aligns text-to-video diffusion models with human preferences at the spatio-temporal region level by automatically generating localized preference pairs from corrupted real videos and applying a region-aware DPO loss.
Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models
cs.CV 2026-01 unverdicted novelty 7.0

LocalDPO creates localized preference pairs from real videos by applying random spatio-temporal masks and restoring masked regions with the frozen base model, then applies region-restricted DPO loss to improve fidelit...
MMLANDMARKS: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding
cs.CV 2025-12 conditional novelty 7.0

MMLandmarks supplies 197k aerial and 329k ground images plus text and GPS for 18,557 landmarks to benchmark multimodal geo-spatial understanding.
MoonSeg3R: Monocular Online Zero-Shot Segment Anything in 3D with Reconstructive Foundation Priors
cs.CV 2025-12 unverdicted novelty 7.0

MoonSeg3R is the first method for online monocular 3D instance segmentation, achieving performance competitive with RGB-D systems by using CUT3R priors for geometric consistency and temporal query memory.
SoccerMaster: A Vision Foundation Model for Soccer Understanding
cs.CV 2025-12 unverdicted novelty 7.0

SoccerMaster is the first soccer-specific vision foundation model that unifies tasks from player detection to event classification via multi-task pretraining and outperforms task-specific models on downstream evaluations.
PowerCLIP: Powerset Alignment for Contrastive Pre-Training
cs.CV 2025-11 conditional novelty 7.0

PowerCLIP improves CLIP-style models by exhaustively aligning powersets of image regions to textual parse trees via efficient non-linear aggregators that approximate the full combinatorial loss.
TRANSPORTER: Transferring Visual Semantics from VLM Manifolds
cs.CV 2025-11 unverdicted novelty 7.0

TRANSPORTER generates videos from VLM logits using optimal transport to interpret model predictions on object attributes, actions, and scenes.
CardioBench: Do Echocardiography Foundation Models Generalize Beyond the Lab?
cs.CV 2025-10 unverdicted novelty 7.0

CardioBench is a new public benchmark that standardizes eight echocardiography datasets into four regression and five classification tasks to evaluate foundation model generalization.
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning
cs.CV 2025-07 conditional novelty 7.0

MGPO elicits grounding in LMMs via multi-turn RL with binary rewards, yielding 5.4% and 5.2% gains on MME-Realworld and V* Bench and surpassing GPT-4o on the latter after training on 21K samples.
AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models
cs.CV 2025-06 unverdicted novelty 7.0

AVA-Bench evaluates vision foundation models by disentangling 14 atomic visual abilities with aligned training-test distributions to reveal precise ability fingerprints.
Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping
cs.CV 2025-05 unverdicted novelty 7.0

A contrastive multimodal framework augments satellite-audio datasets with vision-language model sound descriptions to learn shared soundscape concepts for zero-shot retrieval and synthesis.
FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
cs.CV 2025-04 unverdicted novelty 7.0

FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperfor...
Cambrian-P: Pose-Grounded Video Understanding
cs.CV 2026-05 unverdicted novelty 6.0

Cambrian-P adds per-frame camera pose tokens and a regression head to video MLLMs, delivering 4.5-6.5% gains on spatial benchmarks, generalization to other video QA tasks, and SOTA streaming pose estimation on ScanNet.
Proxy-Based Approximation of Shapley and Banzhaf Interactions
cs.LG 2026-05 unverdicted novelty 6.0

ProxySHAP approximates higher-order Shapley and Banzhaf interactions via tree proxies plus residual correction and a polynomial-time interventional TreeSHAP generalization for tree ensembles.
Proxy-Based Approximation of Shapley and Banzhaf Interactions
cs.LG 2026-05 unverdicted novelty 6.0

ProxySHAP uses tree proxies plus residual correction to achieve state-of-the-art approximation of Shapley and Banzhaf interactions, with a polynomial-time exact method for tree ensembles.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · cited by 205 Pith papers · 12 internal anchors

[1]

Alabdulmohsin, X

I. Alabdulmohsin, X. Zhai, A. Kolesnikov, and L. Beyer. Getting vit in shape: Scaling laws for compute-optimal model design. In NeurIPS, 2023

work page 2023
[2]

Alabdulmohsin, X

I. Alabdulmohsin, X. Wang, A. P. Steiner, P. Goyal, A. D’Amour, and X. Zhai. Clip the bias: How useful is balancing data in multimodal learning? InICLR, 2024

work page 2024
[3]

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P.Wang, J.Lin, C.Zhou, andJ.Zhou. Qwen- VL: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Barbu, D

A. Barbu, D. Mayo, J. Alverio, W. Luo, C. Wang, D. Gutfreund, J. Tenenbaum, and B. Katz. Objectnet: A large-scale bias- controlled dataset for pushing the limits of object recognition models.NeurIPS, 2019

work page 2019
[5]

Are we done with imagenet?arXiv preprint arXiv:2006.07159,

L. Beyer, O. J. Hénaff, A. Kolesnikov, X. Zhai, and A. v. d. Oord. Are we done with ima- genet? arXiv:2006.07159, 2020

work page arXiv 2006
[6]

Beyer, P

L. Beyer, P. Izmailov, A. Kolesnikov, M. Caron, S. Kornblith, X. Zhai, M. Min- derer, M. Tschannen, I. Alabdulmohsin, and F. Pavetic. Flexivit: One model for all patch sizes. InCVPR, 2023

work page 2023
[7]

PaliGemma: A versatile 3B VLM for transfer

L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neu- mann, I. Alabdulmohsin, M. Tschannen, E. Bugliarello, T. Unterthiner, D. Keysers, 12 SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features S. Koppula, F. Liu, A. Grycner, A. Gritsenko, N. Houlsby, M. Kumar, K. Rong, ...

work page internal anchor Pith review arXiv 2024
[8]

Caesar, J

H. Caesar, J. Uijlings, and V. Ferrari. Coco- stuff: Thing and stuff classes in context. In CVPR, 2018

work page 2018
[9]

Caron, H

M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vi- sion transformers. In CVPR, pages 9650– 9660, 2021

work page 2021
[10]

X. Chen, X. Wang, S. Changpinyo, A. J. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer, A. Kolesnikov, J. Puigcerver, N. Ding, K. Rong, H. Akbari, G. Mishra, L. Xue, A. Thapliyal, J. Bradbury, W. Kuo, M. Seyedhosseini, C. Jia, B. K. Ayan, C. Riquelme, A. Steiner, A. Angelova, X. Zhai, N. Houlsby, and R. Soricut. PaLI: A j...

work page internal anchor Pith review arXiv 2022
[11]

S.Cho, H.Shin, S.Hong, A.Arnab, P.H.Seo, and S. Kim. Cat-seg: Cost aggregation for open-vocabulary semantic segmentation. In CVPR, pages 4113–4123, 2024

work page 2024
[12]

Dehghani, B

M. Dehghani, B. Mustafa, J. Djolonga, J. Heek, M. Minderer, M. Caron, A. Steiner, J. Puigcerver, R. Geirhos, I. M. Alabdul- mohsin, et al. Patch n’pack: NaViT, a vi- sion transformer for any aspect ratio and resolution. NeurIPS, 2024

work page 2024
[13]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hi- erarchical image database. InCVPR, pages 248–255, 2009

work page 2009
[14]

J. Ding, N. Xue, G.-S. Xia, and D. Dai. De- coupling zero-shot semantic segmentation. In CVPR, pages 11583–11592, 2022

work page 2022
[15]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transform- ers for image recognition at scale. InICLR, 2021

work page 2021
[16]

Evans, N

T. Evans, N. Parthasarathy, H. Merzic, and O. J. Henaff. Data curation via joint exam- ple selection further accelerates multimodal learning. In NeurIPS Datasets and Bench- marks Track, 2024

work page 2024
[17]

Everingham, L

M. Everingham, L. Van Gool, C. K. Williams, J.Winn,andA.Zisserman. Thepascalvisual object classes (voc) challenge.IJCV, 2010

work page 2010
[18]

L. Fan, D. Krishnan, P. Isola, D. Katabi, and Y. Tian. Improving clip training with lan- guage rewrites. NeurIPS, pages 35544– 35575, 2023

work page 2023
[19]

A. Fang, A. M. Jose, A. Jain, L. Schmidt, A. T. Toshev, and V. Shankar. Data filtering networks. InICLR, 2024

work page 2024
[20]

E. Fini, M. Shukor, X. Li, P. Dufter, M. Klein, D. Haldimann, S. Aitharaju, V. G. T. da Costa, L. Béthune, Z. Gan, A. T. Toshev, M. Eichner, M. Nabi, Y. Yang, J. M. Susskind, and A. El-Nouby. Multimodal autoregres- sive pre-training of large vision encoders. arXiv:2411.14402, 2024

work page arXiv 2024
[21]

S. Y. Gadre, G. Ilharco, A. Fang, J. Hayase, G.Smyrnis, T.Nguyen, R.Marten, M.Worts- man, D. Ghosh, J. Zhang, et al. Datacomp: In search of the next generation of multi- modal datasets.NeurIPS, 36, 2024

work page 2024
[22]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team. Gemma: Open models based on gemini research and technology. arXiv:2403.08295, 2024

work page internal anchor Pith review arXiv 2024
[23]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team. Gemma 2: Improving open language models at a practical size. arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Introduction to Cloud TPU

Google Cloud. Introduction to Cloud TPU. https://cloud.google.com/ tpu/docs/intro-to-tpu, 20xx. Ac- cessed: 2024-07-04. 13 SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

work page 2024
[25]

Gupta, P

A. Gupta, P. Dollar, and R. Girshick. Lvis: A dataset for large vocabulary instance seg- mentation. In CVPR, pages 5356–5364, 2019

work page 2019
[26]

T.-Y. Hsu, C. L. Giles, and T.-H. Huang. Sc- icap: Generating captions for scientific fig- ures. arXiv:2110.11624, 2021

work page arXiv 2021
[27]

Ilharco, M

G. Ilharco, M. Wortsman, R. Wightman, C. Gordon, N. Carlini, R. Taori, A. Dave, V. Shankar, H. Namkoong, J. Miller, H. Ha- jishirzi, A. Farhadi, and L. Schmidt. Open- CLIP, 2021

work page 2021
[28]

C. Jia, Y. Yang, Y. Xia, Y. Chen, Z. Parekh, H. Pham, Q. V. Le, Y. Sung, Z. Li, and T. Duerig. Scaling up visual and vision- language representation learning with noisy text supervision. InICML, 2021

work page 2021
[29]

S.Kazemzadeh,V.Ordonez,M.Matten,and T. Berg. ReferItGame: Referring to objects inphotographsofnaturalscenes. In EMNLP, Oct. 2014

work page 2014
[30]

W. Kuo, Y. Cui, X. Gu, A. Piergiovanni, and A. Angelova. Open-vocabulary object de- tection upon frozen vision and language models. InICLR, 2023

work page 2023
[31]

Z. Lai, H. Zhang, B. Zhang, W. Wu, H. Bai, A. Timofeev, X. Du, Z. Gan, J. Shan, C.-N. Chuah, Y. Yang, and M. Cao. VeCLIP: Im- provingcliptrainingviavisual-enrichedcap- tions. arXiv:2310.07699, 2024

work page arXiv 2024
[32]

J. Li, D. Li, S. Savarese, and S. C. H. Hoi. BLIP-2: bootstrapping language-image pre- training with frozen image encoders and large language models. InICML, 2023

work page 2023
[33]

X. Li, Z. Wang, and C. Xie. Clipa-v2: Scal- ing clip training with 81.1% zero-shot im- agenet accuracy within a $10,000 budget; an extra $4,000 unlocks 81.8% accuracy. arXiv:2306.15658, 2023

work page arXiv 2023
[34]

T. Lin, M. Maire, S. J. Belongie, L. D. Bour- dev, R. B. Girshick, J. Hays, P. Perona, D. Ra- manan, P. Doll’a r, and C. L. Zitnick. Mi- crosoft COCO: common objects in context. arXiv:1405.0312, 2014

work page internal anchor Pith review arXiv 2014
[35]

H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. InNeurIPS, 2023

work page 2023
[36]

S. Long, S. Qin, D. Panteleev, A. Bissacco, Y. Fujii, and M. Raptis. ICDAR 2023 com- petition on hierarchical text detection and recognition. InICDAR, 2023

work page 2023
[37]

Decoupled Weight Decay Regularization

I. Loshchilov, F. Hutter, et al. Fixing weight decayregularizationinadam. arXivpreprint arXiv:1711.05101, 5, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[38]

Maninis, K

K.-K. Maninis, K. Chen, S. Ghosh, A. Karpur, K. Chen, Y. Xia, B. Cao, D. Salz, G. Han, J.Dlabal,etal. TIPS:Text-imagepretraining with spatial awareness. InICLR, 2025

work page 2025
[39]

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

B.McKinzie, Z.Gan, J.Fauconnier, S.Dodge, B. Zhang, P. Dufter, D. Shah, X. Du, F. Peng, F. Weers, A. Belyi, H. Zhang, K. Singh, D. Kang, A. Jain, H. Hè, M. Schwarzer, T. Gunter, X. Kong, A. Zhang, J. Wang, C. Wang, N. Du, T. Lei, S. Wiseman, G. Yin, M. Lee, Z. Wang, R. Pang, P. Grasch, A. To- shev, and Y. Yang. MM1: methods, anal- ysis & insights from mul...

work page internal anchor Pith review arXiv 2024
[40]

Minderer, A

M. Minderer, A. Gritsenko, A. Stone, M. Neumann, D. Weissenborn, A. Dosovit- skiy, A. Mahendran, A. Arnab, M. Dehghani, Z. Shen, et al. Simple open-vocabulary ob- ject detection. In ECCV, pages 728–755, 2022

work page 2022
[41]

Minderer, A

M. Minderer, A. A. Gritsenko, and N. Houlsby. Scaling open-vocabulary object detection. InNeurIPS, 2023

work page 2023
[42]

Sharma, A

S.Mindermann, J.M.Brauner, M.T.Razzak, M. Sharma, A. Kirsch, W. Xu, B. Höltgen, A. N. Gomez, A. Morisot, S. Farquhar, et al. Prioritized training on points that are learn- able, worth learning, and not yet learnt. In ICML, pages 15630–15649, 2022

work page 2022
[43]

Mottaghi, X

R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.- W. Lee, S. Fidler, R. Urtasun, and A. Yuille. The role of context for object detection and semanticsegmentationinthewild. In CVPR, 2014. 14 SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

work page 2014
[44]

N. Mu, A. Kirillov, D. Wagner, and S. Xie. SLIP: Self-supervision meets language- image pre-training. In ECCV, pages 529– 544, 2022

work page 2022
[45]

M. F. Naeem, Y. Xian, X. Zhai, L. Hoyer, L. Van Gool, and F. Tombari. SILC: Improv- ing vision language pretraining with self- distillation. InECCV, pages 38–55, 2024

work page 2024
[46]

Nguyen, S

T. Nguyen, S. Y. Gadre, G. Ilharco, S. Oh, and L. Schmidt. Improving multimodal datasets with image captioning.NeurIPS, 36, 2024

work page 2024
[47]

Oquab, T

M. Oquab, T. Darcet, T. Moutakanni, H. V. Vo, M.Szafraniec, V.Khalidov, P.Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. Di- nov2: Learning robust visual features with- out supervision.TMLR, 2024

work page 2024
[48]

Z. Peng, W. Wang, L. Dong, Y. Hao, S. Huang, S. Ma, and F. Wei. Kosmos-2: Grounding multimodal large language mod- els to the world.arXiv:2306.14824, 2023

work page internal anchor Pith review arXiv 2023
[49]

Pouget, L

A. Pouget, L. Beyer, E. Bugliarello, X. Wang, A. P. Steiner, X. Zhai, and I. Alabdulmohsin. No filter: Cultural and socioeconomic diver- sityin contrastive vision-language models. arXiv:2405.13777, 2024

work page arXiv 2024
[50]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable vi- sual models from natural language supervi- sion. InICML, 2021

work page 2021
[51]

V. V. Ramaswamy, S. Y. Lin, D. Zhao, A. Ad- cock, L. van der Maaten, D. Ghadiyaram, and O. Russakovsky. Geode: a geographi- cally diverse evaluation dataset for object recognition. NeurIPS, 36, 2024

work page 2024
[52]

Ranftl, A

R. Ranftl, A. Bochkovskiy, and V. Koltun. Vision transformers for dense prediction. In CVPR, pages 12179–12188, 2021

work page 2021
[53]

Recht, R

B. Recht, R. Roelofs, L. Schmidt, and V. Shankar. Do imagenet classifiers gen- eralize to imagenet? InICML, pages 5389– 5400, 2019

work page 2019
[54]

W. A. G. Rojas, S. Diamos, K. R. Kini, D. Kan- ter, V. J. Reddi, and C. Coleman. The dollar street dataset: Images representing the geo- graphic and socioeconomic diversity of the world. InNeurIPS Datasets and Benchmarks Track, 2022

work page 2022
[55]

Sidorov, R

O. Sidorov, R. Hu, M. Rohrbach, and A. Singh. TextCaps: A dataset for image captioning with reading comprehension. In ECCV, 2020

work page 2020
[56]

A.Steiner,A.S.Pinto,M.Tschannen,D.Key- sers, X. Wang, Y. Bitton, A. Gritsenko, M. Minderer, A. Sherbondy, S. Long, et al. Paligemma 2: A family of versatile vlms for transfer. arXiv:2412.03555, 2024

work page internal anchor Pith review arXiv 2024
[57]

Q. Sun, Y. Fang, L. Wu, X. Wang, and Y. Cao. EVA-CLIP: Improved training techniques for clip at scale.arXiv:2303.15389, 2023

work page internal anchor Pith review arXiv 2023
[58]

A. V. Thapliyal, J. Pont Tuset, X. Chen, and R. Soricut. Crossmodal-3600: A massively multilingual multimodal evaluation dataset. In EMNLP, 2022

work page 2022
[59]

S. Tong, E. Brown, P. Wu, S. Woo, M. Midde- pogu, S. C. Akula, J. Yang, S. Yang, A. Iyer, X. Pan, A. Wang, R. Fergus, Y. LeCun, and S. Xie. Cambrian-1: A Fully Open, Vision- Centric Exploration of Multimodal LLMs. arXiv:2406.16860, 2024

work page internal anchor Pith review arXiv 2024
[60]

Houlsby, and L

M.Tschannen,M.Kumar,A.Steiner,X.Zhai, N. Houlsby, and L. Beyer. Image captioners are scalable vision learners too. InNeurIPS, 2023

work page 2023
[61]

Udandarao, N

V. Udandarao, N. Parthasarathy, M. F. Naeem, T. Evans, S. Albanie, F. Tombari, Y. Xian, A. Tonioni, and O. J. Hénaff. Active data curation effectively distills large-scale multimodal models. arXiv:2411.18674, 2024

work page arXiv 2024
[62]

B. Wan, M. Tschannen, Y. Xian, F. Pavetic, I. Alabdulmohsin, X. Wang, A. S. Pinto, A. Steiner, L. Beyer, and X. Zhai. LocCa: Visual pretraining with location-aware cap- tioners. InNeurIPS, 2024. 15 SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

work page 2024
[63]

B. Wang, G. Li, X. Zhou, Z. Chen, T. Gross- man, and Y. Li. Screen2words: Automatic mobile ui summarization with multimodal learning. In Symposium on User Interface Software and Technology, 2021

work page 2021
[64]

Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y. Tsvetkov, and Y. Cao. SimVLM: Simple visual lan- guage model pretraining with weak super- vision. InICLR, 2022

work page 2022
[65]

Weyand, A

T. Weyand, A. Araujo, B. Cao, and J. Sim. Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval. InCVPR, pages 2575–2584, 2020

work page 2020
[66]

H. Xu, S. Xie, X. Tan, P.-Y. Huang, R. Howes, V. Sharma, S.-W. Li, G. Ghosh, L. Zettle- moyer, and C. Feichtenhofer. Demystifying clip data. InICLR, 2024

work page 2024
[67]

J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. Wu. CoCa: Con- trastive captioners are image-text founda- tion models.TMLR, 2022

work page 2022
[68]

L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Modeling context in referring expressions. InECCV, pages 69–85, 2016

work page 2016
[69]

X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer. Scaling vision transformers.CVPR, 2022

work page 2022
[70]

X. Zhai, X. Wang, B. Mustafa, A. Steiner, D. Keysers, A. Kolesnikov, and L. Beyer. Lit: Zero-shot transfer with locked-image text tuning. InCVPR, 2022

work page 2022
[71]

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training. InICCV, 2023

work page 2023
[72]

Y. Zhao, A. Gu, R. Varma, L. Luo, C. Huang, M. Xu, L. Wright, H. Shojanazeri, M. Ott, S. Shleifer, A. Desmaison, C. Balioglu, P. Da- mania, B. Nguyen, G. Chauhan, Y. Hao, A.Mathews, andS.Li. PytorchFSDP:experi- ences on scaling fully sharded data parallel. VLDB, 2023

work page 2023
[73]

B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Bar- riuso, and A. Torralba. Scene parsing through ade20k dataset. InCVPR, 2017

work page 2017
[74]

B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba. Semantic un- derstanding of scenes through the ade20k dataset. IJCV, 2019

work page 2019
[75]

J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong. Image BERT pre- training with online tokenizer. In ICLR, 2022. 16 SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Appendix A. Full PaliGemma results Large 224/256px So400m/14 224px So400m 384px SigLIP AIMv2 SigLIP2 SigL...

work page 2022

[1] [1]

Alabdulmohsin, X

I. Alabdulmohsin, X. Zhai, A. Kolesnikov, and L. Beyer. Getting vit in shape: Scaling laws for compute-optimal model design. In NeurIPS, 2023

work page 2023

[2] [2]

Alabdulmohsin, X

I. Alabdulmohsin, X. Wang, A. P. Steiner, P. Goyal, A. D’Amour, and X. Zhai. Clip the bias: How useful is balancing data in multimodal learning? InICLR, 2024

work page 2024

[3] [3]

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P.Wang, J.Lin, C.Zhou, andJ.Zhou. Qwen- VL: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Barbu, D

A. Barbu, D. Mayo, J. Alverio, W. Luo, C. Wang, D. Gutfreund, J. Tenenbaum, and B. Katz. Objectnet: A large-scale bias- controlled dataset for pushing the limits of object recognition models.NeurIPS, 2019

work page 2019

[5] [5]

Are we done with imagenet?arXiv preprint arXiv:2006.07159,

L. Beyer, O. J. Hénaff, A. Kolesnikov, X. Zhai, and A. v. d. Oord. Are we done with ima- genet? arXiv:2006.07159, 2020

work page arXiv 2006

[6] [6]

Beyer, P

L. Beyer, P. Izmailov, A. Kolesnikov, M. Caron, S. Kornblith, X. Zhai, M. Min- derer, M. Tschannen, I. Alabdulmohsin, and F. Pavetic. Flexivit: One model for all patch sizes. InCVPR, 2023

work page 2023

[7] [7]

PaliGemma: A versatile 3B VLM for transfer

L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neu- mann, I. Alabdulmohsin, M. Tschannen, E. Bugliarello, T. Unterthiner, D. Keysers, 12 SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features S. Koppula, F. Liu, A. Grycner, A. Gritsenko, N. Houlsby, M. Kumar, K. Rong, ...

work page internal anchor Pith review arXiv 2024

[8] [8]

Caesar, J

H. Caesar, J. Uijlings, and V. Ferrari. Coco- stuff: Thing and stuff classes in context. In CVPR, 2018

work page 2018

[9] [9]

Caron, H

M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vi- sion transformers. In CVPR, pages 9650– 9660, 2021

work page 2021

[10] [10]

X. Chen, X. Wang, S. Changpinyo, A. J. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer, A. Kolesnikov, J. Puigcerver, N. Ding, K. Rong, H. Akbari, G. Mishra, L. Xue, A. Thapliyal, J. Bradbury, W. Kuo, M. Seyedhosseini, C. Jia, B. K. Ayan, C. Riquelme, A. Steiner, A. Angelova, X. Zhai, N. Houlsby, and R. Soricut. PaLI: A j...

work page internal anchor Pith review arXiv 2022

[11] [11]

S.Cho, H.Shin, S.Hong, A.Arnab, P.H.Seo, and S. Kim. Cat-seg: Cost aggregation for open-vocabulary semantic segmentation. In CVPR, pages 4113–4123, 2024

work page 2024

[12] [12]

Dehghani, B

M. Dehghani, B. Mustafa, J. Djolonga, J. Heek, M. Minderer, M. Caron, A. Steiner, J. Puigcerver, R. Geirhos, I. M. Alabdul- mohsin, et al. Patch n’pack: NaViT, a vi- sion transformer for any aspect ratio and resolution. NeurIPS, 2024

work page 2024

[13] [13]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hi- erarchical image database. InCVPR, pages 248–255, 2009

work page 2009

[14] [14]

J. Ding, N. Xue, G.-S. Xia, and D. Dai. De- coupling zero-shot semantic segmentation. In CVPR, pages 11583–11592, 2022

work page 2022

[15] [15]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transform- ers for image recognition at scale. InICLR, 2021

work page 2021

[16] [16]

Evans, N

T. Evans, N. Parthasarathy, H. Merzic, and O. J. Henaff. Data curation via joint exam- ple selection further accelerates multimodal learning. In NeurIPS Datasets and Bench- marks Track, 2024

work page 2024

[17] [17]

Everingham, L

M. Everingham, L. Van Gool, C. K. Williams, J.Winn,andA.Zisserman. Thepascalvisual object classes (voc) challenge.IJCV, 2010

work page 2010

[18] [18]

L. Fan, D. Krishnan, P. Isola, D. Katabi, and Y. Tian. Improving clip training with lan- guage rewrites. NeurIPS, pages 35544– 35575, 2023

work page 2023

[19] [19]

A. Fang, A. M. Jose, A. Jain, L. Schmidt, A. T. Toshev, and V. Shankar. Data filtering networks. InICLR, 2024

work page 2024

[20] [20]

E. Fini, M. Shukor, X. Li, P. Dufter, M. Klein, D. Haldimann, S. Aitharaju, V. G. T. da Costa, L. Béthune, Z. Gan, A. T. Toshev, M. Eichner, M. Nabi, Y. Yang, J. M. Susskind, and A. El-Nouby. Multimodal autoregres- sive pre-training of large vision encoders. arXiv:2411.14402, 2024

work page arXiv 2024

[21] [21]

S. Y. Gadre, G. Ilharco, A. Fang, J. Hayase, G.Smyrnis, T.Nguyen, R.Marten, M.Worts- man, D. Ghosh, J. Zhang, et al. Datacomp: In search of the next generation of multi- modal datasets.NeurIPS, 36, 2024

work page 2024

[22] [22]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team. Gemma: Open models based on gemini research and technology. arXiv:2403.08295, 2024

work page internal anchor Pith review arXiv 2024

[23] [23]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team. Gemma 2: Improving open language models at a practical size. arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Introduction to Cloud TPU

Google Cloud. Introduction to Cloud TPU. https://cloud.google.com/ tpu/docs/intro-to-tpu, 20xx. Ac- cessed: 2024-07-04. 13 SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

work page 2024

[25] [25]

Gupta, P

A. Gupta, P. Dollar, and R. Girshick. Lvis: A dataset for large vocabulary instance seg- mentation. In CVPR, pages 5356–5364, 2019

work page 2019

[26] [26]

T.-Y. Hsu, C. L. Giles, and T.-H. Huang. Sc- icap: Generating captions for scientific fig- ures. arXiv:2110.11624, 2021

work page arXiv 2021

[27] [27]

Ilharco, M

G. Ilharco, M. Wortsman, R. Wightman, C. Gordon, N. Carlini, R. Taori, A. Dave, V. Shankar, H. Namkoong, J. Miller, H. Ha- jishirzi, A. Farhadi, and L. Schmidt. Open- CLIP, 2021

work page 2021

[28] [28]

C. Jia, Y. Yang, Y. Xia, Y. Chen, Z. Parekh, H. Pham, Q. V. Le, Y. Sung, Z. Li, and T. Duerig. Scaling up visual and vision- language representation learning with noisy text supervision. InICML, 2021

work page 2021

[29] [29]

S.Kazemzadeh,V.Ordonez,M.Matten,and T. Berg. ReferItGame: Referring to objects inphotographsofnaturalscenes. In EMNLP, Oct. 2014

work page 2014

[30] [30]

W. Kuo, Y. Cui, X. Gu, A. Piergiovanni, and A. Angelova. Open-vocabulary object de- tection upon frozen vision and language models. InICLR, 2023

work page 2023

[31] [31]

Z. Lai, H. Zhang, B. Zhang, W. Wu, H. Bai, A. Timofeev, X. Du, Z. Gan, J. Shan, C.-N. Chuah, Y. Yang, and M. Cao. VeCLIP: Im- provingcliptrainingviavisual-enrichedcap- tions. arXiv:2310.07699, 2024

work page arXiv 2024

[32] [32]

J. Li, D. Li, S. Savarese, and S. C. H. Hoi. BLIP-2: bootstrapping language-image pre- training with frozen image encoders and large language models. InICML, 2023

work page 2023

[33] [33]

X. Li, Z. Wang, and C. Xie. Clipa-v2: Scal- ing clip training with 81.1% zero-shot im- agenet accuracy within a $10,000 budget; an extra $4,000 unlocks 81.8% accuracy. arXiv:2306.15658, 2023

work page arXiv 2023

[34] [34]

T. Lin, M. Maire, S. J. Belongie, L. D. Bour- dev, R. B. Girshick, J. Hays, P. Perona, D. Ra- manan, P. Doll’a r, and C. L. Zitnick. Mi- crosoft COCO: common objects in context. arXiv:1405.0312, 2014

work page internal anchor Pith review arXiv 2014

[35] [35]

H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. InNeurIPS, 2023

work page 2023

[36] [36]

S. Long, S. Qin, D. Panteleev, A. Bissacco, Y. Fujii, and M. Raptis. ICDAR 2023 com- petition on hierarchical text detection and recognition. InICDAR, 2023

work page 2023

[37] [37]

Decoupled Weight Decay Regularization

I. Loshchilov, F. Hutter, et al. Fixing weight decayregularizationinadam. arXivpreprint arXiv:1711.05101, 5, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[38] [38]

Maninis, K

K.-K. Maninis, K. Chen, S. Ghosh, A. Karpur, K. Chen, Y. Xia, B. Cao, D. Salz, G. Han, J.Dlabal,etal. TIPS:Text-imagepretraining with spatial awareness. InICLR, 2025

work page 2025

[39] [39]

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

B.McKinzie, Z.Gan, J.Fauconnier, S.Dodge, B. Zhang, P. Dufter, D. Shah, X. Du, F. Peng, F. Weers, A. Belyi, H. Zhang, K. Singh, D. Kang, A. Jain, H. Hè, M. Schwarzer, T. Gunter, X. Kong, A. Zhang, J. Wang, C. Wang, N. Du, T. Lei, S. Wiseman, G. Yin, M. Lee, Z. Wang, R. Pang, P. Grasch, A. To- shev, and Y. Yang. MM1: methods, anal- ysis & insights from mul...

work page internal anchor Pith review arXiv 2024

[40] [40]

Minderer, A

M. Minderer, A. Gritsenko, A. Stone, M. Neumann, D. Weissenborn, A. Dosovit- skiy, A. Mahendran, A. Arnab, M. Dehghani, Z. Shen, et al. Simple open-vocabulary ob- ject detection. In ECCV, pages 728–755, 2022

work page 2022

[41] [41]

Minderer, A

M. Minderer, A. A. Gritsenko, and N. Houlsby. Scaling open-vocabulary object detection. InNeurIPS, 2023

work page 2023

[42] [42]

Sharma, A

S.Mindermann, J.M.Brauner, M.T.Razzak, M. Sharma, A. Kirsch, W. Xu, B. Höltgen, A. N. Gomez, A. Morisot, S. Farquhar, et al. Prioritized training on points that are learn- able, worth learning, and not yet learnt. In ICML, pages 15630–15649, 2022

work page 2022

[43] [43]

Mottaghi, X

R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.- W. Lee, S. Fidler, R. Urtasun, and A. Yuille. The role of context for object detection and semanticsegmentationinthewild. In CVPR, 2014. 14 SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

work page 2014

[44] [44]

N. Mu, A. Kirillov, D. Wagner, and S. Xie. SLIP: Self-supervision meets language- image pre-training. In ECCV, pages 529– 544, 2022

work page 2022

[45] [45]

M. F. Naeem, Y. Xian, X. Zhai, L. Hoyer, L. Van Gool, and F. Tombari. SILC: Improv- ing vision language pretraining with self- distillation. InECCV, pages 38–55, 2024

work page 2024

[46] [46]

Nguyen, S

T. Nguyen, S. Y. Gadre, G. Ilharco, S. Oh, and L. Schmidt. Improving multimodal datasets with image captioning.NeurIPS, 36, 2024

work page 2024

[47] [47]

Oquab, T

M. Oquab, T. Darcet, T. Moutakanni, H. V. Vo, M.Szafraniec, V.Khalidov, P.Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. Di- nov2: Learning robust visual features with- out supervision.TMLR, 2024

work page 2024

[48] [48]

Z. Peng, W. Wang, L. Dong, Y. Hao, S. Huang, S. Ma, and F. Wei. Kosmos-2: Grounding multimodal large language mod- els to the world.arXiv:2306.14824, 2023

work page internal anchor Pith review arXiv 2023

[49] [49]

Pouget, L

A. Pouget, L. Beyer, E. Bugliarello, X. Wang, A. P. Steiner, X. Zhai, and I. Alabdulmohsin. No filter: Cultural and socioeconomic diver- sityin contrastive vision-language models. arXiv:2405.13777, 2024

work page arXiv 2024

[50] [50]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable vi- sual models from natural language supervi- sion. InICML, 2021

work page 2021

[51] [51]

V. V. Ramaswamy, S. Y. Lin, D. Zhao, A. Ad- cock, L. van der Maaten, D. Ghadiyaram, and O. Russakovsky. Geode: a geographi- cally diverse evaluation dataset for object recognition. NeurIPS, 36, 2024

work page 2024

[52] [52]

Ranftl, A

R. Ranftl, A. Bochkovskiy, and V. Koltun. Vision transformers for dense prediction. In CVPR, pages 12179–12188, 2021

work page 2021

[53] [53]

Recht, R

B. Recht, R. Roelofs, L. Schmidt, and V. Shankar. Do imagenet classifiers gen- eralize to imagenet? InICML, pages 5389– 5400, 2019

work page 2019

[54] [54]

W. A. G. Rojas, S. Diamos, K. R. Kini, D. Kan- ter, V. J. Reddi, and C. Coleman. The dollar street dataset: Images representing the geo- graphic and socioeconomic diversity of the world. InNeurIPS Datasets and Benchmarks Track, 2022

work page 2022

[55] [55]

Sidorov, R

O. Sidorov, R. Hu, M. Rohrbach, and A. Singh. TextCaps: A dataset for image captioning with reading comprehension. In ECCV, 2020

work page 2020

[56] [56]

A.Steiner,A.S.Pinto,M.Tschannen,D.Key- sers, X. Wang, Y. Bitton, A. Gritsenko, M. Minderer, A. Sherbondy, S. Long, et al. Paligemma 2: A family of versatile vlms for transfer. arXiv:2412.03555, 2024

work page internal anchor Pith review arXiv 2024

[57] [57]

Q. Sun, Y. Fang, L. Wu, X. Wang, and Y. Cao. EVA-CLIP: Improved training techniques for clip at scale.arXiv:2303.15389, 2023

work page internal anchor Pith review arXiv 2023

[58] [58]

A. V. Thapliyal, J. Pont Tuset, X. Chen, and R. Soricut. Crossmodal-3600: A massively multilingual multimodal evaluation dataset. In EMNLP, 2022

work page 2022

[59] [59]

S. Tong, E. Brown, P. Wu, S. Woo, M. Midde- pogu, S. C. Akula, J. Yang, S. Yang, A. Iyer, X. Pan, A. Wang, R. Fergus, Y. LeCun, and S. Xie. Cambrian-1: A Fully Open, Vision- Centric Exploration of Multimodal LLMs. arXiv:2406.16860, 2024

work page internal anchor Pith review arXiv 2024

[60] [60]

Houlsby, and L

M.Tschannen,M.Kumar,A.Steiner,X.Zhai, N. Houlsby, and L. Beyer. Image captioners are scalable vision learners too. InNeurIPS, 2023

work page 2023

[61] [61]

Udandarao, N

V. Udandarao, N. Parthasarathy, M. F. Naeem, T. Evans, S. Albanie, F. Tombari, Y. Xian, A. Tonioni, and O. J. Hénaff. Active data curation effectively distills large-scale multimodal models. arXiv:2411.18674, 2024

work page arXiv 2024

[62] [62]

B. Wan, M. Tschannen, Y. Xian, F. Pavetic, I. Alabdulmohsin, X. Wang, A. S. Pinto, A. Steiner, L. Beyer, and X. Zhai. LocCa: Visual pretraining with location-aware cap- tioners. InNeurIPS, 2024. 15 SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

work page 2024

[63] [63]

B. Wang, G. Li, X. Zhou, Z. Chen, T. Gross- man, and Y. Li. Screen2words: Automatic mobile ui summarization with multimodal learning. In Symposium on User Interface Software and Technology, 2021

work page 2021

[64] [64]

Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y. Tsvetkov, and Y. Cao. SimVLM: Simple visual lan- guage model pretraining with weak super- vision. InICLR, 2022

work page 2022

[65] [65]

Weyand, A

T. Weyand, A. Araujo, B. Cao, and J. Sim. Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval. InCVPR, pages 2575–2584, 2020

work page 2020

[66] [66]

H. Xu, S. Xie, X. Tan, P.-Y. Huang, R. Howes, V. Sharma, S.-W. Li, G. Ghosh, L. Zettle- moyer, and C. Feichtenhofer. Demystifying clip data. InICLR, 2024

work page 2024

[67] [67]

J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. Wu. CoCa: Con- trastive captioners are image-text founda- tion models.TMLR, 2022

work page 2022

[68] [68]

L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Modeling context in referring expressions. InECCV, pages 69–85, 2016

work page 2016

[69] [69]

X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer. Scaling vision transformers.CVPR, 2022

work page 2022

[70] [70]

X. Zhai, X. Wang, B. Mustafa, A. Steiner, D. Keysers, A. Kolesnikov, and L. Beyer. Lit: Zero-shot transfer with locked-image text tuning. InCVPR, 2022

work page 2022

[71] [71]

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training. InICCV, 2023

work page 2023

[72] [72]

Y. Zhao, A. Gu, R. Varma, L. Luo, C. Huang, M. Xu, L. Wright, H. Shojanazeri, M. Ott, S. Shleifer, A. Desmaison, C. Balioglu, P. Da- mania, B. Nguyen, G. Chauhan, Y. Hao, A.Mathews, andS.Li. PytorchFSDP:experi- ences on scaling fully sharded data parallel. VLDB, 2023

work page 2023

[73] [73]

B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Bar- riuso, and A. Torralba. Scene parsing through ade20k dataset. InCVPR, 2017

work page 2017

[74] [74]

B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba. Semantic un- derstanding of scenes through the ade20k dataset. IJCV, 2019

work page 2019

[75] [75]

J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong. Image BERT pre- training with online tokenizer. In ICLR, 2022. 16 SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Appendix A. Full PaliGemma results Large 224/256px So400m/14 224px So400m 384px SigLIP AIMv2 SigLIP2 SigL...

work page 2022