SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
Pith reviewed 2026-05-10 15:44 UTC · model grok-4.3
The pith
SigLIP 2 encoders outperform the original SigLIP at every scale on core vision-language tasks and show large gains on localization and dense prediction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SigLIP 2 models trained with the extended recipe that unifies captioning pretraining, self-supervised objectives, and online curation outperform prior SigLIP versions at all scales on zero-shot classification, image-text retrieval, and visual representation transfer for VLMs, while also delivering significant gains on localization and dense prediction tasks; multi-resolution variants preserve native aspect ratios and a de-biased diverse data mixture improves multilingual understanding and fairness.
What carries the argument
The unified training recipe that adds captioning-based pretraining, self-supervised losses (self-distillation and masked prediction), and online data curation to the base SigLIP image-text objective, plus multi-resolution support and de-biasing on a diverse data mixture.
If this is right
- Outperforms original SigLIP at every model scale on zero-shot classification and image-text retrieval.
- Better visual representations for downstream vision-language models.
- Substantial gains on localization and dense prediction benchmarks.
- Multi-resolution models that keep native aspect ratios improve flexibility.
- De-biased diverse training yields stronger multilingual results and fairness.
Where Pith is reading between the lines
- The localization and dense-feature improvements could make these encoders more useful for tasks like object detection or segmentation inside larger systems.
- Releasing multiple sizes from 86M to 1B parameters lets practitioners match model capacity to available compute while keeping the same training benefits.
- The de-biasing step may reduce cultural or linguistic skew in applications that serve global users, though its effect on other biases remains untested here.
- Because the gains come from a modular recipe, similar combinations could be tested on other vision-language bases to check whether they transfer.
Load-bearing premise
That the added captioning pretraining, self-supervised losses, and online curation combine without negative interactions or overfitting to the chosen data mixture, and that de-biasing improves fairness without hurting main performance.
What would settle it
Retraining the exact original SigLIP architecture and data with only the new combined recipe and checking whether zero-shot accuracy, retrieval scores, and localization metrics rise by the claimed margins without trade-offs.
read the original abstract
We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe -- this includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements on localization and dense prediction tasks. We also train variants which support multiple resolutions and preserve the input's native aspect ratio. Finally, we train on a more diverse data-mixture that includes de-biasing techniques, leading to much better multilingual understanding and improved fairness. To allow users to trade off inference cost with performance, we release model checkpoints at four sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SigLIP 2, a family of multilingual vision-language encoders extending the original SigLIP image-text objective with captioning-based pretraining, self-supervised losses (self-distillation and masked prediction), and online data curation. The central claim is that this unified recipe yields consistent outperformance over SigLIP baselines at all scales (ViT-B to 1B) on zero-shot classification, image-text retrieval, and VLM transfer tasks, plus substantial gains on localization and dense prediction. Additional variants support multiple resolutions while preserving native aspect ratios, and a more diverse de-biased data mixture improves multilingual understanding and fairness. Checkpoints are released at four sizes.
Significance. If the empirical results hold with proper controls, the work would provide a stronger, practical baseline for vision-language pretraining by showing additive benefits from combining established techniques. Improvements in localization/dense features and multilingual fairness address real limitations in current encoders, and the multi-scale releases enable cost-performance trade-offs. The approach of unifying prior methods into a single recipe could influence subsequent training pipelines, though its value depends on whether gains are attributable to the recipe rather than uncontrolled factors such as total compute or data volume.
major comments (1)
- The abstract asserts consistent outperformance and localization gains but provides no quantitative results, ablation studies, or details on experimental controls (e.g., matched data volume, training steps, or resolution); this makes it impossible to assess whether the reported improvements are load-bearing for the central claim or could be explained by confounding factors.
minor comments (2)
- Notation for the extended loss (captioning + self-supervised terms) should be defined explicitly, including weighting coefficients, to allow reproduction.
- Clarify how online data curation interacts with the de-biasing mixture; any overlap or filtering steps should be described to avoid ambiguity in the data pipeline.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review. We address the single major comment below and have prepared revisions to strengthen the presentation of our results.
read point-by-point responses
-
Referee: The abstract asserts consistent outperformance and localization gains but provides no quantitative results, ablation studies, or details on experimental controls (e.g., matched data volume, training steps, or resolution); this makes it impossible to assess whether the reported improvements are load-bearing for the central claim or could be explained by confounding factors.
Authors: We agree that the abstract, due to its length constraints, does not contain specific quantitative results, ablation details, or explicit statements on experimental controls. The full manuscript addresses these points through quantitative comparisons across multiple tables and figures, ablation studies in Section 4 that isolate the contribution of each added component (captioning, self-supervised losses, and data curation), and Section 3 which describes the training protocol with matched data volumes, step counts, and resolutions relative to the SigLIP baselines. To make this immediately visible, we will revise the abstract to include a small number of key performance deltas and a brief reference to the controlled experimental setup. These changes ensure the central claim can be evaluated without requiring the reader to consult the full text first. revision: yes
Circularity Check
No significant circularity; empirical recipe evaluated on external benchmarks
full rationale
The paper describes an empirical training recipe that extends the prior SigLIP objective with captioning pretraining, self-supervised losses, and online curation, then reports performance gains on standard zero-shot, retrieval, VLM transfer, localization, and dense-prediction benchmarks. No equations, uniqueness theorems, or first-principles derivations are present that could reduce a claimed result to a fitted parameter or self-referential definition. Self-citations to the original SigLIP work serve only as the baseline for comparison and do not carry the load of proving the new gains; those gains are measured against held-out test sets. The argument is therefore self-contained against external benchmarks and contains no circular steps.
Axiom & Free-Parameter Ledger
free parameters (2)
- loss weighting coefficients
- data mixture proportions
axioms (2)
- domain assumption ViT-based encoder architecture behaves consistently under the added objectives
- domain assumption Online data curation selects representative samples without introducing selection bias
Forward citations
Cited by 60 Pith papers
-
Is Dimensionality a Barrier for Retrieval Models?
Dimension d = O(m^{-2} log n) nearly achieves the optimal margin m^rd(+∞, A) for retrieval embeddings, with matching lower bounds showing d = O(k log(n/k)) suffices and is necessary for m = Θ(k^{-1/2}) on k-sparse que...
-
On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models
Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-...
-
Representation Fr\'echet Loss for Visual Generation
Fréchet Distance optimized as FD-loss in representation space by decoupling population size from batch size improves generator quality, enables one-step generation from multi-step models, and motivates a multi-represe...
-
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
-
S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding
S1-MMAlign is a new large-scale dataset of 15.5 million semantically enhanced scientific image-text pairs created via an AI recaptioning pipeline to improve multimodal understanding.
-
ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors
ConceptPose delivers state-of-the-art zero-shot relative pose estimation by matching open-vocabulary 3D concept vectors derived from VLM saliency maps, beating the strongest baseline by 62% in ADD(-S) without training.
-
Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval
ToolMerge decomposes queries into LLM-planned tool calls merged by boolean operators for long-video keyframe retrieval and introduces the M2M benchmark, showing competitive results with 5% gains on caption retrieval.
-
DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving
DriveSpatial benchmark shows the best of 15 VLMs trails humans by 28.4 points on spatiotemporal driving tasks, with cognitive scene construction as the main failure mode.
-
DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders
DecQ uses detail-condensing queries on shallow and deep VFM features to improve both reconstruction PSNR and generative convergence/FID in RAEs without fine-tuning the encoder.
-
Structured Layout Priors for Robust Out-of-Distribution Visual Document Understanding
Injecting pre-computed layout priors from RT-DETR into VLM prompts raises markdown F1 from 0.37 to 0.92 on a 10k-page OOD benchmark and cuts infinite-loop failures across domains.
-
Vision Harnessing Agent for Open Ad-hoc Segmentation
VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.
-
Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models
Mask prior drift and positional attention collapse cause failures in LDVLMs for long generations, fixed by training-free Mask Prior Suppression and Monotonic RoPE Scaling.
-
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
-
Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation
Evidence utility is defined as information gain on the model's output distribution, with ranking by gain on a latent helpfulness variable shown equivalent to answer-space utility under mild assumptions, enabling a tra...
-
CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models
LiteLVLM prunes visual tokens for pixel grounding by reversing CLIP visual-text similarity to retain referent region tokens, outperforming prior methods by over 5% with 22% speedup and 2.3x memory reduction without an...
-
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
-
VIP: Visual-guided Prompt Evolution for Efficient Dense Vision-Language Inference
VIP evolves text prompts using visual cues and saliency-aware aggregation inside dino.txt to deliver 1.4-8.4% higher mIoU on dense vision-language tasks with low overhead.
-
Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation
A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.
-
LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models
LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.
-
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers
Jina-embeddings-v5-omni creates multimodal embeddings for text, image, audio, and video by freezing the text and media encoders and training only 0.35% of the weights via a VLM-style connector.
-
BRIDGE: Background Routing and Isolated Discrete Gating for Coarse-Mask Local Editing
BRIDGE uses separate main and subject paths plus a discrete gate on positional embeddings to improve local edits with coarse masks, raising local SigLIP2-T from 0.39 to 0.50 on its benchmark.
-
Attention Transfer Is Not Universally Effective for Vision Transformers
Attention transfer from ViT teachers succeeds for only 7 of 11 families and fails for the rest because of architectural mismatch between teacher and student.
-
Attributions All the Way Down? The Metagame of Interpretability
Defines meta-attributions as directional second-order Shapley values on attribution methods, proves hierarchical decomposition of attributions, and demonstrates applications in language models, vision-language encoder...
-
OpenGaFF: Open-Vocabulary Gaussian Feature Field with Codebook Attention
OpenGaFF combines a geometry-conditioned Gaussian Feature Field with codebook-guided attention to deliver more spatially coherent open-vocabulary 3D semantic segmentation than prior methods.
-
MolmoAct2: Action Reasoning Models for Real-world Deployment
MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.
-
Posterior Augmented Flow Matching
PAFM augments flow matching with an importance-sampled mixture over an approximate posterior of target completions, yielding an unbiased lower-variance estimator that improves FID by up to 3.4 on ImageNet and CC12M.
-
Differentially Private Contrastive Learning via Bounding Group-level Contribution
DP-GCL improves differentially private contrastive learning by bounding group-level contributions through batch partitioning and intra-group augmentation, delivering 5.6% higher image classification accuracy and 20.1%...
-
GramSR: Visual Feature Conditioning for Diffusion-Based Super-Resolution
GramSR uses DINOv3 visual features instead of text captions to condition a one-step diffusion model for super-resolution via sequential pixel, semantic, and texture LoRA modules.
-
StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition
StyleID supplies human-perception-aligned benchmarks and fine-tuned encoders that improve facial identity recognition robustness across stylization types and strengths.
-
RSRCC: A Remote Sensing Regional Change Comprehension Benchmark Constructed via Retrieval-Augmented Best-of-N Ranking
RSRCC is a new 126k-question benchmark for fine-grained remote sensing change question-answering, constructed via a hierarchical semi-supervised pipeline with retrieval-augmented Best-of-N ranking.
-
Evaluating Remote Sensing Image Captions Beyond Metric Biases
Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA pe...
-
Hybrid Latent Reasoning with Decoupled Policy Optimization
HyLaR with DePO enables effective RL in hybrid discrete-continuous spaces for multimodal models, outperforming prior MLLMs on perception and understanding benchmarks.
-
Coevolving Representations in Joint Image-Feature Diffusion
CoReDi coevolves semantic representations with the diffusion model via a jointly learned linear projection stabilized by stop-gradient, normalization, and regularization, yielding faster convergence and higher sample ...
-
Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution Shapes
Text-to-3D models lose prompt sensitivity for out-of-distribution shapes due to sink traps but retain geometric diversity via unconditional priors, enabling a decoupled inversion method for robust editing.
-
Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models
Audio-Contrastive Preference Optimization (ACPO) mitigates audio hallucination in AVLMs via output-contrastive and input-contrastive objectives that enforce faithful audio grounding.
-
UNIGEOCLIP: Unified Geospatial Contrastive Learning
UNIGEOCLIP creates a unified embedding for aerial imagery, street views, elevation, text, and coordinates via all-to-all contrastive alignment plus a scaled lat-long encoder, outperforming single-modality and coordina...
-
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
-
Bottleneck Tokens for Unified Multimodal Retrieval
Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.
-
RewardFlow: Generate Images by Optimizing What You Reward
RewardFlow unifies differentiable rewards including a new VQA-based one and uses a prompt-aware adaptive policy with Langevin dynamics to achieve state-of-the-art image editing and compositional generation.
-
InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding
InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.
-
Show Me the Infographic I Imagine: Intent-Aware Infographic Retrieval for Authoring Support
Presents a new retrieval system that enriches user queries with an intent taxonomy to improve matching of natural language descriptions to infographic designs and support authoring.
-
Personalizing Text-to-Image Generation to Individual Taste
PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.
-
A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens
Delta tokens compress VFM feature differences into single tokens, enabling a lightweight generative world model that predicts diverse futures with far lower compute than existing approaches.
-
No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models
Concept-centric short captions and cross-modal attention pooling yield SOTA compositionality in contrastive V&L models without degrading zero-shot or retrieval performance.
-
TrajTok: Learning Trajectory Tokens enables better Video Understanding
TrajTok learns adaptive trajectory tokens for videos through a unified end-to-end segmenter, improving understanding performance and efficiency over patch-based or external-pipeline tokenizers.
-
Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models
LocalDPO aligns text-to-video diffusion models with human preferences at the spatio-temporal region level by automatically generating localized preference pairs from corrupted real videos and applying a region-aware DPO loss.
-
Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models
LocalDPO creates localized preference pairs from real videos by applying random spatio-temporal masks and restoring masked regions with the frozen base model, then applies region-restricted DPO loss to improve fidelit...
-
MMLANDMARKS: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding
MMLandmarks supplies 197k aerial and 329k ground images plus text and GPS for 18,557 landmarks to benchmark multimodal geo-spatial understanding.
-
MoonSeg3R: Monocular Online Zero-Shot Segment Anything in 3D with Reconstructive Foundation Priors
MoonSeg3R is the first method for online monocular 3D instance segmentation, achieving performance competitive with RGB-D systems by using CUT3R priors for geometric consistency and temporal query memory.
-
SoccerMaster: A Vision Foundation Model for Soccer Understanding
SoccerMaster is the first soccer-specific vision foundation model that unifies tasks from player detection to event classification via multi-task pretraining and outperforms task-specific models on downstream evaluations.
-
PowerCLIP: Powerset Alignment for Contrastive Pre-Training
PowerCLIP improves CLIP-style models by exhaustively aligning powersets of image regions to textual parse trees via efficient non-linear aggregators that approximate the full combinatorial loss.
-
TRANSPORTER: Transferring Visual Semantics from VLM Manifolds
TRANSPORTER generates videos from VLM logits using optimal transport to interpret model predictions on object attributes, actions, and scenes.
-
CardioBench: Do Echocardiography Foundation Models Generalize Beyond the Lab?
CardioBench is a new public benchmark that standardizes eight echocardiography datasets into four regression and five classification tasks to evaluate foundation model generalization.
-
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning
MGPO elicits grounding in LMMs via multi-turn RL with binary rewards, yielding 5.4% and 5.2% gains on MME-Realworld and V* Bench and surpassing GPT-4o on the latter after training on 21K samples.
-
AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models
AVA-Bench evaluates vision foundation models by disentangling 14 atomic visual abilities with aligned training-test distributions to reveal precise ability fingerprints.
-
Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping
A contrastive multimodal framework augments satellite-audio datasets with vision-language model sound descriptions to learn shared soundscape concepts for zero-shot retrieval and synthesis.
-
FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperfor...
-
Cambrian-P: Pose-Grounded Video Understanding
Cambrian-P adds per-frame camera pose tokens and a regression head to video MLLMs, delivering 4.5-6.5% gains on spatial benchmarks, generalization to other video QA tasks, and SOTA streaming pose estimation on ScanNet.
-
Proxy-Based Approximation of Shapley and Banzhaf Interactions
ProxySHAP approximates higher-order Shapley and Banzhaf interactions via tree proxies plus residual correction and a polynomial-time interventional TreeSHAP generalization for tree ensembles.
-
Proxy-Based Approximation of Shapley and Banzhaf Interactions
ProxySHAP uses tree proxies plus residual correction to achieve state-of-the-art approximation of Shapley and Banzhaf interactions, with a polynomial-time exact method for tree ensembles.
Reference graph
Works this paper leans on
-
[1]
I. Alabdulmohsin, X. Zhai, A. Kolesnikov, and L. Beyer. Getting vit in shape: Scaling laws for compute-optimal model design. In NeurIPS, 2023
work page 2023
-
[2]
I. Alabdulmohsin, X. Wang, A. P. Steiner, P. Goyal, A. D’Amour, and X. Zhai. Clip the bias: How useful is balancing data in multimodal learning? InICLR, 2024
work page 2024
-
[3]
J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P.Wang, J.Lin, C.Zhou, andJ.Zhou. Qwen- VL: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv:2308.12966, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [4]
-
[5]
Are we done with imagenet?arXiv preprint arXiv:2006.07159,
L. Beyer, O. J. Hénaff, A. Kolesnikov, X. Zhai, and A. v. d. Oord. Are we done with ima- genet? arXiv:2006.07159, 2020
- [6]
-
[7]
PaliGemma: A versatile 3B VLM for transfer
L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neu- mann, I. Alabdulmohsin, M. Tschannen, E. Bugliarello, T. Unterthiner, D. Keysers, 12 SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features S. Koppula, F. Liu, A. Grycner, A. Gritsenko, N. Houlsby, M. Kumar, K. Rong, ...
work page internal anchor Pith review arXiv 2024
- [8]
- [9]
-
[10]
X. Chen, X. Wang, S. Changpinyo, A. J. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer, A. Kolesnikov, J. Puigcerver, N. Ding, K. Rong, H. Akbari, G. Mishra, L. Xue, A. Thapliyal, J. Bradbury, W. Kuo, M. Seyedhosseini, C. Jia, B. K. Ayan, C. Riquelme, A. Steiner, A. Angelova, X. Zhai, N. Houlsby, and R. Soricut. PaLI: A j...
work page internal anchor Pith review arXiv 2022
-
[11]
S.Cho, H.Shin, S.Hong, A.Arnab, P.H.Seo, and S. Kim. Cat-seg: Cost aggregation for open-vocabulary semantic segmentation. In CVPR, pages 4113–4123, 2024
work page 2024
-
[12]
M. Dehghani, B. Mustafa, J. Djolonga, J. Heek, M. Minderer, M. Caron, A. Steiner, J. Puigcerver, R. Geirhos, I. M. Alabdul- mohsin, et al. Patch n’pack: NaViT, a vi- sion transformer for any aspect ratio and resolution. NeurIPS, 2024
work page 2024
-
[13]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hi- erarchical image database. InCVPR, pages 248–255, 2009
work page 2009
-
[14]
J. Ding, N. Xue, G.-S. Xia, and D. Dai. De- coupling zero-shot semantic segmentation. In CVPR, pages 11583–11592, 2022
work page 2022
-
[15]
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transform- ers for image recognition at scale. InICLR, 2021
work page 2021
- [16]
-
[17]
M. Everingham, L. Van Gool, C. K. Williams, J.Winn,andA.Zisserman. Thepascalvisual object classes (voc) challenge.IJCV, 2010
work page 2010
-
[18]
L. Fan, D. Krishnan, P. Isola, D. Katabi, and Y. Tian. Improving clip training with lan- guage rewrites. NeurIPS, pages 35544– 35575, 2023
work page 2023
-
[19]
A. Fang, A. M. Jose, A. Jain, L. Schmidt, A. T. Toshev, and V. Shankar. Data filtering networks. InICLR, 2024
work page 2024
-
[20]
E. Fini, M. Shukor, X. Li, P. Dufter, M. Klein, D. Haldimann, S. Aitharaju, V. G. T. da Costa, L. Béthune, Z. Gan, A. T. Toshev, M. Eichner, M. Nabi, Y. Yang, J. M. Susskind, and A. El-Nouby. Multimodal autoregres- sive pre-training of large vision encoders. arXiv:2411.14402, 2024
-
[21]
S. Y. Gadre, G. Ilharco, A. Fang, J. Hayase, G.Smyrnis, T.Nguyen, R.Marten, M.Worts- man, D. Ghosh, J. Zhang, et al. Datacomp: In search of the next generation of multi- modal datasets.NeurIPS, 36, 2024
work page 2024
-
[22]
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team. Gemma: Open models based on gemini research and technology. arXiv:2403.08295, 2024
work page internal anchor Pith review arXiv 2024
-
[23]
Gemma 2: Improving Open Language Models at a Practical Size
Gemma Team. Gemma 2: Improving open language models at a practical size. arXiv:2408.00118, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Google Cloud. Introduction to Cloud TPU. https://cloud.google.com/ tpu/docs/intro-to-tpu, 20xx. Ac- cessed: 2024-07-04. 13 SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
work page 2024
- [25]
- [26]
-
[27]
G. Ilharco, M. Wortsman, R. Wightman, C. Gordon, N. Carlini, R. Taori, A. Dave, V. Shankar, H. Namkoong, J. Miller, H. Ha- jishirzi, A. Farhadi, and L. Schmidt. Open- CLIP, 2021
work page 2021
-
[28]
C. Jia, Y. Yang, Y. Xia, Y. Chen, Z. Parekh, H. Pham, Q. V. Le, Y. Sung, Z. Li, and T. Duerig. Scaling up visual and vision- language representation learning with noisy text supervision. InICML, 2021
work page 2021
-
[29]
S.Kazemzadeh,V.Ordonez,M.Matten,and T. Berg. ReferItGame: Referring to objects inphotographsofnaturalscenes. In EMNLP, Oct. 2014
work page 2014
-
[30]
W. Kuo, Y. Cui, X. Gu, A. Piergiovanni, and A. Angelova. Open-vocabulary object de- tection upon frozen vision and language models. InICLR, 2023
work page 2023
- [31]
-
[32]
J. Li, D. Li, S. Savarese, and S. C. H. Hoi. BLIP-2: bootstrapping language-image pre- training with frozen image encoders and large language models. InICML, 2023
work page 2023
- [33]
-
[34]
T. Lin, M. Maire, S. J. Belongie, L. D. Bour- dev, R. B. Girshick, J. Hays, P. Perona, D. Ra- manan, P. Doll’a r, and C. L. Zitnick. Mi- crosoft COCO: common objects in context. arXiv:1405.0312, 2014
work page internal anchor Pith review arXiv 2014
-
[35]
H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. InNeurIPS, 2023
work page 2023
-
[36]
S. Long, S. Qin, D. Panteleev, A. Bissacco, Y. Fujii, and M. Raptis. ICDAR 2023 com- petition on hierarchical text detection and recognition. InICDAR, 2023
work page 2023
-
[37]
Decoupled Weight Decay Regularization
I. Loshchilov, F. Hutter, et al. Fixing weight decayregularizationinadam. arXivpreprint arXiv:1711.05101, 5, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[38]
K.-K. Maninis, K. Chen, S. Ghosh, A. Karpur, K. Chen, Y. Xia, B. Cao, D. Salz, G. Han, J.Dlabal,etal. TIPS:Text-imagepretraining with spatial awareness. InICLR, 2025
work page 2025
-
[39]
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
B.McKinzie, Z.Gan, J.Fauconnier, S.Dodge, B. Zhang, P. Dufter, D. Shah, X. Du, F. Peng, F. Weers, A. Belyi, H. Zhang, K. Singh, D. Kang, A. Jain, H. Hè, M. Schwarzer, T. Gunter, X. Kong, A. Zhang, J. Wang, C. Wang, N. Du, T. Lei, S. Wiseman, G. Yin, M. Lee, Z. Wang, R. Pang, P. Grasch, A. To- shev, and Y. Yang. MM1: methods, anal- ysis & insights from mul...
work page internal anchor Pith review arXiv 2024
-
[40]
M. Minderer, A. Gritsenko, A. Stone, M. Neumann, D. Weissenborn, A. Dosovit- skiy, A. Mahendran, A. Arnab, M. Dehghani, Z. Shen, et al. Simple open-vocabulary ob- ject detection. In ECCV, pages 728–755, 2022
work page 2022
-
[41]
M. Minderer, A. A. Gritsenko, and N. Houlsby. Scaling open-vocabulary object detection. InNeurIPS, 2023
work page 2023
- [42]
-
[43]
R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.- W. Lee, S. Fidler, R. Urtasun, and A. Yuille. The role of context for object detection and semanticsegmentationinthewild. In CVPR, 2014. 14 SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
work page 2014
-
[44]
N. Mu, A. Kirillov, D. Wagner, and S. Xie. SLIP: Self-supervision meets language- image pre-training. In ECCV, pages 529– 544, 2022
work page 2022
-
[45]
M. F. Naeem, Y. Xian, X. Zhai, L. Hoyer, L. Van Gool, and F. Tombari. SILC: Improv- ing vision language pretraining with self- distillation. InECCV, pages 38–55, 2024
work page 2024
- [46]
- [47]
-
[48]
Z. Peng, W. Wang, L. Dong, Y. Hao, S. Huang, S. Ma, and F. Wei. Kosmos-2: Grounding multimodal large language mod- els to the world.arXiv:2306.14824, 2023
work page internal anchor Pith review arXiv 2023
- [49]
-
[50]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable vi- sual models from natural language supervi- sion. InICML, 2021
work page 2021
-
[51]
V. V. Ramaswamy, S. Y. Lin, D. Zhao, A. Ad- cock, L. van der Maaten, D. Ghadiyaram, and O. Russakovsky. Geode: a geographi- cally diverse evaluation dataset for object recognition. NeurIPS, 36, 2024
work page 2024
- [52]
- [53]
-
[54]
W. A. G. Rojas, S. Diamos, K. R. Kini, D. Kan- ter, V. J. Reddi, and C. Coleman. The dollar street dataset: Images representing the geo- graphic and socioeconomic diversity of the world. InNeurIPS Datasets and Benchmarks Track, 2022
work page 2022
-
[55]
O. Sidorov, R. Hu, M. Rohrbach, and A. Singh. TextCaps: A dataset for image captioning with reading comprehension. In ECCV, 2020
work page 2020
-
[56]
A.Steiner,A.S.Pinto,M.Tschannen,D.Key- sers, X. Wang, Y. Bitton, A. Gritsenko, M. Minderer, A. Sherbondy, S. Long, et al. Paligemma 2: A family of versatile vlms for transfer. arXiv:2412.03555, 2024
work page internal anchor Pith review arXiv 2024
-
[57]
Q. Sun, Y. Fang, L. Wu, X. Wang, and Y. Cao. EVA-CLIP: Improved training techniques for clip at scale.arXiv:2303.15389, 2023
work page internal anchor Pith review arXiv 2023
-
[58]
A. V. Thapliyal, J. Pont Tuset, X. Chen, and R. Soricut. Crossmodal-3600: A massively multilingual multimodal evaluation dataset. In EMNLP, 2022
work page 2022
-
[59]
S. Tong, E. Brown, P. Wu, S. Woo, M. Midde- pogu, S. C. Akula, J. Yang, S. Yang, A. Iyer, X. Pan, A. Wang, R. Fergus, Y. LeCun, and S. Xie. Cambrian-1: A Fully Open, Vision- Centric Exploration of Multimodal LLMs. arXiv:2406.16860, 2024
work page internal anchor Pith review arXiv 2024
-
[60]
M.Tschannen,M.Kumar,A.Steiner,X.Zhai, N. Houlsby, and L. Beyer. Image captioners are scalable vision learners too. InNeurIPS, 2023
work page 2023
-
[61]
V. Udandarao, N. Parthasarathy, M. F. Naeem, T. Evans, S. Albanie, F. Tombari, Y. Xian, A. Tonioni, and O. J. Hénaff. Active data curation effectively distills large-scale multimodal models. arXiv:2411.18674, 2024
-
[62]
B. Wan, M. Tschannen, Y. Xian, F. Pavetic, I. Alabdulmohsin, X. Wang, A. S. Pinto, A. Steiner, L. Beyer, and X. Zhai. LocCa: Visual pretraining with location-aware cap- tioners. InNeurIPS, 2024. 15 SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
work page 2024
-
[63]
B. Wang, G. Li, X. Zhou, Z. Chen, T. Gross- man, and Y. Li. Screen2words: Automatic mobile ui summarization with multimodal learning. In Symposium on User Interface Software and Technology, 2021
work page 2021
-
[64]
Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y. Tsvetkov, and Y. Cao. SimVLM: Simple visual lan- guage model pretraining with weak super- vision. InICLR, 2022
work page 2022
- [65]
-
[66]
H. Xu, S. Xie, X. Tan, P.-Y. Huang, R. Howes, V. Sharma, S.-W. Li, G. Ghosh, L. Zettle- moyer, and C. Feichtenhofer. Demystifying clip data. InICLR, 2024
work page 2024
-
[67]
J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. Wu. CoCa: Con- trastive captioners are image-text founda- tion models.TMLR, 2022
work page 2022
-
[68]
L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Modeling context in referring expressions. InECCV, pages 69–85, 2016
work page 2016
-
[69]
X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer. Scaling vision transformers.CVPR, 2022
work page 2022
-
[70]
X. Zhai, X. Wang, B. Mustafa, A. Steiner, D. Keysers, A. Kolesnikov, and L. Beyer. Lit: Zero-shot transfer with locked-image text tuning. InCVPR, 2022
work page 2022
-
[71]
X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training. InICCV, 2023
work page 2023
-
[72]
Y. Zhao, A. Gu, R. Varma, L. Luo, C. Huang, M. Xu, L. Wright, H. Shojanazeri, M. Ott, S. Shleifer, A. Desmaison, C. Balioglu, P. Da- mania, B. Nguyen, G. Chauhan, Y. Hao, A.Mathews, andS.Li. PytorchFSDP:experi- ences on scaling fully sharded data parallel. VLDB, 2023
work page 2023
-
[73]
B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Bar- riuso, and A. Torralba. Scene parsing through ade20k dataset. InCVPR, 2017
work page 2017
-
[74]
B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba. Semantic un- derstanding of scenes through the ade20k dataset. IJCV, 2019
work page 2019
-
[75]
J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong. Image BERT pre- training with online tokenizer. In ICLR, 2022. 16 SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Appendix A. Full PaliGemma results Large 224/256px So400m/14 224px So400m 384px SigLIP AIMv2 SigLIP2 SigL...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.