arxiv: 2412.03555 · v1 · submitted 2024-12-04 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

PaliGemma 2: A Family of Versatile VLMs for Transfer

Andreas Steiner , Andr\'e Susano Pinto , Michael Tschannen , Daniel Keysers , Xiao Wang , Yonatan Bitton , Alexey Gritsenko , Matthias Minderer

show 10 more authors

Anthony Sherbondy Shangbang Long Siyang Qin Reeve Ingle Emanuele Bugliarello Sahar Kazemzadeh Thomas Mesnard Ibrahim Alabdulmohsin Lucas Beyer Xiaohua Zhai

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:10 UTC · model grok-4.3

classification 💻 cs.CV

keywords PaliGemma 2Vision-Language ModelsTransfer LearningOCR TasksMulti-resolution TrainingGemma 2SigLIPFine-tuning

0 comments

The pith

PaliGemma 2 pairs Gemma 2 language models with SigLIP encoders and trains them at multiple resolutions to achieve strong transfer on OCR and captioning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

PaliGemma 2 upgrades the original PaliGemma by combining the SigLIP-So400m vision encoder with the full Gemma 2 language model family from 2B to 27B parameters. The models receive multi-stage training at 224, 448, and 896 pixel resolutions to acquire broad knowledge that supports fine-tuning on new tasks. Evaluation expands to table structure recognition, molecular structure recognition, music score recognition, long fine-grained captioning, and radiography report generation, where the models reach state-of-the-art results. The family also supports analysis of how model size, resolution, and task type influence transfer performance. Readers would care because the open models offer practical starting points for adapting vision-language systems to specialized document and imaging applications.

Core claim

PaliGemma 2 is an upgrade of the PaliGemma open Vision-Language Model based on the Gemma 2 family of language models. We combine the SigLIP-So400m vision encoder that was also used by PaliGemma with the whole range of Gemma 2 models, from the 2B one all the way up to the 27B model. We train these models at three resolutions (224px, 448px, and 896px) in multiple stages to equip them with broad knowledge for transfer via fine-tuning. The resulting family of base models covering different model sizes and resolutions allows us to investigate factors impacting transfer performance (such as learning rate) and to analyze the interplay between the type of task, model size, and resolution. We further

What carries the argument

Multi-stage training at three resolutions of Gemma 2 language models paired with the SigLIP-So400m vision encoder, which builds broad transferable knowledge for fine-tuning.

If this is right

Larger model sizes paired with higher input resolutions improve accuracy on fine-grained OCR tasks such as table and molecular structure recognition.
The same base models support effective fine-tuning for long-form captioning and medical report generation without task-specific architectural changes.
Varying model size and resolution reveals clear trade-offs that guide selection of the right model for a given task type and compute budget.
Expansion of benchmarks to music scores and molecular diagrams shows the family transfers to domains outside conventional natural-image captioning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The multi-resolution training may confer robustness to inputs whose native detail level varies widely, an advantage not yet quantified against fixed-resolution baselines.
These models could serve as efficient starting points for domain adaptation in fields such as historical document analysis or scientific imaging pipelines.
The music-score results hint at latent capabilities for structured visual sequences that might extend to other ordered domains like chemical diagrams or circuit schematics.
Future work could test whether the same training recipe transfers to video or multi-frame inputs by leveraging the resolution flexibility already present.

Load-bearing premise

Multi-stage training at multiple resolutions equips the models with broad transferable knowledge that simpler single-stage or single-resolution training cannot match.

What would settle it

A controlled comparison in which a single-resolution single-stage model of matching size achieves equal or better accuracy on table structure recognition, molecular recognition, music score recognition, and radiography report generation would falsify the necessity of the multi-stage multi-resolution protocol.

read the original abstract

PaliGemma 2 is an upgrade of the PaliGemma open Vision-Language Model (VLM) based on the Gemma 2 family of language models. We combine the SigLIP-So400m vision encoder that was also used by PaliGemma with the whole range of Gemma 2 models, from the 2B one all the way up to the 27B model. We train these models at three resolutions (224px, 448px, and 896px) in multiple stages to equip them with broad knowledge for transfer via fine-tuning. The resulting family of base models covering different model sizes and resolutions allows us to investigate factors impacting transfer performance (such as learning rate) and to analyze the interplay between the type of task, model size, and resolution. We further increase the number and breadth of transfer tasks beyond the scope of PaliGemma including different OCR-related tasks such as table structure recognition, molecular structure recognition, music score recognition, as well as long fine-grained captioning and radiography report generation, on which PaliGemma 2 obtains state-of-the-art results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PaliGemma 2 is a straightforward scaling of the first version with Gemma 2 backbones and multi-resolution stages, delivering usable models and some new SOTA numbers but without the ablations needed to credit the training schedule.

read the letter

PaliGemma 2 swaps the original language model for the full Gemma 2 range from 2B to 27B while keeping the SigLIP encoder. They train at 224, 448, and 896 pixels across multiple stages and then fine-tune on an expanded set of tasks. The new results include state-of-the-art numbers on table structure recognition, molecular structure recognition, music score recognition, long fine-grained captioning, and radiography report generation.

Referee Report

2 major / 2 minor

Summary. The paper introduces PaliGemma 2, an upgraded family of open vision-language models that pairs the SigLIP-So400m vision encoder with the full range of Gemma 2 language models (2B to 27B parameters). The models are trained in multiple stages at three resolutions (224 px, 448 px, 896 px) to equip them with broad transferable knowledge; the resulting family is then evaluated on an expanded set of transfer tasks that includes OCR-related problems (table structure recognition, molecular structure recognition, music score recognition) as well as long fine-grained captioning and radiography report generation, where state-of-the-art results are reported.

Significance. If the empirical claims hold after proper controls, the work supplies a publicly available, multi-scale VLM family that enables systematic study of how model size, input resolution, and staged training affect downstream transfer. The extension of the task suite to specialized OCR and medical domains is a concrete contribution that can serve as a benchmark for future transfer research.

major comments (2)

[Training and Results sections] The central claim that the multi-stage, multi-resolution training schedule equips the models with broad transferable knowledge is not supported by controlled ablations. No quantitative comparison is presented between the full regimen and simpler single-stage or single-resolution baselines, leaving open the possibility that reported gains are driven primarily by scale, the SigLIP encoder, or task-specific fine-tuning rather than the staged schedule itself.
[Experimental Results] SOTA claims on table structure recognition, molecular structure recognition, music score recognition, and radiography report generation are stated without accompanying error bars, statistical significance tests, or exhaustive baseline tables that would allow readers to verify the magnitude and reliability of the improvements.

minor comments (2)

[Abstract] The abstract states that the family 'allows us to investigate factors impacting transfer performance (such as learning rate)' yet does not summarize the concrete findings of that investigation; a short paragraph or table in the main text would clarify which factors were actually quantified.
[Model Architecture] Notation for the three training resolutions and the exact parameter counts of the 2B–27B variants should be introduced consistently in the model description section to avoid ambiguity when results are later broken down by size and resolution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address the major comments point by point below and describe the revisions we will incorporate.

read point-by-point responses

Referee: [Training and Results sections] The central claim that the multi-stage, multi-resolution training schedule equips the models with broad transferable knowledge is not supported by controlled ablations. No quantitative comparison is presented between the full regimen and simpler single-stage or single-resolution baselines, leaving open the possibility that reported gains are driven primarily by scale, the SigLIP encoder, or task-specific fine-tuning rather than the staged schedule itself.

Authors: We agree that the manuscript would be strengthened by explicit ablations isolating the contribution of the multi-stage, multi-resolution schedule. The current work builds directly on the PaliGemma training recipe and emphasizes the resulting model family’s transfer performance across scales and resolutions, but does not include head-to-head comparisons against single-stage or single-resolution variants. We will add controlled ablation experiments on a representative subset of transfer tasks (including at least one OCR-related task and one captioning task) to quantify the incremental benefit of the staged schedule versus simpler baselines. These results will be reported in a new subsection of the Training section. revision: yes
Referee: [Experimental Results] SOTA claims on table structure recognition, molecular structure recognition, music score recognition, and radiography report generation are stated without accompanying error bars, statistical significance tests, or exhaustive baseline tables that would allow readers to verify the magnitude and reliability of the improvements.

Authors: We acknowledge that the current presentation of SOTA results lacks sufficient statistical detail. In the revised manuscript we will (i) report standard deviations or error bars for all tasks where multiple independent runs were performed, (ii) add pairwise statistical significance tests (e.g., paired t-tests or Wilcoxon tests) against the strongest baselines where feasible, and (iii) expand the baseline tables to include additional published methods and more granular metrics. For the most compute-intensive tasks where only single runs are available, we will explicitly note this limitation and provide any available confidence-interval estimates derived from internal validation splits. revision: yes

Circularity Check

0 steps flagged

Empirical model release with no derivation chain or fitted predictions

full rationale

The paper describes training PaliGemma 2 by combining a SigLIP vision encoder with Gemma 2 language models, performing multi-stage multi-resolution training, and then measuring transfer performance on held-out tasks including OCR variants. No equations, uniqueness theorems, ansatzes, or predictions are claimed; results are reported as direct empirical measurements. This is self-contained against external benchmarks with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that combining the prior SigLIP encoder with Gemma 2 and multi-resolution training produces broad transfer capability; no new entities or free parameters are explicitly introduced beyond standard model sizes and resolutions.

free parameters (2)

model sizes (2B-27B)
Selected from the Gemma 2 family; values are inherited rather than fitted in this work.
training resolutions (224/448/896 px)
Chosen for the multi-stage training schedule.

axioms (1)

domain assumption SigLIP-So400m vision encoder remains effective when paired with larger Gemma 2 language models
Inherited from the original PaliGemma without new justification in the abstract.

pith-pipeline@v0.9.0 · 5572 in / 1263 out tokens · 45113 ms · 2026-05-15T09:10:03.840284+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
cs.CV 2026-05 unverdicted novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models
cs.AI 2026-05 unverdicted novelty 7.0

LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.
Being-H0.7: A Latent World-Action Model from Egocentric Videos
cs.RO 2026-04 unverdicted novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology
cs.CV 2026-05 unverdicted novelty 6.0

RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards f...
SurgCheck: Do Vision-Language Models Really Look at Images in Surgical VQA?
cs.CV 2026-05 unverdicted novelty 6.0

SurgCheck benchmark reveals that vision-language models for surgical VQA often depend on linguistic shortcuts rather than visual reasoning, shown by consistent performance drops on less-biased questions.
Sentinel-VLA: A Metacognitive VLA Model with Active Status Monitoring for Dynamic Reasoning and Error Recovery
cs.RO 2026-05 unverdicted novelty 6.0

Sentinel-VLA introduces a metacognitive VLA model with a sentinel module for real-time status monitoring, dynamic reasoning, and error recovery, plus a self-evolving continual learning method, raising real-world task ...
Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

MoT-HRA learns embodiment-agnostic human-intention priors from the HA-2.2M dataset of 2.2M human video episodes through a three-expert hierarchy to improve robotic motion plausibility and robustness under distribution shift.
Boosting Visual Instruction Tuning with Self-Supervised Guidance
cs.CV 2026-04 unverdicted novelty 6.0

Mixing 3-10% of visually grounded self-supervised instructions into visual instruction tuning consistently boosts MLLM performance on vision-centric benchmarks.
When Meaning Isn't Literal: Exploring Idiomatic Meaning Across Languages and Modalities
cs.CL 2026-04 unverdicted novelty 6.0

Presents the Mediom multilingual multimodal idiom corpus and the HIDE hinting-based framework to benchmark and improve AI comprehension of figurative meanings across languages.
AnySlot: Goal-Conditioned Vision-Language-Action Policies for Zero-Shot Slot-Level Placement
cs.RO 2026-04 unverdicted novelty 6.0

AnySlot decouples language grounding from low-level control by inserting an explicit visual goal image, yielding better zero-shot performance on precise slot placement tasks than flat VLA policies.
VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis
cs.RO 2026-04 unverdicted novelty 6.0

VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.
InVitroVision: a Multi-Modal AI Model for Automated Description of Embryo Development using Natural Language
cs.AI 2026-04 unverdicted novelty 5.0

InVitroVision, a fine-tuned PaliGemma-2 model, generates natural language descriptions of embryo development and outperforms ChatGPT 5.2 and base models on a public time-lapse dataset.
EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation
cs.CV 2026-04 unverdicted novelty 5.0

EgoMotion decouples reasoning from motion synthesis in egocentric vision-language tasks by mapping inputs to motion primitives via VLM then using diffusion to produce grounded and coherent 3D trajectories.
ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning
cs.RO 2026-04 unverdicted novelty 5.0

ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
cs.CV 2026-04 unverdicted novelty 5.0

HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
cs.CV 2026-04 unverdicted novelty 5.0

HiVLA decouples VLM-based semantic planning from DiT-based motor control via structured plans and cascaded cross-attention to outperform end-to-end VLA baselines in long-horizon and fine-grained manipulation.
Anthropogenic Regional Adaptation in Multimodal Vision-Language Model
cs.AI 2026-04 unverdicted novelty 5.0

Anthropogenic Regional Adaptation with GG-EZ improves cultural relevance in multimodal vision-language models for Southeast Asia by 5-15% while retaining over 98% of global performance.
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
cs.RO 2025-01 unverdicted novelty 5.0

SpatialVLA adds 3D-aware position encoding and adaptive discretized action grids to visual-language-action models, enabling strong zero-shot performance and fine-tuning on new robot setups after pre-training on 1.1 mi...
OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL
cs.RO 2026-04 unverdicted novelty 4.0

OmniVLA-RL uses a mix-of-transformers architecture and flow-matching reformulated as SDE with group segmented policy optimization to surpass prior VLA models on LIBERO benchmarks.
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
cs.CV 2025-02 unverdicted novelty 4.0

SigLIP 2 models trained with a unified recipe of captioning, self-supervised losses, and curated diverse data outperform prior SigLIP versions on classification, retrieval, localization, dense prediction, and multilin...

Reference graph

Works this paper leans on

113 extracted references · 113 canonical work pages · cited by 19 Pith papers · 10 internal anchors

[1]

Acharya, K

M. Acharya, K. Kafle, and C. Kanan. Tal- lyQA: Answering complex counting ques- tions. InAAAI, 2019

work page 2019
[2]

Agrawal, K

H. Agrawal, K. Desai, Y. Wang, X. Chen, R. Jain, M. Johnson, D. Batra, D. Parikh, S. Lee, and P. Anderson. NoCaps: Novel object captioning at scale. InICCV, 2019

work page 2019
[3]

Alabdulmohsin, X

I. Alabdulmohsin, X. Zhai, A. Kolesnikov, and L. Beyer. Getting vit in shape: Scaling laws for compute-optimal model design. In NeurIPS, 2023

work page 2023
[4]

Alayrac, J

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Men- sch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O.Vinyals, A.Zisserman, andK.Simonyan. Flamingo: a visual language model f...

work page 2022
[5]

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-VL: A versatile vision- language model for understanding, lo- 10 PaliGemma 2: A Family of Versatile VLMs for Transfer calization, text reading, and beyond. arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Neural Combinatorial Optimization with Reinforcement Learning

I. Bello, H. Pham, Q. V. Le, M. Norouzi, and S. Bengio. Neural combinatorial op- timization with reinforcement learning. arXiv:1611.09940, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[7]

Betker, G

J. Betker, G. Goh, L. Jing, T. Brooks, J.Wang, L.Li, L.Ouyang, J.Zhuang, J.Lee, Y. Guo, et al. Improving image generation with better captions. Technical Report, 2023

work page 2023
[8]

Beyer, X

L. Beyer, X. Zhai, and A. Kolesnikov. Big vision. https://github.com/ google-research/big_vision, 2022

work page 2022
[9]

PaliGemma: A versatile 3B VLM for transfer

L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neu- mann, I. Alabdulmohsin, M. Tschannen, E. Bugliarello, T. Unterthiner, D. Keysers, S. Koppula, F. Liu, A. Grycner, A. Grit- senko, N. Houlsby, M. Kumar, K. Rong, J. Eisenschlos, R. Kabra, M. Bauer, M. Bošnjak, X. Chen, M. Minderer, P. Voigtlaender, I. Bica, I. Balazevic, J. Puigcer...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

A. F. Biten, R. Tito, A. Mafla, L. Gomez, M. Rusinol, C. Jawahar, E. Valveny, and D. Karatzas. Scene text visual question answering. InICCV, Oct. 2019

work page 2019
[11]

Changpinyo, D

S. Changpinyo, D. Kukliansy, I. Szpektor, X. Chen, N. Ding, and R. Soricut. All you may need for VQA are image captions. In NAACL, 2022

work page 2022
[12]

D. L. Chen and W. B. Dolan. Collecting highly parallel data for paraphrase evalu- ation. InACL, 2011

work page 2011
[13]

T. Chen, S. Saxena, L. Li, D. J. Fleet, and G. E. Hinton. Pix2seq: A language mod- eling framework for object detection. In ICLR, 2022

work page 2022
[14]

X. Chen, X. Wang, S. Changpinyo, A. J. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer, A. Kolesnikov, J. Puigcerver, N. Ding, K. Rong, H. Akbari, G. Mishra, L. Xue, A. Thapliyal, J. Bradbury, W. Kuo, M. Seyedhosseini, C. Jia, B. K. Ayan, C. Riquelme, A. Steiner, A. Angelova, X. Zhai, N. Houlsby, and R. Soricut. PaLI: A j...

work page arXiv 2022
[15]

X. Chen, X. Wang, L. Beyer, A. Kolesnikov, J.Wu, P.Voigtlaender, B.Mustafa, S.Good- man, I. Alabdulmohsin, P. Padlewski, D. Salz, X. Xiong, D. Vlasic, F. Pavetic, K. Rong, T. Yu, D. Keysers, X. Zhai, and R. Soricut. PaLI-3 vision lan- guage models: Smaller, faster, stronger. arXiv:2310.09199, 2023

work page arXiv 2023
[16]

X. Chen, J. Djolonga, P. Padlewski, B. Mustafa, S. Changpinyo, J. Wu, C. R. Ruiz,S.Goodman,X.Wang,Y.Tay,S.Shak- eri, M. Dehghani, D. Salz, M. Lucic, M.Tschannen, A.Nagrani, H.Hu, M.Joshi, B. Pang, C. Montgomery, P. Pietrzyk, M. Ritter, A. J. Piergiovanni, M. Minderer, F. Pavetic, A. Waters, G. Li, I. Alabdul- mohsin, L. Beyer, J. Amelot, K. Lee, A. P. Ste...

work page 2024
[17]

C. K. Ch’ng and C. S. Chan. Total-Text: A comprehensive dataset for scene text de- tection and recognition. InICDAR, 2017

work page 2017
[18]

W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi. InstructBLIP: Towards general- purpose vision-language models with in- struction tuning.arxiv:2305.06500, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

M. Deitke, C. Clark, S. Lee, R. Tripathi, Y. Yang, J. S. Park, M. Salehi, N. Muen- nighoff, K. Lo, L. Soldaini, et al. Molmo and PixMo: Open weights and open data 11 PaliGemma 2: A Family of Versatile VLMs for Transfer for state-of-the-art multimodal models. arXiv:2409.17146, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Desai and J

K. Desai and J. Johnson. Virtex: Learning visual representations from textual anno- tations. InCVPR, 2021

work page 2021
[21]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team. Gemma: Open models based on gemini research and technology. arXiv:2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team. Gemma 2: Improving open language models at a practical size. arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

A. L. Goldberger, L. A. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley. PhysioBank, PhysioToolkit, and PhysioNet: components of a new re- search resource for complex physiologic signals. Circulation, 101(23), 2000

work page 2000
[24]

Introduction to Cloud TPU

Google Cloud. Introduction to Cloud TPU. https://cloud.google.com/ tpu/docs/intro-to-tpu, 20xx. Ac- cessed: 2024-07-04

work page 2024
[25]

Goyal, T

Y. Goyal, T. Khot, D. Summers-Stay, D. Ba- tra, and D. Parikh. Making the V in VQA matter: Elevating the role of image under- standing in Visual Question Answering. In CVPR, 2017

work page 2017
[26]

Grauman, J

D.Gurari, Q.Li, A.J.Stangl, A.Guo, C.Lin, K. Grauman, J. Luo, and J. P. Bigham. VizWiz Grand Challenge: Answering vi- sual questions from blind people. InCVPR, 2018

work page 2018
[27]

T.-Y. Hsu, C. L. Giles, and T.-H. Huang. Scicap: Generating captions for scientific figures. arXiv:2110.11624, 2021

work page arXiv 2021
[28]

Huang, N

Y. Huang, N. Lu, D. Chen, Y. Li, Z. Xie, S. Zhu, L. Gao, and W. Peng. Improv- ing table structure recognition with visual- alignmentsequentialcoordinatemodeling. In CVPR, 2023

work page 2023
[29]

Hudson and C

D. Hudson and C. Manning. GQA: A new datasetforreal-worldvisualreasoningand compositional question answering.CVPR, 2019

work page 2019
[30]

S. Jain, A. Agrawal, A. Saporta, S. Truong, T. Bui, P. Chambon, Y. Zhang, M. P. Lun- gren, A. Y. Ng, C. Langlotz, et al. Rad- Graph: Extracting clinical entities and re- lations from radiology reports. InNeurIPS Datasets and Benchmarks Track, 2022

work page 2022
[31]

C. Jia, Y. Yang, Y. Xia, Y. Chen, Z. Parekh, H. Pham, Q. V. Le, Y. Sung, Z. Li, and T. Duerig. Scaling up visual and vision- language representation learning with noisy text supervision. InICML, 2021

work page 2021
[32]

Jocher, J

G. Jocher, J. Qiu, and A. Chaura- sia. Ultralytics YOLO, 2023. URL https://github.com/ultralytics/ ultralytics

work page 2023
[33]

A. E. Johnson, T. J. Pollard, S. J. Berkowitz, N. R. Greenbaum, M. P. Lungren, C.-Y. Deng, R. G. Mark, and S. Horng. MIMIC- CXR, a de-identified publicly available database of chest radiographs with free- text reports. Scientific data, 6(1):317, 2019

work page 2019
[34]

O. F. Kar, A. Tonioni, P. Poklukar, A. Kul- shrestha, A. Zamir, and F. Tombari. BRAVE: Broadening the visual en- coding of vision-language models. arXiv:2404.07204, 2024

work page arXiv 2024
[35]

Karamcheti, S

S. Karamcheti, S. Nair, A. Balakrishna, P. Liang, T. Kollar, and D. Sadigh. Pris- matic VLMs: Investigating the design space of visually-conditioned language models. arXiv:2402.07865, 2024

work page arXiv 2024
[36]

Karatzas, L

D. Karatzas, L. Gomez-Bigorda, A. Nico- laou, S. K. Ghosh, A. D. Bagdanov, M. Iwa- mura, J. Matas, L. Neumann, V. R. Chan- drasekhar, S. Lu, F. Shafait, S. Uchida, and E. Valveny. ICDAR 2015 competition on robust reading. InICDAR, 2015

work page 2015
[37]

Karkkainen and J

K. Karkkainen and J. Joo. Fairface: Face attribute dataset for balanced race, gen- der, and age for bias measurement and mitigation. InWACV, 2021

work page 2021
[38]

Kawakatsu

T. Kawakatsu. Multi-cell decoder and mu- tual learning for table structure and char- acter recognition. InICDAR, 2024. 12 PaliGemma 2: A Family of Versatile VLMs for Transfer

work page 2024
[39]

Kazemzadeh, V

S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg. ReferItGame: Referring to objects in photographs of natural scenes. In EMNLP, Oct. 2014

work page 2014
[40]

Kembhavi, M

A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Hajishirzi, and A. Farhadi. A diagram is worth a dozen images. InECCV, 2016

work page 2016
[41]

S.Kim,P.A.Thiessen,E.E.Bolton,J.Chen, G. Fu, A. Gindulyte, L. Han, J. He, S. He, B.A.Shoemaker,etal. Pubchemsubstance and compound databases. Nucleic acids research, 44(D1):D1202–D1213, 2016

work page 2016
[42]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[43]

Krishna, K

R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles. Dense-captioning events in videos. InICCV, 2017

work page 2017
[44]

Krylov, S

I. Krylov, S. Nosov, and V. Sovrasov. Open images v5 text annotation and yet another mask text spotter. InACCV, 2021

work page 2021
[45]

Laurençon, L

H. Laurençon, L. Tronchon, M. Cord, and V. Sanh. What matters when building vision-language models? arXiv:2405.02246, 2024

work page arXiv 2024
[46]

A. Lees, V. Q. Tran, Y. Tay, J. Sorensen, J. Gupta, D. Metzler, and L. Vasserman. A new generation of perspective API: Ef- ficient multilingual character-level trans- formers. arXiv:2202.11176, 2022

work page arXiv 2022
[47]

B. Li, H. Zhang, K. Zhang, D. Guo, Y. Zhang, R. Zhang, F. Li, Z. Liu, and C. Li. LLaVA-NeXT: What else influences visual instruction tuning beyond data?, May 2024. URL https: //llava-vl.github.io/blog/ 2024-05-25-llava-next-ablations/

work page 2024
[48]

J. Li, D. Li, S. Savarese, and S. C. H. Hoi. BLIP-2: bootstrapping language- image pre-training with frozen image en- coders and large language models. In ICML, 2023

work page 2023
[49]

Y. Li, G. Li, L. He, J. Zheng, H. Li, and Z. Guan. Widget Captioning: Generat- ing natural language description for mo- bileuser interface elements. In EMNLP, 2020

work page 2020
[50]

Y. Li, H. Mao, R. Girshick, and K. He. Exploring plain vision transformer back- bones for object detection. InECCV, 2022

work page 2022
[51]

T. Lin, M. Maire, S. J. Belongie, L. D. Bour- dev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Doll’a r, and C. L. Zitnick. Microsoft COCO: common objects in con- text. arXiv:1405.0312, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[52]

F. Liu, E. Bugliarello, E. M. Ponti, S. Reddy, N. Collier, and D. Elliott. Visually grounded reasoning across languages and cultures. InEMNLP, Nov. 2021

work page 2021
[53]

F. Liu, G. E. T. Emerson, and N. Collier. Visual spatial reasoning. TACL, 11:635– 651, 2023

work page 2023
[54]

H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. InNeurIPS, 2023

work page 2023
[55]

Lobry, D

S. Lobry, D. Marcos, J. Murray, and D. Tuia. RSVQA: Visual question answering for re- mote sensing data. IEEE Trans. on Geo- science and Remote Sensing, 58(12), Dec. 2020

work page 2020
[56]

S. Long, S. Qin, D. Panteleev, A. Bissacco, Y. Fujii, and M. Raptis. Towards end-to- endunifiedscenetextdetectionandlayout analysis. InCVPR, 2022

work page 2022
[57]

S. Long, S. Qin, D. Panteleev, A. Bissacco, Y. Fujii, and M. Raptis. ICDAR 2023 com- petition on hierarchical text detection and recognition. InICDAR, 2023

work page 2023
[58]

S. Long, S. Qin, Y. Fujii, A. Bissacco, and M. Raptis. Hierarchical text spotter for joint text spotting and layout analysis. In WACV, 2024

work page 2024
[59]

P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InNeurIPS, 2022. 13 PaliGemma 2: A Family of Versatile VLMs for Transfer

work page 2022
[60]

N. T. Ly and A. Takasu. An end-to-end multi-task learning model for image-based table recognition. arXiv:2303.08648, 2023

work page arXiv 2023
[61]

J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy. Generation and comprehension of unambiguous ob- ject descriptions. InCVPR, 2016

work page 2016
[62]

Marino, M

K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi. OK-VQA: A visual question answering benchmark requiring external knowledge. InCVPR, 2019

work page 2019
[63]

Masry, X

A. Masry, X. L. Do, J. Q. Tan, S. Joty, and E.Hoque. ChartQA:Abenchmarkforques- tion answering about charts with visual and logical reasoning. InACL, May 2022

work page 2022
[64]

Mathew, D

M. Mathew, D. Karatzas, R. Man- matha, and C. V. Jawahar. DocVQA: A dataset for VQA on document images. arXiv:2007.00398, 2020

work page arXiv 2007
[65]

Mathew, V

M. Mathew, V. Bagal, R. Tito, D. Karatzas, E. Valveny, and C. V. Jawahar. Infograph- icVQA. InWACV, 2022

work page 2022
[66]

McKinzie, Z

B. McKinzie, Z. Gan, J. Fauconnier, S. Dodge, B. Zhang, P. Dufter, D. Shah, X. Du, F. Peng, F. Weers, A. Belyi, H. Zhang, K. Singh, D. Kang, A. Jain, H. Hè, M. Schwarzer, T. Gunter, X. Kong, A. Zhang, J. Wang, C. Wang, N. Du, T. Lei, S. Wiseman, G. Yin, M. Lee, Z. Wang, R. Pang, P. Grasch, A. Toshev, andY.Yang. MM1: methods, analysis&in- sights from multi...

work page arXiv 2024
[67]

Mishra, S

A. Mishra, S. Shekhar, A. K. Singh, and A. Chakraborty. OCR-VQA: Visual question answering by reading text in images. In ICDAR, 2019

work page 2019
[68]

Nayef, F

N. Nayef, F. Yin, I. Bizid, H. Choi, Y. Feng, D. Karatzas, Z. Luo, U. Pal, C. Rigaud, J. Chazalon, et al. ICDAR2017 robust read- ing challenge on multi-lingual scene text detection and script identification - RRC- MLT. InICDAR, 2017

work page 2017
[69]

Y. Onoe, S. Rane, Z. Berger, Y. Bitton, J. Cho, R. Garg, A. Ku, Z. Parekh, J. Pont- Tuset, G.Tanzer, S.Wang, andJ.Baldridge. DOCCI: Descriptions of Connected and Contrasting Images. InECCV, 2024

work page 2024
[70]

H. Pang. YOLO-DocLayNet, Jan

work page
[71]

URL https://github.com/ ppaanngggg/yolo-doclaynet

work page
[72]

Pavlov, M

D. Pavlov, M. Rybalkin, B. Karulin, M. Kozhevnikov, A. Savelyev, and A. Churi- nov. Indigo: Universal cheminformatics API. Journal of Cheminformatics, 3(Suppl 1):P4, 2011

work page 2011
[73]

Z. Peng, W. Wang, L. Dong, Y. Hao, S. Huang, S. Ma, and F. Wei. Kosmos- 2: Grounding multimodal large language models to the world.arXiv:2306.14824, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[74]

Pfeiffer, G

J. Pfeiffer, G. Geigle, A. Kamath, J.-M. Steitz, S. Roth, I. Vulić, and I. Gurevych. xGQA: Cross-lingual visual question an- swering. InACL, 2022

work page 2022
[75]

Pfitzmann, C

B. Pfitzmann, C. Auer, M. Dolfi, A. S. Nas- sar, and P. Staar. DocLayNet: A large human-annotated dataset for document- layout segmentation. InSIGKDD, 2022

work page 2022
[76]

Piergiovanni, W

A. Piergiovanni, W. Kuo, and A. An- gelova. Pre-training image-language transformers for open-vocabulary tasks. arXiv:2209.04372, 2022

work page arXiv 2022
[77]

Y. Qian, J. Guo, Z. Tu, Z. Li, C. W. Coley, and R. Barzilay. MolScribe: Robust molec- ular structure recognition with image-to- graph generation.J. Chem. Inf. Model., 63 (7), 2023

work page 2023
[78]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language su- pervision. InICML, 2021

work page 2021
[79]

Rashkin, V

H. Rashkin, V. Nikolaev, M. Lamm, L. Aroyo, M. Collins, D. Das, S. Petrov, G. S. Tomar, I. Turc, and D. Reitter. Measuring 14 PaliGemma 2: A Family of Versatile VLMs for Transfer attribution in natural language generation models. Computational Linguistics, 49(4): 777–840, 2023

work page 2023
[80]

Ríos-Vila, D

A. Ríos-Vila, D. Rizo, J. M. Iñesta, and J. Calvo-Zaragoza. End-to-end optical mu- sic recognition for pianoform sheet music. IJDAR, 26(3):347–362, 2023

work page 2023

Showing first 80 references.