pith. machine review for the scientific record. sign in

arxiv: 2412.03555 · v1 · submitted 2024-12-04 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

PaliGemma 2: A Family of Versatile VLMs for Transfer

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:10 UTC · model grok-4.3

classification 💻 cs.CV
keywords PaliGemma 2Vision-Language ModelsTransfer LearningOCR TasksMulti-resolution TrainingGemma 2SigLIPFine-tuning
0
0 comments X

The pith

PaliGemma 2 pairs Gemma 2 language models with SigLIP encoders and trains them at multiple resolutions to achieve strong transfer on OCR and captioning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

PaliGemma 2 upgrades the original PaliGemma by combining the SigLIP-So400m vision encoder with the full Gemma 2 language model family from 2B to 27B parameters. The models receive multi-stage training at 224, 448, and 896 pixel resolutions to acquire broad knowledge that supports fine-tuning on new tasks. Evaluation expands to table structure recognition, molecular structure recognition, music score recognition, long fine-grained captioning, and radiography report generation, where the models reach state-of-the-art results. The family also supports analysis of how model size, resolution, and task type influence transfer performance. Readers would care because the open models offer practical starting points for adapting vision-language systems to specialized document and imaging applications.

Core claim

PaliGemma 2 is an upgrade of the PaliGemma open Vision-Language Model based on the Gemma 2 family of language models. We combine the SigLIP-So400m vision encoder that was also used by PaliGemma with the whole range of Gemma 2 models, from the 2B one all the way up to the 27B model. We train these models at three resolutions (224px, 448px, and 896px) in multiple stages to equip them with broad knowledge for transfer via fine-tuning. The resulting family of base models covering different model sizes and resolutions allows us to investigate factors impacting transfer performance (such as learning rate) and to analyze the interplay between the type of task, model size, and resolution. We further

What carries the argument

Multi-stage training at three resolutions of Gemma 2 language models paired with the SigLIP-So400m vision encoder, which builds broad transferable knowledge for fine-tuning.

If this is right

  • Larger model sizes paired with higher input resolutions improve accuracy on fine-grained OCR tasks such as table and molecular structure recognition.
  • The same base models support effective fine-tuning for long-form captioning and medical report generation without task-specific architectural changes.
  • Varying model size and resolution reveals clear trade-offs that guide selection of the right model for a given task type and compute budget.
  • Expansion of benchmarks to music scores and molecular diagrams shows the family transfers to domains outside conventional natural-image captioning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The multi-resolution training may confer robustness to inputs whose native detail level varies widely, an advantage not yet quantified against fixed-resolution baselines.
  • These models could serve as efficient starting points for domain adaptation in fields such as historical document analysis or scientific imaging pipelines.
  • The music-score results hint at latent capabilities for structured visual sequences that might extend to other ordered domains like chemical diagrams or circuit schematics.
  • Future work could test whether the same training recipe transfers to video or multi-frame inputs by leveraging the resolution flexibility already present.

Load-bearing premise

Multi-stage training at multiple resolutions equips the models with broad transferable knowledge that simpler single-stage or single-resolution training cannot match.

What would settle it

A controlled comparison in which a single-resolution single-stage model of matching size achieves equal or better accuracy on table structure recognition, molecular recognition, music score recognition, and radiography report generation would falsify the necessity of the multi-stage multi-resolution protocol.

read the original abstract

PaliGemma 2 is an upgrade of the PaliGemma open Vision-Language Model (VLM) based on the Gemma 2 family of language models. We combine the SigLIP-So400m vision encoder that was also used by PaliGemma with the whole range of Gemma 2 models, from the 2B one all the way up to the 27B model. We train these models at three resolutions (224px, 448px, and 896px) in multiple stages to equip them with broad knowledge for transfer via fine-tuning. The resulting family of base models covering different model sizes and resolutions allows us to investigate factors impacting transfer performance (such as learning rate) and to analyze the interplay between the type of task, model size, and resolution. We further increase the number and breadth of transfer tasks beyond the scope of PaliGemma including different OCR-related tasks such as table structure recognition, molecular structure recognition, music score recognition, as well as long fine-grained captioning and radiography report generation, on which PaliGemma 2 obtains state-of-the-art results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces PaliGemma 2, an upgraded family of open vision-language models that pairs the SigLIP-So400m vision encoder with the full range of Gemma 2 language models (2B to 27B parameters). The models are trained in multiple stages at three resolutions (224 px, 448 px, 896 px) to equip them with broad transferable knowledge; the resulting family is then evaluated on an expanded set of transfer tasks that includes OCR-related problems (table structure recognition, molecular structure recognition, music score recognition) as well as long fine-grained captioning and radiography report generation, where state-of-the-art results are reported.

Significance. If the empirical claims hold after proper controls, the work supplies a publicly available, multi-scale VLM family that enables systematic study of how model size, input resolution, and staged training affect downstream transfer. The extension of the task suite to specialized OCR and medical domains is a concrete contribution that can serve as a benchmark for future transfer research.

major comments (2)
  1. [Training and Results sections] The central claim that the multi-stage, multi-resolution training schedule equips the models with broad transferable knowledge is not supported by controlled ablations. No quantitative comparison is presented between the full regimen and simpler single-stage or single-resolution baselines, leaving open the possibility that reported gains are driven primarily by scale, the SigLIP encoder, or task-specific fine-tuning rather than the staged schedule itself.
  2. [Experimental Results] SOTA claims on table structure recognition, molecular structure recognition, music score recognition, and radiography report generation are stated without accompanying error bars, statistical significance tests, or exhaustive baseline tables that would allow readers to verify the magnitude and reliability of the improvements.
minor comments (2)
  1. [Abstract] The abstract states that the family 'allows us to investigate factors impacting transfer performance (such as learning rate)' yet does not summarize the concrete findings of that investigation; a short paragraph or table in the main text would clarify which factors were actually quantified.
  2. [Model Architecture] Notation for the three training resolutions and the exact parameter counts of the 2B–27B variants should be introduced consistently in the model description section to avoid ambiguity when results are later broken down by size and resolution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address the major comments point by point below and describe the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Training and Results sections] The central claim that the multi-stage, multi-resolution training schedule equips the models with broad transferable knowledge is not supported by controlled ablations. No quantitative comparison is presented between the full regimen and simpler single-stage or single-resolution baselines, leaving open the possibility that reported gains are driven primarily by scale, the SigLIP encoder, or task-specific fine-tuning rather than the staged schedule itself.

    Authors: We agree that the manuscript would be strengthened by explicit ablations isolating the contribution of the multi-stage, multi-resolution schedule. The current work builds directly on the PaliGemma training recipe and emphasizes the resulting model family’s transfer performance across scales and resolutions, but does not include head-to-head comparisons against single-stage or single-resolution variants. We will add controlled ablation experiments on a representative subset of transfer tasks (including at least one OCR-related task and one captioning task) to quantify the incremental benefit of the staged schedule versus simpler baselines. These results will be reported in a new subsection of the Training section. revision: yes

  2. Referee: [Experimental Results] SOTA claims on table structure recognition, molecular structure recognition, music score recognition, and radiography report generation are stated without accompanying error bars, statistical significance tests, or exhaustive baseline tables that would allow readers to verify the magnitude and reliability of the improvements.

    Authors: We acknowledge that the current presentation of SOTA results lacks sufficient statistical detail. In the revised manuscript we will (i) report standard deviations or error bars for all tasks where multiple independent runs were performed, (ii) add pairwise statistical significance tests (e.g., paired t-tests or Wilcoxon tests) against the strongest baselines where feasible, and (iii) expand the baseline tables to include additional published methods and more granular metrics. For the most compute-intensive tasks where only single runs are available, we will explicitly note this limitation and provide any available confidence-interval estimates derived from internal validation splits. revision: yes

Circularity Check

0 steps flagged

Empirical model release with no derivation chain or fitted predictions

full rationale

The paper describes training PaliGemma 2 by combining a SigLIP vision encoder with Gemma 2 language models, performing multi-stage multi-resolution training, and then measuring transfer performance on held-out tasks including OCR variants. No equations, uniqueness theorems, ansatzes, or predictions are claimed; results are reported as direct empirical measurements. This is self-contained against external benchmarks with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that combining the prior SigLIP encoder with Gemma 2 and multi-resolution training produces broad transfer capability; no new entities or free parameters are explicitly introduced beyond standard model sizes and resolutions.

free parameters (2)
  • model sizes (2B-27B)
    Selected from the Gemma 2 family; values are inherited rather than fitted in this work.
  • training resolutions (224/448/896 px)
    Chosen for the multi-stage training schedule.
axioms (1)
  • domain assumption SigLIP-So400m vision encoder remains effective when paired with larger Gemma 2 language models
    Inherited from the original PaliGemma without new justification in the abstract.

pith-pipeline@v0.9.0 · 5572 in / 1263 out tokens · 45113 ms · 2026-05-15T09:10:03.840284+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

    cs.CV 2026-05 unverdicted novelty 8.0

    TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

  2. LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models

    cs.AI 2026-05 unverdicted novelty 7.0

    LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.

  3. Being-H0.7: A Latent World-Action Model from Egocentric Videos

    cs.RO 2026-04 unverdicted novelty 7.0

    Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

  4. RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology

    cs.CV 2026-05 unverdicted novelty 6.0

    RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards f...

  5. SurgCheck: Do Vision-Language Models Really Look at Images in Surgical VQA?

    cs.CV 2026-05 unverdicted novelty 6.0

    SurgCheck benchmark reveals that vision-language models for surgical VQA often depend on linguistic shortcuts rather than visual reasoning, shown by consistent performance drops on less-biased questions.

  6. Sentinel-VLA: A Metacognitive VLA Model with Active Status Monitoring for Dynamic Reasoning and Error Recovery

    cs.RO 2026-05 unverdicted novelty 6.0

    Sentinel-VLA introduces a metacognitive VLA model with a sentinel module for real-time status monitoring, dynamic reasoning, and error recovery, plus a self-evolving continual learning method, raising real-world task ...

  7. Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    MoT-HRA learns embodiment-agnostic human-intention priors from the HA-2.2M dataset of 2.2M human video episodes through a three-expert hierarchy to improve robotic motion plausibility and robustness under distribution shift.

  8. Boosting Visual Instruction Tuning with Self-Supervised Guidance

    cs.CV 2026-04 unverdicted novelty 6.0

    Mixing 3-10% of visually grounded self-supervised instructions into visual instruction tuning consistently boosts MLLM performance on vision-centric benchmarks.

  9. When Meaning Isn't Literal: Exploring Idiomatic Meaning Across Languages and Modalities

    cs.CL 2026-04 unverdicted novelty 6.0

    Presents the Mediom multilingual multimodal idiom corpus and the HIDE hinting-based framework to benchmark and improve AI comprehension of figurative meanings across languages.

  10. AnySlot: Goal-Conditioned Vision-Language-Action Policies for Zero-Shot Slot-Level Placement

    cs.RO 2026-04 unverdicted novelty 6.0

    AnySlot decouples language grounding from low-level control by inserting an explicit visual goal image, yielding better zero-shot performance on precise slot placement tasks than flat VLA policies.

  11. VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis

    cs.RO 2026-04 unverdicted novelty 6.0

    VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.

  12. InVitroVision: a Multi-Modal AI Model for Automated Description of Embryo Development using Natural Language

    cs.AI 2026-04 unverdicted novelty 5.0

    InVitroVision, a fine-tuned PaliGemma-2 model, generates natural language descriptions of embryo development and outperforms ChatGPT 5.2 and base models on a public time-lapse dataset.

  13. EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    EgoMotion decouples reasoning from motion synthesis in egocentric vision-language tasks by mapping inputs to motion primitives via VLM then using diffusion to produce grounded and coherent 3D trajectories.

  14. ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning

    cs.RO 2026-04 unverdicted novelty 5.0

    ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.

  15. HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

    cs.CV 2026-04 unverdicted novelty 5.0

    HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.

  16. HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

    cs.CV 2026-04 unverdicted novelty 5.0

    HiVLA decouples VLM-based semantic planning from DiT-based motor control via structured plans and cascaded cross-attention to outperform end-to-end VLA baselines in long-horizon and fine-grained manipulation.

  17. Anthropogenic Regional Adaptation in Multimodal Vision-Language Model

    cs.AI 2026-04 unverdicted novelty 5.0

    Anthropogenic Regional Adaptation with GG-EZ improves cultural relevance in multimodal vision-language models for Southeast Asia by 5-15% while retaining over 98% of global performance.

  18. SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    cs.RO 2025-01 unverdicted novelty 5.0

    SpatialVLA adds 3D-aware position encoding and adaptive discretized action grids to visual-language-action models, enabling strong zero-shot performance and fine-tuning on new robot setups after pre-training on 1.1 mi...

  19. OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL

    cs.RO 2026-04 unverdicted novelty 4.0

    OmniVLA-RL uses a mix-of-transformers architecture and flow-matching reformulated as SDE with group segmented policy optimization to surpass prior VLA models on LIBERO benchmarks.

  20. SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    cs.CV 2025-02 unverdicted novelty 4.0

    SigLIP 2 models trained with a unified recipe of captioning, self-supervised losses, and curated diverse data outperform prior SigLIP versions on classification, retrieval, localization, dense prediction, and multilin...

Reference graph

Works this paper leans on

113 extracted references · 113 canonical work pages · cited by 19 Pith papers · 10 internal anchors

  1. [1]

    Acharya, K

    M. Acharya, K. Kafle, and C. Kanan. Tal- lyQA: Answering complex counting ques- tions. InAAAI, 2019

  2. [2]

    Agrawal, K

    H. Agrawal, K. Desai, Y. Wang, X. Chen, R. Jain, M. Johnson, D. Batra, D. Parikh, S. Lee, and P. Anderson. NoCaps: Novel object captioning at scale. InICCV, 2019

  3. [3]

    Alabdulmohsin, X

    I. Alabdulmohsin, X. Zhai, A. Kolesnikov, and L. Beyer. Getting vit in shape: Scaling laws for compute-optimal model design. In NeurIPS, 2023

  4. [4]

    Alayrac, J

    J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Men- sch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O.Vinyals, A.Zisserman, andK.Simonyan. Flamingo: a visual language model f...

  5. [5]

    J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-VL: A versatile vision- language model for understanding, lo- 10 PaliGemma 2: A Family of Versatile VLMs for Transfer calization, text reading, and beyond. arXiv:2308.12966, 2023

  6. [6]

    Neural Combinatorial Optimization with Reinforcement Learning

    I. Bello, H. Pham, Q. V. Le, M. Norouzi, and S. Bengio. Neural combinatorial op- timization with reinforcement learning. arXiv:1611.09940, 2016

  7. [7]

    Betker, G

    J. Betker, G. Goh, L. Jing, T. Brooks, J.Wang, L.Li, L.Ouyang, J.Zhuang, J.Lee, Y. Guo, et al. Improving image generation with better captions. Technical Report, 2023

  8. [8]

    Beyer, X

    L. Beyer, X. Zhai, and A. Kolesnikov. Big vision. https://github.com/ google-research/big_vision, 2022

  9. [9]

    PaliGemma: A versatile 3B VLM for transfer

    L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neu- mann, I. Alabdulmohsin, M. Tschannen, E. Bugliarello, T. Unterthiner, D. Keysers, S. Koppula, F. Liu, A. Grycner, A. Grit- senko, N. Houlsby, M. Kumar, K. Rong, J. Eisenschlos, R. Kabra, M. Bauer, M. Bošnjak, X. Chen, M. Minderer, P. Voigtlaender, I. Bica, I. Balazevic, J. Puigcer...

  10. [10]

    A. F. Biten, R. Tito, A. Mafla, L. Gomez, M. Rusinol, C. Jawahar, E. Valveny, and D. Karatzas. Scene text visual question answering. InICCV, Oct. 2019

  11. [11]

    Changpinyo, D

    S. Changpinyo, D. Kukliansy, I. Szpektor, X. Chen, N. Ding, and R. Soricut. All you may need for VQA are image captions. In NAACL, 2022

  12. [12]

    D. L. Chen and W. B. Dolan. Collecting highly parallel data for paraphrase evalu- ation. InACL, 2011

  13. [13]

    T. Chen, S. Saxena, L. Li, D. J. Fleet, and G. E. Hinton. Pix2seq: A language mod- eling framework for object detection. In ICLR, 2022

  14. [14]

    X. Chen, X. Wang, S. Changpinyo, A. J. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer, A. Kolesnikov, J. Puigcerver, N. Ding, K. Rong, H. Akbari, G. Mishra, L. Xue, A. Thapliyal, J. Bradbury, W. Kuo, M. Seyedhosseini, C. Jia, B. K. Ayan, C. Riquelme, A. Steiner, A. Angelova, X. Zhai, N. Houlsby, and R. Soricut. PaLI: A j...

  15. [15]

    X. Chen, X. Wang, L. Beyer, A. Kolesnikov, J.Wu, P.Voigtlaender, B.Mustafa, S.Good- man, I. Alabdulmohsin, P. Padlewski, D. Salz, X. Xiong, D. Vlasic, F. Pavetic, K. Rong, T. Yu, D. Keysers, X. Zhai, and R. Soricut. PaLI-3 vision lan- guage models: Smaller, faster, stronger. arXiv:2310.09199, 2023

  16. [16]

    X. Chen, J. Djolonga, P. Padlewski, B. Mustafa, S. Changpinyo, J. Wu, C. R. Ruiz,S.Goodman,X.Wang,Y.Tay,S.Shak- eri, M. Dehghani, D. Salz, M. Lucic, M.Tschannen, A.Nagrani, H.Hu, M.Joshi, B. Pang, C. Montgomery, P. Pietrzyk, M. Ritter, A. J. Piergiovanni, M. Minderer, F. Pavetic, A. Waters, G. Li, I. Alabdul- mohsin, L. Beyer, J. Amelot, K. Lee, A. P. Ste...

  17. [17]

    C. K. Ch’ng and C. S. Chan. Total-Text: A comprehensive dataset for scene text de- tection and recognition. InICDAR, 2017

  18. [18]

    W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi. InstructBLIP: Towards general- purpose vision-language models with in- struction tuning.arxiv:2305.06500, 2023

  19. [19]

    Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    M. Deitke, C. Clark, S. Lee, R. Tripathi, Y. Yang, J. S. Park, M. Salehi, N. Muen- nighoff, K. Lo, L. Soldaini, et al. Molmo and PixMo: Open weights and open data 11 PaliGemma 2: A Family of Versatile VLMs for Transfer for state-of-the-art multimodal models. arXiv:2409.17146, 2024

  20. [20]

    Desai and J

    K. Desai and J. Johnson. Virtex: Learning visual representations from textual anno- tations. InCVPR, 2021

  21. [21]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team. Gemma: Open models based on gemini research and technology. arXiv:2403.08295, 2024

  22. [22]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team. Gemma 2: Improving open language models at a practical size. arXiv:2408.00118, 2024

  23. [23]

    A. L. Goldberger, L. A. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley. PhysioBank, PhysioToolkit, and PhysioNet: components of a new re- search resource for complex physiologic signals. Circulation, 101(23), 2000

  24. [24]

    Introduction to Cloud TPU

    Google Cloud. Introduction to Cloud TPU. https://cloud.google.com/ tpu/docs/intro-to-tpu, 20xx. Ac- cessed: 2024-07-04

  25. [25]

    Goyal, T

    Y. Goyal, T. Khot, D. Summers-Stay, D. Ba- tra, and D. Parikh. Making the V in VQA matter: Elevating the role of image under- standing in Visual Question Answering. In CVPR, 2017

  26. [26]

    Grauman, J

    D.Gurari, Q.Li, A.J.Stangl, A.Guo, C.Lin, K. Grauman, J. Luo, and J. P. Bigham. VizWiz Grand Challenge: Answering vi- sual questions from blind people. InCVPR, 2018

  27. [27]

    T.-Y. Hsu, C. L. Giles, and T.-H. Huang. Scicap: Generating captions for scientific figures. arXiv:2110.11624, 2021

  28. [28]

    Huang, N

    Y. Huang, N. Lu, D. Chen, Y. Li, Z. Xie, S. Zhu, L. Gao, and W. Peng. Improv- ing table structure recognition with visual- alignmentsequentialcoordinatemodeling. In CVPR, 2023

  29. [29]

    Hudson and C

    D. Hudson and C. Manning. GQA: A new datasetforreal-worldvisualreasoningand compositional question answering.CVPR, 2019

  30. [30]

    S. Jain, A. Agrawal, A. Saporta, S. Truong, T. Bui, P. Chambon, Y. Zhang, M. P. Lun- gren, A. Y. Ng, C. Langlotz, et al. Rad- Graph: Extracting clinical entities and re- lations from radiology reports. InNeurIPS Datasets and Benchmarks Track, 2022

  31. [31]

    C. Jia, Y. Yang, Y. Xia, Y. Chen, Z. Parekh, H. Pham, Q. V. Le, Y. Sung, Z. Li, and T. Duerig. Scaling up visual and vision- language representation learning with noisy text supervision. InICML, 2021

  32. [32]

    Jocher, J

    G. Jocher, J. Qiu, and A. Chaura- sia. Ultralytics YOLO, 2023. URL https://github.com/ultralytics/ ultralytics

  33. [33]

    A. E. Johnson, T. J. Pollard, S. J. Berkowitz, N. R. Greenbaum, M. P. Lungren, C.-Y. Deng, R. G. Mark, and S. Horng. MIMIC- CXR, a de-identified publicly available database of chest radiographs with free- text reports. Scientific data, 6(1):317, 2019

  34. [34]

    O. F. Kar, A. Tonioni, P. Poklukar, A. Kul- shrestha, A. Zamir, and F. Tombari. BRAVE: Broadening the visual en- coding of vision-language models. arXiv:2404.07204, 2024

  35. [35]

    Karamcheti, S

    S. Karamcheti, S. Nair, A. Balakrishna, P. Liang, T. Kollar, and D. Sadigh. Pris- matic VLMs: Investigating the design space of visually-conditioned language models. arXiv:2402.07865, 2024

  36. [36]

    Karatzas, L

    D. Karatzas, L. Gomez-Bigorda, A. Nico- laou, S. K. Ghosh, A. D. Bagdanov, M. Iwa- mura, J. Matas, L. Neumann, V. R. Chan- drasekhar, S. Lu, F. Shafait, S. Uchida, and E. Valveny. ICDAR 2015 competition on robust reading. InICDAR, 2015

  37. [37]

    Karkkainen and J

    K. Karkkainen and J. Joo. Fairface: Face attribute dataset for balanced race, gen- der, and age for bias measurement and mitigation. InWACV, 2021

  38. [38]

    Kawakatsu

    T. Kawakatsu. Multi-cell decoder and mu- tual learning for table structure and char- acter recognition. InICDAR, 2024. 12 PaliGemma 2: A Family of Versatile VLMs for Transfer

  39. [39]

    Kazemzadeh, V

    S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg. ReferItGame: Referring to objects in photographs of natural scenes. In EMNLP, Oct. 2014

  40. [40]

    Kembhavi, M

    A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Hajishirzi, and A. Farhadi. A diagram is worth a dozen images. InECCV, 2016

  41. [41]

    S.Kim,P.A.Thiessen,E.E.Bolton,J.Chen, G. Fu, A. Gindulyte, L. Han, J. He, S. He, B.A.Shoemaker,etal. Pubchemsubstance and compound databases. Nucleic acids research, 44(D1):D1202–D1213, 2016

  42. [42]

    D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2017

  43. [43]

    Krishna, K

    R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles. Dense-captioning events in videos. InICCV, 2017

  44. [44]

    Krylov, S

    I. Krylov, S. Nosov, and V. Sovrasov. Open images v5 text annotation and yet another mask text spotter. InACCV, 2021

  45. [45]

    Laurençon, L

    H. Laurençon, L. Tronchon, M. Cord, and V. Sanh. What matters when building vision-language models? arXiv:2405.02246, 2024

  46. [46]

    A. Lees, V. Q. Tran, Y. Tay, J. Sorensen, J. Gupta, D. Metzler, and L. Vasserman. A new generation of perspective API: Ef- ficient multilingual character-level trans- formers. arXiv:2202.11176, 2022

  47. [47]

    B. Li, H. Zhang, K. Zhang, D. Guo, Y. Zhang, R. Zhang, F. Li, Z. Liu, and C. Li. LLaVA-NeXT: What else influences visual instruction tuning beyond data?, May 2024. URL https: //llava-vl.github.io/blog/ 2024-05-25-llava-next-ablations/

  48. [48]

    J. Li, D. Li, S. Savarese, and S. C. H. Hoi. BLIP-2: bootstrapping language- image pre-training with frozen image en- coders and large language models. In ICML, 2023

  49. [49]

    Y. Li, G. Li, L. He, J. Zheng, H. Li, and Z. Guan. Widget Captioning: Generat- ing natural language description for mo- bileuser interface elements. In EMNLP, 2020

  50. [50]

    Y. Li, H. Mao, R. Girshick, and K. He. Exploring plain vision transformer back- bones for object detection. InECCV, 2022

  51. [51]

    T. Lin, M. Maire, S. J. Belongie, L. D. Bour- dev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Doll’a r, and C. L. Zitnick. Microsoft COCO: common objects in con- text. arXiv:1405.0312, 2014

  52. [52]

    F. Liu, E. Bugliarello, E. M. Ponti, S. Reddy, N. Collier, and D. Elliott. Visually grounded reasoning across languages and cultures. InEMNLP, Nov. 2021

  53. [53]

    F. Liu, G. E. T. Emerson, and N. Collier. Visual spatial reasoning. TACL, 11:635– 651, 2023

  54. [54]

    H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. InNeurIPS, 2023

  55. [55]

    Lobry, D

    S. Lobry, D. Marcos, J. Murray, and D. Tuia. RSVQA: Visual question answering for re- mote sensing data. IEEE Trans. on Geo- science and Remote Sensing, 58(12), Dec. 2020

  56. [56]

    S. Long, S. Qin, D. Panteleev, A. Bissacco, Y. Fujii, and M. Raptis. Towards end-to- endunifiedscenetextdetectionandlayout analysis. InCVPR, 2022

  57. [57]

    S. Long, S. Qin, D. Panteleev, A. Bissacco, Y. Fujii, and M. Raptis. ICDAR 2023 com- petition on hierarchical text detection and recognition. InICDAR, 2023

  58. [58]

    S. Long, S. Qin, Y. Fujii, A. Bissacco, and M. Raptis. Hierarchical text spotter for joint text spotting and layout analysis. In WACV, 2024

  59. [59]

    P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InNeurIPS, 2022. 13 PaliGemma 2: A Family of Versatile VLMs for Transfer

  60. [60]

    N. T. Ly and A. Takasu. An end-to-end multi-task learning model for image-based table recognition. arXiv:2303.08648, 2023

  61. [61]

    J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy. Generation and comprehension of unambiguous ob- ject descriptions. InCVPR, 2016

  62. [62]

    Marino, M

    K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi. OK-VQA: A visual question answering benchmark requiring external knowledge. InCVPR, 2019

  63. [63]

    Masry, X

    A. Masry, X. L. Do, J. Q. Tan, S. Joty, and E.Hoque. ChartQA:Abenchmarkforques- tion answering about charts with visual and logical reasoning. InACL, May 2022

  64. [64]

    Mathew, D

    M. Mathew, D. Karatzas, R. Man- matha, and C. V. Jawahar. DocVQA: A dataset for VQA on document images. arXiv:2007.00398, 2020

  65. [65]

    Mathew, V

    M. Mathew, V. Bagal, R. Tito, D. Karatzas, E. Valveny, and C. V. Jawahar. Infograph- icVQA. InWACV, 2022

  66. [66]

    McKinzie, Z

    B. McKinzie, Z. Gan, J. Fauconnier, S. Dodge, B. Zhang, P. Dufter, D. Shah, X. Du, F. Peng, F. Weers, A. Belyi, H. Zhang, K. Singh, D. Kang, A. Jain, H. Hè, M. Schwarzer, T. Gunter, X. Kong, A. Zhang, J. Wang, C. Wang, N. Du, T. Lei, S. Wiseman, G. Yin, M. Lee, Z. Wang, R. Pang, P. Grasch, A. Toshev, andY.Yang. MM1: methods, analysis&in- sights from multi...

  67. [67]

    Mishra, S

    A. Mishra, S. Shekhar, A. K. Singh, and A. Chakraborty. OCR-VQA: Visual question answering by reading text in images. In ICDAR, 2019

  68. [68]

    Nayef, F

    N. Nayef, F. Yin, I. Bizid, H. Choi, Y. Feng, D. Karatzas, Z. Luo, U. Pal, C. Rigaud, J. Chazalon, et al. ICDAR2017 robust read- ing challenge on multi-lingual scene text detection and script identification - RRC- MLT. InICDAR, 2017

  69. [69]

    Y. Onoe, S. Rane, Z. Berger, Y. Bitton, J. Cho, R. Garg, A. Ku, Z. Parekh, J. Pont- Tuset, G.Tanzer, S.Wang, andJ.Baldridge. DOCCI: Descriptions of Connected and Contrasting Images. InECCV, 2024

  70. [70]

    H. Pang. YOLO-DocLayNet, Jan

  71. [71]

    URL https://github.com/ ppaanngggg/yolo-doclaynet

  72. [72]

    Pavlov, M

    D. Pavlov, M. Rybalkin, B. Karulin, M. Kozhevnikov, A. Savelyev, and A. Churi- nov. Indigo: Universal cheminformatics API. Journal of Cheminformatics, 3(Suppl 1):P4, 2011

  73. [73]

    Z. Peng, W. Wang, L. Dong, Y. Hao, S. Huang, S. Ma, and F. Wei. Kosmos- 2: Grounding multimodal large language models to the world.arXiv:2306.14824, 2023

  74. [74]

    Pfeiffer, G

    J. Pfeiffer, G. Geigle, A. Kamath, J.-M. Steitz, S. Roth, I. Vulić, and I. Gurevych. xGQA: Cross-lingual visual question an- swering. InACL, 2022

  75. [75]

    Pfitzmann, C

    B. Pfitzmann, C. Auer, M. Dolfi, A. S. Nas- sar, and P. Staar. DocLayNet: A large human-annotated dataset for document- layout segmentation. InSIGKDD, 2022

  76. [76]

    Piergiovanni, W

    A. Piergiovanni, W. Kuo, and A. An- gelova. Pre-training image-language transformers for open-vocabulary tasks. arXiv:2209.04372, 2022

  77. [77]

    Y. Qian, J. Guo, Z. Tu, Z. Li, C. W. Coley, and R. Barzilay. MolScribe: Robust molec- ular structure recognition with image-to- graph generation.J. Chem. Inf. Model., 63 (7), 2023

  78. [78]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language su- pervision. InICML, 2021

  79. [79]

    Rashkin, V

    H. Rashkin, V. Nikolaev, M. Lamm, L. Aroyo, M. Collins, D. Das, S. Petrov, G. S. Tomar, I. Turc, and D. Reitter. Measuring 14 PaliGemma 2: A Family of Versatile VLMs for Transfer attribution in natural language generation models. Computational Linguistics, 49(4): 777–840, 2023

  80. [80]

    Ríos-Vila, D

    A. Ríos-Vila, D. Rizo, J. M. Iñesta, and J. Calvo-Zaragoza. End-to-end optical mu- sic recognition for pianoform sheet music. IJDAR, 26(3):347–362, 2023

Showing first 80 references.