Recognition: 2 theorem links
· Lean TheoremPaliGemma 2: A Family of Versatile VLMs for Transfer
Pith reviewed 2026-05-15 09:10 UTC · model grok-4.3
The pith
PaliGemma 2 pairs Gemma 2 language models with SigLIP encoders and trains them at multiple resolutions to achieve strong transfer on OCR and captioning tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PaliGemma 2 is an upgrade of the PaliGemma open Vision-Language Model based on the Gemma 2 family of language models. We combine the SigLIP-So400m vision encoder that was also used by PaliGemma with the whole range of Gemma 2 models, from the 2B one all the way up to the 27B model. We train these models at three resolutions (224px, 448px, and 896px) in multiple stages to equip them with broad knowledge for transfer via fine-tuning. The resulting family of base models covering different model sizes and resolutions allows us to investigate factors impacting transfer performance (such as learning rate) and to analyze the interplay between the type of task, model size, and resolution. We further
What carries the argument
Multi-stage training at three resolutions of Gemma 2 language models paired with the SigLIP-So400m vision encoder, which builds broad transferable knowledge for fine-tuning.
If this is right
- Larger model sizes paired with higher input resolutions improve accuracy on fine-grained OCR tasks such as table and molecular structure recognition.
- The same base models support effective fine-tuning for long-form captioning and medical report generation without task-specific architectural changes.
- Varying model size and resolution reveals clear trade-offs that guide selection of the right model for a given task type and compute budget.
- Expansion of benchmarks to music scores and molecular diagrams shows the family transfers to domains outside conventional natural-image captioning.
Where Pith is reading between the lines
- The multi-resolution training may confer robustness to inputs whose native detail level varies widely, an advantage not yet quantified against fixed-resolution baselines.
- These models could serve as efficient starting points for domain adaptation in fields such as historical document analysis or scientific imaging pipelines.
- The music-score results hint at latent capabilities for structured visual sequences that might extend to other ordered domains like chemical diagrams or circuit schematics.
- Future work could test whether the same training recipe transfers to video or multi-frame inputs by leveraging the resolution flexibility already present.
Load-bearing premise
Multi-stage training at multiple resolutions equips the models with broad transferable knowledge that simpler single-stage or single-resolution training cannot match.
What would settle it
A controlled comparison in which a single-resolution single-stage model of matching size achieves equal or better accuracy on table structure recognition, molecular recognition, music score recognition, and radiography report generation would falsify the necessity of the multi-stage multi-resolution protocol.
read the original abstract
PaliGemma 2 is an upgrade of the PaliGemma open Vision-Language Model (VLM) based on the Gemma 2 family of language models. We combine the SigLIP-So400m vision encoder that was also used by PaliGemma with the whole range of Gemma 2 models, from the 2B one all the way up to the 27B model. We train these models at three resolutions (224px, 448px, and 896px) in multiple stages to equip them with broad knowledge for transfer via fine-tuning. The resulting family of base models covering different model sizes and resolutions allows us to investigate factors impacting transfer performance (such as learning rate) and to analyze the interplay between the type of task, model size, and resolution. We further increase the number and breadth of transfer tasks beyond the scope of PaliGemma including different OCR-related tasks such as table structure recognition, molecular structure recognition, music score recognition, as well as long fine-grained captioning and radiography report generation, on which PaliGemma 2 obtains state-of-the-art results.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PaliGemma 2, an upgraded family of open vision-language models that pairs the SigLIP-So400m vision encoder with the full range of Gemma 2 language models (2B to 27B parameters). The models are trained in multiple stages at three resolutions (224 px, 448 px, 896 px) to equip them with broad transferable knowledge; the resulting family is then evaluated on an expanded set of transfer tasks that includes OCR-related problems (table structure recognition, molecular structure recognition, music score recognition) as well as long fine-grained captioning and radiography report generation, where state-of-the-art results are reported.
Significance. If the empirical claims hold after proper controls, the work supplies a publicly available, multi-scale VLM family that enables systematic study of how model size, input resolution, and staged training affect downstream transfer. The extension of the task suite to specialized OCR and medical domains is a concrete contribution that can serve as a benchmark for future transfer research.
major comments (2)
- [Training and Results sections] The central claim that the multi-stage, multi-resolution training schedule equips the models with broad transferable knowledge is not supported by controlled ablations. No quantitative comparison is presented between the full regimen and simpler single-stage or single-resolution baselines, leaving open the possibility that reported gains are driven primarily by scale, the SigLIP encoder, or task-specific fine-tuning rather than the staged schedule itself.
- [Experimental Results] SOTA claims on table structure recognition, molecular structure recognition, music score recognition, and radiography report generation are stated without accompanying error bars, statistical significance tests, or exhaustive baseline tables that would allow readers to verify the magnitude and reliability of the improvements.
minor comments (2)
- [Abstract] The abstract states that the family 'allows us to investigate factors impacting transfer performance (such as learning rate)' yet does not summarize the concrete findings of that investigation; a short paragraph or table in the main text would clarify which factors were actually quantified.
- [Model Architecture] Notation for the three training resolutions and the exact parameter counts of the 2B–27B variants should be introduced consistently in the model description section to avoid ambiguity when results are later broken down by size and resolution.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback. We address the major comments point by point below and describe the revisions we will incorporate.
read point-by-point responses
-
Referee: [Training and Results sections] The central claim that the multi-stage, multi-resolution training schedule equips the models with broad transferable knowledge is not supported by controlled ablations. No quantitative comparison is presented between the full regimen and simpler single-stage or single-resolution baselines, leaving open the possibility that reported gains are driven primarily by scale, the SigLIP encoder, or task-specific fine-tuning rather than the staged schedule itself.
Authors: We agree that the manuscript would be strengthened by explicit ablations isolating the contribution of the multi-stage, multi-resolution schedule. The current work builds directly on the PaliGemma training recipe and emphasizes the resulting model family’s transfer performance across scales and resolutions, but does not include head-to-head comparisons against single-stage or single-resolution variants. We will add controlled ablation experiments on a representative subset of transfer tasks (including at least one OCR-related task and one captioning task) to quantify the incremental benefit of the staged schedule versus simpler baselines. These results will be reported in a new subsection of the Training section. revision: yes
-
Referee: [Experimental Results] SOTA claims on table structure recognition, molecular structure recognition, music score recognition, and radiography report generation are stated without accompanying error bars, statistical significance tests, or exhaustive baseline tables that would allow readers to verify the magnitude and reliability of the improvements.
Authors: We acknowledge that the current presentation of SOTA results lacks sufficient statistical detail. In the revised manuscript we will (i) report standard deviations or error bars for all tasks where multiple independent runs were performed, (ii) add pairwise statistical significance tests (e.g., paired t-tests or Wilcoxon tests) against the strongest baselines where feasible, and (iii) expand the baseline tables to include additional published methods and more granular metrics. For the most compute-intensive tasks where only single runs are available, we will explicitly note this limitation and provide any available confidence-interval estimates derived from internal validation splits. revision: yes
Circularity Check
Empirical model release with no derivation chain or fitted predictions
full rationale
The paper describes training PaliGemma 2 by combining a SigLIP vision encoder with Gemma 2 language models, performing multi-stage multi-resolution training, and then measuring transfer performance on held-out tasks including OCR variants. No equations, uniqueness theorems, ansatzes, or predictions are claimed; results are reported as direct empirical measurements. This is self-contained against external benchmarks with no reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (2)
- model sizes (2B-27B)
- training resolutions (224/448/896 px)
axioms (1)
- domain assumption SigLIP-So400m vision encoder remains effective when paired with larger Gemma 2 language models
Forward citations
Cited by 20 Pith papers
-
TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
-
LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models
LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.
-
Being-H0.7: A Latent World-Action Model from Egocentric Videos
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
-
RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology
RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards f...
-
SurgCheck: Do Vision-Language Models Really Look at Images in Surgical VQA?
SurgCheck benchmark reveals that vision-language models for surgical VQA often depend on linguistic shortcuts rather than visual reasoning, shown by consistent performance drops on less-biased questions.
-
Sentinel-VLA: A Metacognitive VLA Model with Active Status Monitoring for Dynamic Reasoning and Error Recovery
Sentinel-VLA introduces a metacognitive VLA model with a sentinel module for real-time status monitoring, dynamic reasoning, and error recovery, plus a self-evolving continual learning method, raising real-world task ...
-
Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation
MoT-HRA learns embodiment-agnostic human-intention priors from the HA-2.2M dataset of 2.2M human video episodes through a three-expert hierarchy to improve robotic motion plausibility and robustness under distribution shift.
-
Boosting Visual Instruction Tuning with Self-Supervised Guidance
Mixing 3-10% of visually grounded self-supervised instructions into visual instruction tuning consistently boosts MLLM performance on vision-centric benchmarks.
-
When Meaning Isn't Literal: Exploring Idiomatic Meaning Across Languages and Modalities
Presents the Mediom multilingual multimodal idiom corpus and the HIDE hinting-based framework to benchmark and improve AI comprehension of figurative meanings across languages.
-
AnySlot: Goal-Conditioned Vision-Language-Action Policies for Zero-Shot Slot-Level Placement
AnySlot decouples language grounding from low-level control by inserting an explicit visual goal image, yielding better zero-shot performance on precise slot placement tasks than flat VLA policies.
-
VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis
VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.
-
InVitroVision: a Multi-Modal AI Model for Automated Description of Embryo Development using Natural Language
InVitroVision, a fine-tuned PaliGemma-2 model, generates natural language descriptions of embryo development and outperforms ChatGPT 5.2 and base models on a public time-lapse dataset.
-
EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation
EgoMotion decouples reasoning from motion synthesis in egocentric vision-language tasks by mapping inputs to motion primitives via VLM then using diffusion to produce grounded and coherent 3D trajectories.
-
ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning
ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.
-
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.
-
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
HiVLA decouples VLM-based semantic planning from DiT-based motor control via structured plans and cascaded cross-attention to outperform end-to-end VLA baselines in long-horizon and fine-grained manipulation.
-
Anthropogenic Regional Adaptation in Multimodal Vision-Language Model
Anthropogenic Regional Adaptation with GG-EZ improves cultural relevance in multimodal vision-language models for Southeast Asia by 5-15% while retaining over 98% of global performance.
-
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
SpatialVLA adds 3D-aware position encoding and adaptive discretized action grids to visual-language-action models, enabling strong zero-shot performance and fine-tuning on new robot setups after pre-training on 1.1 mi...
-
OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL
OmniVLA-RL uses a mix-of-transformers architecture and flow-matching reformulated as SDE with group segmented policy optimization to surpass prior VLA models on LIBERO benchmarks.
-
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
SigLIP 2 models trained with a unified recipe of captioning, self-supervised losses, and curated diverse data outperform prior SigLIP versions on classification, retrieval, localization, dense prediction, and multilin...
Reference graph
Works this paper leans on
-
[1]
M. Acharya, K. Kafle, and C. Kanan. Tal- lyQA: Answering complex counting ques- tions. InAAAI, 2019
work page 2019
-
[2]
H. Agrawal, K. Desai, Y. Wang, X. Chen, R. Jain, M. Johnson, D. Batra, D. Parikh, S. Lee, and P. Anderson. NoCaps: Novel object captioning at scale. InICCV, 2019
work page 2019
-
[3]
I. Alabdulmohsin, X. Zhai, A. Kolesnikov, and L. Beyer. Getting vit in shape: Scaling laws for compute-optimal model design. In NeurIPS, 2023
work page 2023
-
[4]
J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Men- sch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O.Vinyals, A.Zisserman, andK.Simonyan. Flamingo: a visual language model f...
work page 2022
-
[5]
J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-VL: A versatile vision- language model for understanding, lo- 10 PaliGemma 2: A Family of Versatile VLMs for Transfer calization, text reading, and beyond. arXiv:2308.12966, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Neural Combinatorial Optimization with Reinforcement Learning
I. Bello, H. Pham, Q. V. Le, M. Norouzi, and S. Bengio. Neural combinatorial op- timization with reinforcement learning. arXiv:1611.09940, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
- [7]
- [8]
-
[9]
PaliGemma: A versatile 3B VLM for transfer
L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neu- mann, I. Alabdulmohsin, M. Tschannen, E. Bugliarello, T. Unterthiner, D. Keysers, S. Koppula, F. Liu, A. Grycner, A. Grit- senko, N. Houlsby, M. Kumar, K. Rong, J. Eisenschlos, R. Kabra, M. Bauer, M. Bošnjak, X. Chen, M. Minderer, P. Voigtlaender, I. Bica, I. Balazevic, J. Puigcer...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
A. F. Biten, R. Tito, A. Mafla, L. Gomez, M. Rusinol, C. Jawahar, E. Valveny, and D. Karatzas. Scene text visual question answering. InICCV, Oct. 2019
work page 2019
-
[11]
S. Changpinyo, D. Kukliansy, I. Szpektor, X. Chen, N. Ding, and R. Soricut. All you may need for VQA are image captions. In NAACL, 2022
work page 2022
-
[12]
D. L. Chen and W. B. Dolan. Collecting highly parallel data for paraphrase evalu- ation. InACL, 2011
work page 2011
-
[13]
T. Chen, S. Saxena, L. Li, D. J. Fleet, and G. E. Hinton. Pix2seq: A language mod- eling framework for object detection. In ICLR, 2022
work page 2022
-
[14]
X. Chen, X. Wang, S. Changpinyo, A. J. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer, A. Kolesnikov, J. Puigcerver, N. Ding, K. Rong, H. Akbari, G. Mishra, L. Xue, A. Thapliyal, J. Bradbury, W. Kuo, M. Seyedhosseini, C. Jia, B. K. Ayan, C. Riquelme, A. Steiner, A. Angelova, X. Zhai, N. Houlsby, and R. Soricut. PaLI: A j...
-
[15]
X. Chen, X. Wang, L. Beyer, A. Kolesnikov, J.Wu, P.Voigtlaender, B.Mustafa, S.Good- man, I. Alabdulmohsin, P. Padlewski, D. Salz, X. Xiong, D. Vlasic, F. Pavetic, K. Rong, T. Yu, D. Keysers, X. Zhai, and R. Soricut. PaLI-3 vision lan- guage models: Smaller, faster, stronger. arXiv:2310.09199, 2023
-
[16]
X. Chen, J. Djolonga, P. Padlewski, B. Mustafa, S. Changpinyo, J. Wu, C. R. Ruiz,S.Goodman,X.Wang,Y.Tay,S.Shak- eri, M. Dehghani, D. Salz, M. Lucic, M.Tschannen, A.Nagrani, H.Hu, M.Joshi, B. Pang, C. Montgomery, P. Pietrzyk, M. Ritter, A. J. Piergiovanni, M. Minderer, F. Pavetic, A. Waters, G. Li, I. Alabdul- mohsin, L. Beyer, J. Amelot, K. Lee, A. P. Ste...
work page 2024
-
[17]
C. K. Ch’ng and C. S. Chan. Total-Text: A comprehensive dataset for scene text de- tection and recognition. InICDAR, 2017
work page 2017
-
[18]
W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi. InstructBLIP: Towards general- purpose vision-language models with in- struction tuning.arxiv:2305.06500, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
M. Deitke, C. Clark, S. Lee, R. Tripathi, Y. Yang, J. S. Park, M. Salehi, N. Muen- nighoff, K. Lo, L. Soldaini, et al. Molmo and PixMo: Open weights and open data 11 PaliGemma 2: A Family of Versatile VLMs for Transfer for state-of-the-art multimodal models. arXiv:2409.17146, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
K. Desai and J. Johnson. Virtex: Learning visual representations from textual anno- tations. InCVPR, 2021
work page 2021
-
[21]
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team. Gemma: Open models based on gemini research and technology. arXiv:2403.08295, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Gemma 2: Improving Open Language Models at a Practical Size
Gemma Team. Gemma 2: Improving open language models at a practical size. arXiv:2408.00118, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
A. L. Goldberger, L. A. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley. PhysioBank, PhysioToolkit, and PhysioNet: components of a new re- search resource for complex physiologic signals. Circulation, 101(23), 2000
work page 2000
-
[24]
Google Cloud. Introduction to Cloud TPU. https://cloud.google.com/ tpu/docs/intro-to-tpu, 20xx. Ac- cessed: 2024-07-04
work page 2024
- [25]
-
[26]
D.Gurari, Q.Li, A.J.Stangl, A.Guo, C.Lin, K. Grauman, J. Luo, and J. P. Bigham. VizWiz Grand Challenge: Answering vi- sual questions from blind people. InCVPR, 2018
work page 2018
- [27]
- [28]
-
[29]
D. Hudson and C. Manning. GQA: A new datasetforreal-worldvisualreasoningand compositional question answering.CVPR, 2019
work page 2019
-
[30]
S. Jain, A. Agrawal, A. Saporta, S. Truong, T. Bui, P. Chambon, Y. Zhang, M. P. Lun- gren, A. Y. Ng, C. Langlotz, et al. Rad- Graph: Extracting clinical entities and re- lations from radiology reports. InNeurIPS Datasets and Benchmarks Track, 2022
work page 2022
-
[31]
C. Jia, Y. Yang, Y. Xia, Y. Chen, Z. Parekh, H. Pham, Q. V. Le, Y. Sung, Z. Li, and T. Duerig. Scaling up visual and vision- language representation learning with noisy text supervision. InICML, 2021
work page 2021
- [32]
-
[33]
A. E. Johnson, T. J. Pollard, S. J. Berkowitz, N. R. Greenbaum, M. P. Lungren, C.-Y. Deng, R. G. Mark, and S. Horng. MIMIC- CXR, a de-identified publicly available database of chest radiographs with free- text reports. Scientific data, 6(1):317, 2019
work page 2019
- [34]
-
[35]
S. Karamcheti, S. Nair, A. Balakrishna, P. Liang, T. Kollar, and D. Sadigh. Pris- matic VLMs: Investigating the design space of visually-conditioned language models. arXiv:2402.07865, 2024
-
[36]
D. Karatzas, L. Gomez-Bigorda, A. Nico- laou, S. K. Ghosh, A. D. Bagdanov, M. Iwa- mura, J. Matas, L. Neumann, V. R. Chan- drasekhar, S. Lu, F. Shafait, S. Uchida, and E. Valveny. ICDAR 2015 competition on robust reading. InICDAR, 2015
work page 2015
-
[37]
K. Karkkainen and J. Joo. Fairface: Face attribute dataset for balanced race, gen- der, and age for bias measurement and mitigation. InWACV, 2021
work page 2021
- [38]
-
[39]
S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg. ReferItGame: Referring to objects in photographs of natural scenes. In EMNLP, Oct. 2014
work page 2014
-
[40]
A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Hajishirzi, and A. Farhadi. A diagram is worth a dozen images. InECCV, 2016
work page 2016
-
[41]
S.Kim,P.A.Thiessen,E.E.Bolton,J.Chen, G. Fu, A. Gindulyte, L. Han, J. He, S. He, B.A.Shoemaker,etal. Pubchemsubstance and compound databases. Nucleic acids research, 44(D1):D1202–D1213, 2016
work page 2016
-
[42]
D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[43]
R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles. Dense-captioning events in videos. InICCV, 2017
work page 2017
- [44]
-
[45]
H. Laurençon, L. Tronchon, M. Cord, and V. Sanh. What matters when building vision-language models? arXiv:2405.02246, 2024
- [46]
-
[47]
B. Li, H. Zhang, K. Zhang, D. Guo, Y. Zhang, R. Zhang, F. Li, Z. Liu, and C. Li. LLaVA-NeXT: What else influences visual instruction tuning beyond data?, May 2024. URL https: //llava-vl.github.io/blog/ 2024-05-25-llava-next-ablations/
work page 2024
-
[48]
J. Li, D. Li, S. Savarese, and S. C. H. Hoi. BLIP-2: bootstrapping language- image pre-training with frozen image en- coders and large language models. In ICML, 2023
work page 2023
-
[49]
Y. Li, G. Li, L. He, J. Zheng, H. Li, and Z. Guan. Widget Captioning: Generat- ing natural language description for mo- bileuser interface elements. In EMNLP, 2020
work page 2020
-
[50]
Y. Li, H. Mao, R. Girshick, and K. He. Exploring plain vision transformer back- bones for object detection. InECCV, 2022
work page 2022
-
[51]
T. Lin, M. Maire, S. J. Belongie, L. D. Bour- dev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Doll’a r, and C. L. Zitnick. Microsoft COCO: common objects in con- text. arXiv:1405.0312, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[52]
F. Liu, E. Bugliarello, E. M. Ponti, S. Reddy, N. Collier, and D. Elliott. Visually grounded reasoning across languages and cultures. InEMNLP, Nov. 2021
work page 2021
-
[53]
F. Liu, G. E. T. Emerson, and N. Collier. Visual spatial reasoning. TACL, 11:635– 651, 2023
work page 2023
-
[54]
H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. InNeurIPS, 2023
work page 2023
- [55]
-
[56]
S. Long, S. Qin, D. Panteleev, A. Bissacco, Y. Fujii, and M. Raptis. Towards end-to- endunifiedscenetextdetectionandlayout analysis. InCVPR, 2022
work page 2022
-
[57]
S. Long, S. Qin, D. Panteleev, A. Bissacco, Y. Fujii, and M. Raptis. ICDAR 2023 com- petition on hierarchical text detection and recognition. InICDAR, 2023
work page 2023
-
[58]
S. Long, S. Qin, Y. Fujii, A. Bissacco, and M. Raptis. Hierarchical text spotter for joint text spotting and layout analysis. In WACV, 2024
work page 2024
-
[59]
P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InNeurIPS, 2022. 13 PaliGemma 2: A Family of Versatile VLMs for Transfer
work page 2022
- [60]
-
[61]
J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy. Generation and comprehension of unambiguous ob- ject descriptions. InCVPR, 2016
work page 2016
- [62]
- [63]
- [64]
- [65]
-
[66]
B. McKinzie, Z. Gan, J. Fauconnier, S. Dodge, B. Zhang, P. Dufter, D. Shah, X. Du, F. Peng, F. Weers, A. Belyi, H. Zhang, K. Singh, D. Kang, A. Jain, H. Hè, M. Schwarzer, T. Gunter, X. Kong, A. Zhang, J. Wang, C. Wang, N. Du, T. Lei, S. Wiseman, G. Yin, M. Lee, Z. Wang, R. Pang, P. Grasch, A. Toshev, andY.Yang. MM1: methods, analysis&in- sights from multi...
- [67]
- [68]
-
[69]
Y. Onoe, S. Rane, Z. Berger, Y. Bitton, J. Cho, R. Garg, A. Ku, Z. Parekh, J. Pont- Tuset, G.Tanzer, S.Wang, andJ.Baldridge. DOCCI: Descriptions of Connected and Contrasting Images. InECCV, 2024
work page 2024
-
[70]
H. Pang. YOLO-DocLayNet, Jan
-
[71]
URL https://github.com/ ppaanngggg/yolo-doclaynet
- [72]
-
[73]
Z. Peng, W. Wang, L. Dong, Y. Hao, S. Huang, S. Ma, and F. Wei. Kosmos- 2: Grounding multimodal large language models to the world.arXiv:2306.14824, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[74]
J. Pfeiffer, G. Geigle, A. Kamath, J.-M. Steitz, S. Roth, I. Vulić, and I. Gurevych. xGQA: Cross-lingual visual question an- swering. InACL, 2022
work page 2022
-
[75]
B. Pfitzmann, C. Auer, M. Dolfi, A. S. Nas- sar, and P. Staar. DocLayNet: A large human-annotated dataset for document- layout segmentation. InSIGKDD, 2022
work page 2022
-
[76]
A. Piergiovanni, W. Kuo, and A. An- gelova. Pre-training image-language transformers for open-vocabulary tasks. arXiv:2209.04372, 2022
-
[77]
Y. Qian, J. Guo, Z. Tu, Z. Li, C. W. Coley, and R. Barzilay. MolScribe: Robust molec- ular structure recognition with image-to- graph generation.J. Chem. Inf. Model., 63 (7), 2023
work page 2023
-
[78]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language su- pervision. InICML, 2021
work page 2021
-
[79]
H. Rashkin, V. Nikolaev, M. Lamm, L. Aroyo, M. Collins, D. Das, S. Petrov, G. S. Tomar, I. Turc, and D. Reitter. Measuring 14 PaliGemma 2: A Family of Versatile VLMs for Transfer attribution in natural language generation models. Computational Linguistics, 49(4): 777–840, 2023
work page 2023
-
[80]
A. Ríos-Vila, D. Rizo, J. M. Iñesta, and J. Calvo-Zaragoza. End-to-end optical mu- sic recognition for pianoform sheet music. IJDAR, 26(3):347–362, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.