pith. machine review for the scientific record. sign in

arxiv: 2302.14045 · v2 · submitted 2023-02-27 · 💻 cs.CL · cs.CV

Recognition: 3 theorem links

· Lean Theorem

Language Is Not All You Need: Aligning Perception with Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:28 UTC · model grok-4.3

classification 💻 cs.CL cs.CV
keywords multimodal large language modelszero-shot multimodal learningfew-shot in-context learningcross-modal transferimage captioningvisual question answeringOCR-free document understanding
0
0 comments X

The pith

Kosmos-1 learns perception and language jointly from web-scale interleaved text and images, then performs zero-shot and few-shot tasks across modalities without any finetuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Kosmos-1 as a single transformer-based model trained from scratch on massive collections of mixed text-image sequences, image captions, and plain text. It shows that this training produces a model capable of language understanding and generation, direct image input for OCR-free document tasks, image captioning, visual question answering, multimodal dialogue, and even image classification when the class is specified only in text instructions. The central argument is that broad exposure to aligned multimodal data enables in-context learning and cross-modal knowledge transfer, so the same weights support both language-only and vision-language work. A new Raven-style IQ test dataset is introduced to measure nonverbal reasoning in such models.

Core claim

Kosmos-1 is a Multimodal Large Language Model trained on web-scale multimodal corpora containing arbitrarily interleaved text and images, image-caption pairs, and text data; the resulting model achieves strong zero-shot and few-shot performance on language tasks, perception-language tasks, and vision tasks specified via text, with no gradient updates or task-specific finetuning required.

What carries the argument

Training a transformer on arbitrarily interleaved text-image sequences so that the same parameters support in-context learning across modalities.

If this is right

  • Knowledge transfers in both directions: language pretraining improves multimodal performance and multimodal training improves language performance.
  • Document images can be fed directly for OCR-free NLP tasks such as question answering or summarization.
  • Image recognition can be performed by supplying only a textual description of the desired classes.
  • A single set of weights can handle multimodal dialogue that mixes text and images in the same conversation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach suggests that separate vision encoders may become unnecessary if interleaved training data is large enough.
  • Similar training recipes could be tested on video or audio sequences to check whether the same model architecture scales to additional modalities.
  • The Raven IQ dataset provides a concrete way to compare nonverbal reasoning across future multimodal models without relying on language mediation.

Load-bearing premise

Web-scale multimodal data already contains enough aligned signal that one model can acquire general cross-modal capabilities that transfer to new tasks without any adaptation.

What would settle it

Kosmos-1 scores no higher than a text-only language model on visual question answering when images are provided as input.

read the original abstract

A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot). Specifically, we train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. We evaluate various settings, including zero-shot, few-shot, and multimodal chain-of-thought prompting, on a wide range of tasks without any gradient updates or finetuning. Experimental results show that Kosmos-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP (directly fed with document images), (ii) perception-language tasks, including multimodal dialogue, image captioning, visual question answering, and (iii) vision tasks, such as image recognition with descriptions (specifying classification via text instructions). We also show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language. In addition, we introduce a dataset of Raven IQ test, which diagnoses the nonverbal reasoning capability of MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces Kosmos-1, a multimodal large language model trained from scratch on web-scale corpora of interleaved text-image data, image-caption pairs, and text. It claims strong zero-shot and few-shot performance (via in-context learning and instruction following) across language understanding/generation, OCR-free document tasks, multimodal dialogue, image captioning, VQA, and vision tasks such as instruction-based image recognition, without any gradient updates or finetuning. The work also reports cross-modal transfer benefits and introduces a new Raven-style IQ test dataset to diagnose nonverbal reasoning in MLLMs.

Significance. If the empirical claims hold after addressing evaluation details, the results would demonstrate that web-scale aligned multimodal pretraining can produce general cross-modal capabilities that transfer to held-out tasks, supporting the broader thesis that perception-language alignment is a key step toward AGI-like convergence of modalities. The new Raven IQ dataset adds a useful diagnostic for nonverbal reasoning that is not language-mediated.

major comments (3)
  1. [§4 and §5] §4 (Experimental Setup) and §5 (Results): The central performance claims (e.g., zero-shot VQA, captioning, OCR-free NLP) are presented without explicit quantitative tables showing exact metrics, standard baselines (Flamingo, BLIP-2, etc.), error bars, or data exclusion criteria. This makes it impossible to judge whether the reported gains reflect genuine generalization or post-hoc prompt selection.
  2. [§3.2 and §5.3] §3.2 (Training Data) and §5.3 (Cross-modal Transfer): No decontamination statistics or overlap analysis are provided between the web-scale training corpora and common evaluation benchmarks (VQA v2, COCO, Raven matrices). Given that web data frequently contains near-duplicates of these benchmarks, the zero-shot transfer claims rest on an untested assumption that performance arises from alignment rather than memorization.
  3. [§6] §6 (Raven IQ Dataset): The new dataset is introduced as a diagnostic for nonverbal reasoning, but the paper provides no details on construction protocol, human validation, or controls for language leakage (e.g., textual descriptions of matrices). This is load-bearing for the claim that MLLMs can be evaluated on purely perceptual reasoning.
minor comments (3)
  1. [Figure 1 and §3.1] Figure 1 and §3.1: The model architecture diagram and description of the visual encoder + LLM integration use inconsistent notation for the special tokens (e.g., <image> vs. [IMG]); standardize and add a precise tokenization equation.
  2. [§5] Throughout §5: All reported numbers should include the exact prompt templates used and the number of in-context examples; several tables omit these details, reducing reproducibility.
  3. [Related Work] Related Work: Add explicit comparison to concurrent MLLMs (Flamingo, PaLM-E) in the introduction and results tables rather than only in passing.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§4 and §5] §4 (Experimental Setup) and §5 (Results): The central performance claims (e.g., zero-shot VQA, captioning, OCR-free NLP) are presented without explicit quantitative tables showing exact metrics, standard baselines (Flamingo, BLIP-2, etc.), error bars, or data exclusion criteria. This makes it impossible to judge whether the reported gains reflect genuine generalization or post-hoc prompt selection.

    Authors: We agree that the results section would benefit from clearer quantitative presentation. In the revised manuscript we will add comprehensive tables in §5 that report exact metrics for every task, direct comparisons to standard baselines including Flamingo and BLIP-2, and explicit statements of evaluation protocols and any data exclusion criteria. Because the evaluations are single-run zero-shot and few-shot settings, we will note the absence of error bars and describe our prompt selection procedure to address concerns about post-hoc tuning. revision: yes

  2. Referee: [§3.2 and §5.3] §3.2 (Training Data) and §5.3 (Cross-modal Transfer): No decontamination statistics or overlap analysis are provided between the web-scale training corpora and common evaluation benchmarks (VQA v2, COCO, Raven matrices). Given that web data frequently contains near-duplicates of these benchmarks, the zero-shot transfer claims rest on an untested assumption that performance arises from alignment rather than memorization.

    Authors: We acknowledge the importance of decontamination for zero-shot claims. Given the scale of the training corpora, exhaustive overlap analysis is computationally prohibitive; however, we will add a new subsection in §3.2 describing the filtering steps we applied to remove known benchmark duplicates and will report any available overlap statistics. We will also clarify that the Raven dataset is newly constructed and therefore free of training overlap. While we cannot provide a complete decontamination audit, the cross-modal transfer results and performance on held-out tasks support generalization beyond simple memorization. revision: partial

  3. Referee: [§6] §6 (Raven IQ Dataset): The new dataset is introduced as a diagnostic for nonverbal reasoning, but the paper provides no details on construction protocol, human validation, or controls for language leakage (e.g., textual descriptions of matrices). This is load-bearing for the claim that MLLMs can be evaluated on purely perceptual reasoning.

    Authors: We thank the referee for highlighting this gap. In the revised §6 we will provide a detailed construction protocol, including how the matrices were procedurally generated, the human validation process used to ensure quality and correctness, and explicit controls against language leakage (e.g., matrices are presented purely visually with no accompanying textual descriptions during evaluation). revision: yes

Circularity Check

0 steps flagged

No circularity in derivation or claims

full rationale

The paper introduces Kosmos-1 via training on web-scale multimodal corpora and reports empirical zero-shot/few-shot results on external benchmarks (VQA, captioning, OCR-free NLP, Raven IQ). No equations, derivations, or load-bearing steps are present that reduce by construction to fitted inputs, self-definitions, or self-citation chains. All performance claims rest on held-out task evaluations rather than internal reparameterization or renamed patterns. The central premise of cross-modal transfer is an empirical observation, not a mathematical identity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that large-scale web data supplies aligned multimodal signal sufficient for generalization; standard transformer architecture and optimization assumptions are inherited from prior LLM work.

free parameters (1)
  • model scale and training hyperparameters
    Architecture size, learning rate schedule, and data mixture ratios are chosen to fit available compute and are not derived from first principles.
axioms (1)
  • domain assumption Web-scale interleaved text-image corpora contain sufficient cross-modal alignments to induce general perception-language capabilities.
    Invoked in the training description to justify zero-shot transfer.

pith-pipeline@v0.9.0 · 5601 in / 1260 out tokens · 59454 ms · 2026-05-15T18:28:25.888757+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. We evaluate various settings, including zero-shot, few-shot, and multimodal chain-of-thought prompting, on a wide range of tasks without any gradient updates or finetuning.

  • IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Experimental results show that Kosmos-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP (directly fed with document images), (ii) perception-language tasks, including multimodal dialogue, image captioning, visual question answering, and (iii) vision tasks, such as image recognition with descriptions.

  • IndisputableMonolith.Foundation.DimensionForcing dimension_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    In addition, we introduce a dataset of Raven IQ test, which diagnoses the nonverbal reasoning capability of MLLMs.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    cs.CV 2024-07 unverdicted novelty 7.0

    LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving ...

  2. 3D-VLA: A 3D Vision-Language-Action Generative World Model

    cs.CV 2024-03 unverdicted novelty 7.0

    3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.

  3. VideoChat: Chat-Centric Video Understanding

    cs.CV 2023-05 conditional novelty 7.0

    VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.

  4. Visual Instruction Tuning

    cs.CV 2023-04 unverdicted novelty 7.0

    LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.

  5. LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

    cs.CV 2023-03 conditional novelty 7.0

    LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.

  6. Retentive Network: A Successor to Transformer for Large Language Models

    cs.CL 2023-07 unverdicted novelty 6.0

    RetNet is a new sequence modeling architecture that delivers parallel training, constant-time inference, and competitive language modeling performance as a potential replacement for Transformers.

  7. InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

    cs.CV 2023-07 unverdicted novelty 6.0

    InternVid supplies 7M videos and LLM captions to train ViCLIP, which reaches leading zero-shot action recognition and competitive retrieval performance.

  8. Kosmos-2: Grounding Multimodal Large Language Models to the World

    cs.CL 2023-06 unverdicted novelty 6.0

    Kosmos-2 grounds text to image regions by encoding refer expressions as Markdown links to sequences of location tokens and trains on a new GrIT dataset of grounded image-text pairs.

  9. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    cs.CV 2023-06 unverdicted novelty 6.0

    MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.

  10. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    cs.CV 2023-04 conditional novelty 6.0

    MiniGPT-4 shows that aligning a frozen vision encoder to Vicuna via one projection layer plus a second-stage detailed-description fine-tune produces GPT-4-like vision-language abilities including detailed captions, cr...

  11. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

    cs.CL 2023-03 unverdicted novelty 6.0

    HuggingGPT is an agent system where ChatGPT plans and orchestrates calls to Hugging Face models to solve complex multi-modal AI tasks.

  12. MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

    cs.CV 2023-03 unverdicted novelty 6.0

    MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.

  13. Qwen2.5-Omni Technical Report

    cs.CL 2025-03 conditional novelty 5.0

    Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text perfo...

  14. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    cs.CL 2023-11 unverdicted novelty 5.0

    The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.

  15. The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

    cs.CV 2023-09 conditional novelty 4.0

    GPT-4V processes interleaved image-text inputs generically and supports visual referring prompting for new human-AI interaction.

  16. The Rise and Potential of Large Language Model Based Agents: A Survey

    cs.AI 2023-09 accept novelty 4.0

    The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.

  17. OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

    cs.CV 2023-08 unverdicted novelty 4.0

    OpenFlamingo provides open-source autoregressive vision-language models that achieve 80-89% of Flamingo performance on seven vision-language datasets.

  18. Large Language Models: A Survey

    cs.CL 2024-02 accept novelty 3.0

    The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

  19. A Survey of Large Language Models

    cs.CL 2023-03 accept novelty 3.0

    This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · cited by 19 Pith papers · 12 internal anchors

  1. [1]

    Armen Aghajanyan, Bernie Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Naman Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, and Luke Zettlemoyer

    [AHR+22] Armen Aghajanyan, Bernie Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Na- man Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, and Luke Zettlemoyer. CM3: A causal masked multimodal model of the Internet. ArXiv, abs/2201.07520,

  2. [2]

    Are Elephants Bigger than Butterflies? Reasoning about Sizes of Objects

    [BHCF16] Hessam Bagherinezhad, Hannaneh Hajishirzi, Yejin Choi, and Ali Farhadi. Are ele- phants bigger than butterflies? reasoning about sizes of objects. ArXiv, abs/1602.00753,

  3. [3]

    Language models are few-shot learners

    [BMR+20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- hini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litw...

  4. [4]

    BoolQ: Exploring the surprising difficulty of natural yes/no questions

    [CLC+19] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short ...

  5. [5]

    Association for Computational Linguistics. 18 [CND+22] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek B Rao, Parker Barnes, Yi Tay, Noam M. Shazeer, Vinodkumar Prabhakaran, ...

  6. [6]

    WebSRC: A dataset for web-based structural reading comprehension

    [CZC+21] Xingyu Chen, Zihan Zhao, Lu Chen, JiaBao Ji, Danyang Zhang, Ao Luo, Yuxuan Xiong, and Kai Yu. WebSRC: A dataset for web-based structural reading comprehension. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4173–4185, Online and Punta Cana, Dominican Republic, November

  7. [7]

    [DDS+09] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei

    Association for Computational Linguistics. [DDS+09] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA, pages 248–255. IEEE Computer Society,

  8. [8]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    [GBB+20] Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027,

  9. [9]

    Gaussian Error Linear Units (GELUs)

    [HG16] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415,

  10. [10]

    Language models are general-purpose interfaces

    [HSD+22] Yaru Hao, Haoyu Song, Li Dong, Shaohan Huang, Zewen Chi, Wenhui Wang, Shum- ing Ma, and Furu Wei. Language models are general-purpose interfaces. ArXiv, abs/2206.06336,

  11. [11]

    Y ., Salakhutdinov, R., and Fried, D

    [KSF23] Jing Yu Koh, Ruslan Salakhutdinov, and Daniel Fried. Grounding language models to images for multimodal generation. arXiv preprint arXiv:2301.13823,

  12. [12]

    The flan collection: Designing data and methods for effective instruction tuning

    [LHV+23] Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688,

  13. [13]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    [LLSH23] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ArXiv, abs/2301.12597,

  14. [14]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    [LOG+19] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692,

  15. [15]

    Lsdsem 2017 shared task: The story cloze test

    [MRL+17] Nasrin Mostafazadeh, Michael Roth, Annie Louis, Nathanael Chambers, and James Allen. Lsdsem 2017 shared task: The story cloze test. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics , pages 46–51,

  16. [16]

    TorchScale: Transformers at scale

    [MWH+22] Shuming Ma, Hongyu Wang, Shaohan Huang, Wenhui Wang, Zewen Chi, Li Dong, Alon Benhaim, Barun Patra, Vishrav Chaudhary, Xia Song, and Furu Wei. TorchScale: Transformers at scale. CoRR, abs/2211.13184,

  17. [17]

    Transferring knowl- edge from vision to language: How to achieve it and how to measure it? ArXiv, abs/2109.11321,

    [NHJ21] Tobias Norlund, Lovisa Hagström, and Richard Johansson. Transferring knowl- edge from vision to language: How to achieve it and how to measure it? ArXiv, abs/2109.11321,

  18. [18]

    LAION-5B: An open large-scale dataset for training next generation image-text models

    [SBV+22] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wight- man, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402,

  19. [19]

    Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

    [SDGS18] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Pro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers , pages 2556–2565. As...

  20. [20]

    A length-extrapolatable transformer

    [SDP+22] Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, and Furu Wei. A length-extrapolatable transformer. arXiv preprint arXiv:2212.10554,

  21. [21]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    [SPP+19] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053,

  22. [22]

    Manning, Andrew Ng, and Christopher Potts

    [SPW+13] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic composition- ality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA, October

  23. [23]

    LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    Association for Computational Linguistics. [SVB+21] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clay- ton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion- 400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114,

  24. [24]

    K., Singhal, S., Som, S., et al

    [WBD+22] Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Mohammed, Saksham Singhal, Subhojit Som, and Furu Wei. Image as a foreign language: BEiT pretraining for all vision and vision-language tasks. ArXiv, abs/2208.10442,

  25. [25]

    Belongie

    [WBW+11] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge J. Belongie. The caltech-ucsd birds-200-2011 dataset

  26. [26]

    Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

    [WCW+23] Chengyi Wang, Sanyuan Chen, Yu Wu, Zi-Hua Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, and Furu Wei. Neural codec language models are zero-shot text to speech synthesizers. ArXiv, abs/2301.02111,

  27. [27]

    DeepNet: Scaling Transformers to 1,000 layers

    21 [WMD+22] Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei. DeepNet: Scaling Transformers to 1,000 layers. CoRR, abs/2203.00555,

  28. [28]

    Foundation transformers

    [WMH+22] Hongyu Wang, Shuming Ma, Shaohan Huang, Li Dong, Wenhui Wang, Zhiliang Peng, Yu Wu, Payal Bajaj, Saksham Singhal, Alon Benhaim, Barun Patra, Zhun Liu, Vishrav Chaudhary, Xia Song, and Furu Wei. Foundation transformers. CoRR, abs/2210.06423,

  29. [29]

    SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

    [WPN+19] Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. arXiv preprint arXiv:1905.00537,

  30. [30]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    [WWS+22] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903,

  31. [31]

    GIT: A generative image-to-text transformer for vision and language

    [WYH+22] Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. GIT: A generative image-to-text transformer for vision and language. CoRR, abs/2205.14100,

  32. [32]

    Retrieval-augmented multimodal language modeling

    [Y AS+22] Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Rich James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, and Wen tau Yih. Retrieval-augmented multimodal language modeling. ArXiv, abs/2211.12561,

  33. [33]

    <PERSON>

    Hyperparameters Training steps 10,000 Warmup steps 375 Batch size of instruction data 256 Batch size of text corpora 32 Batch size of image-caption pairs 768 Batch size of interleaved data 16 Learning rate 2e-5 Table 19: Instruction tuning hyperparameters of KOSMOS -1 23 B Datasets B.1 Pretraning B.1.1 Text Corpora KOSMOS -1 is trained on The Pile [GBB+20...