arxiv: 2302.14045 · v2 · submitted 2023-02-27 · 💻 cs.CL · cs.CV

Recognition: 3 theorem links

· Lean Theorem

Language Is Not All You Need: Aligning Perception with Language Models

Shaohan Huang , Li Dong , Wenhui Wang , Yaru Hao , Saksham Singhal , Shuming Ma , Tengchao Lv , Lei Cui

show 10 more authors

Owais Khan Mohammed Barun Patra Qiang Liu Kriti Aggarwal Zewen Chi Johan Bjorck Vishrav Chaudhary Subhojit Som Xia Song Furu Wei

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:28 UTC · model grok-4.3

classification 💻 cs.CL cs.CV

keywords multimodal large language modelszero-shot multimodal learningfew-shot in-context learningcross-modal transferimage captioningvisual question answeringOCR-free document understanding

0 comments

The pith

Kosmos-1 learns perception and language jointly from web-scale interleaved text and images, then performs zero-shot and few-shot tasks across modalities without any finetuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Kosmos-1 as a single transformer-based model trained from scratch on massive collections of mixed text-image sequences, image captions, and plain text. It shows that this training produces a model capable of language understanding and generation, direct image input for OCR-free document tasks, image captioning, visual question answering, multimodal dialogue, and even image classification when the class is specified only in text instructions. The central argument is that broad exposure to aligned multimodal data enables in-context learning and cross-modal knowledge transfer, so the same weights support both language-only and vision-language work. A new Raven-style IQ test dataset is introduced to measure nonverbal reasoning in such models.

Core claim

Kosmos-1 is a Multimodal Large Language Model trained on web-scale multimodal corpora containing arbitrarily interleaved text and images, image-caption pairs, and text data; the resulting model achieves strong zero-shot and few-shot performance on language tasks, perception-language tasks, and vision tasks specified via text, with no gradient updates or task-specific finetuning required.

What carries the argument

Training a transformer on arbitrarily interleaved text-image sequences so that the same parameters support in-context learning across modalities.

If this is right

Knowledge transfers in both directions: language pretraining improves multimodal performance and multimodal training improves language performance.
Document images can be fed directly for OCR-free NLP tasks such as question answering or summarization.
Image recognition can be performed by supplying only a textual description of the desired classes.
A single set of weights can handle multimodal dialogue that mixes text and images in the same conversation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach suggests that separate vision encoders may become unnecessary if interleaved training data is large enough.
Similar training recipes could be tested on video or audio sequences to check whether the same model architecture scales to additional modalities.
The Raven IQ dataset provides a concrete way to compare nonverbal reasoning across future multimodal models without relying on language mediation.

Load-bearing premise

Web-scale multimodal data already contains enough aligned signal that one model can acquire general cross-modal capabilities that transfer to new tasks without any adaptation.

What would settle it

Kosmos-1 scores no higher than a text-only language model on visual question answering when images are provided as input.

read the original abstract

A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot). Specifically, we train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. We evaluate various settings, including zero-shot, few-shot, and multimodal chain-of-thought prompting, on a wide range of tasks without any gradient updates or finetuning. Experimental results show that Kosmos-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP (directly fed with document images), (ii) perception-language tasks, including multimodal dialogue, image captioning, visual question answering, and (iii) vision tasks, such as image recognition with descriptions (specifying classification via text instructions). We also show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language. In addition, we introduce a dataset of Raven IQ test, which diagnoses the nonverbal reasoning capability of MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Kosmos-1 shows that web-scale interleaved multimodal training can produce usable zero-shot transfer across language, vision, and reasoning tasks, but the claims rest on unquantified results and unaddressed benchmark overlap.

read the letter

Kosmos-1 is trained from scratch on a mix of arbitrarily interleaved text-image data, image-caption pairs, and plain text. It then handles zero-shot and few-shot prompts on language understanding, OCR-free document tasks, VQA, captioning, multimodal dialogue, and even image recognition when the class is described in text. They also release a Raven-style IQ dataset to probe nonverbal reasoning in these models. The cross-modal transfer experiments are the clearest addition: language pretraining helps multimodal performance and multimodal data helps language tasks in return. That setup is cleaner than the usual separate vision tower plus LLM pipeline. The Raven dataset itself is new and could be picked up by others testing reasoning. The main weakness is that the abstract and summary give no numbers, baselines, or error bars, so it is impossible to judge whether the results move the needle or simply reflect scale. Web-scale corpora are known to contain near-duplicates of standard VQA and captioning test sets; without overlap statistics or decontamination steps, the zero-shot numbers could be driven by memorization rather than genuine alignment. The training details on how interleaving is sampled and how long the model is trained are also thin in the provided text. This paper is aimed at groups already scaling LLMs and wanting to add perception without new architectures. Readers who need concrete numbers or tight controls on data leakage will find it frustrating until the full evaluation section is checked. It is worth sending to peer review because the model, the interleaved training mixture, and the Raven dataset are concrete new artifacts; a referee can ask for the missing metrics and contamination analysis without rejecting the core direction outright.

Referee Report

3 major / 3 minor

Summary. The paper introduces Kosmos-1, a multimodal large language model trained from scratch on web-scale corpora of interleaved text-image data, image-caption pairs, and text. It claims strong zero-shot and few-shot performance (via in-context learning and instruction following) across language understanding/generation, OCR-free document tasks, multimodal dialogue, image captioning, VQA, and vision tasks such as instruction-based image recognition, without any gradient updates or finetuning. The work also reports cross-modal transfer benefits and introduces a new Raven-style IQ test dataset to diagnose nonverbal reasoning in MLLMs.

Significance. If the empirical claims hold after addressing evaluation details, the results would demonstrate that web-scale aligned multimodal pretraining can produce general cross-modal capabilities that transfer to held-out tasks, supporting the broader thesis that perception-language alignment is a key step toward AGI-like convergence of modalities. The new Raven IQ dataset adds a useful diagnostic for nonverbal reasoning that is not language-mediated.

major comments (3)

[§4 and §5] §4 (Experimental Setup) and §5 (Results): The central performance claims (e.g., zero-shot VQA, captioning, OCR-free NLP) are presented without explicit quantitative tables showing exact metrics, standard baselines (Flamingo, BLIP-2, etc.), error bars, or data exclusion criteria. This makes it impossible to judge whether the reported gains reflect genuine generalization or post-hoc prompt selection.
[§3.2 and §5.3] §3.2 (Training Data) and §5.3 (Cross-modal Transfer): No decontamination statistics or overlap analysis are provided between the web-scale training corpora and common evaluation benchmarks (VQA v2, COCO, Raven matrices). Given that web data frequently contains near-duplicates of these benchmarks, the zero-shot transfer claims rest on an untested assumption that performance arises from alignment rather than memorization.
[§6] §6 (Raven IQ Dataset): The new dataset is introduced as a diagnostic for nonverbal reasoning, but the paper provides no details on construction protocol, human validation, or controls for language leakage (e.g., textual descriptions of matrices). This is load-bearing for the claim that MLLMs can be evaluated on purely perceptual reasoning.

minor comments (3)

[Figure 1 and §3.1] Figure 1 and §3.1: The model architecture diagram and description of the visual encoder + LLM integration use inconsistent notation for the special tokens (e.g., <image> vs. [IMG]); standardize and add a precise tokenization equation.
[§5] Throughout §5: All reported numbers should include the exact prompt templates used and the number of in-context examples; several tables omit these details, reducing reproducibility.
[Related Work] Related Work: Add explicit comparison to concurrent MLLMs (Flamingo, PaLM-E) in the introduction and results tables rather than only in passing.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [§4 and §5] §4 (Experimental Setup) and §5 (Results): The central performance claims (e.g., zero-shot VQA, captioning, OCR-free NLP) are presented without explicit quantitative tables showing exact metrics, standard baselines (Flamingo, BLIP-2, etc.), error bars, or data exclusion criteria. This makes it impossible to judge whether the reported gains reflect genuine generalization or post-hoc prompt selection.

Authors: We agree that the results section would benefit from clearer quantitative presentation. In the revised manuscript we will add comprehensive tables in §5 that report exact metrics for every task, direct comparisons to standard baselines including Flamingo and BLIP-2, and explicit statements of evaluation protocols and any data exclusion criteria. Because the evaluations are single-run zero-shot and few-shot settings, we will note the absence of error bars and describe our prompt selection procedure to address concerns about post-hoc tuning. revision: yes
Referee: [§3.2 and §5.3] §3.2 (Training Data) and §5.3 (Cross-modal Transfer): No decontamination statistics or overlap analysis are provided between the web-scale training corpora and common evaluation benchmarks (VQA v2, COCO, Raven matrices). Given that web data frequently contains near-duplicates of these benchmarks, the zero-shot transfer claims rest on an untested assumption that performance arises from alignment rather than memorization.

Authors: We acknowledge the importance of decontamination for zero-shot claims. Given the scale of the training corpora, exhaustive overlap analysis is computationally prohibitive; however, we will add a new subsection in §3.2 describing the filtering steps we applied to remove known benchmark duplicates and will report any available overlap statistics. We will also clarify that the Raven dataset is newly constructed and therefore free of training overlap. While we cannot provide a complete decontamination audit, the cross-modal transfer results and performance on held-out tasks support generalization beyond simple memorization. revision: partial
Referee: [§6] §6 (Raven IQ Dataset): The new dataset is introduced as a diagnostic for nonverbal reasoning, but the paper provides no details on construction protocol, human validation, or controls for language leakage (e.g., textual descriptions of matrices). This is load-bearing for the claim that MLLMs can be evaluated on purely perceptual reasoning.

Authors: We thank the referee for highlighting this gap. In the revised §6 we will provide a detailed construction protocol, including how the matrices were procedurally generated, the human validation process used to ensure quality and correctness, and explicit controls against language leakage (e.g., matrices are presented purely visually with no accompanying textual descriptions during evaluation). revision: yes

Circularity Check

0 steps flagged

No circularity in derivation or claims

full rationale

The paper introduces Kosmos-1 via training on web-scale multimodal corpora and reports empirical zero-shot/few-shot results on external benchmarks (VQA, captioning, OCR-free NLP, Raven IQ). No equations, derivations, or load-bearing steps are present that reduce by construction to fitted inputs, self-definitions, or self-citation chains. All performance claims rest on held-out task evaluations rather than internal reparameterization or renamed patterns. The central premise of cross-modal transfer is an empirical observation, not a mathematical identity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that large-scale web data supplies aligned multimodal signal sufficient for generalization; standard transformer architecture and optimization assumptions are inherited from prior LLM work.

free parameters (1)

model scale and training hyperparameters
Architecture size, learning rate schedule, and data mixture ratios are chosen to fit available compute and are not derived from first principles.

axioms (1)

domain assumption Web-scale interleaved text-image corpora contain sufficient cross-modal alignments to induce general perception-language capabilities.
Invoked in the training description to justify zero-shot transfer.

pith-pipeline@v0.9.0 · 5601 in / 1260 out tokens · 59454 ms · 2026-05-15T18:28:25.888757+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. We evaluate various settings, including zero-shot, few-shot, and multimodal chain-of-thought prompting, on a wide range of tasks without any gradient updates or finetuning.
IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experimental results show that Kosmos-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP (directly fed with document images), (ii) perception-language tasks, including multimodal dialogue, image captioning, visual question answering, and (iii) vision tasks, such as image recognition with descriptions.
IndisputableMonolith.Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

In addition, we introduce a dataset of Raven IQ test, which diagnoses the nonverbal reasoning capability of MLLMs.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
cs.CV 2024-07 unverdicted novelty 7.0

LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving ...
3D-VLA: A 3D Vision-Language-Action Generative World Model
cs.CV 2024-03 unverdicted novelty 7.0

3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.
VideoChat: Chat-Centric Video Understanding
cs.CV 2023-05 conditional novelty 7.0

VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.
Visual Instruction Tuning
cs.CV 2023-04 unverdicted novelty 7.0

LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
cs.CV 2023-03 conditional novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
Retentive Network: A Successor to Transformer for Large Language Models
cs.CL 2023-07 unverdicted novelty 6.0

RetNet is a new sequence modeling architecture that delivers parallel training, constant-time inference, and competitive language modeling performance as a potential replacement for Transformers.
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
cs.CV 2023-07 unverdicted novelty 6.0

InternVid supplies 7M videos and LLM captions to train ViCLIP, which reaches leading zero-shot action recognition and competitive retrieval performance.
Kosmos-2: Grounding Multimodal Large Language Models to the World
cs.CL 2023-06 unverdicted novelty 6.0

Kosmos-2 grounds text to image regions by encoding refer expressions as Markdown links to sequences of location tokens and trains on a new GrIT dataset of grounded image-text pairs.
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
cs.CV 2023-06 unverdicted novelty 6.0

MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
cs.CV 2023-04 conditional novelty 6.0

MiniGPT-4 shows that aligning a frozen vision encoder to Vicuna via one projection layer plus a second-stage detailed-description fine-tune produces GPT-4-like vision-language abilities including detailed captions, cr...
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
cs.CL 2023-03 unverdicted novelty 6.0

HuggingGPT is an agent system where ChatGPT plans and orchestrates calls to Hugging Face models to solve complex multi-modal AI tasks.
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
cs.CV 2023-03 unverdicted novelty 6.0

MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.
Qwen2.5-Omni Technical Report
cs.CL 2025-03 conditional novelty 5.0

Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text perfo...
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
cs.CL 2023-11 unverdicted novelty 5.0

The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)
cs.CV 2023-09 conditional novelty 4.0

GPT-4V processes interleaved image-text inputs generically and supports visual referring prompting for new human-AI interaction.
The Rise and Potential of Large Language Model Based Agents: A Survey
cs.AI 2023-09 accept novelty 4.0

The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
cs.CV 2023-08 unverdicted novelty 4.0

OpenFlamingo provides open-source autoregressive vision-language models that achieve 80-89% of Flamingo performance on seven vision-language datasets.
Large Language Models: A Survey
cs.CL 2024-02 accept novelty 3.0

The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
A Survey of Large Language Models
cs.CL 2023-03 accept novelty 3.0

This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · cited by 19 Pith papers · 12 internal anchors

[1]

Armen Aghajanyan, Bernie Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Naman Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, and Luke Zettlemoyer

[AHR+22] Armen Aghajanyan, Bernie Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Na- man Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, and Luke Zettlemoyer. CM3: A causal masked multimodal model of the Internet. ArXiv, abs/2201.07520,

work page arXiv
[2]

Are Elephants Bigger than Butterflies? Reasoning about Sizes of Objects

[BHCF16] Hessam Bagherinezhad, Hannaneh Hajishirzi, Yejin Choi, and Ali Farhadi. Are ele- phants bigger than butterﬂies? reasoning about sizes of objects. ArXiv, abs/1602.00753,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Language models are few-shot learners

[BMR+20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- hini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litw...

work page 1901
[4]

BoolQ: Exploring the surprising difﬁculty of natural yes/no questions

[CLC+19] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difﬁculty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short ...

work page 2019
[5]

Association for Computational Linguistics. 18 [CND+22] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek B Rao, Parker Barnes, Yi Tay, Noam M. Shazeer, Vinodkumar Prabhakaran, ...

work page internal anchor Pith review Pith/arXiv arXiv
[6]

WebSRC: A dataset for web-based structural reading comprehension

[CZC+21] Xingyu Chen, Zihan Zhao, Lu Chen, JiaBao Ji, Danyang Zhang, Ao Luo, Yuxuan Xiong, and Kai Yu. WebSRC: A dataset for web-based structural reading comprehension. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4173–4185, Online and Punta Cana, Dominican Republic, November

work page 2021
[7]

[DDS+09] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei

Association for Computational Linguistics. [DDS+09] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA, pages 248–255. IEEE Computer Society,

work page 2009
[8]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

[GBB+20] Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Gaussian Error Linear Units (GELUs)

[HG16] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Language models are general-purpose interfaces

[HSD+22] Yaru Hao, Haoyu Song, Li Dong, Shaohan Huang, Zewen Chi, Wenhui Wang, Shum- ing Ma, and Furu Wei. Language models are general-purpose interfaces. ArXiv, abs/2206.06336,

work page arXiv
[11]

Y ., Salakhutdinov, R., and Fried, D

[KSF23] Jing Yu Koh, Ruslan Salakhutdinov, and Daniel Fried. Grounding language models to images for multimodal generation. arXiv preprint arXiv:2301.13823,

work page arXiv
[12]

The ﬂan collection: Designing data and methods for effective instruction tuning

[LHV+23] Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The ﬂan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688,

work page arXiv
[13]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

[LLSH23] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ArXiv, abs/2301.12597,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

[LOG+19] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692,

work page internal anchor Pith review Pith/arXiv arXiv 1907
[15]

Lsdsem 2017 shared task: The story cloze test

[MRL+17] Nasrin Mostafazadeh, Michael Roth, Annie Louis, Nathanael Chambers, and James Allen. Lsdsem 2017 shared task: The story cloze test. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics , pages 46–51,

work page 2017
[16]

TorchScale: Transformers at scale

[MWH+22] Shuming Ma, Hongyu Wang, Shaohan Huang, Wenhui Wang, Zewen Chi, Li Dong, Alon Benhaim, Barun Patra, Vishrav Chaudhary, Xia Song, and Furu Wei. TorchScale: Transformers at scale. CoRR, abs/2211.13184,

work page arXiv
[17]

Transferring knowl- edge from vision to language: How to achieve it and how to measure it? ArXiv, abs/2109.11321,

[NHJ21] Tobias Norlund, Lovisa Hagström, and Richard Johansson. Transferring knowl- edge from vision to language: How to achieve it and how to measure it? ArXiv, abs/2109.11321,

work page arXiv
[18]

LAION-5B: An open large-scale dataset for training next generation image-text models

[SBV+22] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wight- man, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

[SDGS18] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Pro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers , pages 2556–2565. As...

work page 2018
[20]

A length-extrapolatable transformer

[SDP+22] Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, and Furu Wei. A length-extrapolatable transformer. arXiv preprint arXiv:2212.10554,

work page arXiv
[21]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

[SPP+19] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053,

work page internal anchor Pith review Pith/arXiv arXiv 1909
[22]

Manning, Andrew Ng, and Christopher Potts

[SPW+13] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic composition- ality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA, October

work page 2013
[23]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Association for Computational Linguistics. [SVB+21] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clay- ton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion- 400m: Open dataset of clip-ﬁltered 400 million image-text pairs. arXiv preprint arXiv:2111.02114,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

K., Singhal, S., Som, S., et al

[WBD+22] Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Mohammed, Saksham Singhal, Subhojit Som, and Furu Wei. Image as a foreign language: BEiT pretraining for all vision and vision-language tasks. ArXiv, abs/2208.10442,

work page arXiv
[25]

Belongie

[WBW+11] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge J. Belongie. The caltech-ucsd birds-200-2011 dataset

work page 2011
[26]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

[WCW+23] Chengyi Wang, Sanyuan Chen, Yu Wu, Zi-Hua Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, and Furu Wei. Neural codec language models are zero-shot text to speech synthesizers. ArXiv, abs/2301.02111,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

DeepNet: Scaling Transformers to 1,000 layers

21 [WMD+22] Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei. DeepNet: Scaling Transformers to 1,000 layers. CoRR, abs/2203.00555,

work page arXiv
[28]

Foundation transformers

[WMH+22] Hongyu Wang, Shuming Ma, Shaohan Huang, Li Dong, Wenhui Wang, Zhiliang Peng, Yu Wu, Payal Bajaj, Saksham Singhal, Alon Benhaim, Barun Patra, Zhun Liu, Vishrav Chaudhary, Xia Song, and Furu Wei. Foundation transformers. CoRR, abs/2210.06423,

work page arXiv
[29]

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

[WPN+19] Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. arXiv preprint arXiv:1905.00537,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[30]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

[WWS+22] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

GIT: A generative image-to-text transformer for vision and language

[WYH+22] Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. GIT: A generative image-to-text transformer for vision and language. CoRR, abs/2205.14100,

work page arXiv
[32]

Retrieval-augmented multimodal language modeling

[Y AS+22] Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Rich James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, and Wen tau Yih. Retrieval-augmented multimodal language modeling. ArXiv, abs/2211.12561,

work page arXiv
[33]

<PERSON>

Hyperparameters Training steps 10,000 Warmup steps 375 Batch size of instruction data 256 Batch size of text corpora 32 Batch size of image-caption pairs 768 Batch size of interleaved data 16 Learning rate 2e-5 Table 19: Instruction tuning hyperparameters of KOSMOS -1 23 B Datasets B.1 Pretraning B.1.1 Text Corpora KOSMOS -1 is trained on The Pile [GBB+20...

work page 2020