arxiv: 2305.03726 · v2 · submitted 2023-05-05 · 💻 cs.CV · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Bo Li , Yuanhan Zhang , Liangyu Chen , Jinghao Wang , Fanyi Pu , Joshua Adrian Cahyono , Jingkang Yang , Ziwei Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:40 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords Otter modelMIMIC-IT datasetmulti-modal instruction tuningin-context learningFlamingo architecturevideo understandingvisual assistantsmulti-image reasoning

0 comments

The pith

Otter improves multi-modal instruction following by training on in-context examples from both text and images or videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Otter, a model that extends Flamingo by incorporating textual and visual in-context examples during instruction tuning. This setup uses the MIMIC-IT dataset of over three million multi-modal instruction-response pairs to train the model for general-purpose visual assistance. The central idea is that providing diverse in-context examples across images and videos helps the model converge faster and generalize to new tasks involving multiple images or dynamic video. If this holds, models could handle complex visual queries more reliably without relying solely on single-instruction training. The work targets the gap where prior models did not systematically use multi-modal contexts for better instruction adherence.

Core claim

Otter builds on the Flamingo Perceiver architecture and is instruction-tuned on the MIMIC-IT dataset, which supplies diverse in-context examples across text, multiple images, and video; this training substantially improves convergence speed and generalization on tasks that require understanding complex video content and multi-image scenarios.

What carries the argument

The MIMIC-IT dataset and its per-entry curation of in-context multi-modal examples, which the model uses during instruction tuning to process combined text, image, and video inputs.

If this is right

Otter processes instructions that combine text with multiple images or video clips without separate handling steps.
Training convergence accelerates when in-context examples accompany each instruction.
Performance rises on video understanding and multi-image reasoning benchmarks.
The model supports a broader range of real-world visual assistant queries than single-instruction baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar in-context tuning could be applied to other base multi-modal architectures to test if the gains transfer.
Dataset curation that emphasizes scenario diversity may reduce the total volume of data needed for strong multi-modal performance.
Future work could explore whether the same approach scales to longer video sequences or more images per context.
This points toward instruction tuning that treats context as a first-class input rather than an optional add-on.

Load-bearing premise

The assumption that MIMIC-IT's diverse in-context examples produce real generalization gains instead of improvements limited to the dataset's own scenarios.

What would settle it

Measure whether a version of Otter trained without in-context examples matches or exceeds the full model's accuracy on a held-out set of multi-image and video instruction tasks drawn from sources outside MIMIC-IT.

read the original abstract

Recent advances in Large Multimodal Models (LMMs) have unveiled great potential as visual assistants. However, most existing works focus on responding to individual instructions or using previous dialogues for contextual understanding. There is little discussion on employing both images and text as in-context examples to enhance the instruction following capability. To bridge this gap, we introduce the \textbf{Otter} model to leverage both textual and visual in-context examples for instruction tuning. Specifically, Otter builds upon Flamingo with Perceiver architecture, and has been instruction tuned for general purpose multi-modal assistant. Otter seamlessly processes multi-modal inputs, supporting modalities including text, multiple images, and dynamic video content. To support the training of Otter, we present the \textbf{MIMIC-IT} (\textbf{M}ult\textbf{I}-\textbf{M}odal \textbf{I}n-\textbf{C}ontext \textbf{I}nstruction \textbf{T}uning) dataset, which encompasses over 3 million multi-modal instruction-response pairs, including approximately 2.2 million unique instructions across a broad spectrum of images and videos. MIMIC-IT has been carefully curated to feature a diverse array of in-context examples for each entry. Comprehensive evaluations suggest that instruction tuning with these in-context examples substantially enhances model convergence and generalization capabilities. Notably, the extensive scenario coverage provided by the MIMIC-IT dataset empowers the Otter model to excel in tasks involving complex video and multi-image understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Otter adds a new MIMIC-IT dataset for multi-modal in-context examples on a Flamingo base, but the abstract's claims of big gains have no numbers or ablations to back them up.

read the letter

The paper's main contribution is the MIMIC-IT dataset of roughly 3 million multi-modal instruction pairs that deliberately include multiple images or video clips as in-context examples, plus the Otter model that instruction-tunes a Flamingo-style Perceiver on it. This is a focused extension of prior Flamingo work, aimed at making LMMs better at handling complex visual contexts rather than single-image instructions. The curation for scenario diversity across images and videos is the part that looks useful on its face, and the claim that the architecture handles variable-length visual sequences without major surgery seems plausible given the Perceiver design. What the work does cleanly is fill a narrow gap by treating visual examples symmetrically with text in the instruction-tuning loop. The soft spot is exactly what the stress-test note flags: the abstract asserts substantial improvements in convergence and generalization on complex video and multi-image tasks, yet supplies no quantitative results, no Flamingo baseline, no error bars, and no ablation that separates the in-context formatting from raw data volume or diversity. Without those controls it is impossible to tell whether the in-context structure is doing the work or whether any large, diverse dataset would produce similar effects. The paper is aimed at researchers building practical visual assistants and dataset curators in the multimodal space. A reader who wants to see how in-context visual examples are constructed at scale will find the dataset description worth their time, but anyone looking for verified performance lifts will need the full evaluation section. I would send it to peer review because the dataset and modeling choice are concrete enough to evaluate and the gap it targets is real, even if the current evidence is thin.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Otter model, built on the Flamingo Perceiver architecture, and the MIMIC-IT dataset of over 3 million multi-modal instruction-response pairs featuring diverse in-context examples across images and videos. It claims that instruction tuning with these in-context examples substantially enhances convergence and generalization, enabling strong performance on complex video and multi-image tasks.

Significance. If the empirical claims hold with proper validation, the work would advance multi-modal in-context learning by releasing a large-scale dataset with broad scenario coverage and demonstrating a model that handles variable multi-image and video inputs. The dataset curation and architectural integration represent potentially useful resources for general-purpose visual assistants.

major comments (3)

[Abstract] Abstract: The assertion that 'instruction tuning with these in-context examples substantially enhances model convergence and generalization capabilities' is unsupported by any quantitative metrics, baselines (e.g., Flamingo), error bars, or evaluation details, so the data-to-claim link cannot be assessed.
[Experiments] Experiments section: No ablation isolates the contribution of in-context structure (multiple images/videos per instruction) versus equivalent data volume or scenario coverage in single-instruction format, leaving dataset scale as a plausible alternative explanation for any gains.
[Model Architecture] Model Architecture section: Details are missing on how variable-length visual sequences from multiple images and dynamic videos are tokenized, embedded, and attended within the existing Flamingo Perceiver without hidden limitations or modifications.

minor comments (2)

[Experiments] Add explicit comparison tables against Flamingo and other LMM baselines with standard metrics (e.g., accuracy, CIDEr) and statistical significance tests.
[Method] Clarify notation for in-context example formatting and provide pseudocode for how multi-modal sequences are constructed during training.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, providing clarifications from the manuscript and indicating where revisions will be made to strengthen the presentation of results and technical details.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that 'instruction tuning with these in-context examples substantially enhances model convergence and generalization capabilities' is unsupported by any quantitative metrics, baselines (e.g., Flamingo), error bars, or evaluation details, so the data-to-claim link cannot be assessed.

Authors: We appreciate this observation. The manuscript's Section 4 reports quantitative evaluations across multiple benchmarks, including direct comparisons to the base Flamingo model, with performance metrics on video and multi-image tasks and error bars in the tables. The abstract summarizes these findings at a high level. To make the link between data and claims explicit, we will revise the abstract to include specific quantitative improvements (e.g., accuracy gains on key tasks) supported by the experimental results. revision: yes
Referee: [Experiments] Experiments section: No ablation isolates the contribution of in-context structure (multiple images/videos per instruction) versus equivalent data volume or scenario coverage in single-instruction format, leaving dataset scale as a plausible alternative explanation for any gains.

Authors: This is a fair critique. The current experiments demonstrate gains from instruction tuning on MIMIC-IT but do not include an ablation that holds total data volume fixed while varying only the in-context (multi-image/video) structure versus single-instruction format. Creating a matched single-instruction control set at the same scale would require substantial additional curation. We will add a dedicated paragraph in the Experiments section acknowledging this limitation, discussing why the in-context design is central to the dataset, and noting it as an avenue for future work. revision: partial
Referee: [Model Architecture] Model Architecture section: Details are missing on how variable-length visual sequences from multiple images and dynamic videos are tokenized, embedded, and attended within the existing Flamingo Perceiver without hidden limitations or modifications.

Authors: We thank the referee for noting this gap. Otter uses the Flamingo Perceiver architecture without modifications to its core resampler or attention layers. Multiple images are encoded independently by the vision encoder and their visual tokens are concatenated into a single sequence; videos are processed by uniformly sampling frames and treating the resulting sequence as additional visual tokens. The Perceiver resampler then maps the variable-length visual sequence to a fixed set of latents for cross-attention with the language model. We will expand the Model Architecture section with a new subsection that explicitly describes the input preprocessing, tokenization, embedding, and attention flow for these variable-length cases. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on new dataset and standard training

full rationale

The paper introduces Otter by extending the external Flamingo Perceiver architecture and trains it on the newly curated MIMIC-IT dataset containing multi-modal in-context examples. The central claim—that in-context instruction tuning enhances convergence and generalization—is presented as an empirical outcome of this training and subsequent evaluations, with no equations, fitted parameters, or self-citations that reduce the result to a definitional loop or renamed input. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the derivation chain; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim depends on the effectiveness of the newly curated MIMIC-IT dataset and the compatibility of the Flamingo base model with added in-context visual examples; these rest on unstated assumptions about data diversity and architecture suitability.

free parameters (2)

Number and selection of in-context examples per training instance
Chosen during dataset curation and training but not quantified in the abstract.
Instruction tuning hyperparameters
Standard training settings required for convergence but unspecified.

axioms (1)

domain assumption The Perceiver architecture in the Flamingo base model enables seamless handling of text, multiple images, and dynamic video inputs.
Invoked when stating that Otter processes multi-modal inputs without additional architectural changes.

pith-pipeline@v0.9.0 · 5579 in / 1213 out tokens · 31120 ms · 2026-05-15T02:40:09.150748+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Otter builds upon Flamingo with Perceiver architecture... perceiver resampler module ingests a sequence of image or video features to produce a fixed set of visual tokens... cross-gated attention modules
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MIMIC-IT... 3 million multi-modal instruction-response pairs... diverse array of in-context examples... instruction tuning with these in-context examples substantially enhances model convergence

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
cs.CV 2024-09 accept novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
cs.CL 2024-09 accept novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
cs.CL 2023-11 unverdicted novelty 8.0

MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation
cs.CV 2026-04 unverdicted novelty 7.0

AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
Why Multimodal In-Context Learning Lags Behind? Unveiling the Inner Mechanisms and Bottlenecks
cs.CV 2026-04 unverdicted novelty 7.0

Multimodal ICL lags text-only ICL in few-shot settings due to weak cross-modal reasoning alignment and unreliable task mapping transfer, with an inference-stage method proposed to strengthen transfer.
MLVU: Benchmarking Multi-task Long Video Understanding
cs.CV 2024-06 conditional novelty 7.0

MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
cs.CL 2023-07 unverdicted novelty 7.0

SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
Evaluating Object Hallucination in Large Vision-Language Models
cs.CV 2023-05 accept novelty 7.0

Large vision-language models exhibit severe object hallucination that varies with training instructions, and the proposed POPE polling method evaluates it more stably and flexibly than prior approaches.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
cs.CV 2023-03 conditional novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 6.0

SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.
R-CoV: Region-Aware Chain-of-Verification for Alleviating Object Hallucinations in LVLMs
cs.CV 2026-04 conditional novelty 6.0

R-CoV is a six-step region-aware chain-of-verification technique that elicits coordinate and description outputs from LVLMs themselves to detect and reduce object hallucinations without external models or retraining.
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
cs.CV 2026-04 unverdicted novelty 6.0

CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM
cs.CV 2026-03 unverdicted novelty 6.0

Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five...
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
cs.CV 2023-11 conditional novelty 6.0

A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
cs.CV 2023-11 unverdicted novelty 6.0

Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
cs.CV 2023-06 accept novelty 6.0

A new dataset of 400k visual instructions including negative examples at three semantic levels reduces hallucinations in models like MiniGPT-4 when used for fine-tuning while improving benchmark performance.
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
cs.CV 2023-06 unverdicted novelty 6.0

MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.
Hallucination of Multimodal Large Language Models: A Survey
cs.CV 2024-04 accept novelty 5.0

The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
cs.CV 2023-12 unverdicted novelty 5.0

InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
cs.CV 2024-04 unverdicted novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
cs.CV 2023-08 unverdicted novelty 4.0

OpenFlamingo provides open-source autoregressive vision-language models that achieve 80-89% of Flamingo performance on seven vision-language datasets.

Reference graph

Works this paper leans on

104 extracted references · 104 canonical work pages · cited by 21 Pith papers · 28 internal anchors

[1]

https://commoncrawl.org/

Common crawl. https://commoncrawl.org/. Accessed: 2023-11-

work page 2023
[2]

What learning algorithm is in-context learning? investigations with linear models

Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models. arXiv preprint arXiv:2211.15661 ,

work page arXiv
[3]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022. 1

work page 2022
[4]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022. 3

work page 2022
[5]

Vqa: Visual question answering

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision , pages 2425–2433, 2015. 8

work page 2015
[7]

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Open- flamingo: An open-source framework for training large autore- gressive vision-language models. arXiv preprint...

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 , 2023. 1, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Introducing our multimodal models, 2023

Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sa˘ gnak Ta¸ sırlar. Introducing our multimodal models, 2023. 12

work page 2023
[10]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020. 1

work page 1901
[11]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems (NeuIPS), 33:1877–1901, 2020. 2, 3

work page 1901
[12]

Activitynet: A large-scale video benchmark for human activity understanding

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition , pages 961–970, 2015. 5

work page 2015
[13]

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 3558– 3568, 2021. 3

work page 2021
[14]

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 3558– 3568, 2021. 5

work page 2021
[15]

Scanrefer: 3d object localization in rgb-d scans using natural language

Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. 16th European Conference on Computer Vision (ECCV) ,

work page
[16]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195 , 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Making your first choice: To address cold start problem in vision active learning

Liangyu Chen, Yutong Bai, Siyu Huang, Yongyi Lu, Bihan Wen, Alan L Yuille, and Zongwei Zhou. Making your first choice: To address cold start problem in vision active learning. arXiv preprint arXiv:2210.02442, 2022. 7

work page arXiv 2022
[18]

Sharegpt4v: Improving large multi-modal models with better captions

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. In European Conference on Computer Vision, pages 370–387. Springer, 2024. 3

work page 2024
[19]

Pali-x: On scaling up a multilingual vision and language model

Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Good- man, Xiao Wang, Yi Tay, et al. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565 , 2023. 1

work page arXiv 2023
[20]

Pali-3 vision language models: Smaller, faster, stronger, 2023

Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, Daniel Salz, Xi Xiong, Daniel Vlasic, Filip Pavetic, Keran Rong, Tianli Yu, Daniel Keysers, Xiaohua Zhai, and Radu Soricut. Pali-3 vision language models: Smaller, faster, stronger, 2023. 1

work page 2023
[21]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015. 9

work page internal anchor Pith review Pith/arXiv arXiv 2015
[22]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P . Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. 1

work page 2023
[23]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[24]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 , 2022. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[25]

Scannet: Richly- annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly- annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 5828–5839, 2017. 4

work page 2017
[26]

Why can gpt learn in-context? language models secretly perform gradient descent as meta optimizers

Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Zhifang Sui, and Furu Wei. Why can gpt learn in-context? language models secretly perform gradient descent as meta optimizers. arXiv preprint arXiv:2212.10559, 2022. 2

work page arXiv 2022
[28]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven C. H. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR, abs/2305.06500, 2023. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei- Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 248–255. Ieee, 2009. 8

work page 2009
[30]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multi- modal large language models. arXiv preprint arXiv:2306.13394 ,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, et al. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 , 2023. 1

work page internal anchor Pith review arXiv 2023
[32]

What can transformers learn in-context? a case study of simple function classes

Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems , 35:30583–30598, 2022. 2

work page 2022
[33]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 6904–6913, 2017. 3

work page 2017
[34]

Unnatural instructions: Tuning language models with (almost) no human labor

Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689 , 2022. 3 JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 15

work page arXiv 2022
[35]

Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Aishwarya Agrawal, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al

Ting-Hao K. Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Aishwarya Agrawal, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al. Visual storytelling. In 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2016) , 2016. 4

work page 2016
[36]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 6700–6709, 2019. 8

work page 2019
[37]

Face Classification and Detection

Wondong Hyeon. Face Classification and Detection. https:// github.com/wondonghyeon/face-classification, 2023. Accessed: 2023-11-25. 5

work page 2023
[38]

Learning to describe differences between pairs of similar images

Harsh Jhamtani and Taylor Berg-Kirkpatrick. Learning to describe differences between pairs of similar images. In EMNLP, pages 4024–4034. Association for Computational Linguistics, 2018. 4

work page 2018
[39]

Referitgame: Referring to objects in photographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages 787–798, 2014. 10

work page 2014
[40]

A hierarchical approach for generating descriptive image paragraphs

Jonathan Krause, Justin Johnson, Ranjay Krishna, and Li Fei- Fei. A hierarchical approach for generating descriptive image paragraphs. In Computer Vision and Patterm Recognition (CVPR) ,

work page
[41]

Dense-captioning events in videos

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision , pages 706–715,

work page
[42]

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. corr abs/1602.07332, 2016. 8

work page internal anchor Pith review Pith/arXiv arXiv 2016
[43]

Visual genome: Connecting language and vision using crowdsourced dense image annotations

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision , 123:32–73, 2017. 3

work page 2017
[44]

Visual genome: Connecting language and vision using crowdsourced dense image annotations

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision , 123:32–73, 2017. 5

work page 2017
[45]

Obelics: An open web-scale filtered dataset of interleaved image-text documents

Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks T rack, 2023. 1, 6

work page 2023
[46]

Tvr: A large- scale dataset for video-subtitle moment retrieval

Jie Lei, Licheng Yu, Tamara L Berg, and Mohit Bansal. Tvr: A large- scale dataset for video-subtitle moment retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16 , pages 447–463. Springer,

work page 2020
[47]

Otterhd: A high-resolution multi- modality model

Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, and Ziwei Liu. Otterhd: A high-resolution multi-modality model. arXiv preprint arXiv:2311.04219 , 2023. 1, 12

work page arXiv 2023
[48]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image en- coders and large language models. arXiv preprint arXiv:2301.12597,

work page internal anchor Pith review Pith/arXiv arXiv
[49]

VideoChat: Chat-Centric Video Understanding

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat- centric video understanding. arXiv preprint arXiv:2305.06355 , 2023. 9

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

M3it: A large-scale dataset towards multi- modal multilingual instruction tuning

Lei Li, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, Jingjing Xu, Xu Sun, et al. M3it: A large-scale dataset towards multi-modal multilingual instruction tuning. arXiv preprint arXiv:2306.04387 , 2023. 5, 8

work page arXiv 2023
[51]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision- language models. arXiv preprint arXiv:2305.10355 , 2023. 9

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision– ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 , pages 740–755. Springer, 2014. 3, 4, 5, 10

work page 2014
[53]

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565 , 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Mitigating hallucination in large multi-modal models via robust instruction tuning, 2023

Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Mitigating hallucination in large multi-modal models via robust instruction tuning, 2023. 5

work page 2023
[55]

Im- proved baselines with visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Im- proved baselines with visual instruction tuning, 2023. 6, 8

work page 2023
[57]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485 , 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Prismer: A vision-language model with an ensemble of experts

Shikun Liu, Linxi Fan, Edward Johns, Zhiding Yu, Chaowei Xiao, and Anima Anandkumar. Prismer: A vision-language model with an ensemble of experts. arXiv preprint arXiv:2303.02506 , 2023. 1

work page arXiv 2023
[59]

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 , 2023. 9, 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reason- ing of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023. 9

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition , pages 3195– 3204, 2019. 8

work page 2019
[62]

Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837, 2022. 2

work page internal anchor Pith review arXiv 2022
[63]

Ocr-vqa: Visual question answering by reading text in images

Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, 2019. 8

work page 2019
[64]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Salman Khan Muhammad Maaz, Hanoona Rasheed and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. ArXiv 2306.05424, 2023. 3, 5, 9

work page internal anchor Pith review arXiv 2023
[65]

Gpt-4 technical report

OpenAI. Gpt-4 technical report. https://openai.com/research/ gpt-4, 2023. 1, 2, 3

work page 2023
[66]

Im2text: De- scribing images using 1 million captioned photographs

Vicente Ordonez, Girish Kulkarni, and Tamara Berg. Im2text: De- scribing images using 1 million captioned photographs. Advances in neural information processing systems , 24, 2011. 3

work page 2011
[67]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems , 35:27730–27744, 2022. 1

work page 2022
[68]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems , 35:27730–27744, 2022. 2, 3

work page 2022
[69]

Instruction Tuning with GPT-4

Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with GPT-4. arXiv preprint arXiv:2304.03277, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[70]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning , pages 8748–8763. PMLR, 2021. 3

work page 2021
[71]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML) , pages 8748–8763. PMLR,

work page
[72]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. 1

work page 2019
[73]

Multitask Prompted Training Enables Zero-Shot Task Generalization

Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lin- tang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training en- ables zero-shot task generalization. arXiv preprint arXiv:2110.08207,

work page internal anchor Pith review Pith/arXiv arXiv
[74]

2 JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 16

work page 2015
[75]

Laion-5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems , 35:25278–25294,

work page
[76]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021. 3

work page internal anchor Pith review Pith/arXiv arXiv 2021
[77]

A-okvqa: A benchmark for visual question answering using world knowledge

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Ken- neth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision , pages 146–162. Springer, 2022. 8

work page 2022
[78]

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 2556–2565, 2018. 3

work page 2018
[79]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 8317–8326, 2019. 8

work page 2019
[80]

Vistext: A benchmark for semantically rich chart captioning

Benny J Tang, Angie Boggust, and Arvind Satyanarayan. Vistext: A benchmark for semantically rich chart captioning. arXiv preprint arXiv:2307.05356, 2023. 5

work page arXiv 2023
[81]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023. 1

work page 2023
[82]

Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023

MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. Accessed: 2023-05-05. 3

work page 2023
[84]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Na- man Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

Showing first 80 references.