arxiv: 2204.00598 · v2 · submitted 2022-04-01 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

Recognition: 1 theorem link

· Lean Theorem

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

Andy Zeng , Maria Attarian , Brian Ichter , Krzysztof Choromanski , Adrian Wong , Stefan Welker , Federico Tombari , Aveek Purohit

show 5 more authors

Michael Ryoo Vikas Sindhwani Johnny Lee Vincent Vanhoucke Pete Florence

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:46 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG

keywords Socratic Modelszero-shot compositionmultimodal promptingfoundation modelsmodel chainingegocentric video QArobot planningassistive dialogue

0 comments

The pith

Pretrained models can be composed zero-shot through multimodal prompting to exchange information and gain new multimodal capabilities without finetuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that separately trained foundation models, such as vision-language models and pure language models, hold complementary forms of commonsense knowledge because their training data barely overlap. By framing one model as a prompt generator for another, Socratic Models let these systems exchange information in a chain of zero-shot queries, producing joint behavior on tasks that neither model was explicitly trained to handle. This modular prompting approach matches existing zero-shot baselines on image captioning and video retrieval while opening new uses such as free-form question answering on egocentric video and robot planning that interfaces with external databases. A reader cares because the method suggests that large pretrained models can be reused as modular components rather than retrained for each new multimodal application.

Core claim

Socratic Models (SMs) form a modular framework in which multiple pretrained models may be composed zero-shot via multimodal-informed prompting to exchange information with each other and capture new multimodal capabilities, without requiring finetuning. With minimal engineering, SMs are competitive with state-of-the-art zero-shot image captioning and video-to-text retrieval, and they enable new applications such as answering free-form questions about egocentric video, engaging in multimodal assistive dialogue by interfacing with external APIs, and supporting robot perception and planning.

What carries the argument

Socratic Models: a modular framework that composes pretrained models zero-shot through multimodal-informed prompting so they exchange information across domains.

If this is right

Competitive performance with state-of-the-art zero-shot image captioning and video-to-text retrieval is achieved.
Free-form questions about egocentric video can be answered by chaining vision and language models.
Multimodal assistive dialogue becomes possible by letting the composed system call external APIs and databases.
Robot perception and planning tasks can be handled through the same prompting-based composition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prompting composition could extend to other modality pairs such as audio-language or tactile-language without new training runs.
Error accumulation across long prompting chains might limit reliability on complex multi-step tasks.
The framework suggests a route to more modular AI systems in which new capabilities are added by swapping one component model rather than retraining the whole system.

Load-bearing premise

That distinct capabilities stored in separately trained foundation models can be reliably accessed and combined through prompting alone, without finetuning or task-specific adaptation.

What would settle it

A controlled test in which a Socratic Model chain is given a multimodal query that requires both visual recognition and symbolic reasoning, yet produces answers no better than the individual models used in isolation.

read the original abstract

Large pretrained (e.g., "foundation") models exhibit distinct capabilities depending on the domain of data they are trained on. While these domains are generic, they may only barely overlap. For example, visual-language models (VLMs) are trained on Internet-scale image captions, but large language models (LMs) are further trained on Internet-scale text with no images (e.g., spreadsheets, SAT questions, code). As a result, these models store different forms of commonsense knowledge across different domains. In this work, we show that this diversity is symbiotic, and can be leveraged through Socratic Models (SMs): a modular framework in which multiple pretrained models may be composed zero-shot i.e., via multimodal-informed prompting, to exchange information with each other and capture new multimodal capabilities, without requiring finetuning. With minimal engineering, SMs are not only competitive with state-of-the-art zero-shot image captioning and video-to-text retrieval, but also enable new applications such as (i) answering free-form questions about egocentric video, (ii) engaging in multimodal assistive dialogue with people (e.g., for cooking recipes) by interfacing with external APIs and databases (e.g., web search), and (iii) robot perception and planning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Socratic Models shows you can chain off-the-shelf VLMs and LMs through text prompts to handle egocentric video QA and robot planning zero-shot, but the text bottleneck is not measured tightly enough to support the broader claims.

read the letter

The core contribution is a prompting recipe that lets a vision-language model describe scenes or actions and feed that text to a separate language model for reasoning or planning, then route answers back. They get competitive zero-shot numbers on captioning and video retrieval, plus working demos on free-form egocentric questions, cooking dialogue that calls web APIs, and basic robot perception-planning loops. The modularity is the practical win: no joint fine-tuning or new multimodal training runs required, just careful prompt design to move information between models that were trained on different data distributions. That part is straightforward and replicable for anyone already using these foundation models. The soft spot is exactly the one the stress-test flagged. Visual information gets compressed into language descriptions before the LM sees it, and the paper does not run controls that quantify how much spatial, temporal, or relational detail is lost or how often that loss produces wrong downstream answers. Without those measurements or failure-case analysis, it is hard to know whether the method scales beyond the forgiving tasks they chose or whether prompt engineering is quietly doing extra work that breaks the zero-shot framing. Readers who want to compose existing models for robotics or video understanding will find the examples useful and easy to try. The work is coherent on its own terms and shows clear thinking about how to leverage separate pretraining runs, so it deserves a serious referee even if the information-loss question needs tighter experiments in revision.

Referee Report

2 major / 2 minor

Summary. The paper proposes Socratic Models (SMs), a modular framework for composing multiple pretrained foundation models (VLMs, LMs, etc.) zero-shot via multimodal-informed prompting. This enables information exchange across models to capture new multimodal capabilities without finetuning. The work reports competitive performance on zero-shot image captioning and video-to-text retrieval, plus new applications in egocentric video QA, multimodal assistive dialogue with external APIs, and robot perception/planning.

Significance. If the results hold under rigorous controls, the work is significant for showing that complementary knowledge stored in separately trained foundation models can be combined through prompting to enable new tasks with minimal engineering. This modular approach could reduce the need for task-specific finetuning and support rapid prototyping in robotics, video understanding, and assistive systems.

major comments (2)

[§4] §4 (Experiments): The abstract states competitive results on captioning and retrieval, but the manuscript provides no full baseline tables, statistical significance tests, or error analysis for the zero-shot composition claim; this is load-bearing because the central assertion of reliable exchange without finetuning cannot be verified from the reported metrics alone.
[§3] §3 (Method): The framework assumes text prompts suffice to transfer visual information (e.g., from VLM detections to LM planning), yet no quantitative bound or ablation on information loss (spatial/temporal/relational details) is provided; this directly affects the zero-shot property and the new applications such as egocentric video QA.

minor comments (2)

The phrase 'minimal engineering' in the abstract is used without concrete examples of prompt templates or API interfaces in the main text.
Figure captions and method diagrams would benefit from explicit notation for the prompting flow between models.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. We provide point-by-point responses to the major comments and outline the revisions to be made in the updated manuscript.

read point-by-point responses

Referee: [§4] §4 (Experiments): The abstract states competitive results on captioning and retrieval, but the manuscript provides no full baseline tables, statistical significance tests, or error analysis for the zero-shot composition claim; this is load-bearing because the central assertion of reliable exchange without finetuning cannot be verified from the reported metrics alone.

Authors: We acknowledge the need for more rigorous experimental validation. In the revised version, we will include full baseline tables with additional zero-shot methods, conduct statistical significance tests (e.g., using McNemar's test for classification-like metrics or bootstrap for others), and provide an error analysis highlighting where the multimodal composition excels or falls short. This will better substantiate the zero-shot capabilities. revision: yes
Referee: [§3] §3 (Method): The framework assumes text prompts suffice to transfer visual information (e.g., from VLM detections to LM planning), yet no quantitative bound or ablation on information loss (spatial/temporal/relational details) is provided; this directly affects the zero-shot property and the new applications such as egocentric video QA.

Authors: We agree that ablations on information transfer are valuable. We will add experiments ablating the prompt content and VLM output types to measure effects on task performance. However, a general quantitative bound on information loss is not feasible without further assumptions on the models' internal representations, as the transfer is through natural language which is inherently lossy for visual details. revision: partial

standing simulated objections not resolved

A general theoretical quantitative bound on information loss in the text-based transfer between models.

Circularity Check

0 steps flagged

No significant circularity: empirical composition of pretrained models via prompting

full rationale

The paper introduces Socratic Models as a modular framework for zero-shot composition of existing foundation models (VLMs, LMs) through multimodal-informed prompting. No equations, fitted parameters, or derivations are present that reduce outputs to inputs by construction. The central claim rests on empirical demonstrations of new capabilities (egocentric QA, assistive dialogue, robot planning) rather than self-definitional steps, self-citation load-bearing premises, or renamed known results. Self-citations to prior model work are standard and non-circular per the guidelines, as the framework itself adds no fitted or definitional reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that separately trained models hold complementary knowledge that prompting can surface; no free parameters, new entities, or ad-hoc axioms are introduced beyond standard use of pretrained models.

axioms (1)

domain assumption Large pretrained models exhibit distinct capabilities depending on the domain of data they are trained on.
Directly stated in the opening of the abstract as the premise enabling symbiotic composition.

pith-pipeline@v0.9.0 · 5573 in / 1166 out tokens · 28809 ms · 2026-05-16T09:46:44.327244+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Socratic Models (SMs): a modular framework in which multiple pretrained models may be composed zero-shot i.e., via multimodal-informed prompting, to exchange information with each other

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs
cs.CL 2023-04 conditional novelty 8.0

API-Bank is a new benchmark and training dataset for tool-augmented LLMs that shows fine-tuned models can approach GPT-3.5 tool-use effectiveness.
Code as Policies: Language Model Programs for Embodied Control
cs.RO 2022-09 accept novelty 8.0

Language models generate robot policy code from natural language commands via few-shot prompting, enabling spatial-geometric reasoning, generalization, and precise control on real robots.
GAIA: a benchmark for General AI Assistants
cs.CL 2023-11 unverdicted novelty 7.0

GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
cs.RO 2023-07 unverdicted novelty 7.0

VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
Voyager: An Open-Ended Embodied Agent with Large Language Models
cs.AI 2023-05 unverdicted novelty 7.0

Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
cs.CV 2023-03 accept novelty 7.0

Visual ChatGPT integrates visual foundation models with ChatGPT via prompts to enable multi-step image understanding, generation, and editing in conversational interactions.
Flamingo: a Visual Language Model for Few-Shot Learning
cs.CV 2022-04 unverdicted novelty 7.0

Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
Building a Precise Video Language with Human-AI Oversight
cs.CV 2026-04 unverdicted novelty 6.0

CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video gene...
Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs
cs.CV 2026-04 unverdicted novelty 6.0

Perception Programs rewrite dense visual tool outputs into language-native summaries, boosting MLLM accuracy by 15-45% absolute on BLINK perception tasks and setting new state-of-the-art results.
Demystifying CLIP Data
cs.CV 2023-09 accept novelty 6.0

MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
Improving Factuality and Reasoning in Language Models through Multiagent Debate
cs.CL 2023-05 unverdicted novelty 6.0

Multiagent debate among LLMs improves mathematical reasoning, strategic reasoning, and factual accuracy while reducing hallucinations.
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
cs.CV 2023-03 unverdicted novelty 6.0

MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.
Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents
cs.AI 2023-02 conditional novelty 6.0

DEPS combines LLM-based interactive planning with a trainable goal selector to create a zero-shot multi-task agent that completes 70+ Minecraft tasks and nearly doubles prior performance.
Inner Monologue: Embodied Reasoning through Planning with Language Models
cs.RO 2022-07 unverdicted novelty 6.0

LLMs form an inner monologue from closed-loop language feedback to improve high-level instruction completion in simulated and real robotic rearrangement and kitchen manipulation tasks.
Emergent Abilities of Large Language Models
cs.CL 2022-06 unverdicted novelty 6.0

Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
CoCa: Contrastive Captioners are Image-Text Foundation Models
cs.CV 2022-05 accept novelty 6.0

CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.
From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs
cs.CV 2026-05 unverdicted novelty 5.0

SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.
A Survey on Multimodal Large Language Models
cs.CV 2023-06 accept novelty 3.0

This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

Reference graph

Works this paper leans on

142 extracted references · 142 canonical work pages · cited by 18 Pith papers · 15 internal anchors

[1]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[2]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[3]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021

work page 2021
[4]

On the Opportunities and Risks of Foundation Models

R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi. Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems, 34, 2021

work page 2021
[6]

Huang, P

W. Huang, P. Abbeel, D. Pathak, and I. Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. arXiv preprint arXiv:2201.07207, 2022

work page arXiv 2022
[7]

M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, K.-H. Lee, S. Levine, Y . Lu, L. Luu, C. Parada, P. Pastor, J. Quiambao, K. Rao, J. Rettinghouse, D. Reyes, ...

work page arXiv 2022
[8]

Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y . Tsvetkov, and Y . Cao. Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904, 2021

work page arXiv 2021
[9]

A. Jain, M. Guo, K. Srinivasan, T. Chen, S. Kudugunta, C. Jia, Y . Yang, and J. Baldridge. Mural: multimodal, multitask retrieval across languages. arXiv preprint arXiv:2109.05125, 2021

work page arXiv 2021
[10]

LaMDA: Language Models for Dialog Applications

R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y . Du, et al. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

Know What You Don't Know: Unanswerable Questions for SQuAD

P. Rajpurkar, R. Jia, and P. Liang. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[13]

C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021

work page 2021
[14]

Hu and A

R. Hu and A. Singh. Transformer is all you need: Multimodal multitask learning with a uniﬁed transformer. arXiv e-prints, pages arXiv–2102, 2021

work page 2021
[15]

X. Chen, H. Fang, T.-Y . Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[16]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014

work page 2014
[17]

Kreiss, N

E. Kreiss, N. D. Goodman, and C. Potts. Concadia: Tackling image accessibility with context. arXiv preprint arXiv:2104.08376, 2021

work page arXiv 2021
[18]

J. Xu, T. Mei, T. Yao, and Y . Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 5288–5296, 2016. 10

work page 2016
[19]

Grauman, A

K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. arXiv preprint arXiv:2110.07058, 2021

work page arXiv 2021
[20]

Damen, H

D. Damen, H. Doughty, G. M. Farinella, A. Furnari, E. Kazakos, J. Ma, D. Moltisanti, J. Munro, T. Perrett, W. Price, et al. Rescaling egocentric vision. arXiv preprint arXiv:2006.13256, 2020

work page arXiv 2006
[21]

Ngiam, A

J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y . Ng. Multimodal deep learning. In ICML, 2011

work page 2011
[22]

R. Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997

work page 1997
[23]

S. Thrun. Lifelong learning algorithms. In Learning to learn, pages 181–209. Springer, 1998

work page 1998
[24]

G. E. Hinton, S. Osindero, and Y .-W. Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006

work page 2006
[25]

Bengio, P

Y . Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. Advances in neural information processing systems, 19, 2006

work page 2006
[26]

Vincent, H

P. Vincent, H. Larochelle, Y . Bengio, and P.-A. Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103, 2008

work page 2008
[27]

Raina, A

R. Raina, A. Battle, H. Lee, B. Packer, and A. Y . Ng. Self-taught learning: transfer learning from unlabeled data. In Proceedings of the 24th international conference on Machine learning, pages 759–766, 2007

work page 2007
[28]

Mesnil, Y

G. Mesnil, Y . Dauphin, X. Glorot, S. Rifai, Y . Bengio, I. Goodfellow, E. Lavoie, X. Muller, G. Desjardins, D. Warde-Farley, et al. Unsupervised and transfer learning challenge: a deep learning approach. In Proceedings of ICML Workshop on Unsupervised and Transfer Learning, pages 97–110. JMLR Workshop and Conference Proceedings, 2012

work page 2012
[29]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

work page 2009
[30]

Girshick, J

R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014

work page 2014
[31]

Donahue, Y

J. Donahue, Y . Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pages 647–655. PMLR, 2014

work page 2014
[32]

M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014

work page 2014
[33]

OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks

P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y . LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[34]

Mikolov, I

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 2013

work page 2013
[35]

Pennington, R

J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages 1532–1543, 2014

work page 2014
[36]

A. M. Dai and Q. V . Le. Semi-supervised sequence learning.Advances in neural information processing systems, 28, 2015

work page 2015
[37]

Unsupervised Pretraining for Sequence to Sequence Learning

P. Ramachandran, P. J. Liu, and Q. V . Le. Unsupervised pretraining for sequence to sequence learning. arXiv preprint arXiv:1611.02683, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[38]

M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. Deep contextual- ized word representations. 2018

work page 2018
[39]

X. Zhai, X. Wang, B. Mustafa, A. Steiner, D. Keysers, A. Kolesnikov, and L. Beyer. Lit: Zero-shot transfer with locked-image text tuning. arXiv preprint arXiv:2111.07991, 2021

work page arXiv 2021
[40]

T. D. Kulkarni, A. Gupta, C. Ionescu, S. Borgeaud, M. Reynolds, A. Zisserman, and V . Mnih. Unsuper- vised learning of object keypoints for perception and control. Advances in neural information processing systems, 32, 2019

work page 2019
[41]

Florence, L

P. Florence, L. Manuelli, and R. Tedrake. Self-supervised correspondence in visuomotor policy learning. IEEE Robotics and Automation Letters, 5(2):492–499, 2019

work page 2019
[42]

Tsimpoukelli, J

M. Tsimpoukelli, J. L. Menick, S. Cabi, S. Eslami, O. Vinyals, and F. Hill. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200–212, 2021

work page 2021
[43]

Zakka, A

K. Zakka, A. Zeng, P. Florence, J. Tompson, J. Bohg, and D. Dwibedi. Xirl: Cross-embodiment inverse reinforcement learning. In Conference on Robot Learning, pages 537–546. PMLR, 2022. 11

work page 2022
[44]

J. Lu, D. Batra, D. Parikh, and S. Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32, 2019

work page 2019
[45]

Mokady, A

R. Mokady, A. Hertz, and A. H. Bermano. Clipcap: Clip preﬁx for image captioning. arXiv preprint arXiv:2111.09734, 2021

work page arXiv 2021
[46]

Z. Gao, J. Liu, S. Chen, D. Chang, H. Zhang, and J. Yuan. Clip2tv: An empirical study on transformer- based methods for video-text retrieval. arXiv preprint arXiv:2111.05610, 2021

work page arXiv 2021
[47]

H. Song, L. Dong, W.-N. Zhang, T. Liu, and F. Wei. Clip models are few-shot learners: Empirical studies on vqa and visual entailment. arXiv preprint arXiv:2203.07190, 2022

work page arXiv 2022
[48]

Zellers, J

R. Zellers, J. Lu, X. Lu, Y . Yu, Y . Zhao, M. Salehi, A. Kusupati, J. Hessel, A. Farhadi, and Y . Choi. Merlot reserve: Neural script knowledge through vision and language and sound. arXiv preprint arXiv:2201.02639, 2022

work page arXiv 2022
[49]

Sutskever, O

I. Sutskever, O. Vinyals, and Q. V . Le. Sequence to sequence learning with neural networks.Advances in neural information processing systems, 27, 2014

work page 2014
[50]

Y . Song, X. Fan, Y . Yang, G. Ren, and W. Pan. Large pretrained models on multimodal sentiment analysis. In Artiﬁcial Intelligence in China, pages 506–513. Springer, 2022

work page 2022
[51]

Bapna, C

A. Bapna, C. Cherry, Y . Zhang, Y . Jia, M. Johnson, Y . Cheng, S. Khanuja, J. Riesa, and A. Conneau. mslam: Massively multilingual joint pre-training for speech and text. arXiv preprint arXiv:2202.01374, 2022

work page arXiv 2022
[52]

Karpagavalli and E

S. Karpagavalli and E. Chandra. A review on automatic speech recognition architecture and approaches. International Journal of Signal Processing, Image Processing and Pattern Recognition, 9(4):393–404, 2016

work page 2016
[53]

M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the em algorithm.Neural computation, 6(2):181–214, 1994

work page 1994
[54]

Masoudnia and R

S. Masoudnia and R. Ebrahimpour. Mixture of experts: a literature survey. Artiﬁcial Intelligence Review, 42(2):275–293, 2014

work page 2014
[55]

Y . Liu, S. Albanie, A. Nagrani, and A. Zisserman. Use what you have: Video retrieval using representations from collaborative experts. BMVC, 2019

work page 2019
[56]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[57]

PaLM: Scaling Language Modeling with Pathways

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[58]

Wortsman, G

M. Wortsman, G. Ilharco, M. Li, J. W. Kim, H. Hajishirzi, A. Farhadi, H. Namkoong, and L. Schmidt. Robust ﬁne-tuning of zero-shot models. arXiv preprint arXiv:2109.01903, 2021

work page arXiv 2021
[59]

B. Zhou, A. Khosla, A. Lapedriza, A. Torralba, and A. Oliva. Places: An image database for deep scene understanding. arXiv preprint arXiv:1610.02055, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[60]

B. Wu, W. Chen, Y . Fan, Y . Zhang, J. Hou, J. Liu, and T. Zhang. Tencent ml-images: A large-scale multi-label image database for visual representation learning. IEEE Access, 7:172683–172693, 2019

work page 2019
[61]

Y . Su, T. Lan, Y . Liu, F. Liu, D. Yogatama, Y . Wang, L. Kong, and N. Collier. Language models can see: Plugging visual controls in text generation. arXiv preprint arXiv:2205.02655, 2022

work page arXiv 2022
[62]

Tewel, Y

Y . Tewel, Y . Shalev, I. Schwartz, and L. Wolf. Zero-shot image-to-text generation for visual-semantic arithmetic. arXiv preprint arXiv:2111.14447, 2021

work page arXiv 2021
[63]

Karpathy and L

A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015

work page 2015
[64]

Vedantam, C

R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015

work page 2015
[65]

Gu, T.-Y

X. Gu, T.-Y . Lin, W. Kuo, and Y . Cui. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021

work page arXiv 2021
[66]

Kamath, M

A. Kamath, M. Singh, Y . LeCun, G. Synnaeve, I. Misra, and N. Carion. Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1780–1790, 2021

work page 2021
[67]

J. A. Portillo-Quintero, J. C. Ortiz-Bayliss, and H. Terashima-Marín. A straightforward framework for video retrieval using clip. In Mexican Conference on Pattern Recognition, pages 3–12. Springer, 2021

work page 2021
[68]

https://cloud.google.com/ speech-to-text

Speech-to-text: Automatic speech recognition | google cloud. https://cloud.google.com/ speech-to-text. Accessed: 2022-05-13. 12

work page 2022
[69]

Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[70]

N. C. Mithun, J. Li, F. Metze, and A. K. Roy-Chowdhury. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, pages 19–27, 2018

work page 2018
[71]

H. Fang, P. Xiong, L. Xu, and Y . Chen. Clip2video: Mastering video-text retrieval via image clip.arXiv preprint arXiv:2106.11097, 2021

work page arXiv 2021
[72]

Y . Yu, J. Kim, and G. Kim. A joint sequence fusion model for video question answering and retrieval. In Proceedings of the European Conference on Computer Vision (ECCV), pages 471–487, 2018

work page 2018
[73]

Cheng, H

X. Cheng, H. Lin, X. Wu, F. Yang, and D. Shen. Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss. arXiv preprint arXiv:2109.04290, 2021

work page arXiv 2021
[74]

Sharma, N

P. Sharma, N. Ding, S. Goodman, and R. Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018

work page 2018
[75]

Y . Li, T. Nagarajan, B. Xiong, and K. Grauman. Ego-exo: Transferring visual representations from third-person to ﬁrst-person videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6943–6953, 2021

work page 2021
[76]

G. A. Sigurdsson, A. Gupta, C. Schmid, A. Farhadi, and K. Alahari. Charades-ego: A large-scale dataset of paired third and ﬁrst person videos. arXiv preprint arXiv:1804.09626, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[77]

Patel, R

D. Patel, R. Parikh, and Y . Shastri. Recent advances in video question answering: A review of datasets and methods. In International Conference on Pattern Recognition, pages 339–356. Springer, 2021

work page 2021
[78]

J. Wu, L. Ouyang, D. M. Ziegler, N. Stiennon, R. Lowe, J. Leike, and P. Christiano. Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862, 2021

work page arXiv 2021
[79]

Izadi, D

S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison, et al. Kinectfusion: real-time 3d reconstruction and interaction using a moving depth camera. In Proceedings of the 24th annual ACM symposium on User interface software and technology, pages 559–568, 2011

work page 2011
[80]

Tancik, V

M. Tancik, V . Casser, X. Yan, S. Pradhan, B. Mildenhall, P. P. Srinivasan, J. T. Barron, and H. Kretzschmar. Block-nerf: Scalable large scene neural view synthesis. arXiv preprint arXiv:2202.05263, 2022

work page arXiv 2022

Showing first 80 references.