pith. machine review for the scientific record. sign in

arxiv: 2204.00598 · v2 · submitted 2022-04-01 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

Recognition: 1 theorem link

· Lean Theorem

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:46 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG
keywords Socratic Modelszero-shot compositionmultimodal promptingfoundation modelsmodel chainingegocentric video QArobot planningassistive dialogue
0
0 comments X

The pith

Pretrained models can be composed zero-shot through multimodal prompting to exchange information and gain new multimodal capabilities without finetuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that separately trained foundation models, such as vision-language models and pure language models, hold complementary forms of commonsense knowledge because their training data barely overlap. By framing one model as a prompt generator for another, Socratic Models let these systems exchange information in a chain of zero-shot queries, producing joint behavior on tasks that neither model was explicitly trained to handle. This modular prompting approach matches existing zero-shot baselines on image captioning and video retrieval while opening new uses such as free-form question answering on egocentric video and robot planning that interfaces with external databases. A reader cares because the method suggests that large pretrained models can be reused as modular components rather than retrained for each new multimodal application.

Core claim

Socratic Models (SMs) form a modular framework in which multiple pretrained models may be composed zero-shot via multimodal-informed prompting to exchange information with each other and capture new multimodal capabilities, without requiring finetuning. With minimal engineering, SMs are competitive with state-of-the-art zero-shot image captioning and video-to-text retrieval, and they enable new applications such as answering free-form questions about egocentric video, engaging in multimodal assistive dialogue by interfacing with external APIs, and supporting robot perception and planning.

What carries the argument

Socratic Models: a modular framework that composes pretrained models zero-shot through multimodal-informed prompting so they exchange information across domains.

If this is right

  • Competitive performance with state-of-the-art zero-shot image captioning and video-to-text retrieval is achieved.
  • Free-form questions about egocentric video can be answered by chaining vision and language models.
  • Multimodal assistive dialogue becomes possible by letting the composed system call external APIs and databases.
  • Robot perception and planning tasks can be handled through the same prompting-based composition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prompting composition could extend to other modality pairs such as audio-language or tactile-language without new training runs.
  • Error accumulation across long prompting chains might limit reliability on complex multi-step tasks.
  • The framework suggests a route to more modular AI systems in which new capabilities are added by swapping one component model rather than retraining the whole system.

Load-bearing premise

That distinct capabilities stored in separately trained foundation models can be reliably accessed and combined through prompting alone, without finetuning or task-specific adaptation.

What would settle it

A controlled test in which a Socratic Model chain is given a multimodal query that requires both visual recognition and symbolic reasoning, yet produces answers no better than the individual models used in isolation.

read the original abstract

Large pretrained (e.g., "foundation") models exhibit distinct capabilities depending on the domain of data they are trained on. While these domains are generic, they may only barely overlap. For example, visual-language models (VLMs) are trained on Internet-scale image captions, but large language models (LMs) are further trained on Internet-scale text with no images (e.g., spreadsheets, SAT questions, code). As a result, these models store different forms of commonsense knowledge across different domains. In this work, we show that this diversity is symbiotic, and can be leveraged through Socratic Models (SMs): a modular framework in which multiple pretrained models may be composed zero-shot i.e., via multimodal-informed prompting, to exchange information with each other and capture new multimodal capabilities, without requiring finetuning. With minimal engineering, SMs are not only competitive with state-of-the-art zero-shot image captioning and video-to-text retrieval, but also enable new applications such as (i) answering free-form questions about egocentric video, (ii) engaging in multimodal assistive dialogue with people (e.g., for cooking recipes) by interfacing with external APIs and databases (e.g., web search), and (iii) robot perception and planning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Socratic Models (SMs), a modular framework for composing multiple pretrained foundation models (VLMs, LMs, etc.) zero-shot via multimodal-informed prompting. This enables information exchange across models to capture new multimodal capabilities without finetuning. The work reports competitive performance on zero-shot image captioning and video-to-text retrieval, plus new applications in egocentric video QA, multimodal assistive dialogue with external APIs, and robot perception/planning.

Significance. If the results hold under rigorous controls, the work is significant for showing that complementary knowledge stored in separately trained foundation models can be combined through prompting to enable new tasks with minimal engineering. This modular approach could reduce the need for task-specific finetuning and support rapid prototyping in robotics, video understanding, and assistive systems.

major comments (2)
  1. [§4] §4 (Experiments): The abstract states competitive results on captioning and retrieval, but the manuscript provides no full baseline tables, statistical significance tests, or error analysis for the zero-shot composition claim; this is load-bearing because the central assertion of reliable exchange without finetuning cannot be verified from the reported metrics alone.
  2. [§3] §3 (Method): The framework assumes text prompts suffice to transfer visual information (e.g., from VLM detections to LM planning), yet no quantitative bound or ablation on information loss (spatial/temporal/relational details) is provided; this directly affects the zero-shot property and the new applications such as egocentric video QA.
minor comments (2)
  1. The phrase 'minimal engineering' in the abstract is used without concrete examples of prompt templates or API interfaces in the main text.
  2. Figure captions and method diagrams would benefit from explicit notation for the prompting flow between models.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. We provide point-by-point responses to the major comments and outline the revisions to be made in the updated manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): The abstract states competitive results on captioning and retrieval, but the manuscript provides no full baseline tables, statistical significance tests, or error analysis for the zero-shot composition claim; this is load-bearing because the central assertion of reliable exchange without finetuning cannot be verified from the reported metrics alone.

    Authors: We acknowledge the need for more rigorous experimental validation. In the revised version, we will include full baseline tables with additional zero-shot methods, conduct statistical significance tests (e.g., using McNemar's test for classification-like metrics or bootstrap for others), and provide an error analysis highlighting where the multimodal composition excels or falls short. This will better substantiate the zero-shot capabilities. revision: yes

  2. Referee: [§3] §3 (Method): The framework assumes text prompts suffice to transfer visual information (e.g., from VLM detections to LM planning), yet no quantitative bound or ablation on information loss (spatial/temporal/relational details) is provided; this directly affects the zero-shot property and the new applications such as egocentric video QA.

    Authors: We agree that ablations on information transfer are valuable. We will add experiments ablating the prompt content and VLM output types to measure effects on task performance. However, a general quantitative bound on information loss is not feasible without further assumptions on the models' internal representations, as the transfer is through natural language which is inherently lossy for visual details. revision: partial

standing simulated objections not resolved
  • A general theoretical quantitative bound on information loss in the text-based transfer between models.

Circularity Check

0 steps flagged

No significant circularity: empirical composition of pretrained models via prompting

full rationale

The paper introduces Socratic Models as a modular framework for zero-shot composition of existing foundation models (VLMs, LMs) through multimodal-informed prompting. No equations, fitted parameters, or derivations are present that reduce outputs to inputs by construction. The central claim rests on empirical demonstrations of new capabilities (egocentric QA, assistive dialogue, robot planning) rather than self-definitional steps, self-citation load-bearing premises, or renamed known results. Self-citations to prior model work are standard and non-circular per the guidelines, as the framework itself adds no fitted or definitional reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that separately trained models hold complementary knowledge that prompting can surface; no free parameters, new entities, or ad-hoc axioms are introduced beyond standard use of pretrained models.

axioms (1)
  • domain assumption Large pretrained models exhibit distinct capabilities depending on the domain of data they are trained on.
    Directly stated in the opening of the abstract as the premise enabling symbiotic composition.

pith-pipeline@v0.9.0 · 5573 in / 1166 out tokens · 28809 ms · 2026-05-16T09:46:44.327244+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

    cs.CL 2023-04 conditional novelty 8.0

    API-Bank is a new benchmark and training dataset for tool-augmented LLMs that shows fine-tuned models can approach GPT-3.5 tool-use effectiveness.

  2. Code as Policies: Language Model Programs for Embodied Control

    cs.RO 2022-09 accept novelty 8.0

    Language models generate robot policy code from natural language commands via few-shot prompting, enabling spatial-geometric reasoning, generalization, and precise control on real robots.

  3. GAIA: a benchmark for General AI Assistants

    cs.CL 2023-11 unverdicted novelty 7.0

    GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.

  4. VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

    cs.RO 2023-07 unverdicted novelty 7.0

    VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.

  5. Voyager: An Open-Ended Embodied Agent with Large Language Models

    cs.AI 2023-05 unverdicted novelty 7.0

    Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...

  6. Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

    cs.CV 2023-03 accept novelty 7.0

    Visual ChatGPT integrates visual foundation models with ChatGPT via prompts to enable multi-step image understanding, generation, and editing in conversational interactions.

  7. Flamingo: a Visual Language Model for Few-Shot Learning

    cs.CV 2022-04 unverdicted novelty 7.0

    Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

  8. Building a Precise Video Language with Human-AI Oversight

    cs.CV 2026-04 unverdicted novelty 6.0

    CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video gene...

  9. Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs

    cs.CV 2026-04 unverdicted novelty 6.0

    Perception Programs rewrite dense visual tool outputs into language-native summaries, boosting MLLM accuracy by 15-45% absolute on BLINK perception tasks and setting new state-of-the-art results.

  10. Demystifying CLIP Data

    cs.CV 2023-09 accept novelty 6.0

    MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.

  11. Improving Factuality and Reasoning in Language Models through Multiagent Debate

    cs.CL 2023-05 unverdicted novelty 6.0

    Multiagent debate among LLMs improves mathematical reasoning, strategic reasoning, and factual accuracy while reducing hallucinations.

  12. MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

    cs.CV 2023-03 unverdicted novelty 6.0

    MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.

  13. Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents

    cs.AI 2023-02 conditional novelty 6.0

    DEPS combines LLM-based interactive planning with a trainable goal selector to create a zero-shot multi-task agent that completes 70+ Minecraft tasks and nearly doubles prior performance.

  14. Inner Monologue: Embodied Reasoning through Planning with Language Models

    cs.RO 2022-07 unverdicted novelty 6.0

    LLMs form an inner monologue from closed-loop language feedback to improve high-level instruction completion in simulated and real robotic rearrangement and kitchen manipulation tasks.

  15. Emergent Abilities of Large Language Models

    cs.CL 2022-06 unverdicted novelty 6.0

    Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.

  16. CoCa: Contrastive Captioners are Image-Text Foundation Models

    cs.CV 2022-05 accept novelty 6.0

    CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.

  17. From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs

    cs.CV 2026-05 unverdicted novelty 5.0

    SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.

  18. A Survey on Multimodal Large Language Models

    cs.CV 2023-06 accept novelty 3.0

    This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

Reference graph

Works this paper leans on

142 extracted references · 142 canonical work pages · cited by 18 Pith papers · 15 internal anchors

  1. [1]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

  2. [2]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

  3. [3]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021

  4. [4]

    On the Opportunities and Risks of Foundation Models

    R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021

  5. [5]

    J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi. Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems, 34, 2021

  6. [6]

    Huang, P

    W. Huang, P. Abbeel, D. Pathak, and I. Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. arXiv preprint arXiv:2201.07207, 2022

  7. [7]

    M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, K.-H. Lee, S. Levine, Y . Lu, L. Luu, C. Parada, P. Pastor, J. Quiambao, K. Rao, J. Rettinghouse, D. Reyes, ...

  8. [8]

    Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y . Tsvetkov, and Y . Cao. Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904, 2021

  9. [9]

    A. Jain, M. Guo, K. Srinivasan, T. Chen, S. Kudugunta, C. Jia, Y . Yang, and J. Baldridge. Mural: multimodal, multitask retrieval across languages. arXiv preprint arXiv:2109.05125, 2021

  10. [10]

    LaMDA: Language Models for Dialog Applications

    R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y . Du, et al. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022

  11. [11]

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

  12. [12]

    Know What You Don't Know: Unanswerable Questions for SQuAD

    P. Rajpurkar, R. Jia, and P. Liang. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822, 2018

  13. [13]

    C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021

  14. [14]

    Hu and A

    R. Hu and A. Singh. Transformer is all you need: Multimodal multitask learning with a unified transformer. arXiv e-prints, pages arXiv–2102, 2021

  15. [15]

    X. Chen, H. Fang, T.-Y . Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015

  16. [16]

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014

  17. [17]

    Kreiss, N

    E. Kreiss, N. D. Goodman, and C. Potts. Concadia: Tackling image accessibility with context. arXiv preprint arXiv:2104.08376, 2021

  18. [18]

    J. Xu, T. Mei, T. Yao, and Y . Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 5288–5296, 2016. 10

  19. [19]

    Grauman, A

    K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. arXiv preprint arXiv:2110.07058, 2021

  20. [20]

    Damen, H

    D. Damen, H. Doughty, G. M. Farinella, A. Furnari, E. Kazakos, J. Ma, D. Moltisanti, J. Munro, T. Perrett, W. Price, et al. Rescaling egocentric vision. arXiv preprint arXiv:2006.13256, 2020

  21. [21]

    Ngiam, A

    J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y . Ng. Multimodal deep learning. In ICML, 2011

  22. [22]

    R. Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997

  23. [23]

    S. Thrun. Lifelong learning algorithms. In Learning to learn, pages 181–209. Springer, 1998

  24. [24]

    G. E. Hinton, S. Osindero, and Y .-W. Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006

  25. [25]

    Bengio, P

    Y . Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. Advances in neural information processing systems, 19, 2006

  26. [26]

    Vincent, H

    P. Vincent, H. Larochelle, Y . Bengio, and P.-A. Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103, 2008

  27. [27]

    Raina, A

    R. Raina, A. Battle, H. Lee, B. Packer, and A. Y . Ng. Self-taught learning: transfer learning from unlabeled data. In Proceedings of the 24th international conference on Machine learning, pages 759–766, 2007

  28. [28]

    Mesnil, Y

    G. Mesnil, Y . Dauphin, X. Glorot, S. Rifai, Y . Bengio, I. Goodfellow, E. Lavoie, X. Muller, G. Desjardins, D. Warde-Farley, et al. Unsupervised and transfer learning challenge: a deep learning approach. In Proceedings of ICML Workshop on Unsupervised and Transfer Learning, pages 97–110. JMLR Workshop and Conference Proceedings, 2012

  29. [29]

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  30. [30]

    Girshick, J

    R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014

  31. [31]

    Donahue, Y

    J. Donahue, Y . Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pages 647–655. PMLR, 2014

  32. [32]

    M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014

  33. [33]

    OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks

    P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y . LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013

  34. [34]

    Mikolov, I

    T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 2013

  35. [35]

    Pennington, R

    J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages 1532–1543, 2014

  36. [36]

    A. M. Dai and Q. V . Le. Semi-supervised sequence learning.Advances in neural information processing systems, 28, 2015

  37. [37]

    Unsupervised Pretraining for Sequence to Sequence Learning

    P. Ramachandran, P. J. Liu, and Q. V . Le. Unsupervised pretraining for sequence to sequence learning. arXiv preprint arXiv:1611.02683, 2016

  38. [38]

    M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. Deep contextual- ized word representations. 2018

  39. [39]

    X. Zhai, X. Wang, B. Mustafa, A. Steiner, D. Keysers, A. Kolesnikov, and L. Beyer. Lit: Zero-shot transfer with locked-image text tuning. arXiv preprint arXiv:2111.07991, 2021

  40. [40]

    T. D. Kulkarni, A. Gupta, C. Ionescu, S. Borgeaud, M. Reynolds, A. Zisserman, and V . Mnih. Unsuper- vised learning of object keypoints for perception and control. Advances in neural information processing systems, 32, 2019

  41. [41]

    Florence, L

    P. Florence, L. Manuelli, and R. Tedrake. Self-supervised correspondence in visuomotor policy learning. IEEE Robotics and Automation Letters, 5(2):492–499, 2019

  42. [42]

    Tsimpoukelli, J

    M. Tsimpoukelli, J. L. Menick, S. Cabi, S. Eslami, O. Vinyals, and F. Hill. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200–212, 2021

  43. [43]

    Zakka, A

    K. Zakka, A. Zeng, P. Florence, J. Tompson, J. Bohg, and D. Dwibedi. Xirl: Cross-embodiment inverse reinforcement learning. In Conference on Robot Learning, pages 537–546. PMLR, 2022. 11

  44. [44]

    J. Lu, D. Batra, D. Parikh, and S. Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32, 2019

  45. [45]

    Mokady, A

    R. Mokady, A. Hertz, and A. H. Bermano. Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021

  46. [46]

    Z. Gao, J. Liu, S. Chen, D. Chang, H. Zhang, and J. Yuan. Clip2tv: An empirical study on transformer- based methods for video-text retrieval. arXiv preprint arXiv:2111.05610, 2021

  47. [47]

    H. Song, L. Dong, W.-N. Zhang, T. Liu, and F. Wei. Clip models are few-shot learners: Empirical studies on vqa and visual entailment. arXiv preprint arXiv:2203.07190, 2022

  48. [48]

    Zellers, J

    R. Zellers, J. Lu, X. Lu, Y . Yu, Y . Zhao, M. Salehi, A. Kusupati, J. Hessel, A. Farhadi, and Y . Choi. Merlot reserve: Neural script knowledge through vision and language and sound. arXiv preprint arXiv:2201.02639, 2022

  49. [49]

    Sutskever, O

    I. Sutskever, O. Vinyals, and Q. V . Le. Sequence to sequence learning with neural networks.Advances in neural information processing systems, 27, 2014

  50. [50]

    Y . Song, X. Fan, Y . Yang, G. Ren, and W. Pan. Large pretrained models on multimodal sentiment analysis. In Artificial Intelligence in China, pages 506–513. Springer, 2022

  51. [51]

    Bapna, C

    A. Bapna, C. Cherry, Y . Zhang, Y . Jia, M. Johnson, Y . Cheng, S. Khanuja, J. Riesa, and A. Conneau. mslam: Massively multilingual joint pre-training for speech and text. arXiv preprint arXiv:2202.01374, 2022

  52. [52]

    Karpagavalli and E

    S. Karpagavalli and E. Chandra. A review on automatic speech recognition architecture and approaches. International Journal of Signal Processing, Image Processing and Pattern Recognition, 9(4):393–404, 2016

  53. [53]

    M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the em algorithm.Neural computation, 6(2):181–214, 1994

  54. [54]

    Masoudnia and R

    S. Masoudnia and R. Ebrahimpour. Mixture of experts: a literature survey. Artificial Intelligence Review, 42(2):275–293, 2014

  55. [55]

    Y . Liu, S. Albanie, A. Nagrani, and A. Zisserman. Use what you have: Video retrieval using representations from collaborative experts. BMVC, 2019

  56. [56]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022

  57. [57]

    PaLM: Scaling Language Modeling with Pathways

    A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

  58. [58]

    Wortsman, G

    M. Wortsman, G. Ilharco, M. Li, J. W. Kim, H. Hajishirzi, A. Farhadi, H. Namkoong, and L. Schmidt. Robust fine-tuning of zero-shot models. arXiv preprint arXiv:2109.01903, 2021

  59. [59]

    B. Zhou, A. Khosla, A. Lapedriza, A. Torralba, and A. Oliva. Places: An image database for deep scene understanding. arXiv preprint arXiv:1610.02055, 2016

  60. [60]

    B. Wu, W. Chen, Y . Fan, Y . Zhang, J. Hou, J. Liu, and T. Zhang. Tencent ml-images: A large-scale multi-label image database for visual representation learning. IEEE Access, 7:172683–172693, 2019

  61. [61]

    Y . Su, T. Lan, Y . Liu, F. Liu, D. Yogatama, Y . Wang, L. Kong, and N. Collier. Language models can see: Plugging visual controls in text generation. arXiv preprint arXiv:2205.02655, 2022

  62. [62]

    Tewel, Y

    Y . Tewel, Y . Shalev, I. Schwartz, and L. Wolf. Zero-shot image-to-text generation for visual-semantic arithmetic. arXiv preprint arXiv:2111.14447, 2021

  63. [63]

    Karpathy and L

    A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015

  64. [64]

    Vedantam, C

    R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015

  65. [65]

    Gu, T.-Y

    X. Gu, T.-Y . Lin, W. Kuo, and Y . Cui. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021

  66. [66]

    Kamath, M

    A. Kamath, M. Singh, Y . LeCun, G. Synnaeve, I. Misra, and N. Carion. Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1780–1790, 2021

  67. [67]

    J. A. Portillo-Quintero, J. C. Ortiz-Bayliss, and H. Terashima-Marín. A straightforward framework for video retrieval using clip. In Mexican Conference on Pattern Recognition, pages 3–12. Springer, 2021

  68. [68]

    https://cloud.google.com/ speech-to-text

    Speech-to-text: Automatic speech recognition | google cloud. https://cloud.google.com/ speech-to-text. Accessed: 2022-05-13. 12

  69. [69]

    Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019

  70. [70]

    N. C. Mithun, J. Li, F. Metze, and A. K. Roy-Chowdhury. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, pages 19–27, 2018

  71. [71]

    H. Fang, P. Xiong, L. Xu, and Y . Chen. Clip2video: Mastering video-text retrieval via image clip.arXiv preprint arXiv:2106.11097, 2021

  72. [72]

    Y . Yu, J. Kim, and G. Kim. A joint sequence fusion model for video question answering and retrieval. In Proceedings of the European Conference on Computer Vision (ECCV), pages 471–487, 2018

  73. [73]

    Cheng, H

    X. Cheng, H. Lin, X. Wu, F. Yang, and D. Shen. Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss. arXiv preprint arXiv:2109.04290, 2021

  74. [74]

    Sharma, N

    P. Sharma, N. Ding, S. Goodman, and R. Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018

  75. [75]

    Y . Li, T. Nagarajan, B. Xiong, and K. Grauman. Ego-exo: Transferring visual representations from third-person to first-person videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6943–6953, 2021

  76. [76]

    G. A. Sigurdsson, A. Gupta, C. Schmid, A. Farhadi, and K. Alahari. Charades-ego: A large-scale dataset of paired third and first person videos. arXiv preprint arXiv:1804.09626, 2018

  77. [77]

    Patel, R

    D. Patel, R. Parikh, and Y . Shastri. Recent advances in video question answering: A review of datasets and methods. In International Conference on Pattern Recognition, pages 339–356. Springer, 2021

  78. [78]

    J. Wu, L. Ouyang, D. M. Ziegler, N. Stiennon, R. Lowe, J. Leike, and P. Christiano. Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862, 2021

  79. [79]

    Izadi, D

    S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison, et al. Kinectfusion: real-time 3d reconstruction and interaction using a moving depth camera. In Proceedings of the 24th annual ACM symposium on User interface software and technology, pages 559–568, 2011

  80. [80]

    Tancik, V

    M. Tancik, V . Casser, X. Yan, S. Pradhan, B. Mildenhall, P. P. Srinivasan, J. T. Barron, and H. Kretzschmar. Block-nerf: Scalable large scene neural view synthesis. arXiv preprint arXiv:2202.05263, 2022

Showing first 80 references.