arxiv: 2501.01957 · v4 · pith:DANAGOQMnew · submitted 2025-01-03 · 💻 cs.CV · cs.SD· eess.AS

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Chaoyou Fu , Haojia Lin , Xiong Wang , Yi-Fan Zhang , Yunhang Shen , Xiaoyu Liu , Haoyu Cao , Zuwei Long

show 8 more authors

Heting Gao Ke Li Long Ma Xiawu Zheng Rongrong Ji Xing Sun Caifeng Shan Ran He

This is my paper

Pith reviewed 2026-05-17 21:03 UTC · model grok-4.3

classification 💻 cs.CV cs.SDeess.AS

keywords multimodal large language modelsvision-speech interactionreal-time multimodal dialoguemulti-stage trainingend-to-end speech responseVITA-1.5

0 comments

The pith

A multi-stage training method allows large language models to handle vision and speech together for near real-time interaction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors present a multi-stage training methodology to progressively train an LLM on visual and speech information. This creates a model that supports fluent interactions involving both seeing and speaking. The approach avoids the need for separate speech processing modules to keep responses fast. A reader would care if this leads to AI systems that feel more natural in everyday use by combining modalities efficiently.

Core claim

The paper's core discovery is that through a carefully designed multi-stage training methodology, an LLM can be trained to understand both visual and speech information. This not only preserves strong vision-language capacity but also enables efficient speech-to-speech dialogue without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. Benchmarks confirm strong capabilities in image, video, and speech tasks.

What carries the argument

multi-stage training methodology for progressively integrating visual and speech understanding

Load-bearing premise

The multi-stage training can be balanced so that speech capabilities are added without degrading the pre-existing vision-language capacity.

What would settle it

Observing whether vision-language benchmark scores decrease after the speech training stages or whether the model's response latency remains higher than expected for real-time use.

read the original abstract

Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a significant challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities, making near real-time vision and speech interaction. Code has been released at https://github.com/VITA-MLLM/VITA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces VITA-1.5, a multimodal LLM trained via a carefully designed multi-stage progressive methodology to integrate vision and speech modalities into a single model. It claims this enables fluent end-to-end vision and speech interaction in near real-time without separate ASR or TTS modules, while preserving strong vision-language capacity, and demonstrates competitive results against SOTA models on image, video, and speech benchmarks.

Significance. If the multi-stage training successfully adds speech capabilities without degrading prior vision performance, the result would advance unified multimodal dialogue systems and reduce reliance on cascaded pipelines. The public code release at the provided GitHub link is a clear strength for reproducibility and further research.

major comments (2)

[Training Methodology] Training Methodology section: The central claim that progressive training balances speech integration while fully preserving vision-language capacity lacks direct empirical support; no before/after checkpoint comparisons or ablations on vision benchmarks (e.g., after speech stages) are reported, leaving open whether degradation occurred and was recovered or if final scores reflect a different trade-off.
[Experiments] Experiments section: Benchmark results for image/video tasks are presented as competitive but without error bars, exact data splits, or ablations on stage ordering, which moderately weakens support for the preserved vision capacity claim as noted in the soundness assessment.

minor comments (2)

[Abstract] The abstract could more explicitly list the specific benchmarks and metrics used for the image, video, and speech comparisons to improve clarity.
Notation for training stages and data mixing ratios could be formalized with a table or diagram for easier reference across sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for their thoughtful and constructive comments on our manuscript. We have carefully considered each point and provide detailed responses below. Where appropriate, we will revise the manuscript to incorporate additional analyses and clarifications to address the concerns raised.

read point-by-point responses

Referee: [Training Methodology] Training Methodology section: The central claim that progressive training balances speech integration while fully preserving vision-language capacity lacks direct empirical support; no before/after checkpoint comparisons or ablations on vision benchmarks (e.g., after speech stages) are reported, leaving open whether degradation occurred and was recovered or if final scores reflect a different trade-off.

Authors: We acknowledge the referee's point that direct empirical evidence through before-and-after comparisons would strengthen our claims regarding the preservation of vision-language capabilities. Our multi-stage training is designed to progressively integrate modalities while minimizing interference, and the final model's competitive performance on vision benchmarks supports this. However, to provide more direct support, in the revised manuscript we will add ablation studies showing vision benchmark results at intermediate checkpoints before and after the speech training stages. revision: yes
Referee: [Experiments] Experiments section: Benchmark results for image/video tasks are presented as competitive but without error bars, exact data splits, or ablations on stage ordering, which moderately weakens support for the preserved vision capacity claim as noted in the soundness assessment.

Authors: We appreciate this feedback. To enhance the robustness of our experimental results, we will include error bars for the reported metrics where multiple runs are available, provide precise details on the data splits used for each benchmark, and conduct additional ablations to evaluate the impact of different stage orderings on the final performance. These additions will be incorporated into the revised version of the paper. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on external benchmark comparisons

full rationale

The paper describes a multi-stage training methodology for adding speech capabilities to a vision-language LLM while aiming to preserve prior performance. Central claims are validated through direct comparisons to state-of-the-art models on separate image, video, and speech benchmarks. No mathematical derivations, equations, or self-referential definitions appear in the provided text that would reduce any prediction or result to fitted parameters or inputs by construction. The work is self-contained against external benchmarks rather than internal tautologies, consistent with a score of 0.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions of multimodal LLM training plus empirical choices for stage ordering and data mixing ratios.

free parameters (1)

Stage-wise learning rates and data mixing ratios
Chosen during development to balance vision retention against speech acquisition.

axioms (1)

domain assumption Staged training on multimodal data avoids catastrophic forgetting of vision-language capabilities.
Invoked when describing the progressive training methodology that first reinforces vision-language then adds speech.

pith-pipeline@v0.9.0 · 5531 in / 1249 out tokens · 64948 ms · 2026-05-17T21:03:49.526514+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing
cs.CV 2026-05 unverdicted novelty 8.0

VEBENCH is the first benchmark evaluating LMMs on video editing technique recognition and operation simulation using 3.9K videos and 3,080 QA pairs, revealing a large performance gap to humans.
How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue
cs.CL 2026-05 unverdicted novelty 7.0

Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of w...
VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing
cs.CV 2026-05 unverdicted novelty 7.0

VEBENCH is the first benchmark with 3.9K videos and 3,080 human-verified QA pairs that measures LMMs on video editing technique recognition and operation simulation, revealing a large gap to human performance.
TiCo: Time-Controllable Spoken Dialogue Model
cs.CL 2026-03 unverdicted novelty 7.0

TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.
From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents
cs.CV 2026-03 unverdicted novelty 7.0

MM-Mem distills video input through a hierarchical memory of sensory buffer, episodic stream, and symbolic schema, optimized by a semantic information bottleneck and SIB-GRPO, to achieve SOTA on long-horizon video benchmarks.
See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
cs.CV 2025-12 unverdicted novelty 7.0

AV-SpeakerBench is a new speaker-centered benchmark showing that top multimodal models still struggle with fine-grained audiovisual speech understanding, with Gemini 2.5 Pro leading but open models lagging on fusion.
StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos
cs.CV 2025-12 unverdicted novelty 7.0

StreamGaze is a new benchmark and QA generation pipeline that measures how well MLLMs leverage gaze trajectories for temporal reasoning and proactive intention prediction in streaming egocentric videos.
Game-Time: Evaluating Temporal Dynamics in Spoken Language Models
eess.AS 2025-09 unverdicted novelty 7.0

Game-Time Benchmark shows spoken language models handle basic tasks but degrade sharply under temporal constraints like tempo adherence and synchronized responses.
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
cs.CV 2025-02 unverdicted novelty 7.0

WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification
cs.CL 2026-05 unverdicted novelty 6.0

DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.
EmoMM: Benchmarking and Steering MLLM for Multimodal Emotion Recognition under Conflict and Missingness
cs.CV 2026-05 unverdicted novelty 6.0

EmoMM benchmark reveals Video Contribution Collapse in MLLMs for emotion recognition under modality conflict and missingness, mitigated by CHASE head-level attention steering.
GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking
cs.SD 2026-04 unverdicted novelty 6.0

GRM ranks Mel bands by attack contribution versus utility sensitivity, perturbs a subset, and learns a universal perturbation to reach 88.46% average jailbreak success rate with improved attack-utility trade-off on fo...
Small Vision-Language Models are Smart Compressors for Long Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

Tempo uses a 6B SVLM as a local temporal compressor with training-free adaptive token allocation to achieve SOTA long-video understanding at 0.5-16 tokens per frame, scoring 52.3 on 4101s LVBench under 8K budget.
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

Video-MME-v2 is a new benchmark that applies progressive visual-to-reasoning levels and non-linear group scoring to expose gaps in video MLLM capabilities.
FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection
cs.SD 2026-04 unverdicted novelty 6.0

FastTurn unifies acoustic features and streaming CTC decoding for low-latency, robust turn detection in full-duplex dialogue systems and releases a realistic human-dialogue test set.
OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models
cs.CV 2025-11 conditional novelty 6.0

OmniZip introduces an audio-guided dynamic token compression framework that achieves 3.42X inference speedup and 1.4X memory reduction for omnimodal LLMs without any training.
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
cs.CV 2024-12 unverdicted novelty 6.0

VideoChat-Flash applies hierarchical video token compression to achieve ~50x reduction in context length for long videos while maintaining near-original performance on long-context benchmarks.
ERASE: Eliminating Redundant Visual Tokens via Adaptive Two-Stage Token Pruning
cs.CV 2026-05 unverdicted novelty 5.0

ERASE prunes 85% of vision tokens in Qwen2.5-VL-7B while retaining 89.46% accuracy, outperforming prior methods that retain only 78.1%.
Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge
eess.AS 2026-04 unverdicted novelty 5.0

A new HumDial-FDBench benchmark and real human-recorded dual-channel dataset are released to assess full-duplex dialogue systems on interruptions and conversational flow.
Valley3: Scaling Omni Foundation Models for E-commerce
cs.AI 2026-05 unverdicted novelty 4.0

Valley3 is an omni MLLM for e-commerce that uses a four-stage pre-training pipeline plus post-training for controllable reasoning and agentic search, outperforming baselines on e-commerce benchmarks while staying comp...

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · cited by 19 Pith papers · 28 internal anchors

[1]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.arXiv preprint arXiv:2304.08485, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Anygpt: Unified multimodal llm with discrete sequence modeling.arXiv preprint arXiv:2402.12226,

Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, et al. Anygpt: Unified multimodal llm with discrete sequence modeling.arXiv preprint arXiv:2402.12226, 2024

work page arXiv 2024
[3]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding.arXiv preprint arXiv:2306.02858, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Speechact: Towards generating whole-body motion from speech.IEEE TVCG, 2025

Jinsong Zhang, Minjie Zhu, Yuxiang Zhang, Zerong Zheng, Yebin Liu, and Kun Li. Speechact: Towards generating whole-body motion from speech.IEEE TVCG, 2025

work page 2025
[5]

video-salmonn: Speech-enhanced audio-visual large language models.arXiv preprint arXiv:2406.15704,

Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, and Chao Zhang. video-salmonn: Speech-enhanced audio-visual large language models.arXiv preprint arXiv:2406.15704, 2024

work page arXiv 2024
[6]

Audio-visual llm for video understanding,

Fangxun Shu, Lei Zhang, Hao Jiang, and Cihang Xie. Audio-visual llm for video understanding. arXiv preprint arXiv:2312.06720, 2023

work page arXiv 2023
[7]

Dycrowd: Towards dynamic crowd reconstruction from a large-scene video.IEEE TPAMI, 2025

Hao Wen, Hongbo Kang, Jian Ma, Jing Huang, Yuanwang Yang, Haozhe Lin, Yu-Kun Lai, and Kun Li. Dycrowd: Towards dynamic crowd reconstruction from a large-scene video.IEEE TPAMI, 2025

work page 2025
[8]

Empow- ering llms with pseudo-untrimmed videos for audio-visual temporal understanding

Yunlong Tang, Daiki Shimada, Jing Bi, Mingqian Feng, Hang Hua, and Chenliang Xu. Empow- ering llms with pseudo-untrimmed videos for audio-visual temporal understanding. InAAAI, 2025

work page 2025
[9]

An overview of large ai models and their applications.Visual Intelligence, 2024

Xiaoguang Tu, Zhi He, Yi Huang, Zhi-Hao Zhang, Ming Yang, and Jian Zhao. An overview of large ai models and their applications.Visual Intelligence, 2024

work page 2024
[10]

Improving multimodal speech recognition by data augmentation and speech representations

Dan Oneat, ˘a and Horia Cucu. Improving multimodal speech recognition by data augmentation and speech representations. InCVPR, 2022

work page 2022
[11]

Speech-to-text and text-to-speech recognition using deep learning

V Madhusudhana Reddy, T Vaishnavi, and K Pavan Kumar. Speech-to-text and text-to-speech recognition using deep learning. InICECAA. IEEE, 2023

work page 2023
[12]

Vita: Towards open-source interactive omni multimodal llm.arXiv preprint arXiv:2408.05211, 2024

Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Meng Zhao, Yifan Zhang, Shaoqi Dong, Xiong Wang, Di Yin, Long Ma, et al. Vita: Towards open-source interactive omni multimodal llm.arXiv preprint arXiv:2408.05211, 2024

work page arXiv 2024
[13]

SpeechGPT : E mpowering large language models with intrinsic cross-modal conversational abilities

Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities.arXiv preprint arXiv:2305.11000, 2023

work page arXiv 2023
[14]

Geometry-guided dense perspective network for speech-driven facial animation

Jingying Liu, Binyuan Hui, Kun Li, Yunke Liu, Yu-Kun Lai, Yuxiang Zhang, Yebin Liu, and Jingyu Yang. Geometry-guided dense perspective network for speech-driven facial animation. IEEE TVCG, 2021

work page 2021
[15]

Logavatar: Local gaussian splatting for human avatar modeling from monocular video.CAD, 2025

Jinsong Zhang, Xiongzheng Li, Hailong Jia, Jin Li, Zhuo Su, Guidong Wang, and Kun Li. Logavatar: Local gaussian splatting for human avatar modeling from monocular video.CAD, 2025

work page 2025
[16]

Gpt-4 technical report

OpenAI. Gpt-4 technical report. 2023

work page 2023
[17]

Language models are few-shot learners.NeurIPS, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.NeurIPS, 2020

work page 2020
[18]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Stanford alpaca: An instruction-following llama model, 2023

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Stanford alpaca: An instruction-following llama model, 2023

work page 2023
[21]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna. lmsys. org (accessed 14 April 2023), 2023

work page 2023
[22]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b.arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms.arXiv preprint arXiv:2406.16860, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models.arXiv preprint arXiv:2403.18814, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies.arXiv preprint arXiv:2404.06395, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, et al. Deepseek-vl: towards real-world vision-language understanding.arXiv preprint arXiv:2403.05525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Beyond llava-hd: Diving into high-resolution large multimodal models.arXiv preprint arXiv:2406.08487,

Yi-Fan Zhang, Qingsong Wen, Chaoyou Fu, Xue Wang, Zhang Zhang, Liang Wang, and Rong Jin. Beyond llava-hd: Diving into high-resolution large multimodal models.arXiv preprint arXiv:2406.08487, 2024

work page arXiv 2024
[29]

Long Context Transfer from Language to Vision

Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision. arXiv preprint arXiv:2406.16852, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

E- chat: Emotion-sensitive spoken dialogue system with large language models.arXiv preprint arXiv:2401.00475, 2023

Hongfei Xue, Yuhao Liang, Bingshen Mu, Shiliang Zhang, Qian Chen, and Lei Xie. E- chat: Emotion-sensitive spoken dialogue system with large language models.arXiv preprint arXiv:2401.00475, 2023

work page arXiv 2023
[32]

Paralinguistics-enhanced large language modeling of spoken dialogue.arXiv preprint arXiv:2312.15316, 2023

Guan-Ting Lin, Prashanth Gurunath Shivakumar, Ankur Gandhe, Chao-Han Huck Yang, Yile Gu, Shalini Ghosh, Andreas Stolcke, Hung-yi Lee, and Ivan Bulyko. Paralinguistics-enhanced large language modeling of spoken dialogue.arXiv preprint arXiv:2312.15316, 2023

work page arXiv 2023
[33]

OpenAI.https://openai.com/index/hello-gpt-4o/, 2024

work page 2024
[35]

, Guo, S

Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, and Yang Feng. Llama-omni: Seamless speech interaction with large language models.arXiv preprint arXiv:2409.06666, 2024. 11

work page arXiv 2024
[36]

Moshi: a speech-text foundation model for real-time dialogue

Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Hello gpt-4o

OpenAI. Hello gpt-4o. 2023

work page 2023
[38]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.arXiv preprint arXiv:2404.16821, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm

Xiong Wang, Yangze Li, Chaoyou Fu, Lei Xie, Ke Li, Xing Sun, and Long Ma. Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm.arXiv preprint arXiv:2411.00774, 2024

work page arXiv 2024
[41]

Fewer-token neural speech codec with time-invariant codes

Yong Ren, Tao Wang, Jiangyan Yi, Le Xu, Jianhua Tao, Chu Yuan Zhang, and Junzuo Zhou. Fewer-token neural speech codec with time-invariant codes. InICASSP. IEEE, 2024

work page 2024
[42]

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions.arXiv preprint arXiv:2311.12793, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Allava: Harnessing gpt4v- synthesized data for a lite vision-language model, 2024

Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jianquan Li, Xiang Wan, and Benyou Wang. Allava: Harnessing gpt4v- synthesized data for a lite vision-language model, 2024

work page 2024
[44]

To see is to believe: Prompting gpt-4v for better visual instruction tuning

Junke Wang, Lingchen Meng, Zejia Weng, Bo He, Zuxuan Wu, and Yu-Gang Jiang. To see is to believe: Prompting gpt-4v for better visual instruction tuning.arXiv preprint arXiv:2311.07574, 2023

work page arXiv 2023
[45]

Learn to explain: Multimodal reasoning via thought chains for science question answering.NeurIPS, 2022

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.NeurIPS, 2022

work page 2022
[46]

Chatqa: Surpassing gpt-4 on conversational qa and rag.arXiv preprint arXiv:2401.10225, 2024

Zihan Liu, Wei Ping, Rajarshi Roy, Peng Xu, Chankyu Lee, Mohammad Shoeybi, and Bryan Catanzaro. Chatqa: Surpassing gpt-4 on conversational qa and rag.arXiv preprint arXiv:2401.10225, 2024

work page arXiv 2024
[47]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Anytext: Multilingual visual text generation and editing.arXiv preprint arXiv:2311.03054, 2023

Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng, and Xuansong Xie. Anytext: Multilingual visual text generation and editing.arXiv preprint arXiv:2311.03054, 2023

work page arXiv 2023
[49]

Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model.arXiv preprint arXiv:2310.05126, 2023

Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu, Chenliang Li, Jun- feng Tian, Qi Qian, Ji Zhang, et al. Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model.arXiv preprint arXiv:2310.05126, 2023

work page arXiv 2023
[50]

Sharegemini: Scaling up video caption data for multimodal large language models, June 2024.https://github.com/Share14/ShareGemini

Share. Sharegemini: Scaling up video caption data for multimodal large language models, June 2024.https://github.com/Share14/ShareGemini

work page 2024
[51]

Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark.NeurIPS, 2022

Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou, Niu Minzhe, Xiaodan Liang, Lewei Yao, Runhui Huang, Wei Zhang, Xin Jiang, et al. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark.NeurIPS, 2022

work page 2022
[52]

Laion- 5b: An open large-scale dataset for training next generation image-text models.NeurIPS, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion- 5b: An open large-scale dataset for training next generation image-text models.NeurIPS, 2022. 12

work page 2022
[53]

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. InCVPR, 2021

work page 2021
[54]

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. InICML, 2006

work page 2006
[55]

Vila: On pre-training for visual language models, 2023

Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models, 2023

work page 2023
[56]

Llava-next: Stronger llms supercharge multimodal capabilities in the wild, May 2024

Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Renrui Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, and Chunyuan Li. Llava-next: Stronger llms supercharge multimodal capabilities in the wild, May 2024

work page 2024
[57]

CogVLM2: Visual Language Models for Image and Video Understanding

Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Junhui Ji, Zhao Xue, et al. Cogvlm2: Visual language models for image and video understanding.arXiv preprint arXiv:2408.16500, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

Pan Zhang, Xiaoyi Dong Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Shuangrui Ding, Songyang Zhang, Haodong Duan, Hang Yan, et al. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition.arXiv preprint arXiv:2309.15112, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

Ovis: Structural embedding alignment for multimodal large language model.arXiv preprint arXiv:2405.20797, 2024

Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Han-Jia Ye. Ovis: Structural embedding alignment for multimodal large language model.arXiv preprint arXiv:2405.20797, 2024

work page arXiv 2024
[60]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks.arXiv preprint arXiv:2312.14238, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[62]

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player?arXiv preprint arXiv:2307.06281, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

Are We on the Right Way for Evaluating Large Vision-Language Models?

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?arXiv preprint arXiv:2403.20330, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024

work page 2024
[65]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[66]

Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InCVPR, 2024

work page 2024
[67]

Ai2d-rst: A multi- modal corpus of 1000 primary school science diagrams.Language Resources and Evaluation, 2021

Tuomo Hiippala, Malihe Alikhani, Jonas Haverinen, Timo Kalliokoski, Evanfiya Logacheva, Serafina Orekhova, Aino Tuomainen, Matthew Stone, and John A Bateman. Ai2d-rst: A multi- modal corpus of 1000 primary school science diagrams.Language Resources and Evaluation, 2021. 13

work page 2021
[68]

OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models

Yuliang Liu, Zhang Li, Biao Yang, Chunyuan Li, Xucheng Yin, Cheng-lin Liu, Lianwen Jin, and Xiang Bai. On the hidden mystery of ocr in large multimodal models.arXiv preprint arXiv:2305.07895, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[69]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[70]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[71]

Mvbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InCVPR, 2024

work page 2024
[72]

TempCompass: Do Video LLMs Really Understand Videos?

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos?arXiv preprint arXiv:2403.00476, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[73]

wav2vec 2.0: A framework for self-supervised learning of speech representations.NeurIPS, 2020

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations.NeurIPS, 2020

work page 2020
[74]

Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities.arXiv preprint arXiv:2410.11190, 2024

Zhifei Xie and Changqiao Wu. Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities.arXiv preprint arXiv:2410.11190, 2024

work page arXiv 2024
[75]

Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline

Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. InO-COCOSDA. IEEE, 2017

work page 2017
[76]

Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio

Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, et al. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio.arXiv preprint arXiv:2106.06909, 2021

work page arXiv 2021
[77]

Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition

Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao, Chao Yang, Lei Xie, Xin Xu, Hui Bu, Xiaoyu Chen, Chenchen Zeng, et al. Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition. InICASSP, 2022

work page 2022
[78]

Librispeech: an asr corpus based on public domain audio books

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. InICASSP, 2015. 14

work page 2015