arxiv: 2503.20215 · v1 · submitted 2025-03-26 · 💻 cs.CL · cs.CV· cs.SD· eess.AS

Recognition: 1 theorem link

Qwen2.5-Omni Technical Report

Jin Xu , Zhifang Guo , Jinzheng He , Hangrui Hu , Ting He , Shuai Bai , Keqin Chen , Jialin Wang

show 6 more authors

Yang Fan Kai Dang Bin Zhang Xiong Wang Yunfei Chu Junyang Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:50 UTC · model grok-4.3

classification 💻 cs.CL cs.CVcs.SDeess.AS

keywords multimodal modelstreaming speech generationend-to-end architectureThinker-TalkerTMRoPEOmni-Benchspeech instruction followingQwen2.5-Omni

0 comments

The pith

Qwen2.5-Omni processes text, images, audio and video inputs while generating text and streaming speech in one end-to-end system.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The report describes a multimodal model that accepts diverse inputs and produces both text answers and natural-sounding speech responses without separate pipelines. A block-wise encoder handles streaming inputs, while a new position embedding aligns video frames with audio timestamps. The core design splits generation into a Thinker that produces text and a Talker that turns the Thinker's internal states directly into audio tokens. This arrangement lets the model reach state-of-the-art scores on multimodal tests and keep speech-instruction performance close to its text-instruction scores on standard language benchmarks.

Core claim

Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Its end-to-end speech instruction following is comparable to its text capabilities on MMLU and GSM8K, and its streaming Talker outperforms most existing alternatives in robustness and naturalness.

What carries the argument

The Thinker-Talker architecture, in which the Thinker operates as a language model for text generation while the Talker directly consumes the Thinker's hidden representations to autoregressively produce audio tokens, together with Time-aligned Multimodal RoPE that interleaves audio and video for synchronized timestamps.

If this is right

Text and speech can be produced at the same time because the Talker reads the Thinker's states directly.
Video and audio inputs stay time-aligned through sequential interleaving and the new position embedding.
Streaming speech decoding uses a sliding-window diffusion transformer that limits the initial delay.
The single model matches or exceeds the performance of prior separate audio and vision systems on shared benchmarks.
End-to-end training of both components becomes possible without modality-specific post-processing stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same hidden-state handoff could extend to additional output modalities such as video or code if the Talker module is swapped.
Real-time conversational systems would gain lower latency because one forward pass supplies both text and speech tokens.
Training data requirements might decrease if the shared Thinker representations transfer across modalities more efficiently than isolated encoders.
Deployment on edge devices could simplify because only one set of weights needs quantization and serving.

Load-bearing premise

The Thinker-Talker split and TMRoPE fully remove interference between modalities and timestamp misalignment without hidden costs in training stability or generalization that only show up on wider or out-of-distribution tests.

What would settle it

A side-by-side evaluation on out-of-distribution multimodal tasks where the unified model's accuracy falls noticeably below that of separately trained modality specialists would show the interference-avoidance claim does not hold.

read the original abstract

In this report, we present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. To synchronize the timestamps of video inputs with audio, we organize the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE(Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose \textbf{Thinker-Talker} architecture. In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni is comparable with the similarly sized Qwen2.5-VL and outperforms Qwen2-Audio. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Notably, Qwen2.5-Omni's performance in end-to-end speech instruction following is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni's streaming Talker outperforms most existing streaming and non-streaming alternatives in robustness and naturalness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Qwen2.5-Omni adds Thinker-Talker and TMRoPE for multimodal streaming but the report gives no ablations to confirm those pieces drive the gains.

read the letter

The main things to know are that this is a technical report on a new end-to-end model handling text, images, audio, and video with streaming speech output, and that the Thinker-Talker split plus TMRoPE are the specific engineering moves meant to reduce modality interference and fix timestamp alignment. The architecture description is straightforward: block-wise encoders, interleaved sequencing, a Thinker that acts like a standard LLM for text, and a dual-track Talker that consumes its hidden states to generate audio tokens via a sliding-window DiT. The numbers show it matches or beats earlier Qwen models on MMLU and GSM8K for speech instructions and reaches SOTA on Omni-Bench, with the streaming Talker described as more robust than most alternatives. That is useful concrete work from the Qwen team. The soft spots are exactly where the stress-test note points. No ablation tables appear, no error bars, and no OOD or stability checks to show that the new components actually eliminate interference or misalignment rather than the gains coming from scale or data. Without those, the central claims about clean decoupling rest on the benchmark scores alone. This paper is for people who build or deploy multimodal systems and want to see one practical implementation of streaming text-plus-speech. Researchers tracking Qwen progress or looking for implementation details on interleaved multimodal position embeddings will get value from it. It deserves a serious referee because the model is large, the benchmarks are relevant, and the architecture choices are clearly described even if more controls would strengthen the attribution. I would send it to review and expect requests for ablations and broader testing.

Referee Report

3 major / 2 minor

Summary. The paper introduces Qwen2.5-Omni, an end-to-end multimodal model that processes text, image, audio, and video inputs while generating text and streaming natural speech. It employs block-wise encoders for audio/visual streams, interleaves audio and video with the proposed TMRoPE position embedding to align timestamps, and uses a Thinker-Talker architecture in which the Thinker (an LLM) produces text and hidden states that feed a dual-track autoregressive Talker for audio tokens. A sliding-window DiT decoder enables low-latency streaming speech. The report claims the model matches similarly sized Qwen2.5-VL, outperforms Qwen2-Audio, reaches SOTA on Omni-Bench, shows speech instruction-following performance comparable to text on MMLU and GSM8K, and delivers more robust and natural streaming speech than prior alternatives.

Significance. If the performance numbers hold and can be attributed to the architectural choices, the work would be significant for advancing unified multimodal models that support real-time streaming generation. The Thinker-Talker decoupling and TMRoPE alignment mechanism address practical challenges in modality interference and temporal synchronization, providing a concrete design pattern that could be adopted or extended by the community. The end-to-end training claim and the streaming DiT component are also useful reference points for latency-sensitive applications.

major comments (3)

[Abstract / §3 (Architecture)] Abstract and architecture description: The central claims that the Thinker-Talker split fully eliminates interference between text and speech modalities and that TMRoPE resolves timestamp misalignment rest on the assertion that these mechanisms succeed without hidden costs; however, the manuscript provides no ablation tables, stability metrics, or OOD evaluations that isolate their contributions versus scale or data effects.
[Abstract / Results section] Results claims: The SOTA performance on Omni-Bench and the statement that end-to-end speech instruction following matches text performance on MMLU/GSM8K are reported without error bars, multiple runs, or explicit controls for data contamination, making it difficult to verify that the gains derive from the proposed block-wise encoders, interleaved sequencing, and dual-track Talker rather than training data or model size.
[Abstract / Talker description] Streaming Talker evaluation: The claim that the sliding-window DiT Talker outperforms existing streaming and non-streaming alternatives in robustness and naturalness lacks quantitative latency measurements, robustness tests on out-of-distribution audio, or direct comparisons that control for the Thinker hidden-state input quality.

minor comments (2)

[§3.1] The mathematical definition of TMRoPE (Time-aligned Multimodal RoPE) is described only in prose; adding an explicit equation would improve reproducibility and allow readers to verify the timestamp alignment logic.
[Abstract / Evaluation] Benchmark names such as Omni-Bench are used without a brief definition or citation in the main text; a short footnote or reference would aid readers unfamiliar with the suite.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, acknowledging where additional evidence is needed and outlining targeted revisions to the manuscript.

read point-by-point responses

Referee: [Abstract / §3 (Architecture)] Abstract and architecture description: The central claims that the Thinker-Talker split fully eliminates interference between text and speech modalities and that TMRoPE resolves timestamp misalignment rest on the assertion that these mechanisms succeed without hidden costs; however, the manuscript provides no ablation tables, stability metrics, or OOD evaluations that isolate their contributions versus scale or data effects.

Authors: We agree that explicit ablations would strengthen the attribution of gains to the Thinker-Talker decoupling and TMRoPE. The current results rely on end-to-end comparisons against Qwen2.5-VL and Qwen2-Audio. In the revised manuscript we will add ablation tables that disable the dual-track Talker (forcing joint text-speech generation) and remove TMRoPE (replacing it with standard RoPE), reporting effects on modality interference, timestamp alignment accuracy, and downstream benchmark scores. We will also include training stability metrics (loss variance across seeds) and a small OOD test set for temporal misalignment. revision: yes
Referee: [Abstract / Results section] Results claims: The SOTA performance on Omni-Bench and the statement that end-to-end speech instruction following matches text performance on MMLU/GSM8K are reported without error bars, multiple runs, or explicit controls for data contamination, making it difficult to verify that the gains derive from the proposed block-wise encoders, interleaved sequencing, and dual-track Talker rather than training data or model size.

Authors: We acknowledge the value of statistical reporting. All numbers are from single training runs given the scale of end-to-end multimodal training. In revision we will report inference-time variance (multiple decoding seeds) with error bars on MMLU, GSM8K, and Omni-Bench. We will also add a paragraph detailing our data decontamination pipeline (exact overlap checks against benchmark test sets) and note that full multi-run training ablations are computationally prohibitive. These changes clarify the evaluation protocol without altering the reported point estimates. revision: partial
Referee: [Abstract / Talker description] Streaming Talker evaluation: The claim that the sliding-window DiT Talker outperforms existing streaming and non-streaming alternatives in robustness and naturalness lacks quantitative latency measurements, robustness tests on out-of-distribution audio, or direct comparisons that control for the Thinker hidden-state input quality.

Authors: We will expand the Talker evaluation section with concrete latency metrics (initial package delay, real-time factor, and end-to-end latency under streaming conditions). We will add robustness results on an OOD audio test set (noisy, accented, and code-switched samples) and include controlled comparisons that feed identical Thinker hidden states to both our sliding-window DiT and baseline decoders. These quantitative additions will be placed in a new subsection of the results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark claims with no derivations or self-referential predictions

full rationale

The paper is a technical report presenting Qwen2.5-Omni's architecture (block-wise encoders, TMRoPE, Thinker-Talker split, sliding-window DiT) and its observed performance on benchmarks like Omni-Bench, MMLU, and GSM8K. No equations, first-principles derivations, or 'predictions' are claimed that could reduce to fitted inputs or self-citations by construction. Self-references to prior Qwen models (e.g., Qwen2.5-VL, Qwen2-Audio) are standard comparisons and not load-bearing for the new elements, which are validated directly via empirical results rather than internal definitions. The central claims rest on external benchmark evaluations, making the report self-contained against external data without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard transformer assumptions plus the empirical effectiveness of the proposed components; no new physical or mathematical axioms are introduced.

axioms (1)

standard math Standard transformer attention and autoregressive generation assumptions hold for the interleaved multimodal inputs.
Invoked implicitly when describing block-wise encoders and Thinker-Talker training.

pith-pipeline@v0.9.0 · 5679 in / 1327 out tokens · 33331 ms · 2026-05-10T17:50:03.825098+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search
cs.SD 2026-05 unverdicted novelty 8.0

Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.
TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
cs.CV 2026-05 unverdicted novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models
cs.SD 2026-04 unverdicted novelty 8.0

HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-lang...
Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs
cs.CR 2026-04 conditional novelty 8.0

Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.
DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues
cs.AI 2026-04 unverdicted novelty 8.0

DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.
Do Audio-Visual Large Language Models Really See and Hear?
cs.AI 2026-04 unverdicted novelty 8.0

AVLLMs encode audio semantics in middle layers but suppress them in final text outputs when audio conflicts with vision, due to training that largely inherits from vision-language base models.
TB-AVA: Text as a Semantic Bridge for Audio-Visual Parameter Efficient Finetuning
cs.CV 2026-05 unverdicted novelty 7.0

TB-AVA uses text as a semantic anchor with a new Text-Bridged Audio-Visual Adapter and Gated Semantic Modulation to achieve state-of-the-art results on audio-visual benchmarks through parameter-efficient fine-tuning.
AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling
cs.SD 2026-05 unverdicted novelty 7.0

AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.
How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue
cs.CL 2026-05 unverdicted novelty 7.0

Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of w...
Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization
cs.CV 2026-05 unverdicted novelty 7.0

Omni-Persona benchmark with 18 tasks shows open-source models have audio-visual grounding gaps, RLVR narrows them but leads to conservative outputs, and scale or recall alone fail as diagnostics.
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers
cs.CL 2026-05 unverdicted novelty 7.0

Jina-embeddings-v5-omni creates multimodal embeddings for text, image, audio, and video by freezing the text and media encoders and training only 0.35% of the weights via a VLM-style connector.
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
cs.CL 2026-05 unverdicted novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
cs.CL 2026-05 unverdicted novelty 7.0

TextPro-SLM minimizes the speech-text modality gap from the input side via a prosody-aware unified encoder, delivering the lowest gap and strong performance at 3B/7B scales with only ~1000 hours of audio.
Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization
cs.CR 2026-05 conditional novelty 7.0

Sparse selection of high-gradient-energy audio tokens suffices for effective jailbreaking of audio language models with minimal drop in attack success rate.
TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation
cs.SD 2026-05 unverdicted novelty 7.0

TMD-Bench is a multi-level benchmark that measures music-dance co-generation quality including beat-level rhythmic synchronization, supported by a new dataset and Music Captioner, and shows commercial models lag in rh...
Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models
eess.AS 2026-04 unverdicted novelty 7.0

Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- a...
Toward Efficient Membership Inference Attacks against Federated Large Language Models: A Projection Residual Approach
cs.LG 2026-04 unverdicted novelty 7.0

ProjRes achieves near-100% accuracy in membership inference on FedLLMs by measuring projection residuals of hidden embeddings on gradient subspaces, outperforming prior methods by up to 75.75% even under differential privacy.
SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation
cs.CL 2026-04 unverdicted novelty 7.0

SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.
ATIR: Towards Audio-Text Interleaved Contextual Retrieval
cs.SD 2026-04 unverdicted novelty 7.0

Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.
ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety
cs.CR 2026-04 unverdicted novelty 7.0

ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisone...
Watching Movies Like a Human: Egocentric Emotion Understanding for Embodied Companions
cs.CV 2026-04 conditional novelty 7.0

Creates the first egocentric screen-view movie emotion benchmark and demonstrates that cinematic models drop sharply in Macro-F1 on realistic robot-like viewing conditions while domain-specific training improves robustness.
From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench
cs.AI 2026-04 unverdicted novelty 7.0

ProVoice-Bench is the first framework to evaluate proactive voice agents, revealing that state-of-the-art multimodal LLMs struggle with over-triggering and context-aware reasoning.
Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs
cs.CV 2026-04 unverdicted novelty 7.0

Chain of Modality dynamically orchestrates multimodal input topologies and bifurcates cognitive execution to overcome static fusion biases in Omni-MLLMs.
Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models
cs.CV 2026-04 unverdicted novelty 7.0

Audio-Contrastive Preference Optimization (ACPO) mitigates audio hallucination in AVLMs via output-contrastive and input-contrastive objectives that enforce faithful audio grounding.
Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

RoleJudge is a multidimensional evaluation framework for speech-character alignment in audio LLMs, backed by the RoleChat dataset and multi-stage RL training with standard alignment to reduce reward issues.
HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models
eess.AS 2026-04 unverdicted novelty 7.0

HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semanti...
TiCo: Time-Controllable Spoken Dialogue Model
cs.CL 2026-03 unverdicted novelty 7.0

TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
cs.SD 2025-07 unverdicted novelty 7.0

Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.
FSD50K-Solo: Automated Curation of Single-Source Sound Events
eess.AS 2026-05 conditional novelty 6.0

The authors present a scalable curation method that combines diffusion-based mixture synthesis with a discriminative classifier to automatically extract single-source sound events from FSD50K and release the cleaned F...
SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning
cs.CV 2026-05 unverdicted novelty 6.0

SyncDPO improves temporal synchronization in video-audio joint generation using DPO with efficient on-the-fly negative sample construction and curriculum learning.
Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model
eess.AS 2026-05 unverdicted novelty 6.0

A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
cs.MM 2026-05 unverdicted novelty 6.0

Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.
Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLMs
cs.CV 2026-05 unverdicted novelty 6.0

ContextGuard prunes 55% of tokens in Qwen2.5-Omni 7B while matching full performance on five of six audio-visual benchmarks by preserving audio-irrecoverable visual context.
TB-AVA: Text as a Semantic Bridge for Audio-Visual Parameter Efficient Finetuning
cs.CV 2026-05 unverdicted novelty 6.0

TB-AVA uses text-mediated gated semantic modulation to enable efficient audio-visual alignment, achieving state-of-the-art results on AVE, AVS, and AVVP benchmarks.
Probing Cross-modal Information Hubs in Audio-Visual LLMs
cs.AI 2026-05 unverdicted novelty 6.0

AVLLMs encode integrated audio-visual information primarily in specialized cross-modal sink tokens, which enables a training-free hallucination mitigation approach.
Probing Cross-modal Information Hubs in Audio-Visual LLMs
cs.AI 2026-05 unverdicted novelty 6.0

AVLLMs store integrated audio-visual information mainly in a distinct subset of sink tokens called cross-modal sink tokens, which can be leveraged for training-free hallucination mitigation.
Accelerating Compound LLM Training Workloads with Maestro
cs.DC 2026-05 unverdicted novelty 6.0

Maestro accelerates compound LLM training via section graphs for per-component configuration and wavefront scheduling for dynamic execution, reducing GPU consumption by ~40% in real deployments.
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers
cs.CL 2026-05 unverdicted novelty 6.0

GELATO extends frozen text embedding models with locked image and audio encoders, training minimal connectors to produce a single semantic embedding space for text, image, audio, and video while keeping original text ...
KARMA-MV: A Benchmark for Causal Question Answering on Music Videos
cs.CV 2026-05 unverdicted novelty 6.0

KARMA-MV is a new benchmark showing that causal knowledge graphs improve VLMs on causal audio-visual reasoning in music videos.
EmoMM: Benchmarking and Steering MLLM for Multimodal Emotion Recognition under Conflict and Missingness
cs.CV 2026-05 unverdicted novelty 6.0

EmoMM benchmark reveals Video Contribution Collapse in MLLMs for emotion recognition under modality conflict and missingness, mitigated by CHASE head-level attention steering.
All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation
cs.SD 2026-04 unverdicted novelty 6.0

Audio-language models retain 60-72% of benchmark scores without audio, and most audio-dependent items can be solved from short fragments rather than full clips.
Exploring Audio Hallucination in Egocentric Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

AV-LLMs hallucinate audio from visuals in egocentric videos, scoring only 27.3% accuracy on foreground sounds and 39.5% on background sounds in a 1000-question evaluation.
HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models
cs.SD 2026-04 unverdicted novelty 6.0

HeadRouter prunes audio tokens more effectively by dynamically routing based on per-head importance for semantic versus acoustic tasks, exceeding baseline performance at 70% token retention on Qwen2.5-Omni models.
DM-ASR: Diarization-aware Multi-speaker ASR with Large Language Models
eess.AS 2026-04 unverdicted novelty 6.0

DM-ASR reformulates multi-speaker ASR as multi-turn dialogue generation conditioned on diarization results, achieving competitive benchmark performance with relatively small models and limited data.
MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

MMControl adds multi-modal controls for identity, timbre, pose, and layout to unified audio-video diffusion models via dual-stream injection and adjustable guidance scaling.
Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval
cs.SD 2026-04 unverdicted novelty 6.0

Omni-Embed-Audio uses multimodal LLMs to match CLAP on standard audio retrieval while improving text-to-text retrieval by 22% relative and hard negative discrimination by 4.3 points HNSR@10 on user-intent queries.
Hard to Be Heard: Phoneme-Level ASR Analysis of Phonologically Complex, Low-Resource Endangered Languages
cs.CL 2026-04 conditional novelty 6.0

Phoneme-level analysis of ASR on Archi and Rutul shows data scarcity explains recognition errors better than phonological complexity, with language-specific adaptations improving wav2vec2 performance.
VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech
eess.AS 2026-04 unverdicted novelty 6.0

VIBE evaluates generative biases in large audio-language models with real-world speech and open-ended tasks, showing that gender cues produce larger distributional shifts than accent cues across 11 tested models.
Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models
cs.AI 2026-04 unverdicted novelty 6.0

Omni-modal LLMs exhibit visual preference that emerges in mid-to-late layers, enabling hallucination detection without task-specific training.
AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers
cs.CV 2026-04 unverdicted novelty 6.0

AVRT transfers reasoning to audio-visual models by distilling traces from single-modality teachers via LLM merger followed by SFT cold-start and RL, achieving SOTA on OmniBench, DailyOmni, and MMAR with 3B/7B models.
RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 6.0

RaTA-Tool retrieves suitable external tools for multimodal queries by matching generated task descriptions against tool metadata, supported by a new Hugging Face-derived dataset and DPO optimization.
Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization
eess.AS 2026-04 unverdicted novelty 6.0

A timing-aware pre-quantization fusion approach integrates visual cues into audio tokenizers along the temporal axis, maintaining reconstruction quality while outperforming audio-only and prior multimodal baselines on...
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
cs.CV 2026-04 unverdicted novelty 6.0

POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
Bridging What the Model Thinks and How It Speaks: Self-Aware Speech Language Models for Expressive Speech Generation
cs.CL 2026-04 unverdicted novelty 6.0

SA-SLM uses variational information bottleneck for intent-aware bridging and self-criticism for realization-aware alignment to close the semantic-acoustic gap, outperforming open-source models and nearing GPT-4o-Audio...
EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation
cs.LG 2026-04 unverdicted novelty 6.0

EdgeRazor delivers 1.58-1.88 bit quantized LLMs that outperform 2-3 bit baselines by up to 11.3 points while using 4-10x less training compute than leading QAT methods.
GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking
cs.SD 2026-04 unverdicted novelty 6.0

GRM ranks Mel bands by attack contribution versus utility sensitivity, perturbs a subset, and learns a universal perturbation to reach 88.46% average jailbreak success rate with improved attack-utility trade-off on fo...
Noise-Aware In-Context Learning for Hallucination Mitigation in ALLMs
cs.SD 2026-04 unverdicted novelty 6.0

NAICL reduces hallucination rates in ALLMs from 26.53% to 16.98% via noise priors in context and introduces the Clotho-1K benchmark with four hallucination types.
OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models
cs.AI 2026-05 unverdicted novelty 5.0

OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.
Speech-based Psychological Crisis Assessment using LLMs
cs.CL 2026-05 unverdicted novelty 5.0

LLM system with paralinguistic cue injection and auxiliary reasoning training reaches 0.802 macro F1 and 0.805 accuracy on three-class speech-based crisis level classification under 5-fold cross-validation.
EmoS: A High-Fidelity Multimodal Benchmark for Fine-grained Streaming Emotional Understanding
cs.CL 2026-05 unverdicted novelty 5.0

EmoS is a new high-fidelity benchmark for fine-grained streaming emotional understanding that produces measurable gains when used to fine-tune multimodal large language models.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 72 Pith papers · 20 internal anchors

[1]

Anastassiou, J

Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, et al. Seed-tts: A family of high-quality versatile speech generation models. arXiv preprint arXiv:2406.02430,

work page arXiv
[2]

Program Synthesis with Large Language Models

URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_ 3.pdf. Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large language models. CoRR, abs/2108.07732,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Seed-asr: Understanding diverse speech and contexts with llm-based speech recognition

Ye Bai, Jingping Chen, Jitong Chen, Wei Chen, Zhuo Chen, Chuang Ding, Linhao Dong, Qianqian Dong, Yujiao Du, Kepan Gao, et al. Seed-asr: Understanding diverse speech and contexts with llm-based speech recognition. arXiv preprint arXiv:2407.04675,

work page arXiv
[5]

Are We on the Right Way for Evaluating Large Vision-Language Models?

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv:2403.20330, 2024a. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Jo...

work page internal anchor Pith review arXiv
[6]

URL https://arxiv.org/abs/2501.06282. Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, and Furu Wei. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE ...

work page arXiv
[7]

Tan, and Haizhou Li

Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T Tan, and Haizhou Li. Voicebench: Benchmarking llm-based voice assistants. arXiv preprint arXiv:2410.17196, 2024b. Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, and Xie Chen. F5- tts: A fairytaler that fakes fluent and faithful speech with flow matching.arXiv prep...

work page arXiv
[8]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv:2305.06500,

work page internal anchor Pith review arXiv
[9]

Nilaksh Das, Saket Dingliwal, Srikanth Ronanki, Rohit Paturi, Zhaocheng Huang, Prashant Mathur, Jie Yuan, Dhanush Bekal, Xing Niu, Sai Muralidhar Jayanthi, et al

Nilaksh Das, Saket Dingliwal, Srikanth Ronanki, Rohit Paturi, David Huang, Prashant Mathur, Jie Yuan, Dhanush Bekal, Xing Niu, Sai Muralidhar Jayanthi, et al. Speechverse: A large-scale generalizable audio language model. arXiv preprint arXiv:2405.08295,

work page arXiv
[10]

Lp-musiccaps: Llm-based pseudo music captioning,

SeungHeon Doh, Keunwoo Choi, Jongpil Lee, and Juhan Nam. Lp-musiccaps: Llm-based pseudo music captioning. arXiv preprint arXiv:2307.16372,

work page arXiv
[11]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, et al. Cosyvoice 2: Scalable streaming speech synthesis with large language models. arXiv preprint arXiv:2412.10117,

work page internal anchor Pith review arXiv
[12]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany...

work page internal anchor Pith review Pith/arXiv arXiv
[13]

E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts

Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, et al. E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts. In 2024 IEEE Spoken Language Technology Workshop (SLT), pp. 682–689. IEEE,

work page 2024
[14]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv:2306.13394,

work page internal anchor Pith review arXiv
[15]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv:2405.21075, 2024a. Ling Fu, Biao Yang, Zhebin Kuang, Jiajun Song, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, ...

work page internal anchor Pith review arXiv
[16]

Are we done with mmlu? CoRR, abs/2406.04127,

Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we done with mmlu? CoRR, abs/2406.04127,

work page arXiv
[17]

Av-odyssey bench: Can your multimodal llms really understand audio-visual information?, 2024

URL https://storage.googleapis.com/deepmind-media/gemini/gemi ni_v1_5_report.pdf. Kaixiong Gong, Kaituo Feng, Bohao Li, Yibing Wang, Mofan Cheng, Shijia Yang, Jiaming Han, Benyou Wang, Yutong Bai, Zhuoran Yang, et al. Av-odyssey bench: Can your multimodal llms really understand audio-visual information? arXiv preprint arXiv:2412.02611,

work page arXiv
[18]

Meralion-audiollm: Technical report

16 Yingxu He, Zhuohan Liu, Shuo Sun, Bin Wang, Wenyu Zhang, Xunlong Zou, Nancy F Chen, and Ai Ti Aw. Meralion-audiollm: Technical report. arXiv preprint arXiv:2412.09818,

work page arXiv
[19]

Language is not all you need: Aligning perception with language models

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In ICLR. OpenReview.net, 2021a. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH ...

work page arXiv
[20]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Infinigence. Infini-megrez-omni. URL https://github.com/infinigence/Infini-Megrez-Omni. Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code. CoRR, abs/2403.07974,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

ReferItGame: Referring to objects in photographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. ReferItGame: Referring to objects in photographs of natural scenes. In Alessandro Moschitti, Bo Pang, and Walter Daelemans (eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 787– 798, Doha, Qatar, October

work page 2014
[22]

R efer I t G ame: Referring to objects in photographs of natural scenes

Association for Computational Linguistics. doi: 10.3115/v1/D14-1086. URL https://aclanthology.org/D14-1086/. Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In ECCV,

work page doi:10.3115/v1/d14-1086
[23]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv:2301.12597,

work page internal anchor Pith review arXiv
[24]

Grounded language-image pre-training

URL https://arxiv.org/abs/2112.03857. Yadong Li, Jun Liu, Tao Zhang, Song Chen, Tianpeng Li, Zehuan Li, Lijun Liu, Lingfeng Ming, Guosheng Dong, Da Pan, et al. Baichuan-omni-1.5 technical report. arXiv preprint arXiv:2501.15368,

work page arXiv
[25]

Omnibench: Towards the future of universal omni-language models, 2025

Yizhi Li, Ge Zhang, Yinghao Ma, Ruibin Yuan, Kang Zhu, Hangyu Guo, Yiming Liang, Jiaheng Liu, Zekun Wang, Jian Yang, et al. Omnibench: Towards the future of universal omni-language models. arXiv preprint arXiv:2409.15272, 2024b. Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In ICLR

work page arXiv
[26]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv:2310.03744, 2023a. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv:2304.08485, 2023b. Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su,...

work page internal anchor Pith review arXiv
[27]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

URL https://arxiv.org/abs/2303.05499. Yuan Liu, Haodong Duan, Bo Li Yuanhan Zhang, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player? arXiv:2307.06281, 2023c. Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoi...

work page Pith review arXiv
[28]

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque

URL https://arxiv.org/ abs/1511.02283. Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv:2203.10244,

work page arXiv
[29]

URL https://github.com/openai/openai-python/blob/e389823ba013a24b4c3 2ce38fa0bd87e6bccae94/chatml.md. OpenAI. GPT4 technical report. CoRR, abs/2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever

URL https://openai.com/index/hello-gpt-4o/. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA,

work page 2023
[31]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level Google-proof Q&A benchmark. CoRR, abs/2311.12022,

work page internal anchor Pith review arXiv
[32]

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

URL https://arxiv.org/abs/2410.19168. Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In CVPR,

work page internal anchor Pith review arXiv
[33]

video-salmonn: Speech-enhanced audio-visual large language models.arXiv preprint arXiv:2406.15704, 2024

Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, and Chao Zhang. video-salmonn: Speech-enhanced audio-visual large language models. arXiv preprint arXiv:2406.15704,

work page arXiv
[34]

SALMONN: towards generic hearing abilities for large language models

Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. SALMONN: towards generic hearing abilities for large language models. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024,

work page 2024
[35]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soum...

work page internal anchor Pith review Pith/arXiv arXiv
[37]

On decoder-only architecture for speech-to-text and large language model integration

Jian Wu, Yashesh Gaur, Zhuo Chen, Long Zhou, Yimeng Zhu, Tianrui Wang, Jinyu Li, Shujie Liu, Bo Ren, Linquan Liu, and Yu Wu. On decoder-only architecture for speech-to-text and large language model integration. abs/2307.03917,

work page arXiv
[38]

Xie and C

URL https://x.ai/blog/grok-1.5v. Zhifei Xie and Changqiao Wu. Mini-omni: Language models can hear, talk while thinking in streaming. arXiv preprint arXiv:2408.16725,

work page arXiv
[39]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report. arXiv:2407.10671, 2024a. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Y...

work page internal anchor Pith review Pith/arXiv arXiv
[40]

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv:2311.16502,

work page internal anchor Pith review arXiv
[41]

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

19 Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Ming Yin, Botao Yu, Ge Zhang, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. arXiv preprint arXiv:2409.02813,

work page internal anchor Pith review arXiv
[42]

AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling.arXiv:2402.12226, 2024

Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, et al. Anygpt: Unified multimodal llm with discrete sequence modeling. arXiv preprint arXiv:2402.12226,

work page arXiv
[43]

Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024

Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans? arXiv preprint arXiv:2408.13257,

work page arXiv
[44]

Lyra: An efficient and speech-centric framework for omni-cognition.arXiv preprint arXiv:2412.09501,

Zhisheng Zhong, Chengyao Wang, Yuqi Liu, Senqiao Yang, Longxiang Tang, Yuechen Zhang, Jingyao Li, Tianyuan Qu, Yanwei Li, Yukang Chen, et al. Lyra: An efficient and speech-centric framework for omni-cognition. arXiv preprint arXiv:2412.09501,

work page arXiv
[45]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision- language understanding with advanced large language models. arXiv:2304.10592,

work page internal anchor Pith review arXiv