pith. sign in

hub

video-salmonn: Speech-enhanced audio-visual large language models.arXiv preprint arXiv:2406.15704

14 Pith papers cite this work. Polarity classification is still indexing.

14 Pith papers citing it

hub tools

citation-role summary

background 3

citation-polarity summary

years

2026 6 2025 8

roles

background 3

polarities

background 3

representative citing papers

Do Audio-Visual Large Language Models Really See and Hear?

cs.AI · 2026-04-03 · unverdicted · novelty 8.0

AVLLMs encode audio semantics in middle layers but suppress them in final text outputs when audio conflicts with vision, due to training that largely inherits from vision-language base models.

V-LynX: Token Interface Alignment for Video+X LLMs

cs.CV · 2026-05-30 · unverdicted · novelty 6.0

V-LynX integrates novel modalities into frozen Video LLMs by aligning to an internalized continuous token manifold using unpaired unimodal data and attention/statistical matching.

Qwen2.5-Omni Technical Report

cs.CL · 2025-03-26 · conditional · novelty 5.0

Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text performance on reasoning benchmarks.

citing papers explorer

Showing 14 of 14 citing papers.