archive

Every paper Pith has read. Search by title, abstract, or pith.

221 papers in cs.MM · page 1

cs.SD 2026-05-14 reviewed

SpeakerLLM turns speaker verification into natural-language reasoning
SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning

Ha-Jin Yu +4
cs.CV 2026-05-14 reviewed

Two-stage model fuses radar and satellite for sharper rain forecasts
VMU-Diff: A Coarse-to-fine Multi-source Data Fusion Framework for Precipitation Nowcasting

Boyu Liu +8
cs.CV 2026-05-14 reviewed

RC metrics align object removal scores with human perception
PROVE: A Perceptual RemOVal cohErence Benchmark for Visual Media

Daiguo Zhou +8
cs.MM 2026-05-14 reviewed

Multi-agent system resolves multimedia claims into editable reports
Contestable Multi-Agent Debate with Arena-based Argumentative Computation for Multimedia Verification

Hoang-Loc Cao +5
cs.CV 2026-05-14 reviewed

Delta Forcing curbs drift in interactive video generation
Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation

Dongman Lee +6
cs.CV 2026-05-13 reviewed

Few channels control entire DiT image generation
Few Channels Draw The Whole Picture: Revealing Massive Activations in Diffusion Transformers

Davide Bucciarelli +4
cs.CV 2026-05-13 reviewed

Backbone knowledge alone fools frozen deepfake detectors
Backbone is All You Need: Assessing Vulnerabilities of Frozen Foundation Models in Synthetic Image Forensics

Andrea Montibeller +3
cs.MA 2026-05-12 reviewed

Synthetic dataset benchmarks AI for swim coaching
Synthesizing the Expert: A Validated Multimodal Dataset for Trustworthy AI-Assisted Swimming Coaching

Ahmad Al-Kabbany +1
cs.MM 2026-05-12 reviewed

3B omni model matches 30B on clean benchmarks
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

Che Liu +6
cs.MM 2026-05-12 reviewed

3B omni-model matches 30B on debiased benchmarks
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

Che Liu +6
cs.IR 2026-05-12 reviewed

ZipRerank matches top multimodal rerankers at 10x lower latency
Very Efficient Listwise Multimodal Reranking for Long Documents

Lawrence B. Hsieh +2
cs.IR 2026-05-12 reviewed

Critic and generator agents iteratively refine research outlines
AgentDisCo: Towards Disentanglement and Collaboration in Open-ended Deep Research Agents

Jiarui Jin +4
cs.MM 2026-05-12 reviewed

Adaptive path choice lifts unified multimodal reasoning
UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning

Hayes Bai +4
cs.CV 2026-05-11 reviewed

Unified transformer generates images from raw pixels without VAEs
HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

Chengmin Gao +24
cs.MM 2026-05-11 reviewed

Targeted head boost cuts hallucinations in vision-language models
Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination

Guodong Du +6
cs.MM 2026-05-11 reviewed

New benchmark links social posts to fact-check evidence for model testing
RW-Post: Auditable Evidence-Grounded Multimodal Fact-Checking in the Wild

Danni Xu +3
cs.MM 2026-05-11 reviewed

Benchmark shows AI models struggle with evidence in multimodal fact-checking
RW-Post: Auditable Evidence-Grounded Multimodal Fact-Checking in the Wild

Danni Xu +3
cs.MM 2026-05-11 reviewed

User queries alter video retrieval model behavior
FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries

Bohan Zeng +6
eess.IV 2026-05-11 reviewed

Tube packages stabilize video recovery faster in semantic HARQ
Tube-Structured Incremental Semantic HARQ for Generative Video Receivers

Runxin Zhang +2
cs.CV 2026-05-10 reviewed

Multi-scale supervision cuts pose errors in sign animation
KAN Text to Vision? The Exploration of Kolmogorov-Arnold Networks for Multi-Scale Sequence-Based Pose Animation from Sign Language Notation

Guanyi Du +3
eess.IV 2026-05-10 reviewed

Multi-layer CLIP similarities predict machine image preferences
ML-CLIPSim: Multi-Layer CLIP Similarity for Machine-Oriented Image Quality

Feng Ding +5
cs.MM 2026-05-10 reviewed

Dual pathways fix conflicts in text-video-audio intent recognition
Mitigating Multimodal Inconsistency via Cognitive Dual-Pathway Reasoning for Intent Recognition

Kai Gao +4
cs.CV 2026-05-10 reviewed

Invariant relations to known prototypes turn GCD into reliable pattern matching
Relational Retrieval: Leveraging Known-Novel Interactions for Generalized Category Discovery

Chunqi Guo +3
cs.AI 2026-05-10 reviewed

Three-agent system lifts VLM accuracy on few-shot time series tasks
Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning

Boxin Li +9
cs.CL 2026-05-10 reviewed

Home activity benchmark shows AI question-answering gaps
HOME-KGQA: A Benchmark Dataset for Multimodal Knowledge Graph Question Answering on Household Daily Activities

Aoi Ohta +7
cs.GR 2026-05-10 reviewed

Color-adaptive scheme raises 3D Gaussian streaming quality 5-20 dB
CAGS: Color-Adaptive Volumetric Video Streaming with Dynamic 3D Gaussian Splatting

Cong Zhang +9
cs.CV 2026-05-09 reviewed

Gaussian splatting relights VP scenes by sampling LED backgrounds directly
Relightable Gaussian Splatting for Virtual Production Using Image-Based Illumination

Adrian Azzarelli +3
eess.IV 2026-05-09 reviewed

Neural network adapts frame rate and resolution for better streamed graphics
Streaming of rendered content with adaptive frame rate and resolution

Joseph G. March +2
cs.MM 2026-05-09 reviewed

Edge offloading and pruning cut multi-condition T2I latency by 25%
Accelerating Multi-Condition T2I Generation via Adaptive Condition Offloading and Pruning

Chongbin Yi +4
cs.CV 2026-05-09 reviewed

Unison aligns motion, speech and sound in video generation
Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation

Chi Zhang +8
cs.CV 2026-05-09 reviewed

Uni-modal focus sharpens weakly supervised AVVP
EAR: Enhancing Uni-Modal Representations for Weakly Supervised Audio-Visual Video Parsing

Huilai Li +5
eess.IV 2026-05-09 reviewed

Thin clients stream interactive 3D Gaussian Splatting over HTTP/3
Thin-Client Interactive Gaussian Adaptive Streaming over HTTP/3

Cheng-Hsin Hsu +6
cs.MM 2026-05-08 reviewed

Anisotropic correction fixes modality gaps for unpaired training
Anisotropic Modality Align

Chengwei Qin +10
cs.MM 2026-05-08 reviewed

Multimedia benchmark shows access method guides terminal agent workflows
MMTB: Evaluating Terminal Agents on Multimedia-File Tasks

Chiyeong Heo +6
cs.SD 2026-05-08 reviewed

Decomposed stages yield better chord variety and rules compliance
A Decomposed Retrieval-Edit-Rerank Framework for Chord Generation

Anqi Huang +3
cs.CR 2026-05-08 reviewed

Honeywell deleted videos remain recoverable
Forensic analysis of video data deletion and recovery in Honeywell surveillance file system

Jinhee Yoon +1
cs.GR 2026-05-08 reviewed

Semantic codebook creates style-matched co-speech gestures
PersonaGest: Personalized Co-Speech Gesture Generation with Semantic-Guided Hierarchical Motion Representation

Junchuan Zhao +2
cs.SD 2026-05-08 reviewed

Audio-video models fail to keep physics consistent in transitions
Do Joint Audio-Video Generation Models Understand Physics?

Chenming Ge +10
cs.CL 2026-05-07 reviewed

MIST benchmark shows LLMs lag on voice IoT tasks
MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes

Alexandros Papangelis +5
cs.CV 2026-05-07 reviewed

Benchmark shows little progress in multimodal domain generalization
Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study

Eleni Chatzi +5
eess.IV 2026-05-07 reviewed

Neural codec with FFT encoder outperforms tokenizers on sensors
LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation

Dan Jacobellis +1
cs.MM 2026-05-07 reviewed

Contrastive and uncertainty methods improve emotion recognition
Modality-Aware Contrastive and Uncertainty-Regularized Emotion Recognition

Fuji Ren +4
cs.CV 2026-05-07 reviewed

The paper introduces Holmes, a hierarchical evidential learning method for retrieving…
Revisiting Uncertainty: On Evidential Learning for Partially Relevant Video Retrieval

Jinpeng Wang +7
cs.CV 2026-05-07 reviewed

LLM and RL coupling with VR feedback creates adaptive 3D scenes
Closing the Loop: Unified 3D Scene Generation and Immersive Interaction via LLM-RL Coupling

Anh H. Vo +4
cs.MM 2026-05-06 reviewed

Dual paths learn when to fuse or drop modalities in emotion recognition
To Fuse or to Drop? Dual-Path Learning for Resolving Modality Conflicts in Multimodal Emotion Recognition

Erik Cambria +7
cs.SD 2026-05-05 reviewed

0.1B omni model reaches 0.09 CER in speech-text consistency
MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model

Jingyao Gong
cs.CV 2026-05-05 reviewed

Conformal loop self-calibrates multimodal models on noisy
Multimodal Learning on Low-Quality Data with Conformal Predictive Self-Calibration

Disen Hu +7
cs.MM 2026-05-05 reviewed

Imitation learning splits music colors across multiple stage lights
Stage Light is Sequence$^2$: Multi-Light Control via Imitation Learning

Dian Jin +3
cs.SD 2026-05-05 reviewed

Aesthetic features lift AI music preference prediction on unseen generators
APEX: Large-scale Multi-task Aesthetic-Informed Popularity Prediction for AI-Generated Music

Dorien Herremans +1
cs.CV 2026-05-05 reviewed

Dual-system refines scores to boost self-supervised forgery detectors
Enhancing Self-Supervised Talking Head Forgery Detection via a Training-Free Dual-System Framework

Jiwei Wei +7