archive
Every paper Pith has read. Search by title, abstract, or pith.
221 papers in cs.MM · page 1
-
SpeakerLLM turns speaker verification into natural-language reasoning
SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning
-
Two-stage model fuses radar and satellite for sharper rain forecasts
VMU-Diff: A Coarse-to-fine Multi-source Data Fusion Framework for Precipitation Nowcasting
-
RC metrics align object removal scores with human perception
PROVE: A Perceptual RemOVal cohErence Benchmark for Visual Media
-
Multi-agent system resolves multimedia claims into editable reports
Contestable Multi-Agent Debate with Arena-based Argumentative Computation for Multimedia Verification
-
Delta Forcing curbs drift in interactive video generation
Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation
-
Few channels control entire DiT image generation
Few Channels Draw The Whole Picture: Revealing Massive Activations in Diffusion Transformers
-
Backbone knowledge alone fools frozen deepfake detectors
Backbone is All You Need: Assessing Vulnerabilities of Frozen Foundation Models in Synthetic Image Forensics
-
Synthetic dataset benchmarks AI for swim coaching
Synthesizing the Expert: A Validated Multimodal Dataset for Trustworthy AI-Assisted Swimming Coaching
-
3B omni model matches 30B on clean benchmarks
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
-
3B omni-model matches 30B on debiased benchmarks
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
-
ZipRerank matches top multimodal rerankers at 10x lower latency
Very Efficient Listwise Multimodal Reranking for Long Documents
-
Critic and generator agents iteratively refine research outlines
AgentDisCo: Towards Disentanglement and Collaboration in Open-ended Deep Research Agents
-
Adaptive path choice lifts unified multimodal reasoning
UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning
-
Unified transformer generates images from raw pixels without VAEs
HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer
-
Targeted head boost cuts hallucinations in vision-language models
Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination
-
New benchmark links social posts to fact-check evidence for model testing
RW-Post: Auditable Evidence-Grounded Multimodal Fact-Checking in the Wild
-
Benchmark shows AI models struggle with evidence in multimodal fact-checking
RW-Post: Auditable Evidence-Grounded Multimodal Fact-Checking in the Wild
-
User queries alter video retrieval model behavior
FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries
-
Tube packages stabilize video recovery faster in semantic HARQ
Tube-Structured Incremental Semantic HARQ for Generative Video Receivers
-
Multi-scale supervision cuts pose errors in sign animation
KAN Text to Vision? The Exploration of Kolmogorov-Arnold Networks for Multi-Scale Sequence-Based Pose Animation from Sign Language Notation
-
Multi-layer CLIP similarities predict machine image preferences
ML-CLIPSim: Multi-Layer CLIP Similarity for Machine-Oriented Image Quality
-
Dual pathways fix conflicts in text-video-audio intent recognition
Mitigating Multimodal Inconsistency via Cognitive Dual-Pathway Reasoning for Intent Recognition
-
Invariant relations to known prototypes turn GCD into reliable pattern matching
Relational Retrieval: Leveraging Known-Novel Interactions for Generalized Category Discovery
-
Three-agent system lifts VLM accuracy on few-shot time series tasks
Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning
-
Home activity benchmark shows AI question-answering gaps
HOME-KGQA: A Benchmark Dataset for Multimodal Knowledge Graph Question Answering on Household Daily Activities
-
Color-adaptive scheme raises 3D Gaussian streaming quality 5-20 dB
CAGS: Color-Adaptive Volumetric Video Streaming with Dynamic 3D Gaussian Splatting
-
Gaussian splatting relights VP scenes by sampling LED backgrounds directly
Relightable Gaussian Splatting for Virtual Production Using Image-Based Illumination
-
Neural network adapts frame rate and resolution for better streamed graphics
Streaming of rendered content with adaptive frame rate and resolution
-
Edge offloading and pruning cut multi-condition T2I latency by 25%
Accelerating Multi-Condition T2I Generation via Adaptive Condition Offloading and Pruning
-
Unison aligns motion, speech and sound in video generation
Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation
-
Uni-modal focus sharpens weakly supervised AVVP
EAR: Enhancing Uni-Modal Representations for Weakly Supervised Audio-Visual Video Parsing
-
Thin clients stream interactive 3D Gaussian Splatting over HTTP/3
Thin-Client Interactive Gaussian Adaptive Streaming over HTTP/3
-
Anisotropic correction fixes modality gaps for unpaired training
Anisotropic Modality Align
-
Multimedia benchmark shows access method guides terminal agent workflows
MMTB: Evaluating Terminal Agents on Multimedia-File Tasks
-
Decomposed stages yield better chord variety and rules compliance
A Decomposed Retrieval-Edit-Rerank Framework for Chord Generation
-
Honeywell deleted videos remain recoverable
Forensic analysis of video data deletion and recovery in Honeywell surveillance file system
-
Semantic codebook creates style-matched co-speech gestures
PersonaGest: Personalized Co-Speech Gesture Generation with Semantic-Guided Hierarchical Motion Representation
-
Audio-video models fail to keep physics consistent in transitions
Do Joint Audio-Video Generation Models Understand Physics?
-
MIST benchmark shows LLMs lag on voice IoT tasks
MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes
-
Benchmark shows little progress in multimodal domain generalization
Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study
-
Neural codec with FFT encoder outperforms tokenizers on sensors
LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation
-
Contrastive and uncertainty methods improve emotion recognition
Modality-Aware Contrastive and Uncertainty-Regularized Emotion Recognition
-
The paper introduces Holmes, a hierarchical evidential learning method for retrieving…
Revisiting Uncertainty: On Evidential Learning for Partially Relevant Video Retrieval
-
LLM and RL coupling with VR feedback creates adaptive 3D scenes
Closing the Loop: Unified 3D Scene Generation and Immersive Interaction via LLM-RL Coupling
-
Dual paths learn when to fuse or drop modalities in emotion recognition
To Fuse or to Drop? Dual-Path Learning for Resolving Modality Conflicts in Multimodal Emotion Recognition
-
0.1B omni model reaches 0.09 CER in speech-text consistency
MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model
-
Conformal loop self-calibrates multimodal models on noisy
Multimodal Learning on Low-Quality Data with Conformal Predictive Self-Calibration
-
Imitation learning splits music colors across multiple stage lights
Stage Light is Sequence$^2$: Multi-Light Control via Imitation Learning
-
Aesthetic features lift AI music preference prediction on unseen generators
APEX: Large-scale Multi-task Aesthetic-Informed Popularity Prediction for AI-Generated Music
-
Dual-system refines scores to boost self-supervised forgery detectors
Enhancing Self-Supervised Talking Head Forgery Detection via a Training-Free Dual-System Framework