pith. machine review for the scientific record. sign in

archive

Every paper Pith has read. Search by title, abstract, or pith.

221 papers in cs.MM · page 1

  1. cs.SD 2026-05-14 reviewed
    SpeakerLLM turns speaker verification into natural-language reasoning

    SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning

    Ha-Jin Yu +4

  2. cs.CV 2026-05-14 reviewed
    Two-stage model fuses radar and satellite for sharper rain forecasts

    VMU-Diff: A Coarse-to-fine Multi-source Data Fusion Framework for Precipitation Nowcasting

    Boyu Liu +8

  3. cs.CV 2026-05-14 reviewed
    RC metrics align object removal scores with human perception

    PROVE: A Perceptual RemOVal cohErence Benchmark for Visual Media

    Daiguo Zhou +8

  4. cs.MM 2026-05-14 reviewed
    Multi-agent system resolves multimedia claims into editable reports

    Contestable Multi-Agent Debate with Arena-based Argumentative Computation for Multimedia Verification

    Hoang-Loc Cao +5

  5. cs.CV 2026-05-14 reviewed
    Delta Forcing curbs drift in interactive video generation

    Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation

    Dongman Lee +6

  6. cs.CV 2026-05-13 reviewed
    Few channels control entire DiT image generation

    Few Channels Draw The Whole Picture: Revealing Massive Activations in Diffusion Transformers

    Davide Bucciarelli +4

  7. cs.CV 2026-05-13 reviewed
    Backbone knowledge alone fools frozen deepfake detectors

    Backbone is All You Need: Assessing Vulnerabilities of Frozen Foundation Models in Synthetic Image Forensics

    Andrea Montibeller +3

  8. cs.MA 2026-05-12 reviewed
    Synthetic dataset benchmarks AI for swim coaching

    Synthesizing the Expert: A Validated Multimodal Dataset for Trustworthy AI-Assisted Swimming Coaching

    Ahmad Al-Kabbany +1

  9. cs.MM 2026-05-12 reviewed
    3B omni model matches 30B on clean benchmarks

    Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

    Che Liu +6

  10. cs.MM 2026-05-12 reviewed
    3B omni-model matches 30B on debiased benchmarks

    Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

    Che Liu +6

  11. cs.IR 2026-05-12 reviewed
    ZipRerank matches top multimodal rerankers at 10x lower latency

    Very Efficient Listwise Multimodal Reranking for Long Documents

    Lawrence B. Hsieh +2

  12. cs.IR 2026-05-12 reviewed
    Critic and generator agents iteratively refine research outlines

    AgentDisCo: Towards Disentanglement and Collaboration in Open-ended Deep Research Agents

    Jiarui Jin +4

  13. cs.MM 2026-05-12 reviewed
    Adaptive path choice lifts unified multimodal reasoning

    UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning

    Hayes Bai +4

  14. cs.CV 2026-05-11 reviewed
    Unified transformer generates images from raw pixels without VAEs

    HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

    Chengmin Gao +24

  15. cs.MM 2026-05-11 reviewed
    Targeted head boost cuts hallucinations in vision-language models

    Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination

    Guodong Du +6

  16. cs.MM 2026-05-11 reviewed
    New benchmark links social posts to fact-check evidence for model testing

    RW-Post: Auditable Evidence-Grounded Multimodal Fact-Checking in the Wild

    Danni Xu +3

  17. cs.MM 2026-05-11 reviewed
    Benchmark shows AI models struggle with evidence in multimodal fact-checking

    RW-Post: Auditable Evidence-Grounded Multimodal Fact-Checking in the Wild

    Danni Xu +3

  18. cs.MM 2026-05-11 reviewed
    User queries alter video retrieval model behavior

    FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries

    Bohan Zeng +6

  19. eess.IV 2026-05-11 reviewed
    Tube packages stabilize video recovery faster in semantic HARQ

    Tube-Structured Incremental Semantic HARQ for Generative Video Receivers

    Runxin Zhang +2

  20. cs.CV 2026-05-10 reviewed
    Multi-scale supervision cuts pose errors in sign animation

    KAN Text to Vision? The Exploration of Kolmogorov-Arnold Networks for Multi-Scale Sequence-Based Pose Animation from Sign Language Notation

    Guanyi Du +3

  21. eess.IV 2026-05-10 reviewed
    Multi-layer CLIP similarities predict machine image preferences

    ML-CLIPSim: Multi-Layer CLIP Similarity for Machine-Oriented Image Quality

    Feng Ding +5

  22. cs.MM 2026-05-10 reviewed
    Dual pathways fix conflicts in text-video-audio intent recognition

    Mitigating Multimodal Inconsistency via Cognitive Dual-Pathway Reasoning for Intent Recognition

    Kai Gao +4

  23. cs.CV 2026-05-10 reviewed
    Invariant relations to known prototypes turn GCD into reliable pattern matching

    Relational Retrieval: Leveraging Known-Novel Interactions for Generalized Category Discovery

    Chunqi Guo +3

  24. cs.AI 2026-05-10 reviewed
    Three-agent system lifts VLM accuracy on few-shot time series tasks

    Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning

    Boxin Li +9

  25. cs.CL 2026-05-10 reviewed
    Home activity benchmark shows AI question-answering gaps

    HOME-KGQA: A Benchmark Dataset for Multimodal Knowledge Graph Question Answering on Household Daily Activities

    Aoi Ohta +7

  26. cs.GR 2026-05-10 reviewed
    Color-adaptive scheme raises 3D Gaussian streaming quality 5-20 dB

    CAGS: Color-Adaptive Volumetric Video Streaming with Dynamic 3D Gaussian Splatting

    Cong Zhang +9

  27. cs.CV 2026-05-09 reviewed
    Gaussian splatting relights VP scenes by sampling LED backgrounds directly

    Relightable Gaussian Splatting for Virtual Production Using Image-Based Illumination

    Adrian Azzarelli +3

  28. eess.IV 2026-05-09 reviewed
    Neural network adapts frame rate and resolution for better streamed graphics

    Streaming of rendered content with adaptive frame rate and resolution

    Joseph G. March +2

  29. cs.MM 2026-05-09 reviewed
    Edge offloading and pruning cut multi-condition T2I latency by 25%

    Accelerating Multi-Condition T2I Generation via Adaptive Condition Offloading and Pruning

    Chongbin Yi +4

  30. cs.CV 2026-05-09 reviewed
    Unison aligns motion, speech and sound in video generation

    Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation

    Chi Zhang +8

  31. cs.CV 2026-05-09 reviewed
    Uni-modal focus sharpens weakly supervised AVVP

    EAR: Enhancing Uni-Modal Representations for Weakly Supervised Audio-Visual Video Parsing

    Huilai Li +5

  32. eess.IV 2026-05-09 reviewed
    Thin clients stream interactive 3D Gaussian Splatting over HTTP/3

    Thin-Client Interactive Gaussian Adaptive Streaming over HTTP/3

    Cheng-Hsin Hsu +6

  33. cs.MM 2026-05-08 reviewed
    Anisotropic correction fixes modality gaps for unpaired training

    Anisotropic Modality Align

    Chengwei Qin +10

  34. cs.MM 2026-05-08 reviewed
    Multimedia benchmark shows access method guides terminal agent workflows

    MMTB: Evaluating Terminal Agents on Multimedia-File Tasks

    Chiyeong Heo +6

  35. cs.SD 2026-05-08 reviewed
    Decomposed stages yield better chord variety and rules compliance

    A Decomposed Retrieval-Edit-Rerank Framework for Chord Generation

    Anqi Huang +3

  36. cs.CR 2026-05-08 reviewed
    Honeywell deleted videos remain recoverable

    Forensic analysis of video data deletion and recovery in Honeywell surveillance file system

    Jinhee Yoon +1

  37. cs.GR 2026-05-08 reviewed
    Semantic codebook creates style-matched co-speech gestures

    PersonaGest: Personalized Co-Speech Gesture Generation with Semantic-Guided Hierarchical Motion Representation

    Junchuan Zhao +2

  38. cs.SD 2026-05-08 reviewed
    Audio-video models fail to keep physics consistent in transitions

    Do Joint Audio-Video Generation Models Understand Physics?

    Chenming Ge +10

  39. cs.CL 2026-05-07 reviewed
    MIST benchmark shows LLMs lag on voice IoT tasks

    MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes

    Alexandros Papangelis +5

  40. cs.CV 2026-05-07 reviewed
    Benchmark shows little progress in multimodal domain generalization

    Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study

    Eleni Chatzi +5

  41. eess.IV 2026-05-07 reviewed
    Neural codec with FFT encoder outperforms tokenizers on sensors

    LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation

    Dan Jacobellis +1

  42. cs.MM 2026-05-07 reviewed
    Contrastive and uncertainty methods improve emotion recognition

    Modality-Aware Contrastive and Uncertainty-Regularized Emotion Recognition

    Fuji Ren +4

  43. cs.CV 2026-05-07 reviewed
    The paper introduces Holmes, a hierarchical evidential learning method for retrieving…

    Revisiting Uncertainty: On Evidential Learning for Partially Relevant Video Retrieval

    Jinpeng Wang +7

  44. cs.CV 2026-05-07 reviewed
    LLM and RL coupling with VR feedback creates adaptive 3D scenes

    Closing the Loop: Unified 3D Scene Generation and Immersive Interaction via LLM-RL Coupling

    Anh H. Vo +4

  45. cs.MM 2026-05-06 reviewed
    Dual paths learn when to fuse or drop modalities in emotion recognition

    To Fuse or to Drop? Dual-Path Learning for Resolving Modality Conflicts in Multimodal Emotion Recognition

    Erik Cambria +7

  46. cs.SD 2026-05-05 reviewed
    0.1B omni model reaches 0.09 CER in speech-text consistency

    MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model

    Jingyao Gong

  47. cs.CV 2026-05-05 reviewed
    Conformal loop self-calibrates multimodal models on noisy

    Multimodal Learning on Low-Quality Data with Conformal Predictive Self-Calibration

    Disen Hu +7

  48. cs.MM 2026-05-05 reviewed
    Imitation learning splits music colors across multiple stage lights

    Stage Light is Sequence$^2$: Multi-Light Control via Imitation Learning

    Dian Jin +3

  49. cs.SD 2026-05-05 reviewed
    Aesthetic features lift AI music preference prediction on unseen generators

    APEX: Large-scale Multi-task Aesthetic-Informed Popularity Prediction for AI-Generated Music

    Dorien Herremans +1

  50. cs.CV 2026-05-05 reviewed
    Dual-system refines scores to boost self-supervised forgery detectors

    Enhancing Self-Supervised Talking Head Forgery Detection via a Training-Free Dual-System Framework

    Jiwei Wei +7