AffectCodec applies block-diagonal projections in residual FSQ to explicitly allocate bits to emotion and acoustic subspaces, combined with emotion conditioning, yielding better emotion preservation at low bitrates with competitive acoustic quality.
Iemocap: Interactive emotional dyadic motion capture database.Language resources and evaluation, 42(4):335–359, 2008
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4verdicts
UNVERDICTED 4roles
dataset 1polarities
use dataset 1representative citing papers
VISAFF is a tuning-free speaker-centered visual affective feature learning framework for emotion recognition in conversation that guides frozen VLMs to active speakers and uses reliability-guided complementation from textual and acoustic modalities to achieve competitive performance.
Controlla learns identity and attribute factors from multimodal inputs and aligns them with graph priors using graph-constrained optimal transport to enforce consistent attribute trajectories while preserving reference identity.
TextPro-SLM reduces the speech-text modality gap by feeding an LLM backbone with synchronized text tokens and prosody embeddings from WhisperPro, achieving lowest gap scores at 3B/7B scales with roughly 1,000 hours of audio.
citing papers explorer
-
AffectCodec: Emotion-Preserving Neural Speech Codec with Block-Diagonal Residual FSQ
AffectCodec applies block-diagonal projections in residual FSQ to explicitly allocate bits to emotion and acoustic subspaces, combined with emotion conditioning, yielding better emotion preservation at low bitrates with competitive acoustic quality.
-
VISAFF: Speaker-Centered Visual Affective Feature Learning for Emotion Recognition in Conversation
VISAFF is a tuning-free speaker-centered visual affective feature learning framework for emotion recognition in conversation that guides frozen VLMs to active speakers and uses reliability-guided complementation from textual and acoustic modalities to achieve competitive performance.
-
Controlla: Learning Controllability via Graph-Constrained Latent Geometry
Controlla learns identity and attribute factors from multimodal inputs and aligns them with graph priors using graph-constrained optimal transport to enforce consistent attribute trajectories while preserving reference identity.
-
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
TextPro-SLM reduces the speech-text modality gap by feeding an LLM backbone with synchronized text tokens and prosody embeddings from WhisperPro, achieving lowest gap scores at 3B/7B scales with roughly 1,000 hours of audio.