S3 decomposes multimodal data into selectable semantic experts, routes them adaptively, and sparsifies to achieve higher accuracy on MultiBench benchmarks with peak performance at intermediate sparsity levels.
Ur-funny: A mul- timodal language dataset for understanding humor
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
polarities
background 3representative citing papers
CDPR uses an intuition pathway for cross-modal consensus and a reasoning pathway for quantifying and mitigating inconsistencies to improve multimodal intent recognition.
CUCI-Net abstracts context-utterance dependency into an interpretation cue that combines local modality signals with global context and feeds it into the final multimodal interaction for context-conditioned predictions.
A literature survey on abstract concept recognition in videos that catalogs prior tasks and datasets while advocating for foundation models and reuse of decades of community experience.
citing papers explorer
-
Toward Structural Multimodal Representations: Specialization, Selection, and Sparsification via Mixture-of-Experts
S3 decomposes multimodal data into selectable semantic experts, routes them adaptively, and sparsifies to achieve higher accuracy on MultiBench benchmarks with peak performance at intermediate sparsity levels.
-
Mitigating Multimodal Inconsistency via Cognitive Dual-Pathway Reasoning for Intent Recognition
CDPR uses an intuition pathway for cross-modal consensus and a reasoning pathway for quantifying and mitigating inconsistencies to improve multimodal intent recognition.
-
Beyond Isolated Utterances: Cue-Guided Interaction for Context-Dependent Conversational Multimodal Understanding
CUCI-Net abstracts context-utterance dependency into an interpretation cue that combines local modality signals with global context and feeds it into the final multimodal interaction for context-conditioned predictions.
-
Looking Beyond the Obvious: A Survey on Abstract Concept Recognition for Video Understanding
A literature survey on abstract concept recognition in videos that catalogs prior tasks and datasets while advocating for foundation models and reuse of decades of community experience.
- Self-Captioning Multimodal Interaction Tuning: Amplifying Exploitable Redundancies for Robust Vision Language Models