Recognition: unknown
Gated Multimodal Units for Information Fusion
read the original abstract
This paper presents a novel model for multimodal learning based on gated neural networks. The Gated Multimodal Unit (GMU) model is intended to be used as an internal unit in a neural network architecture whose purpose is to find an intermediate representation based on a combination of data from different modalities. The GMU learns to decide how modalities influence the activation of the unit using multiplicative gates. It was evaluated on a multilabel scenario for genre classification of movies using the plot and the poster. The GMU improved the macro f-score performance of single-modality approaches and outperformed other fusion strategies, including mixture of experts models. Along with this work, the MM-IMDb dataset is released which, to the best of our knowledge, is the largest publicly available multimodal dataset for genre prediction on movies.
This paper has not been read by Pith yet.
Forward citations
Cited by 5 Pith papers
-
DecepGPT: Schema-Driven Deception Detection with Multicultural Datasets and Robust Multimodal Learning
A new 1695-sample multicultural dataset plus two modules for stable multimodal fusion and modality consistency yield state-of-the-art deception detection with cross-cultural transfer.
-
Learning Multi-Relational Graph Representations for DNA Methylation-Based Biological Age Estimation
RelAge-GNN models relationships among CpG sites via co-methylation, genomic location, and gene association graphs to estimate biological age more accurately than prior methods.
-
EduGage: Methods and Dataset for Sensor-Based Momentary Assessment of Engagement in Self-Guided Video Learning
EduGage releases a multimodal sensor dataset and models for estimating learner engagement in self-guided video learning, reporting MAE of 0.81 and outperforming baselines with 16 participants.
-
CGCMA: Conditionally-Gated Cross-Modal Attention for Event-Conditioned Asynchronous Fusion
CGCMA separates text-conditioned grounding from lag-aware trust gating to fuse asynchronous price and web data, yielding the highest Sharpe ratio of +0.449 on a new crypto news corpus.
-
Lightweight Stylistic Consistency Profiling: Robust Detection of LLM-Generated Textual Content for Multimedia Moderation
LiSCP detects LLM-generated text via stylistic consistency profiling across paraphrased variants and reports up to 11.79% better cross-domain accuracy plus robustness to adversarial attacks.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.