Gated Multimodal Units for Information Fusion

John Arevalo , Thamar Solorio , Manuel Montes-y-G\'omez , Fabio A. Gonz\'alez

Authors on Pith no claims yet

classification 📊 stat.ML cs.LG

keywords multimodalgatedunitdatasetfusiongenremodalitiesmodel

read the original abstract

This paper presents a novel model for multimodal learning based on gated neural networks. The Gated Multimodal Unit (GMU) model is intended to be used as an internal unit in a neural network architecture whose purpose is to find an intermediate representation based on a combination of data from different modalities. The GMU learns to decide how modalities influence the activation of the unit using multiplicative gates. It was evaluated on a multilabel scenario for genre classification of movies using the plot and the poster. The GMU improved the macro f-score performance of single-modality approaches and outperformed other fusion strategies, including mixture of experts models. Along with this work, the MM-IMDb dataset is released which, to the best of our knowledge, is the largest publicly available multimodal dataset for genre prediction on movies.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DecepGPT: Schema-Driven Deception Detection with Multicultural Datasets and Robust Multimodal Learning
cs.CV 2026-03 unverdicted novelty 7.0

A new 1695-sample multicultural dataset plus two modules for stable multimodal fusion and modality consistency yield state-of-the-art deception detection with cross-cultural transfer.
Learning Multi-Relational Graph Representations for DNA Methylation-Based Biological Age Estimation
cs.LG 2026-05 unverdicted novelty 6.0

RelAge-GNN models relationships among CpG sites via co-methylation, genomic location, and gene association graphs to estimate biological age more accurately than prior methods.
EduGage: Methods and Dataset for Sensor-Based Momentary Assessment of Engagement in Self-Guided Video Learning
cs.HC 2026-05 unverdicted novelty 6.0

EduGage releases a multimodal sensor dataset and models for estimating learner engagement in self-guided video learning, reporting MAE of 0.81 and outperforming baselines with 16 participants.
CGCMA: Conditionally-Gated Cross-Modal Attention for Event-Conditioned Asynchronous Fusion
cs.LG 2026-04 unverdicted novelty 6.0

CGCMA separates text-conditioned grounding from lag-aware trust gating to fuse asynchronous price and web data, yielding the highest Sharpe ratio of +0.449 on a new crypto news corpus.
Lightweight Stylistic Consistency Profiling: Robust Detection of LLM-Generated Textual Content for Multimedia Moderation
cs.CL 2026-05 unverdicted novelty 4.0

LiSCP detects LLM-generated text via stylistic consistency profiling across paraphrased variants and reports up to 11.79% better cross-domain accuracy plus robustness to adversarial attacks.