Multimodal Transformer for Unaligned Multimodal Language Sequences

J. Zico Kolter; Louis-Philippe Morency; Paul Pu Liang; Ruslan Salakhutdinov; Shaojie Bai; Yao-Hung Hubert Tsai

arxiv: 1906.00295 · v1 · pith:S7MUHONQnew · submitted 2019-06-01 · 💻 cs.CL

Multimodal Transformer for Unaligned Multimodal Language Sequences

Yao-Hung Hubert Tsai , Shaojie Bai , Paul Pu Liang , J. Zico Kolter , Louis-Philippe Morency , Ruslan Salakhutdinov This is my paper

classification 💻 cs.CL

keywords multimodallanguagecrossmodaldatasequencesacrossattentionhuman

0 comments

read the original abstract

Human language is often multimodal, which comprehends a mixture of natural language, facial gestures, and acoustic behaviors. However, two major challenges in modeling such multimodal human language time-series data exist: 1) inherent data non-alignment due to variable sampling rates for the sequences from each modality; and 2) long-range dependencies between elements across modalities. In this paper, we introduce the Multimodal Transformer (MulT) to generically address the above issues in an end-to-end manner without explicitly aligning the data. At the heart of our model is the directional pairwise crossmodal attention, which attends to interactions between multimodal sequences across distinct time steps and latently adapt streams from one modality to another. Comprehensive experiments on both aligned and non-aligned multimodal time-series show that our model outperforms state-of-the-art methods by a large margin. In addition, empirical analysis suggests that correlated crossmodal signals are able to be captured by the proposed crossmodal attention mechanism in MulT.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Comparison of Fusion Techniques for Multi-Modal Human Activity Recognition on the HARMES Dataset
cs.LG 2026-06 conditional novelty 6.0

Gated Multi-modal Fusion reaches 0.82 macro F1 on HARMES, beating the concatenation baseline of 0.76 by 6 points under leave-one-participant-out evaluation.
Synthetic Homes: A Multimodal Generative AI Pipeline for Residential Building Data Generation under Data Scarcity
cs.AI 2025-09 unverdicted novelty 5.0

A modular multimodal generative AI framework produces synthetic residential building data from public sources, with reported overlaps exceeding 65% against a national reference dataset.
A Conflict-Aware Penalty and Statistical Loss Framework for Balancing Modalities and Enhancing Stability in Multimodal Sentiment Analysis
cs.AI 2026-05 unverdicted novelty 3.0

Introduces CP and SL to balance modalities and stabilize training in MSA, reporting SOTA results on CMU-MOSI with component ablations.