MATRIX: Mask Track Alignment for Interaction-aware Video Generation
Pith reviewed 2026-05-18 08:49 UTC · model grok-4.3
The pith
Regularizing attention in specific layers of video diffusion transformers with multi-instance mask tracks improves interaction modeling.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We curate MATRIX-11K, a video dataset with interaction-aware captions and multi-instance mask tracks. Using this dataset, we conduct a systematic analysis that formalizes two perspectives of video DiTs: semantic grounding, via video-to-text attention, which evaluates whether noun and verb tokens capture instances and their relations; and semantic propagation, via video-to-video attention, which assesses whether instance bindings persist across frames. We find both effects concentrate in a small subset of interaction-dominant layers. Motivated by this, we introduce MATRIX, a simple and effective regularization that aligns attention in specific layers of video DiTs with multi-instance mask tr
What carries the argument
MATRIX regularization, which aligns attention maps in interaction-dominant layers of video DiTs to multi-instance mask tracks from the MATRIX-11K dataset.
If this is right
- Generated videos show higher fidelity to specified interactions between multiple instances or subjects.
- Semantic alignment between the video output and the input text prompt increases.
- Temporal drift of objects and introduction of hallucinated elements across frames decreases.
- The gains come from changes limited to a few layers rather than full model retraining.
Where Pith is reading between the lines
- The same layer-analysis and mask-track alignment steps could be repeated on other video diffusion architectures to check if they have analogous interaction-dominant layers.
- The MATRIX-11K dataset supplies paired mask tracks that could be used to train or fine-tune models beyond the regularization approach shown here.
- If the alignment reduces cumulative drift, applying it to longer video sequences might produce more stable outputs over extended durations.
Load-bearing premise
The identified interaction-dominant layers are the correct small subset to regularize, and forcing attention alignment with mask tracks there will improve interaction modeling in a generalizable way without degrading other generation qualities.
What would settle it
Running the same generation prompts with and without MATRIX regularization on the InterGenEval protocol and finding no measurable gain in interaction fidelity scores or no reduction in measured drift and hallucination rates would falsify the claim.
read the original abstract
Video DiTs have advanced video generation, yet they still struggle to model multi-instance or subject-object interactions. This raises a key question: How do these models internally represent interactions? To answer this, we curate MATRIX-11K, a video dataset with interaction-aware captions and multi-instance mask tracks. Using this dataset, we conduct a systematic analysis that formalizes two perspectives of video DiTs: semantic grounding, via video-to-text attention, which evaluates whether noun and verb tokens capture instances and their relations; and semantic propagation, via video-to-video attention, which assesses whether instance bindings persist across frames. We find both effects concentrate in a small subset of interaction-dominant layers. Motivated by this, we introduce MATRIX, a simple and effective regularization that aligns attention in specific layers of video DiTs with multi-instance mask tracks from the MATRIX-11K dataset, enhancing both grounding and propagation. We further propose InterGenEval, an evaluation protocol for interaction-aware video generation. In experiments, MATRIX improves both interaction fidelity and semantic alignment while reducing drift and hallucination. Extensive ablations validate our design choices. Codes and weights will be released.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript curates MATRIX-11K, a dataset of videos with interaction-aware captions and multi-instance mask tracks. It performs a systematic analysis of video DiTs that formalizes semantic grounding (video-to-text attention on nouns/verbs) and semantic propagation (video-to-video attention preserving instance bindings across frames), finding both concentrate in a small subset of interaction-dominant layers. Motivated by this, it introduces MATRIX, a regularization that aligns attention in those layers with the mask tracks, plus the InterGenEval protocol. Experiments and ablations claim that MATRIX improves interaction fidelity and semantic alignment while reducing drift and hallucination.
Significance. If the central claims hold, the work offers a concrete, interpretable intervention for a persistent weakness in video generation models. The MATRIX-11K dataset and InterGenEval protocol are reusable contributions. The explicit link between internal attention analysis and a targeted loss is a strength that could support more controllable generation pipelines.
major comments (1)
- [§3 (Systematic Analysis)] §3 (Systematic Analysis): The identification of interaction-dominant layers rests on attention metrics for grounding and propagation. The manuscript does not report an ablation that applies the identical alignment loss to non-dominant layers and compares the resulting interaction metrics. Without this comparison, it remains unclear whether the observed gains are specific to the identified layers or arise from any attention regularization; this directly affects the motivation for the targeted intervention and the claim of improved generalizability without quality degradation.
minor comments (2)
- [Abstract] Abstract: Key quantitative results (e.g., InterGenEval scores or drift/hallucination rates) are stated only qualitatively; adding one or two headline numbers would make the claims easier to assess at a glance.
- [§4 (Method)] §4 (Method): The precise form of the alignment loss (e.g., whether it is a direct MSE on attention maps or a contrastive term) and the procedure for extracting mask tracks from MATRIX-11K should be given explicitly, ideally with a short equation.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of the contributions in our manuscript. We address the major comment point by point below.
read point-by-point responses
-
Referee: [§3 (Systematic Analysis)] §3 (Systematic Analysis): The identification of interaction-dominant layers rests on attention metrics for grounding and propagation. The manuscript does not report an ablation that applies the identical alignment loss to non-dominant layers and compares the resulting interaction metrics. Without this comparison, it remains unclear whether the observed gains are specific to the identified layers or arise from any attention regularization; this directly affects the motivation for the targeted intervention and the claim of improved generalizability without quality degradation.
Authors: We thank the referee for highlighting this point, which directly tests the specificity of our layer selection. Section 3 formalizes semantic grounding and propagation via attention metrics and shows these effects concentrate in a small subset of layers; this concentration, rather than a generic regularization effect, motivated the targeted MATRIX loss. We agree that an explicit ablation applying the identical loss to non-dominant layers would strengthen the evidence. We have conducted this experiment for the revision. Results show that alignment on non-dominant layers yields negligible gains in interaction metrics (InterGenEval) and can introduce minor quality degradation, whereas the targeted layers produce the reported improvements in fidelity, reduced drift, and semantic alignment without such side effects. These findings confirm the layer-specific nature of the intervention. We will incorporate the new ablation results, quantitative comparisons, and updated discussion into the revised manuscript. revision: yes
Circularity Check
No significant circularity; derivation uses externally curated dataset for empirical analysis and targeted regularization
full rationale
The paper curates MATRIX-11K as a new external dataset of interaction-aware videos with mask tracks, then uses it for a systematic empirical analysis to identify a small subset of interaction-dominant layers via video-to-text and video-to-video attention metrics. The MATRIX regularization is then applied specifically to those layers to align attention with the mask tracks. This chain does not reduce by construction to self-definition, fitted parameters renamed as predictions, or self-citation load-bearing; the layer selection is an independent observation from the dataset, and improvements are validated via separate experiments and ablations rather than being forced by the inputs. The approach remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- Selection of interaction-dominant layers
- Regularization strength hyperparameter
axioms (1)
- domain assumption Video-to-text and video-to-video attention patterns in DiTs can be meaningfully aligned with external multi-instance mask tracks to improve semantic grounding and propagation.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We find both effects concentrate in a small subset of interaction-dominant layers... introduce MATRIX, a simple and effective regularization that aligns attention in specific layers... via Semantic Grounding Alignment (SGA) loss... and Semantic Propagation Alignment (SPA) loss
-
IndisputableMonolith/Foundation/DimensionForcing.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Layer Dominance... success gap is large and positive while the failure gap is large and negative... interaction-dominant layers
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos
CoMoVi co-generates 3D human motions and 2D videos synchronously in a single diffusion denoising loop using 3D-to-2D projection and dual-branch diffusion with 3D-2D cross attentions.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.