arXiv preprint arXiv:2110.04260 , year=

Zuo,S · 2021 · arXiv 2110.04260

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

cs.LG · 2023-09-25 · accept · novelty 6.0

DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.

MEPA: Multi-Scale Representation Alignment for Visual Autoregressive Modeling with Mixture of Experts

cs.CV · 2026-07-01 · unverdicted · novelty 5.0

MEPA adds token-routed MoE and residual self-supervised feature alignment to VAR models, reporting better FID on ImageNet 256x256 with half the training epochs and fewer parameters than dense baselines.

citing papers explorer

Showing 2 of 2 citing papers.

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models cs.LG · 2023-09-25 · accept · none · ref 167
DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.
MEPA: Multi-Scale Representation Alignment for Visual Autoregressive Modeling with Mixture of Experts cs.CV · 2026-07-01 · unverdicted · none · ref 66
MEPA adds token-routed MoE and residual self-supervised feature alignment to VAR models, reporting better FID on ImageNet 256x256 with half the training epochs and fewer parameters than dense baselines.

arXiv preprint arXiv:2110.04260 , year=

fields

years

verdicts

representative citing papers

citing papers explorer