hub

Scaling Diﬀusion Transformers to 16 Billion Parameters

Scaling Diffusion Transformers to 16 Billion Parameters , author= · 2024 · arXiv 2407.11633

14 Pith papers cite this work. Polarity classification is still indexing.

14 Pith papers citing it

read on arXiv browse 14 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 method 1

citation-polarity summary

background 2 use method 1

representative citing papers

DiT-Reward: Generative Representations for Text-to-Image Reward Modeling

cs.LG · 2026-06-22 · unverdicted · novelty 7.0

DiT-Reward converts pretrained DiT models into reward predictors that outperform HPSv3 on four benchmarks while providing 1.65x inference speedup.

From Articulated Kinematics to Routed Visual Control for Action-Conditioned Surgical Video Generation

cs.CV · 2026-05-09 · unverdicted · novelty 7.0

A kinematic-to-visual lifting paradigm combined with hierarchically routed control generates action-conditioned surgical videos with better faithfulness, fidelity, and efficiency.

Large Spikes in Stochastic Gradient Descent: A Large-Deviations View

cs.LG · 2026-03-10 · unverdicted · novelty 7.0

Large loss spikes in SGD are polynomially likely and serve as the dominant mechanism for escaping sharp minima toward flatter solutions in the NTK regime.

InstructMoLE: Instruction-Guided Mixture of Low-rank Experts for Multi-Conditional Image Generation

cs.CV · 2025-12-25 · unverdicted · novelty 7.0

InstructMoLE replaces per-token routing with instruction-guided global routing for mixture-of-low-rank-experts in diffusion transformers and adds an output-space orthogonality loss to improve multi-conditional image generation.

Amplifying Membership Signal Through Chained Regeneration

cs.LG · 2026-06-30 · unverdicted · novelty 6.0

MADreMIA amplifies membership inference signals by showing that memorized samples maintain higher coherence and slower degradation in chained regeneration trajectories than non-members.

WiSP: A Working-Set View of Mixture-of-Experts Serving on Extremely Low-Resource Hardware

cs.LG · 2026-06-20 · unverdicted · novelty 6.0

WiSP achieves up to 1.95x decode throughput on low-resource MoE serving by dynamically paging reused experts and using MV-WSA to allocate VRAM between experts and KV cache, with the offline policy performing well on both prefill and decode.

MoECodec: Image Compression for joint human and machine perception via Mixture-of-Experts

eess.IV · 2026-06-19 · unverdicted · novelty 6.0

MoECodec replaces FFN layers with token-wise MoE plus stable routing and GShMLP experts to support multiple downstream tasks in a single image compression model.

CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

CoInteract adds a human-aware mixture-of-experts and spatially-structured co-generation to a diffusion transformer to synthesize videos with stable structures and physically plausible human-object contacts.

VersaVogue: Visual Expert Orchestration and Preference Alignment for Unified Fashion Synthesis

cs.CV · 2026-04-08 · unverdicted · novelty 6.0

VersaVogue unifies garment generation and virtual dressing via trait-routing attention with mixture-of-experts and an automated multi-perspective preference optimization pipeline that uses DPO without human labels.

TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models

cs.CV · 2025-02-10 · unverdicted · novelty 6.0

TripoSG generates high-fidelity 3D meshes from input images via a large-scale rectified flow transformer and hybrid-trained 3D VAE on a custom 2-million-sample dataset, claiming state-of-the-art fidelity and generalization.

GenEraser: Generalizable Video Object Removal via Balanced Text-Mask Guidance and Decoupled Locator-Preserver

cs.CV · 2026-05-28 · unverdicted · novelty 5.0

GenEraser proposes MC-MoE with bipartite text guidance, LD-CFG fusion, and a decoupled locator-preserver architecture for generalizable video object and effect removal, claiming 2.16 dB and 1.44 dB gains on ROSE and VOR-Eval benchmarks.

Sparse Mixture-of-Experts Routing in Visual Diffusion Transformers:Diagnosis, Boundary Calibration and Evolutionary Roadmap from Routing Collapse to Selective Deadlock

cs.CV · 2026-05-12 · unverdicted · novelty 5.0

Diagnoses five failure modes in Token-Choice MoE routing for visual diffusion transformers and proposes the Functional Redundancy Hypothesis to explain selective deadlock.

FocusDiT: Masking Queries in Diffusion Transformers for Fine-grained Image Generation

cs.CV · 2026-06-01 · unverdicted · novelty 4.0

FocusDiT masks non-critical query tokens before they enter the FFN in DiT models, directing capacity toward complex visual details and reporting improved text-to-image results.

Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE

cs.CV · 2026-05-04 · unverdicted · novelty 4.0

Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.

citing papers explorer

Showing 4 of 4 citing papers after filters.

DiT-Reward: Generative Representations for Text-to-Image Reward Modeling cs.LG · 2026-06-22 · unverdicted · none · ref 87
DiT-Reward converts pretrained DiT models into reward predictors that outperform HPSv3 on four benchmarks while providing 1.65x inference speedup.
Large Spikes in Stochastic Gradient Descent: A Large-Deviations View cs.LG · 2026-03-10 · unverdicted · none · ref 19
Large loss spikes in SGD are polynomially likely and serve as the dominant mechanism for escaping sharp minima toward flatter solutions in the NTK regime.
Amplifying Membership Signal Through Chained Regeneration cs.LG · 2026-06-30 · unverdicted · none · ref 13
MADreMIA amplifies membership inference signals by showing that memorized samples maintain higher coherence and slower degradation in chained regeneration trajectories than non-members.
WiSP: A Working-Set View of Mixture-of-Experts Serving on Extremely Low-Resource Hardware cs.LG · 2026-06-20 · unverdicted · none · ref 4
WiSP achieves up to 1.95x decode throughput on low-resource MoE serving by dynamically paging reused experts and using MV-WSA to allocate VRAM between experts and KV cache, with the offline policy performing well on both prefill and decode.

Scaling Diﬀusion Transformers to 16 Billion Parameters

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer