Below, we detail the dataset composition, hardware configuration, and hyperparameters utilized to train the models

A Training Details To train the Orthrus dual-view architecture, we employ a highly optimized distillation pipeline that isolates the diffusion head while keeping the autoregressive (AR) backbone strictly frozen · 2025

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Orthrus unifies autoregressive LLMs and diffusion models via shared KV cache and consensus to enable up to 7.8x parallel token generation speedup with O(1) memory overhead and lossless results.

citing papers explorer

Showing 1 of 1 citing paper.

Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion cs.LG · 2026-05-12 · unverdicted · none · ref 25
Orthrus unifies autoregressive LLMs and diffusion models via shared KV cache and consensus to enable up to 7.8x parallel token generation speedup with O(1) memory overhead and lossless results.

Below, we detail the dataset composition, hardware configuration, and hyperparameters utilized to train the models

fields

years

verdicts

representative citing papers

citing papers explorer