Training transformers with enforced lipschitz constants,

· 2025 · arXiv 2507.13338

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

DARE: Diffusion Language Model Activation Reuse for Efficient Inference

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

DARE reuses up to 87% of attention activations in diffusion LLMs through KV caching and output reuse, delivering 1.2x per-layer latency gains with average performance drops of 1.2-2.0%.

The Geometric Alignment Tax: Tokenization vs. Continuous Geometry in Scientific Foundation Models

cs.LG · 2026-04-05 · unverdicted · novelty 6.0

Discrete tokenization in scientific foundation models imposes a geometric alignment tax that distorts continuous manifolds, with continuous heads reducing distortion by up to 8.5x and exposing three failure regimes in 14 biological models.

Constrained Stochastic Spectral Preconditioning Converges for Nonconvex Objectives

math.OC · 2026-05-12 · unverdicted · novelty 5.0

Proximal stochastic spectral preconditioning converges for nonconvex constrained objectives under heavy-tailed noise, with a variance-reduced version achieving faster rates and a refined analysis of Muon iterations.

Rate-Distortion Optimization for Transformer Inference

cs.LG · 2026-01-29 · unverdicted · novelty 5.0

A rate-distortion framework for lossy compression of transformer representations yields substantial bitrate savings on language tasks while preserving accuracy, with observed rates aligning to derived information-theoretic bounds.

citing papers explorer

Showing 4 of 4 citing papers.

DARE: Diffusion Language Model Activation Reuse for Efficient Inference cs.LG · 2026-05-01 · unverdicted · none · ref 21
DARE reuses up to 87% of attention activations in diffusion LLMs through KV caching and output reuse, delivering 1.2x per-layer latency gains with average performance drops of 1.2-2.0%.
The Geometric Alignment Tax: Tokenization vs. Continuous Geometry in Scientific Foundation Models cs.LG · 2026-04-05 · unverdicted · none · ref 31
Discrete tokenization in scientific foundation models imposes a geometric alignment tax that distorts continuous manifolds, with continuous heads reducing distortion by up to 8.5x and exposing three failure regimes in 14 biological models.
Constrained Stochastic Spectral Preconditioning Converges for Nonconvex Objectives math.OC · 2026-05-12 · unverdicted · none · ref 48
Proximal stochastic spectral preconditioning converges for nonconvex constrained objectives under heavy-tailed noise, with a variance-reduced version achieving faster rates and a refined analysis of Muon iterations.
Rate-Distortion Optimization for Transformer Inference cs.LG · 2026-01-29 · unverdicted · none · ref 64
A rate-distortion framework for lossy compression of transformer representations yields substantial bitrate savings on language tasks while preserving accuracy, with observed rates aligning to derived information-theoretic bounds.

Training transformers with enforced lipschitz constants,

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer