Under a polynomial context-truncation sensitivity assumption, suffix-only KV cache policies require per-token memory scaling as Θ(ε^{-1/α}) to achieve distortion ε.
Formal algorithms for transformers
9 Pith papers cite this work. Polarity classification is still indexing.
years
2026 9verdicts
UNVERDICTED 9representative citing papers
Derives transformer-like dual-filter inference layers from first-principles optimal control on nonlinear discrete and linear Gaussian sequence models.
A categorical algebra for deep learning that formalizes broadcasting via axis-stride and array-broadcasted categories and supplies matching Python and TypeScript implementations.
Temporal diversity in task distribution during training increases generalization bias over memorization in transformers for in-context linear regression.
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
AIR augments activation-aware SVD compression of LLMs with an influence metric and a closed-form ALS update, claiming >18% perplexity improvement at 60% parameter retention and 90% less calibration data than SVD-LLM(W).
Media sentiment indicators from Canadian news, when added to a New Keynesian model with endogenous central-bank response, improve out-of-sample forecasts and account for part of monetary-policy propagation to output and prices.
The survey groups attention-based GNNs into three stages—graph recurrent attention networks, graph attention networks, and graph transformers—while reviewing architectures and future directions.
Lecture notes that treat statistical physics as probability theory and connect Ising models, spin glasses, and renormalization group ideas to Hopfield networks, restricted Boltzmann machines, and large language models.
citing papers explorer
-
Polynomial Context-Truncation Sensitivity in Autoregressive Language Models: Sequential Wyner-Ziv Bounds for KV Cache Compression
Under a polynomial context-truncation sensitivity assumption, suffix-only KV cache policies require per-token memory scaling as Θ(ε^{-1/α}) to achieve distortion ε.
-
Transformer-like Inference from Optimal Control
Derives transformer-like dual-filter inference layers from first-principles optimal control on nonlinear discrete and linear Gaussian sequence models.
-
Weaves, Wires, and Morphisms: Formalizing and Implementing the Algebra of Deep Learning
A categorical algebra for deep learning that formalizes broadcasting via axis-stride and array-broadcasted categories and supplies matching Python and TypeScript implementations.
-
Temporal Task Diversity: Inductive Biases Under Non-Stationarity in Synthetic Sequence Modelling
Temporal diversity in task distribution during training increases generalization bias over memorization in transformers for in-context linear regression.
-
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
-
Activation- and Influence-Aware Ranks (AIR): Function-Preserving SVD Compression for LLMs
AIR augments activation-aware SVD compression of LLMs with an influence metric and a closed-form ALS update, claiming >18% perplexity improvement at 60% parameter retention and 90% less calibration data than SVD-LLM(W).
-
Monetary Policy in the Media Spotlight: Sentiments, Signals, and Economic Impact
Media sentiment indicators from Canadian news, when added to a New Keynesian model with endogenous central-bank response, improve out-of-sample forecasts and account for part of monetary-policy propagation to output and prices.
-
Attention-based graph neural networks: a survey
The survey groups attention-based GNNs into three stages—graph recurrent attention networks, graph attention networks, and graph transformers—while reviewing architectures and future directions.
-
Lecture Notes on Statistical Physics and Neural Networks
Lecture notes that treat statistical physics as probability theory and connect Ising models, spin glasses, and renormalization group ideas to Hopfield networks, restricted Boltzmann machines, and large language models.