hub

Attention is All you Need , url =

Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Gomez, Aidan N

21 Pith papers cite this work. Polarity classification is still indexing.

21 Pith papers citing it

browse 21 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Online Learning-to-Defer with Varying Experts

stat.ML · 2026-05-12 · unverdicted · novelty 8.0

Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.

Convergent Stochastic Training of Attention and Understanding LoRA

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

Attention and LoRA regression losses induce Poincaré inequalities under mild regularization, so SGD-mimicking SDEs converge to minimizers with no assumptions on data or model size.

Deep Reasoning in General Purpose Agents via Structured Meta-Cognition

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.

OLIVIA: Online Learning via Inference-time Action Adaptation for Decision Making in LLM ReAct Agents

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

OLIVIA treats LLM agent action selection as a contextual linear bandit over frozen hidden states and applies UCB exploration to adapt online, yielding consistent gains over static ReAct and prompt-based baselines on four benchmarks.

Key-Value Means: Transformers with Expandable Block-Recurrent Compressed Memory

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

KVM is a novel block-recurrent compressed memory for attention that unifies expandable transformer context with linear RNN efficiency, enabling competitive long-context performance with released code and models.

Sinkhorn Treatment Effects: A Causal Optimal Transport Measure

stat.ML · 2026-05-08 · unverdicted · novelty 7.0

The Sinkhorn treatment effect is a new entropic optimal transport measure of divergence between counterfactual distributions that admits first- and second-order pathwise differentiability, debiased estimators, and asymptotically valid tests for distributional treatment effects.

Scalable Distributed Stochastic Optimization via Bidirectional Compression: Beyond Pessimistic Limits

math.OC · 2026-05-08 · unverdicted · novelty 7.0

Inkheart SGD and M4 use bidirectional compression to achieve time complexities in distributed SGD that improve with worker count n and surpass prior lower bounds under a necessary structural assumption.

Don't Retrain, Align: Adapting Autoregressive LMs to Diffusion LMs via Representation Alignment

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

Layer-wise representation alignment lets diffusion language models reuse semantic structures from frozen autoregressive models, accelerating training by up to 4x without architectural changes beyond the attention mask.

Steering Language Models With Activation Engineering

cs.CL · 2023-08-20 · unverdicted · novelty 7.0

Activation Addition steers language models by adding contrastive activation vectors from prompt pairs to control high-level properties like sentiment and toxicity at inference time without training.

Prefix-Tuning: Optimizing Continuous Prompts for Generation

cs.CL · 2021-01-01 · conditional · novelty 7.0

Prefix-tuning matches or exceeds fine-tuning on NLG tasks by optimizing a continuous prefix using 0.1% of parameters while keeping the LM frozen.

Weighted Rules under the Stable Model Semantics

cs.AI · 2026-05-10 · unverdicted · novelty 6.0

Weighted rules extend stable model semantics to support probabilistic reasoning, model ranking, and statistical inference in answer set programs.

RT-Transformer: The Transformer Block as a Spherical State Estimator

cs.LG · 2026-05-10 · unverdicted · novelty 6.0

Transformer components arise as the natural solution to precision-weighted directional state estimation on the hypersphere.

PnP-Corrector: A Universal Correction Framework for Coupled Spatiotemporal Forecasting

cs.AI · 2026-05-09 · unverdicted · novelty 6.0

PnP-Corrector decouples physics simulation from error correction via a plug-and-play agent, cutting error by 29% in 300-day global ocean-atmosphere forecasts.

The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

Expanded recall in LLM agents erodes cooperative intent in multi-agent social dilemmas, observed in 18 of 28 model-game settings.

Colorful-Noise: Training-Free Low-Frequency Noise Manipulation for Color-Based Conditional Image Generation

cs.CV · 2026-05-01 · unverdicted · novelty 6.0

A training-free technique manipulates low-frequency noise in diffusion models to control image color and structure using low-frequency priors.

Muon is Scalable for LLM Training

cs.LG · 2025-02-24 · unverdicted · novelty 6.0

Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.

Rennala MVR: Improved Time Complexity for Parallel Stochastic Optimization via Momentum-Based Variance Reduction

math.OC · 2026-05-09 · unverdicted · novelty 5.0

Rennala MVR improves time complexity over Rennala SGD for smooth nonconvex stochastic optimization in heterogeneous parallel systems under a mean-squared smoothness assumption.

Towards General Text Embeddings with Multi-stage Contrastive Learning

cs.CL · 2023-08-07 · unverdicted · novelty 5.0

GTE_base is a compact text embedding model using multi-stage contrastive learning on diverse data that outperforms OpenAI's API and 10x larger models on massive benchmarks and works for code as text.

Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models

cs.CL · 2026-05-12 · unverdicted · novelty 4.0

Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.

YEZE at SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization via Heterogeneous Ensembling

cs.CL · 2026-05-07 · unverdicted · novelty 2.0

A heterogeneous ensemble of XLM-RoBERTa-large and mDeBERTa-v3-base with independent task modeling and class weighting is reported as effective for multilingual, multicultural, and multievent online polarization detection.

citing papers explorer

Showing 21 of 21 citing papers.

Online Learning-to-Defer with Varying Experts stat.ML · 2026-05-12 · unverdicted · none · ref 129
Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.
EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding cs.CV · 2026-05-11 · unverdicted · none · ref 1
EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.
Convergent Stochastic Training of Attention and Understanding LoRA cs.LG · 2026-05-08 · unverdicted · none · ref 33
Attention and LoRA regression losses induce Poincaré inequalities under mild regularization, so SGD-mimicking SDEs converge to minimizers with no assumptions on data or model size.
Deep Reasoning in General Purpose Agents via Structured Meta-Cognition cs.CL · 2026-05-12 · unverdicted · none · ref 84
DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.
OLIVIA: Online Learning via Inference-time Action Adaptation for Decision Making in LLM ReAct Agents cs.AI · 2026-05-11 · unverdicted · none · ref 1
OLIVIA treats LLM agent action selection as a contextual linear bandit over frozen hidden states and applies UCB exploration to adapt online, yielding consistent gains over static ReAct and prompt-based baselines on four benchmarks.
Key-Value Means: Transformers with Expandable Block-Recurrent Compressed Memory cs.LG · 2026-05-11 · unverdicted · none · ref 1
KVM is a novel block-recurrent compressed memory for attention that unifies expandable transformer context with linear RNN efficiency, enabling competitive long-context performance with released code and models.
Sinkhorn Treatment Effects: A Causal Optimal Transport Measure stat.ML · 2026-05-08 · unverdicted · none · ref 33
The Sinkhorn treatment effect is a new entropic optimal transport measure of divergence between counterfactual distributions that admits first- and second-order pathwise differentiability, debiased estimators, and asymptotically valid tests for distributional treatment effects.
Scalable Distributed Stochastic Optimization via Bidirectional Compression: Beyond Pessimistic Limits math.OC · 2026-05-08 · unverdicted · none · ref 1
Inkheart SGD and M4 use bidirectional compression to achieve time complexities in distributed SGD that improve with worker count n and surpass prior lower bounds under a necessary structural assumption.
Don't Retrain, Align: Adapting Autoregressive LMs to Diffusion LMs via Representation Alignment cs.LG · 2026-05-07 · unverdicted · none · ref 1
Layer-wise representation alignment lets diffusion language models reuse semantic structures from frozen autoregressive models, accelerating training by up to 4x without architectural changes beyond the attention mask.
Steering Language Models With Activation Engineering cs.CL · 2023-08-20 · unverdicted · none · ref 103
Activation Addition steers language models by adding contrastive activation vectors from prompt pairs to control high-level properties like sentiment and toxicity at inference time without training.
Prefix-Tuning: Optimizing Continuous Prompts for Generation cs.CL · 2021-01-01 · conditional · none · ref 91
Prefix-tuning matches or exceeds fine-tuning on NLG tasks by optimizing a continuous prefix using 0.1% of parameters while keeping the LM frozen.
Weighted Rules under the Stable Model Semantics cs.AI · 2026-05-10 · unverdicted · none · ref 103
Weighted rules extend stable model semantics to support probabilistic reasoning, model ranking, and statistical inference in answer set programs.
RT-Transformer: The Transformer Block as a Spherical State Estimator cs.LG · 2026-05-10 · unverdicted · none · ref 14
Transformer components arise as the natural solution to precision-weighted directional state estimation on the hypersphere.
PnP-Corrector: A Universal Correction Framework for Coupled Spatiotemporal Forecasting cs.AI · 2026-05-09 · unverdicted · none · ref 92
PnP-Corrector decouples physics simulation from error correction via a plug-and-play agent, cutting error by 29% in 300-day global ocean-atmosphere forecasts.
The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents cs.CL · 2026-05-08 · unverdicted · none · ref 1
Expanded recall in LLM agents erodes cooperative intent in multi-agent social dilemmas, observed in 18 of 28 model-game settings.
Colorful-Noise: Training-Free Low-Frequency Noise Manipulation for Color-Based Conditional Image Generation cs.CV · 2026-05-01 · unverdicted · none · ref 101
A training-free technique manipulates low-frequency noise in diffusion models to control image color and structure using low-frequency priors.
Muon is Scalable for LLM Training cs.LG · 2025-02-24 · unverdicted · none · ref 33
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
Rennala MVR: Improved Time Complexity for Parallel Stochastic Optimization via Momentum-Based Variance Reduction math.OC · 2026-05-09 · unverdicted · none · ref 73
Rennala MVR improves time complexity over Rennala SGD for smooth nonconvex stochastic optimization in heterogeneous parallel systems under a mean-squared smoothness assumption.
Towards General Text Embeddings with Multi-stage Contrastive Learning cs.CL · 2023-08-07 · unverdicted · none · ref 60
GTE_base is a compact text embedding model using multi-stage contrastive learning on diverse data that outperforms OpenAI's API and 10x larger models on massive benchmarks and works for code as text.
Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models cs.CL · 2026-05-12 · unverdicted · none · ref 87
Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.
YEZE at SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization via Heterogeneous Ensembling cs.CL · 2026-05-07 · unverdicted · none · ref 26
A heterogeneous ensemble of XLM-RoBERTa-large and mDeBERTa-v3-base with independent task modeling and class weighting is reported as effective for multilingual, multicultural, and multievent online polarization detection.

Attention is All you Need , url =

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer