hub Mixed citations

YaRN: Efficient Context Window Extension of Large Language Models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, Enrico Shippole · 2023 · cs.CL · arXiv 2309.00071

Mixed citation behavior. Most common role is background (62%).

92 Pith papers citing it

Background 62% of classified citations

open full Pith review browse 92 citing papers arXiv PDF

abstract

Rotary Position Embeddings (RoPE) have been shown to effectively encode positional information in transformer-based language models. However, these models fail to generalize past the sequence length they were trained on. We present YaRN (Yet another RoPE extensioN method), a compute-efficient method to extend the context window of such models, requiring 10x less tokens and 2.5x less training steps than previous methods. Using YaRN, we show that LLaMA models can effectively utilize and extrapolate to context lengths much longer than their original pre-training would allow, while also surpassing previous the state-of-the-art at context window extension. In addition, we demonstrate that YaRN exhibits the capability to extrapolate beyond the limited context of a fine-tuning dataset. Code is available at https://github.com/jquesnelle/yarn

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 11 method 4 other 1

citation-polarity summary

background 10 use method 4 unclear 2

claims ledger

abstract Rotary Position Embeddings (RoPE) have been shown to effectively encode positional information in transformer-based language models. However, these models fail to generalize past the sequence length they were trained on. We present YaRN (Yet another RoPE extensioN method), a compute-efficient method to extend the context window of such models, requiring 10x less tokens and 2.5x less training steps than previous methods. Using YaRN, we show that LLaMA models can effectively utilize and extrapolate to context lengths much longer than their original pre-training would allow, while also surpassing

co-cited works

representative citing papers

A Unified Framework for Critical Scaling of Inverse Temperature in Self-Attention

stat.ML · 2026-05-12 · unverdicted · novelty 8.0

The upper-tail accumulation scale derived from the gap-counting function N_n sets the critical inverse temperature for softmax attention concentration, unifying prior conflicting laws as special cases of different N_n.

ReContext: Recursive Evidence Replay as LLM Harness for Long-Context Reasoning

cs.AI · 2026-07-02 · unverdicted · novelty 7.0

RECONTEXT is a recursive evidence replay technique that improves long-context reasoning in LLMs by constructing and replaying a query-conditioned evidence pool before final generation.

Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models

cs.CL · 2026-06-09 · unverdicted · novelty 7.0

Prefilling-dLLM partitions prefixes into chunks, caches KV representations, and applies sparse top-K selection during decoding to cut dLLM inference complexity to quadratic in decode length only.

LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

LongLive-RAG formulates long video generation as retrieval-augmented generation by treating self-generated latents as a dynamic searchable history and adding a Window Temporal Delta Loss for better retrieval.

Cognitive Fatigue in Autoregressive Transformers: Formalization and Measurement

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

Autoregressive transformers exhibit measurable cognitive fatigue during extended generation, quantified by the Fatigue Index that predicts degradation (AUROC 0.95) and repetition (rho 0.94).

VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

VINS-120K supplies the first large-scale set of instruction-image-edited-image triplets at ultra-high resolution together with an adaptation strategy that improves detail synthesis.

HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.

Jordan-RoPE: Non-Semisimple Relative Positional Encoding via Complex Jordan Blocks

cs.LG · 2026-05-05 · unverdicted · novelty 7.0

Jordan-RoPE realizes a distance-modulated phase basis via non-semisimple Jordan blocks, generating features such as d e^{iωd} for relative positional encoding.

Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation

cs.DC · 2026-04-24 · unverdicted · novelty 7.0

GVR uses previous-step Top-K predictions, pre-indexed stats, secant counting, and shared-memory verification to deliver 1.88x average speedup over radix-select while preserving bit-exact Top-K on DeepSeek-V3.2 workloads.

Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings

q-bio.QM · 2026-04-09 · unverdicted · novelty 7.0

Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and showing strong masked language modeling results with or without positional embeddings.

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

cs.CL · 2026-04-06 · unverdicted · novelty 7.0

TriAttention compresses KV cache by exploiting stable pre-RoPE Q/K concentration and trigonometric distance preferences to match full-attention reasoning accuracy with far lower memory and higher speed.

SHARP: Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion in Remote Sensing Synthesis

cs.CV · 2026-03-23 · conditional · novelty 7.0

SHARP applies a spectrum-aware dynamic RoPE scaling schedule that promotes resolution more strongly in early denoising stages and relaxes it later, outperforming static baselines on quality metrics for remote sensing images.

Group Representational Position Encoding

cs.LG · 2025-12-08 · unverdicted · novelty 7.0

GRAPE unifies RoPE and ALiBi as special cases of group actions on positions, providing a principled design space for positional encodings via SO(d) rotations and GL unipotent transformations.

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

cs.LG · 2024-07-11 · accept · novelty 7.0

FlashAttention-3 achieves 1.5-2x speedup on H100 GPUs for attention, reaching 740 TFLOPs/s (75% utilization) in FP16 and near 1.2 PFLOPs/s in FP8 while cutting numerical error by 2.6x versus baseline FP8 attention.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

cs.CL · 2024-05-07 · unverdicted · novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

cs.CL · 2024-04-10 · conditional · novelty 7.0

Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

cs.CL · 2024-02-21 · unverdicted · novelty 7.0

LongRoPE extends LLM context windows to 2048k tokens via search for non-uniform positional interpolation, progressive fine-tuning from 256k, and short-context readjustment.

Hierarchical Global Attention (HGA)

cs.LG · 2026-06-29 · unverdicted · novelty 6.0

HGA uses RoPE-aware chunk summaries for two-level hierarchical routing to approximate dense causal attention at 3% sparsity with 0.01-0.02 nats quality gap, as a drop-in replacement requiring no retraining.

End-to-End Context Compression at Scale

cs.CL · 2026-06-08 · unverdicted · novelty 6.0

LCLMs are scaled 0.6B-encoder 4B-decoder compressors pre-trained on over 350B tokens that improve the Pareto frontier for general-task performance, compression speed, and peak memory in long-context language model inference.

PJ-RoPE: A Fourier-Jet-Affine Position Space for Relative Attention

cs.LG · 2026-06-03 · unverdicted · novelty 6.0

PJ-RoPE organizes relative-position mechanisms as a learnable Fourier-Jet-Affine space derived from lag-shift dynamics, extending RoPE and ALiBi with explicit jets and sector selection.

AdaCodec: A Predictive Visual Code for Video MLLMs

cs.CV · 2026-06-01 · unverdicted · novelty 6.0

AdaCodec introduces a predictive visual code that cuts visual token use in video MLLMs by sending full frames only on high predictive cost and otherwise encoding inter-frame changes as P-tokens, yielding better benchmark scores at lower budgets.

Give it Space! Explicit Disentangling of Positional and Semantic Representations in Encoders

cs.CL · 2026-05-28 · unverdicted · novelty 6.0

Explicitly disentangling semantic and positional streams in a Transformer encoder reveals that absolute positional representations collapse to a 2D document-structure manifold, attention heads specialize by role, and the approach improves linguistic probing performance on 49 of 65 phenomena.

Simulating Human Memory with Language Models

cs.CL · 2026-05-25 · unverdicted · novelty 6.0

Language models show superior memory to humans on psych experiments but can be adjusted via prompting and compaction to forget more human-like, yielding better user simulators.

SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

SEGA adaptively scales RoPE attention components using spectral-energy guidance from the latent to improve structural coherence and fine details in high-resolution DiT synthesis.

citing papers explorer

Showing 18 of 18 citing papers after filters.

Jordan-RoPE: Non-Semisimple Relative Positional Encoding via Complex Jordan Blocks cs.LG · 2026-05-05 · unverdicted · none · ref 19 · internal anchor
Jordan-RoPE realizes a distance-modulated phase basis via non-semisimple Jordan blocks, generating features such as d e^{iωd} for relative positional encoding.
Group Representational Position Encoding cs.LG · 2025-12-08 · unverdicted · none · ref 15 · internal anchor
GRAPE unifies RoPE and ALiBi as special cases of group actions on positions, providing a principled design space for positional encodings via SO(d) rotations and GL unipotent transformations.
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision cs.LG · 2024-07-11 · accept · none · ref 43 · internal anchor
FlashAttention-3 achieves 1.5-2x speedup on H100 GPUs for attention, reaching 740 TFLOPs/s (75% utilization) in FP16 and near 1.2 PFLOPs/s in FP8 while cutting numerical error by 2.6x versus baseline FP8 attention.
Hierarchical Global Attention (HGA) cs.LG · 2026-06-29 · unverdicted · none · ref 14 · internal anchor
HGA uses RoPE-aware chunk summaries for two-level hierarchical routing to approximate dense causal attention at 3% sparsity with 0.01-0.02 nats quality gap, as a drop-in replacement requiring no retraining.
PJ-RoPE: A Fourier-Jet-Affine Position Space for Relative Attention cs.LG · 2026-06-03 · unverdicted · none · ref 7 · internal anchor
PJ-RoPE organizes relative-position mechanisms as a learnable Fourier-Jet-Affine space derived from lag-shift dynamics, extending RoPE and ALiBi with explicit jets and sector selection.
Less Data, Faster Training: repeating smaller datasets speeds up learning via sampling biases cs.LG · 2026-05-19 · unverdicted · none · ref 33 · internal anchor
Repeating smaller datasets speeds up training via sampling biases that enable appropriate layer-wise growth, leading to compute savings over larger datasets across tasks and architectures.
Remember to Forget: Gated Adaptive Positional Encoding cs.LG · 2026-05-11 · unverdicted · none · ref 21 · internal anchor
GAPE augments RoPE with query- and key-dependent gates to stabilize attention and improve long-context performance in language models.
Democratizing Tool Learning with Environments Fully Simulated by a Free 8B Language Model cs.LG · 2026-04-20 · unverdicted · none · ref 34 · internal anchor
TRUSTEE uses an 8B LM to simulate complete dynamic environments for RL-based tool learning and outperforms baselines that require extra external resources.
In-Place Test-Time Training cs.LG · 2026-04-07 · conditional · none · ref 43 · internal anchor
In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
Short window attention enables long-term memorization cs.LG · 2025-09-29 · unverdicted · none · ref 27 · internal anchor
Short sliding windows in hybrid attention-xLSTM models boost long-context performance by encouraging long-term memory use, and stochastic window sizing improves both short and long tasks.
Muon is Scalable for LLM Training cs.LG · 2025-02-24 · unverdicted · none · ref 11 · internal anchor
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
DataComp-LM: In search of the next generation of training sets for language models cs.LG · 2024-06-17 · unverdicted · none · ref 143 · internal anchor
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
HSAP: A Hierarchical Sequence-aware Parallelism for Hybrid-Context Generative Models cs.LG · 2026-06-29 · unverdicted · none · ref 13 · 2 links · internal anchor
HSAP introduces a hierarchical framework and sequence-aware algorithm with JIT-optimized NCCL communication to enable correct causal attention computation on hybrid-context packed sequences without limiting parallelism.
Modularized Reinforcement Learning on LLMs: From MDP Creation to Exploration and Learning cs.LG · 2026-06-20 · unverdicted · none · ref 149 · internal anchor
Survey mapping RL techniques onto LLM training and highlighting gaps in value-based, off-policy, and bootstrapping methods.
Energy-Gated Attention and Wavelet Positional Encoding: Complementary Inductive Biases for Transformer Attention cs.LG · 2026-05-25 · unverdicted · none · ref 14 · internal anchor
EGA and MoPE together yield a 0.119 validation loss improvement on TinyShakespeare that exceeds the sum of their individual effects, indicating complementary inductive biases for salience and locality.
VIP-COP: Context Optimization for Tabular Foundation Models cs.LG · 2026-05-13 · unverdicted · none · ref 28 · internal anchor
VIP-COP is a black-box method that optimizes context for tabular foundation models by ranking and selecting high-value samples and features via online KernelSHAP regression, outperforming baselines on large high-dimensional data.
Kimi K2: Open Agentic Intelligence cs.LG · 2025-07-28 · unverdicted · none · ref 59 · internal anchor
Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
HE-SNR: Uncovering Latent Logic via Entropy for Guiding Mid-Training on SWE-bench cs.LG · 2026-01-28 · unreviewed · ref 15 · internal anchor

YaRN: Efficient Context Window Extension of Large Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer