LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).
hub Mixed citations
Extending Context Window of Large Language Models via Positional Interpolation
Mixed citation behavior. Most common role is background (65%).
abstract
We present Position Interpolation (PI) that extends the context window sizes of RoPE-based pretrained LLMs such as LLaMA models to up to 32768 with minimal fine-tuning (within 1000 steps), while demonstrating strong empirical results on various tasks that require long context, including passkey retrieval, language modeling, and long document summarization from LLaMA 7B to 65B. Meanwhile, the extended model by Position Interpolation preserve quality relatively well on tasks within its original context window. To achieve this goal, Position Interpolation linearly down-scales the input position indices to match the original context window size, rather than extrapolating beyond the trained context length which may lead to catastrophically high attention scores that completely ruin the self-attention mechanism. Our theoretical study shows that the upper bound of interpolation is at least $\sim 600 \times$ smaller than that of extrapolation, further demonstrating its stability. Models extended via Position Interpolation retain its original architecture and can reuse most pre-existing optimization and infrastructure.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Autoregressive transformers exhibit measurable cognitive fatigue during extended generation, quantified by the Fatigue Index that predicts degradation (AUROC 0.95) and repetition (rho 0.94).
Tensor Cache augments sliding-window attention with an eviction-fed outer-product associative memory and a training correction to improve long-context performance under bounded memory.
ExtraVAR enables resolution extrapolation in visual autoregressive models by stage-aware RoPE remapping and entropy-driven attention scaling, suppressing repetition and detail loss.
Jordan-RoPE realizes a distance-modulated phase basis via non-semisimple Jordan blocks, generating features such as d e^{iωd} for relative positional encoding.
Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and showing strong masked language modeling results with or without positional embeddings.
Multiscreen replaces softmax attention with screening to provide absolute query-key relevance, resulting in models with 30% fewer parameters that maintain stable performance at long contexts.
SHARP applies a spectrum-aware dynamic RoPE scaling schedule that promotes resolution more strongly in early denoising stages and relaxes it later, outperforming static baselines on quality metrics for remote sensing images.
GRAPE unifies RoPE and ALiBi as special cases of group actions on positions, providing a principled design space for positional encodings via SO(d) rotations and GL unipotent transformations.
Stacked causal self-attention combined with LayerNorm induces recency bias in Transformer decoders, reversing the earlier-token bias seen in attention alone.
KV cache compression causes task-dependent degradation in high-density reasoning due to disrupted CoT links; ShotKV mitigates this by preserving few-shot examples as indivisible semantic units through phase separation, delivering 9-18% accuracy gains and 11% latency reduction.
Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.
LongRoPE extends LLM context windows to 2048k tokens via search for non-uniform positional interpolation, progressive fine-tuning from 256k, and short-context readjustment.
LPES uses per-layer scaling factors optimized by a genetic algorithm with Bézier curves to balance attention and improve long-context LLM performance by up to 11.2% on key-value retrieval.
Language models show superior memory to humans on psych experiments but can be adjusted via prompting and compaction to forget more human-like, yielding better user simulators.
SEGA adaptively scales RoPE attention components using spectral-energy guidance from the latent to improve structural coherence and fine details in high-resolution DiT synthesis.
Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.
Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.
KV-Fold turns frozen transformers into stable long-context models by folding the KV cache across sequence chunks in repeated forward passes.
EXACT re-allocates training supervision by inverse frequency of long effective-context targets, improving NoLiMa and RULER scores by 5-18 points on Qwen and LLaMA models without degrading standard QA or reasoning.
GAPE augments RoPE with query- and key-dependent gates to stabilize attention and improve long-context performance in language models.
SCoL trains LLMs via meta-reinforcement learning to generate layer-specific update instructions that improve knowledge acquisition and retention from context streams over standard baselines.
RouteHead trains a lightweight router to dynamically select optimal LLM attention heads per query for improved attention-based document re-ranking.
SharedLLM stacks two copies of a short-context LLM so the lower one compresses context into query-aware multi-grained tokens that are injected only at the lowest layers of the upper one, enabling generalization from 8K training to 128K+ inputs.
citing papers explorer
-
PRISM: Position-encoded Regressive Inverse Spectral Model for Multilayer Thin-Film Design
PRISM is a position-encoded autoregressive transformer that solves the inverse design of multilayer thin films via spectrum prefix conditioning and cumulative-depth RoPE, reporting over 50% MAE reduction versus baselines with fewer parameters.
-
A Survey of Context Engineering for Large Language Models
The survey organizes Context Engineering into retrieval, processing, management, and integrated systems like RAG and multi-agent setups while identifying an asymmetry where LLMs handle complex inputs well but struggle with equally sophisticated long outputs.
-
Phi-4-reasoning Technical Report
A 14B reasoning model trained via supervised fine-tuning on selected prompts and o3-mini traces, plus outcome RL, outperforms larger open models like DeepSeek-R1-Distill-Llama-70B on math, coding, planning and related benchmarks.
-
Gemma 3 Technical Report
Gemma 3 introduces multimodal open models with architectural changes for efficient long context, trained via distillation and a new post-training recipe that makes the 4B version competitive with prior 27B models and the 27B version comparable to Gemini-1.5-Pro.
-
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.
-
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
GLM-4 models rival or exceed GPT-4 on MMLU, GSM8K, MATH, BBH, GPQA, HumanEval, IFEval, long-context tasks, and Chinese alignment while adding autonomous tool use for web, code, and image generation.
-
A Survey of Large Language Models
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
-
A Comprehensive Overview of Large Language Models
A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.
- EndPrompt: Efficient Long-Context Extension via Terminal Anchoring
- A Recipe for Long-Context Reasoning in Large Language Models via On-Policy Optimization and Distillation
- Sustainable Hybrid Document-Routed Retrieval for Financial RAG: Resolving the Robustness-Precision Trade-off