Multimodal KB-VQA exhibits a primacy bias where gold passages at prompt start outperform those at the end by 16-26 points, flipping the text-only lost-in-the-middle pattern.
hub Canonical reference
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
Canonical reference. 71% of citing Pith papers cite this work as background.
abstract
Since the introduction of the transformer model by Vaswani et al. (2017), a fundamental question has yet to be answered: how does a model achieve extrapolation at inference time for sequences that are longer than it saw during training? We first show that extrapolation can be enabled by simply changing the position representation method, though we find that current methods do not allow for efficient extrapolation. We therefore introduce a simpler and more efficient position method, Attention with Linear Biases (ALiBi). ALiBi does not add positional embeddings to word embeddings; instead, it biases query-key attention scores with a penalty that is proportional to their distance. We show that this method trains a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position embedding model trained on inputs of length 2048 but training 11% faster and using 11% less memory. ALiBi's inductive bias towards recency also leads it to outperform multiple strong position methods on the WikiText-103 benchmark.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
SlotGCG uses Vulnerable Slot Score (VSS) to identify and target the most vulnerable prompt positions for adversarial token insertion, delivering 14% higher ASR than standard GCG and 42% higher against defenses.
SISA adds an SSM importance term inside the attention score and runs the full operation as one SDPA call on augmented Q/K vectors, reporting better LAMBADA and perfect NIAH at small scale.
CORTIS combines Fisher-information masking and orthogonal projection to enable sequential speaker unlearning in ZS-TTS without access to prior unlearned data while preserving forgetting.
A hierarchical anisometric positional encoding that combines distance-indexed in-route and depot-anchored angular cross-route components improves transformer-based solvers for vehicle routing problems over index-based alternatives.
ALiBi bias is the expectation of positional LSH-induced block masks, yielding spectral and max-norm approximation bounds that reduce long-context biased attention to randomized short-context unbiased attention.
Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.
Multi-head Gaussian kernels inject temporal scale discrepancy as inductive bias to enable full-duplex talking-listening avatar generation, supported by a new decoupled VoxHear dataset and claimed SOTA naturalness.
GRAPE unifies RoPE and ALiBi as special cases of group actions on positions, providing a principled design space for positional encodings via SO(d) rotations and GL unipotent transformations.
Transformers with O(sum m^j) blocks and O(d sum m^j) parameters can exactly interpolate any finite dataset of input sequences in R^d to output sequences of lengths m^j.
Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.
Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
DPPE decouples rotation and translation in camera positional encodings for multi-view transformers to resolve late-stage training stagnation and improve generalization in novel view synthesis.
CHIAR-Former routes tokens via spectral entropy to DCT mixing or attention, yielding 35-40% FLOP savings at 400M parameters with modest perplexity increase on WikiText-103.
SMT reduces RNN training to supervised learning on memory transitions (m_t, x_{t+1}) to m_{t+1} obtained from a Transformer encoder, enabling time-parallel training with O(1) gradient paths.
Absolute position information leaks into RoPE decoder-only transformers through the causal mask's position-dependent softmax denominator and the residual stream's closed dynamics at position 0 read by sink heads.
Q-K=V projection sharing in transformers matches standard QKV performance with 50% KV cache reduction and combines with GQA/MQA for up to 96.9% reduction across vision and language tasks.
Explicitly disentangling semantic and positional streams in a Transformer encoder reveals that absolute positional representations collapse to a 2D document-structure manifold, attention heads specialize by role, and the approach improves linguistic probing performance on 49 of 65 phenomena.
Self-pretraining improves Transformer sequence classification by enabling learning of proximity-biased attention from positional encodings that label supervision alone cannot easily acquire from random starts.
PMNet uses unitary phasor dynamics and hierarchical anchors to make explicit memory stable for long sequences, matching a 3x larger Mamba model on long-context robustness with a 119M parameter network.
Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.
Theoretical analysis of continual factual knowledge acquisition shows data replay stabilizes pretrained knowledge by shifting convergence dynamics while regularization only slows forgetting, leading to the STOC method for attention-based replay selection.
EXACT re-allocates training supervision by inverse frequency of long effective-context targets, improving NoLiMa and RULER scores by 5-18 points on Qwen and LLaMA models without degrading standard QA or reasoning.
GAPE augments RoPE with query- and key-dependent gates to stabilize attention and improve long-context performance in language models.
citing papers explorer
-
FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation
FAAST performs test-time supervised adaptation by analytically deriving fast weights from examples in one forward pass, matching backprop performance with over 90% less adaptation time and up to 95% memory savings versus memory-based methods.
-
A Survey on Large Language Models for Code Generation
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.