pith. machine review for the scientific record. sign in

arxiv: 2104.09864 · v5 · submitted 2021-04-20 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

RoFormer: Enhanced Transformer with Rotary Position Embedding

Ahmed Murtadha, Bo Wen, Jianlin Su, Shengfeng Pan, Yu Lu, Yunfeng Liu

Pith reviewed 2026-05-10 19:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords rotary position embeddingtransformerposition encodingself-attentionrelative positionlanguage modellong text classification
0
0 comments X

The pith

Rotary position embeddings encode absolute positions through rotation matrices while folding relative position dependencies directly into the self-attention dot product.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper first surveys existing ways to inject positional information into transformers. It then presents Rotary Position Embedding, which multiplies each token's query and key vectors by a rotation matrix whose angle depends on the token's absolute index. The resulting dot product between any pair of tokens therefore depends on their relative distance, not just their absolute locations. Experiments on long-text classification benchmarks show that the resulting RoFormer models outperform prior position-encoding schemes. A short theoretical section links the rotation construction to the observed decay of attention scores with distance and to compatibility with linear attention.

Core claim

Applying a position-dependent rotation matrix to the query and key vectors causes their inner product to become a function of the relative offset between the two positions; this single operation simultaneously supplies absolute positional information and explicit relative-position modulation inside the attention mechanism.

What carries the argument

The rotation matrix applied to each query and key vector according to its absolute position index, whose effect on the attention score is to make that score depend only on the relative distance between the two tokens.

If this is right

  • The model can accept sequences of arbitrary length at inference time without retraining or special padding.
  • Attention scores between tokens naturally decrease as their relative distance grows.
  • Linear self-attention variants can receive relative position information without extra parameters or quadratic cost.
  • The same rotation construction yields measurable gains on multiple long-text classification benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could eliminate the need for separate learned position embeddings in many transformer variants.
  • Because the relative-distance effect arises from geometry rather than learned tables, the same matrices might support extrapolation beyond the training length.
  • The rotation idea is modular and could be inserted into attention layers of non-transformer architectures that still compute pairwise similarities.

Load-bearing premise

A single fixed set of rotation frequencies will generate useful relative-position signals for every task and model scale without any task-specific adjustment.

What would settle it

Training RoFormer and a standard absolute-position transformer on the same long-text classification dataset and measuring no accuracy difference between them would show that the rotary construction adds no practical benefit.

read the original abstract

Position encoding recently has shown effective in the transformer architecture. It enables valuable supervision for dependency modeling between elements at different positions of the sequence. In this paper, we first investigate various methods to integrate positional information into the learning process of transformer-based language models. Then, we propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information. Specifically, the proposed RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation. Notably, RoPE enables valuable properties, including the flexibility of sequence length, decaying inter-token dependency with increasing relative distances, and the capability of equipping the linear self-attention with relative position encoding. Finally, we evaluate the enhanced transformer with rotary position embedding, also called RoFormer, on various long text classification benchmark datasets. Our experiments show that it consistently overcomes its alternatives. Furthermore, we provide a theoretical analysis to explain some experimental results. RoFormer is already integrated into Huggingface: \url{https://huggingface.co/docs/transformers/model_doc/roformer}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Rotary Position Embedding (RoPE) for Transformer models. RoPE encodes absolute positions via rotation matrices applied to query and key vectors, which by construction makes the self-attention logits depend only on relative distances. The authors highlight resulting properties including sequence-length flexibility, distance-decaying inter-token dependencies, and compatibility with linear attention. They evaluate the resulting RoFormer on long-text classification benchmarks, reporting consistent gains over prior positional encoding methods, and supply a theoretical analysis of the decay property.

Significance. If the empirical gains and properties hold under scrutiny, RoPE supplies a lightweight, parameter-free mechanism for relative positional information that avoids the length limitations of learned absolute embeddings. The explicit construction of relative dependency and the accompanying theoretical account of decay are strengths; the method's subsequent adoption in Hugging Face Transformers further indicates practical utility for long-sequence modeling.

major comments (3)
  1. [§3.2] §3.2, Eq. (5)–(7): the rotation frequencies are fixed at θ_i = 10000^{-2i/d} with no ablation on the base value, no sensitivity analysis with respect to model dimension d, and no demonstration that the same schedule remains effective when scale or task distribution changes. Because the claimed 'valuable properties' (decaying dependency, relative encoding) rest on this specific choice, the absence of robustness checks is load-bearing for the central claim that RoPE reliably incorporates explicit relative dependency.
  2. [§4] §4, Tables 1–3: all reported accuracies are single-point estimates with no error bars, no standard deviations across random seeds, and no statistical significance tests. Given the stochasticity of Transformer training, these results do not yet support the strong claim of 'consistent' outperformance over alternatives.
  3. [§3.3] §3.3: the extension to linear self-attention is asserted but the paper provides neither an explicit derivation showing how the rotation matrices compose with the linear-attention kernel nor any empirical verification on a linear-attention backbone; this weakens the generality claim.
minor comments (2)
  1. [Abstract] The abstract states that RoPE 'enables valuable properties' but the precise conditions under which the decay property holds are only derived later; a forward reference or brief statement in the abstract would improve readability.
  2. [§3] Notation for the rotation matrix R_m is introduced in §3 but the explicit block-diagonal form is not shown until an appendix; moving the compact matrix definition into the main text would aid readers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our paper. We address each of the major comments below and have incorporated revisions to improve the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2, Eq. (5)–(7): the rotation frequencies are fixed at θ_i = 10000^{-2i/d} with no ablation on the base value, no sensitivity analysis with respect to model dimension d, and no demonstration that the same schedule remains effective when scale or task distribution changes. Because the claimed 'valuable properties' (decaying dependency, relative encoding) rest on this specific choice, the absence of robustness checks is load-bearing for the central claim that RoPE reliably incorporates explicit relative dependency.

    Authors: We appreciate this observation. The relative encoding property derives directly from the use of rotation matrices and holds independently of the specific frequency values, as shown in our derivation in Section 3. The particular choice of θ_i follows the convention established in the original Transformer model to induce the desirable decay in dependencies with distance, which we analyze theoretically in the paper. While we did not perform extensive ablations in the original submission, we agree that robustness checks are valuable. In the revised manuscript, we have added a discussion in §3.2 on the choice of base value and included a small-scale sensitivity analysis for different θ bases on the text classification task. revision: partial

  2. Referee: [§4] §4, Tables 1–3: all reported accuracies are single-point estimates with no error bars, no standard deviations across random seeds, and no statistical significance tests. Given the stochasticity of Transformer training, these results do not yet support the strong claim of 'consistent' outperformance over alternatives.

    Authors: We acknowledge the referee's concern regarding the lack of statistical reporting. Training multiple independent runs for each configuration on long-sequence tasks is computationally intensive. Nevertheless, to address this, we have rerun the main experiments with three random seeds and report means and standard deviations in the revised Tables 1-3. We have also added a note on the statistical significance where applicable. revision: yes

  3. Referee: [§3.3] §3.3: the extension to linear self-attention is asserted but the paper provides neither an explicit derivation showing how the rotation matrices compose with the linear-attention kernel nor any empirical verification on a linear-attention backbone; this weakens the generality claim.

    Authors: Thank you for highlighting this. In the original manuscript, we briefly mentioned the compatibility but indeed omitted the detailed derivation and experiments. We have now added an explicit derivation in the appendix showing how RoPE can be integrated with linear attention mechanisms (e.g., by applying the rotations before the kernel approximation). Additionally, we have included preliminary empirical results on a linear attention variant in the revised Section 3.3 to support the claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity: RoPE is a novel definitional construction whose relative-position property follows from trigonometry, not from data fits or self-citations.

full rationale

The paper defines Rotary Position Embedding by applying a fixed rotation matrix (with frequencies taken from the original Transformer) to query and key vectors. The claimed incorporation of explicit relative-position dependency is a direct algebraic consequence of the rotation formulation and the dot-product attention, shown via trigonometric identities in the derivation. This is a self-contained mathematical property of the proposed ansatz, not a quantity fitted to evaluation data or derived from prior self-citations. Experiments compare RoFormer against baselines on long-text classification benchmarks, and a separate theoretical section explains observed behaviors; neither reduces the central claims to fitted parameters or load-bearing self-references. The construction is therefore independent of its own outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The method rests on the standard transformer attention equations plus the choice of a fixed set of rotation frequencies (commonly 10000 base) that are not derived from data in the abstract. No new particles or forces are postulated.

free parameters (1)
  • rotation base frequency
    The base value (typically 10000) that sets the wavelength of each rotation dimension is chosen by hand rather than learned or derived.
axioms (2)
  • standard math Standard scaled dot-product attention remains the core operation.
    Invoked when the rotation is applied inside the existing QK^T computation.
  • domain assumption Rotation matrices preserve the inner-product structure needed for relative distance encoding.
    Used to claim that relative position dependency appears automatically.

pith-pipeline@v0.9.0 · 5497 in / 1369 out tokens · 44943 ms · 2026-05-10T19:30:15.540802+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. A Unified Framework for Critical Scaling of Inverse Temperature in Self-Attention

    stat.ML 2026-05 unverdicted novelty 8.0

    The upper-tail accumulation scale derived from the gap-counting function N_n sets the critical inverse temperature for softmax attention concentration, unifying prior conflicting laws as special cases of different N_n.

  2. RULER: What's the Real Context Size of Your Long-Context Language Models?

    cs.CL 2024-04 accept novelty 8.0

    RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.

  3. Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    cs.LG 2023-12 unverdicted novelty 8.0

    Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

  4. Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

    stat.ML 2023-10 unverdicted novelty 8.0

    Score entropy loss enables discrete diffusion models (SEDD) that cut perplexity 25-75% versus prior diffusion methods and outperform GPT-2 on language modeling while supporting infilling and compute-quality tradeoffs.

  5. CRePE: Curved Ray Expectation Positional Encoding for Unified-Camera-Controlled Video Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    CRePE supplies depth-aware positional distributions along curved rays for stable unified-camera control in frozen video DiT models.

  6. From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.

  7. Cosine-Gated Adam-Decay: Drop-In Staleness-Aware Outer Optimization for Decoupled DiLoCo

    cs.LG 2026-05 unverdicted novelty 7.0

    CGAD is a staleness-aware Adam variant for DiLoCo that gates gradients with cosine and exponential decay, proves a convergence bound independent of maximum delay, and demonstrates stable pretraining of 25M to 7B param...

  8. Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs

    cs.LG 2026-05 unverdicted novelty 7.0

    Memory Inception steers LLMs via selective latent KV cache injection at chosen layers, delivering better control-drift balance than prompting or CAA on personality and reasoning tasks while reducing storage needs.

  9. Jordan-RoPE: Non-Semisimple Relative Positional Encoding via Complex Jordan Blocks

    cs.LG 2026-05 conditional novelty 7.0

    Jordan-RoPE realizes a non-semisimple relative positional operator that produces coupled oscillatory-polynomial features such as d e^{i omega d} for causal query-key lags.

  10. Graph Transformers and Stabilized Reinforcement Learning for Large-Scale Dynamic Routing Modulation and Spectrum Allocation in Elastic Optical Networks

    cs.NI 2026-05 conditional novelty 7.0

    Graph transformer RL for dynamic RMSA supports up to 13% more traffic than benchmarks on networks up to 143 nodes and 362 links.

  11. Homogeneous Stellar Parameters from Heterogeneous Spectra with Deep Learning

    astro-ph.GA 2026-04 unverdicted novelty 7.0

    A single end-to-end Transformer model unifies stellar labels from heterogeneous spectroscopic surveys into a self-consistent scale without post-hoc recalibration.

  12. Attention Is Not All You Need for Diffraction

    cond-mat.mtrl-sci 2026-04 unverdicted novelty 7.0

    Physics-informed transformer with sin^2(theta) encoding, physics-aware positional encoding, multi-task decoder, and three-stage curriculum classifies powder diffraction into 99 extinction groups, with structured error...

  13. Video Analysis and Generation via a Semantic Progress Function

    cs.CV 2026-04 unverdicted novelty 7.0

    A Semantic Progress Function is defined as a 1D curve of cumulative semantic shifts from frame embeddings, supporting a linearization procedure that retimes video sequences for constant-rate semantic evolution.

  14. WildSplatter: Feed-forward 3D Gaussian Splatting with Appearance Control from Unconstrained Images

    cs.CV 2026-04 unverdicted novelty 7.0

    WildSplatter jointly learns 3D Gaussians and appearance embeddings from unconstrained photo collections to enable fast feed-forward reconstruction and flexible lighting control in 3D Gaussian Splatting.

  15. Masked-Token Prediction for Anomaly Detection at the Large Hadron Collider

    hep-ph 2026-04 unverdicted novelty 7.0

    The work demonstrates masked-token prediction with transformers for model-independent anomaly detection in LHC data, achieving strong results on top-rich BSM signatures like four-top production using VQ-VAE tokenization.

  16. When Flat Minima Fail: Characterizing INT4 Quantization Collapse After FP32 Convergence

    cs.LG 2026-04 conditional novelty 7.0

    FP32-converged language models enter a post-convergence phase where INT4 quantization error explodes while FP32 perplexity remains stable, with onset tied to fine convergence rather than learning rate decay.

  17. Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings

    q-bio.QM 2026-04 unverdicted novelty 7.0

    Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and show...

  18. Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

    cs.LG 2025-02 unverdicted novelty 7.0

    A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

  19. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    cs.LG 2024-05 unverdicted novelty 7.0

    Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.

  20. When is Warmstarting Effective for Scaling Language Models?

    cs.LG 2026-05 unverdicted novelty 6.0

    A 2x growth factor in model warmstarting yields reliable training speedups for language models under 20 tokens/parameter budgets, with an empirical upper bound on effective growth factors.

  21. Phasor Memory Networks: Stable Backpropagation Through Time for Scalable Explicit Memory

    cs.LG 2026-05 unverdicted novelty 6.0

    PMNet uses unitary phasor dynamics and hierarchical anchors to make explicit memory stable for long sequences, matching a 3x larger Mamba model on long-context robustness with a 119M parameter network.

  22. SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.

  23. RT-Transformer: The Transformer Block as a Spherical State Estimator

    cs.LG 2026-05 unverdicted novelty 6.0

    Transformer components arise as the natural solution to precision-weighted directional state estimation on the hypersphere.

  24. Sparse Layers are Critical to Scaling Looped Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Looped MoE models scale better than standard transformers because different experts activate on each loop pass, recovering expressivity without extra parameters, and support superior early exits.

  25. Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs

    cs.LG 2026-05 unverdicted novelty 6.0

    Memory Inception is a training-free method that injects latent KV banks at chosen layers to steer LLMs, achieving superior control-drift balance and up to 118x storage reduction on personality and structured-reasoning tasks.

  26. Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization

    stat.ML 2026-05 unverdicted novelty 6.0

    Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-...

  27. Feature Starvation as Geometric Instability in Sparse Autoencoders

    cs.LG 2026-05 unverdicted novelty 6.0

    Adaptive elastic net SAEs (AEN-SAEs) mitigate feature starvation in SAEs by combining ℓ2 structural stability with adaptive ℓ1 reweighting, producing a Lipschitz-continuous sparse coding map that recovers global featu...

  28. Spiking Sequence Machines and Transformers

    cs.NE 2026-05 unverdicted novelty 6.0

    Spiking SDM and transformers implement identical functional operations for sequences via cosine similarity retrieval, unified by a phase-latency isomorphism between spike timing and sinusoidal positional encoding.

  29. Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.

  30. Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Foveated Reasoner integrates foveation as stateful actions inside the autoregressive decoding loop of vision-language models, trained via cold-start supervision then reinforcement learning to achieve higher accuracy a...

  31. HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering

    cs.AI 2026-04 unverdicted novelty 6.0

    HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.

  32. LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

    cs.CV 2026-04 unverdicted novelty 6.0

    LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.

  33. Towards Real-Time ECG and EMG Modeling on $\mu$NPUs

    cs.LG 2026-04 unverdicted novelty 6.0

    PhysioLite delivers Transformer-comparable ECG/EMG performance using learnable wavelet filters and hardware-aware design at ~370KB quantized size on μNPUs.

  34. Shape: A Self-Supervised 3D Geometry Foundation Model for Industrial CAD Analysis

    cs.CV 2026-04 unverdicted novelty 6.0

    A 10.9M-parameter self-supervised model pretrained on 61k CAD meshes achieves R²=0.729 reconstruction and 98.1% top-1 retrieval on held-out data via masked normalized geometry reconstruction and multi-resolution contr...

  35. Parcae: Scaling Laws For Stable Looped Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...

  36. MT-OSC: Path for LLMs that Get Lost in Multi-Turn Conversation

    cs.CL 2026-04 unverdicted novelty 6.0

    MT-OSC condenses chat history via a one-off sequential process with a few-shot Condenser and lightweight Decider to reduce tokens and preserve LLM accuracy in multi-turn settings.

  37. LoMa: Local Feature Matching Revisited

    cs.CV 2026-04 unverdicted novelty 6.0

    Scaling data, model size, and compute for local feature matching produces large performance gains on challenging benchmarks and a new manually annotated HardMatch dataset.

  38. GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    cs.CV 2025-07 unverdicted novelty 6.0

    GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.

  39. SAM 2: Segment Anything in Images and Videos

    cs.CV 2024-08 conditional novelty 6.0

    SAM 2 delivers more accurate video segmentation with 3x fewer user interactions and 6x faster image segmentation than the original SAM by training a streaming-memory transformer on the largest video segmentation datas...

  40. Chameleon: Mixed-Modal Early-Fusion Foundation Models

    cs.CL 2024-05 unverdicted novelty 6.0

    Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro o...

  41. StarCoder 2 and The Stack v2: The Next Generation

    cs.SE 2024-02 accept novelty 6.0

    StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.

  42. Efficient Streaming Language Models with Attention Sinks

    cs.CL 2023-09 accept novelty 6.0

    StreamingLLM lets finite-window LLMs generalize to infinite-length sequences by retaining initial-token KV states as attention sinks, enabling stable streaming inference up to 4M tokens.

  43. YaRN: Efficient Context Window Extension of Large Language Models

    cs.CL 2023-08 unverdicted novelty 6.0

    YaRN extends the context window of RoPE-based LLMs like LLaMA more efficiently than prior methods, using 10x fewer tokens and 2.5x fewer steps while surpassing state-of-the-art performance and enabling extrapolation b...

  44. Retentive Network: A Successor to Transformer for Large Language Models

    cs.CL 2023-07 unverdicted novelty 6.0

    RetNet is a new sequence modeling architecture that delivers parallel training, constant-time inference, and competitive language modeling performance as a potential replacement for Transformers.

  45. Textbooks Are All You Need

    cs.CL 2023-06 unverdicted novelty 6.0

    A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.

  46. BloombergGPT: A Large Language Model for Finance

    cs.LG 2023-03 conditional novelty 6.0

    BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.

  47. PaLM: Scaling Language Modeling with Pathways

    cs.CL 2022-04 accept novelty 6.0

    PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.

  48. Mela: Test-Time Memory Consolidation based on Transformation Hypothesis

    cs.CL 2026-05 unverdicted novelty 5.0

    Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.

  49. StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k

    cs.LG 2026-05 accept novelty 5.0

    Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.

  50. Nautile-370M: Spectral Memory Meets Attention in a Small Reasoning Model

    cs.LG 2026-04 unverdicted novelty 5.0

    Nautile-370M is a hybrid small language model using SeqCond Attention layers alternating with transformers, with a claimed proof that the spectral operator matches full self-attention expressiveness in the continuous limit.

  51. When Does Removing LayerNorm Help? Activation Bounding as a Regime-Dependent Implicit Regularizer

    cs.LG 2026-04 unverdicted novelty 5.0

    DyT improves validation loss 27% at 64M params/1M tokens but worsens it 19% at 118M tokens, with saturation levels predicting the sign of the effect.

  52. Working Memory Constraints Scaffold Learning in Transformers under Data Scarcity

    cs.CL 2026-04 unverdicted novelty 5.0

    Fixed-width and decay-based attention mechanisms inspired by working memory improve Transformer grammatical accuracy and human alignment under limited training data.

  53. Sessa: Selective State Space Attention

    cs.LG 2026-04 unverdicted novelty 5.0

    Sessa integrates attention within recurrent paths to achieve power-law memory tails and flexible non-decaying selective retrieval, outperforming baselines on long-context tasks.

  54. HieraSparse: Hierarchical Semi-Structured Sparse KV Attention

    cs.DC 2026-04 unverdicted novelty 5.0

    HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, pl...

  55. Woosh: A Sound Effects Foundation Model

    cs.SD 2026-04 accept novelty 5.0

    Woosh is a new publicly released foundation model optimized for high-quality sound effect generation from text or video, showing competitive or better results than open alternatives like Stable Audio Open.

  56. BARFI-Q: Quantum-Enhanced Block Attention Residual Fusion Framework for Multivariate Time-Series Forecasting in Atom Interferometry

    quant-ph 2026-05 unverdicted novelty 4.0

    BARFI-Q integrates patch-based embedding, dual-branch temporal modeling, hierarchical fusion, adaptive block-attention residuals, and quantum feature mapping to forecast atom interferometry time-series, outperforming ...

  57. Fall Risk and Gait Analysis in Community-Dwelling Older Adults using World-Spaced 3D Human Mesh Recovery

    cs.CV 2026-04 unverdicted novelty 4.0

    Video-based 3D mesh recovery extracts gait parameters that correlate with sensor measurements and are associated with higher fall risk in older adults.

  58. Improving Local Feature Matching by Entropy-inspired Scale Adaptability and Flow-endowed Local Consistency

    cs.CV 2026-04 unverdicted novelty 4.0

    A semi-dense image matching pipeline adds scale adaptability via score-matrix hints at the coarse stage and local flow consistency via gradient loss at the fine stage.

  59. Ministral 3

    cs.CL 2026-01 unverdicted novelty 4.0

    Ministral 3 releases 3B/8B/14B parameter-efficient language models with base, instruction, and reasoning variants derived via iterative pruning and distillation, including image understanding capabilities.

  60. Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

    cs.LG 2024-03 accept novelty 4.0

    A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · cited by 67 Pith papers · 3 internal anchors

  1. [1]

    How much position information do convolutional neural networks encode?arXiv preprint arXiv:2001.08248, 2020

    Md. Amirul Islam, Sen Jia, and Neil D. B. Bruce. How much position information do convolutional neural networks encode? ArXiv, abs/2001.08248,

  2. [2]

    URL https://proceedings.neurips.cc/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. J. Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT,

  3. [3]

    Guolin Ke, Di He, and T. Liu. Rethinking positional encoding in language pre-training. ArXiv, abs/2006.15595,

  4. [4]

    DeBERTa: Decoding-enhanced BERT with Disentangled Attention

    Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. ArXiv, abs/2006.03654,

  5. [5]

    Improve transformer models with better relative position embeddings

    Zhiheng Huang, Davis Liang, Peng Xu, and Bing Xiang. Improve transformer models with better relative position embeddings. In Findings of the Association for Computational Linguistics: EMNLP 2020 , pages 3327–3335, Online, November

  6. [6]

    doi:10.18653/v1/2020.findings-emnlp.298

    Association for Computational Linguistics. doi:10.18653/v1/2020.findings-emnlp.298. URL https://www.aclweb.org/anthology/2020.findings-emnlp.298. Xuanqing Liu, Hsiang-Fu Yu, Inderjit S. Dhillon, and Cho-Jui Hsieh. Learning to encode position for transformer with continuous dynamical model. In Proceedings of the 37th International Conference on Machine Lea...

  7. [7]

    Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud

    URL http://proceedings.mlr.press/v119/liu20n.html. Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neura...

  8. [8]

    Rethinking Attention with Performers

    Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, A. Gane, Tamás Sarlós, Peter Hawkins, J. Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy J. Colwell, and Adrian Weller. Rethinking attention with performers. ArXiv, abs/2009.14794,

  9. [9]

    Findings of the 2014 workshop on statistical machine translation

    Ondrej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, Radu Soricut, Lucia Specia, and Alevs Tamchyna. Findings of the 2014 workshop on statistical machine translation. pages 12–58, 06

  10. [10]

    Findings of the 2014 workshop on statistical machine translation

    doi:10.3115/v1/W14-3302. Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. 08

  11. [11]

    fairseq: A fast, extensible toolkit for sequence modeling

    doi:10.18653/v1/N19-4009. Kishore Papineni, Salim Roukos, Todd Ward, and Wei Jing Zhu. Bleu: a method for automatic evaluation of machine translation. 10

  12. [12]

    B leu: a Method for Automatic Evaluation of Machine Translation

    doi:10.3115/1073083.1073135. Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In arXiv preprint arXiv:1506.06724,

  13. [13]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. arXiv e-prints, art. arXiv:1711.05101, November

  14. [14]

    doi: 10.18653/v1/D16-1264

    doi:10.18653/v1/D16-1264. Hussein Al-Natsheh. Udl at semeval-2017 task 1: Semantic textual similarity estimation of english sentence pairs using regression model over pairwise features. 08

  15. [15]

    doi:10.18653/v1/N18-1101. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Tran...

  16. [16]

    URL https://www.aclweb.org/anthology/2020.emnlp-demos.6

    Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos.6. Matt Mahoney. Large text compression benchmark, http://www.mattmahoney.net/dc/text.html,