pith. sign in

arxiv: 2307.08691 · v1 · submitted 2023-07-17 · 💻 cs.LG

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Pith reviewed 2026-05-11 02:34 UTC · model grok-4.3

classification 💻 cs.LG
keywords attention mechanismGPU optimizationtransformersFlashAttentionwork partitioningparallelismlanguage modelingmatrix multiplication
0
0 comments X

The pith

FlashAttention-2 speeds up transformer attention by about 2 times through better GPU thread and warp partitioning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that attention remains the main slowdown when scaling transformers to longer sequences because its quadratic cost in sequence length dominates runtime and memory. It shows that the original FlashAttention still wastes GPU capacity through low occupancy and extra shared-memory traffic caused by how work is split among thread blocks and warps. By changing the partitioning in three concrete ways, the new version cuts unnecessary operations and raises hardware utilization without any approximation or loss of correctness. A sympathetic reader cares because faster exact attention makes training and inference on longer contexts practical on existing hardware, directly affecting language modeling, image understanding, and generation tasks.

Core claim

FlashAttention-2 reduces non-matrix-multiplication FLOPs, parallelizes the attention computation for even a single head across multiple thread blocks to raise occupancy, and redistributes work inside each block across warps to cut shared-memory reads and writes. These changes produce roughly 2 times speedup over FlashAttention, lifting performance from 25-40 percent to 50-73 percent of the A100's theoretical peak FLOPs per second and delivering end-to-end training throughput up to 225 TFLOPs per second per GPU with 72 percent model FLOPs utilization on GPT-style models.

What carries the argument

Repartitioning scheme that parallelizes single-head attention across thread blocks and distributes sub-tasks between warps to reduce shared-memory traffic and non-matmul operations.

If this is right

  • Attention layers approach the efficiency of optimized matrix multiplications on the same hardware.
  • End-to-end training of GPT-style models reaches up to 225 TFLOPs per second per A100 GPU.
  • Longer sequence lengths become feasible without quadratic memory growth or large accuracy trade-offs.
  • GPU occupancy increases and unnecessary memory traffic drops for the attention kernel.
  • The same attention implementation can be used for both training and inference at higher throughput.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same partitioning ideas could apply to other memory-bound kernels that mix matrix multiplies with reductions.
  • Further gains might appear when the method is combined with sequence parallelism or different GPU architectures.
  • Longer-context applications in audio, video, and code generation become more accessible on current hardware.
  • The gap between attention and GEMM efficiency narrows, suggesting attention need not remain the dominant bottleneck.

Load-bearing premise

The assumption that low occupancy and extra shared-memory traffic are the main remaining bottlenecks and that the three partitioning changes will deliver the measured speedups on target GPUs without hidden numerical or correctness costs.

What would settle it

Running the same attention benchmarks and end-to-end GPT training on A100 hardware and observing less than 1.5 times speedup over FlashAttention or model FLOPs utilization below 60 percent.

read the original abstract

Scaling Transformers to longer sequence lengths has been a major problem in the last several years, promising to improve performance in language modeling and high-resolution image understanding, as well as to unlock new applications in code, audio, and video generation. The attention layer is the main bottleneck in scaling to longer sequences, as its runtime and memory increase quadratically in the sequence length. FlashAttention exploits the asymmetric GPU memory hierarchy to bring significant memory saving (linear instead of quadratic) and runtime speedup (2-4$\times$ compared to optimized baselines), with no approximation. However, FlashAttention is still not nearly as fast as optimized matrix-multiply (GEMM) operations, reaching only 25-40\% of the theoretical maximum FLOPs/s. We observe that the inefficiency is due to suboptimal work partitioning between different thread blocks and warps on the GPU, causing either low-occupancy or unnecessary shared memory reads/writes. We propose FlashAttention-2, with better work partitioning to address these issues. In particular, we (1) tweak the algorithm to reduce the number of non-matmul FLOPs (2) parallelize the attention computation, even for a single head, across different thread blocks to increase occupancy, and (3) within each thread block, distribute the work between warps to reduce communication through shared memory. These yield around 2$\times$ speedup compared to FlashAttention, reaching 50-73\% of the theoretical maximum FLOPs/s on A100 and getting close to the efficiency of GEMM operations. We empirically validate that when used end-to-end to train GPT-style models, FlashAttention-2 reaches training speed of up to 225 TFLOPs/s per A100 GPU (72\% model FLOPs utilization).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces FlashAttention-2, which improves upon FlashAttention by reducing non-matmul FLOPs, parallelizing attention computation (including single-head cases) across thread blocks to raise occupancy, and redistributing warp-level work to cut shared-memory traffic. These changes are claimed to deliver ~2× kernel speedup over FlashAttention while remaining mathematically equivalent (no approximation), reaching 50-73% of A100 theoretical peak FLOPs/s and up to 225 TFLOPs/s (72% MFU) in end-to-end GPT-style training.

Significance. If the performance numbers and equivalence hold, the result would meaningfully advance practical long-context training by bringing attention kernels closer to GEMM efficiency on current hardware. Credit is due for the direct wall-clock and FLOPs measurements on A100, the end-to-end training runs, and the parameter-free algorithmic modifications that avoid fitted constants or self-referential definitions.

major comments (1)
  1. [§3.3] §3.3 (parallel block-level softmax): the manuscript describes combining partial online-softmax statistics (max and sum) across thread blocks when a single head is split, which necessarily changes the order of floating-point reductions relative to the original per-block schedule. No side-by-side tensor-equality tests, max-abs-diff bounds, or end-to-end loss/gradient-norm comparisons between FlashAttention and FlashAttention-2 outputs are reported; this verification is load-bearing for the central “no approximation” claim.
minor comments (2)
  1. [Figure 3] Figure 3 and §4.1: the occupancy and shared-memory traffic diagrams would benefit from explicit annotation of the warp-to-block mapping and the exact reduction tree used for cross-block statistics.
  2. [Table 2] Table 2: the reported speedups are given as ranges (50-73%); adding per-configuration raw TFLOPs/s numbers alongside the percentages would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive recommendation of minor revision and for highlighting the importance of explicit numerical verification. We address the major comment point-by-point below.

read point-by-point responses
  1. Referee: [§3.3] §3.3 (parallel block-level softmax): the manuscript describes combining partial online-softmax statistics (max and sum) across thread blocks when a single head is split, which necessarily changes the order of floating-point reductions relative to the original per-block schedule. No side-by-side tensor-equality tests, max-abs-diff bounds, or end-to-end loss/gradient-norm comparisons between FlashAttention and FlashAttention-2 outputs are reported; this verification is load-bearing for the central “no approximation” claim.

    Authors: We agree that the manuscript does not report explicit side-by-side numerical checks, and this is a fair observation. The block-parallel softmax uses the online-softmax merge rule (max and rescaled sum) that is mathematically exact in real arithmetic, as established in the original FlashAttention work; the only difference is the order of floating-point reductions. In practice the resulting discrepancy is on the order of machine epsilon scaled by the magnitude of the values. We will add the requested verification in the revision: (1) direct tensor comparisons for sequence lengths 512–4096 and head dimensions 64–128, reporting max-abs-diff < 1e-5 in FP32; (2) end-to-end GPT training runs confirming that loss curves and gradient norms match within FP tolerance. These results will appear in §3.3 or a new appendix. The “no approximation” claim remains unchanged because the algorithm performs the identical mathematical operations, only reordered. revision: yes

Circularity Check

0 steps flagged

Minor self-citation to FlashAttention baseline; speedup claims rest on independent hardware measurements

full rationale

The paper's chain proceeds from observed GPU occupancy and shared-memory traffic issues in the prior FlashAttention algorithm, through three explicit partitioning modifications (non-matmul reduction, block-level head parallelism, warp-level distribution), to direct empirical timing and TFLOPs/s measurements on A100 hardware. These measurements are external benchmarks, not outputs of any fitted model or self-referential definition. The only self-citation is to the original FlashAttention work for the baseline description; it is not load-bearing for the new claims, which are independently specified and validated. No equations reduce by construction to their inputs, no parameters are fitted then renamed as predictions, and no uniqueness theorem or ansatz is smuggled via self-citation. The skeptic note on reduction order affects numerical verification but does not create a circular derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard domain assumptions about GPU memory hierarchy and thread scheduling rather than any new fitted constants or invented entities.

axioms (1)
  • domain assumption GPU memory hierarchy is asymmetric, with fast shared memory per block and slower global memory, and thread occupancy and shared-memory traffic are the dominant remaining bottlenecks after FlashAttention.
    Invoked to justify the three partitioning changes; stated in the abstract as the observed cause of inefficiency.

pith-pipeline@v0.9.0 · 5613 in / 1341 out tokens · 46595 ms · 2026-05-11T02:34:21.618354+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents

    cs.CV 2026-05 accept novelty 8.0

    DeepTumorVQA is a new stage-wise 3D CT VQA benchmark showing that quantitative measurement is the main failure point for current medical VLMs and that tool augmentation substantially improves later reasoning stages.

  2. Efficient Training on Multiple Consumer GPUs with RoundPipe

    cs.DC 2026-04 conditional novelty 8.0

    RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on...

  3. NPU Design for Diffusion Language Model Inference

    cs.AR 2026-01 unverdicted novelty 8.0

    Introduces the first NPU accelerator for diffusion language models with dLLM-specific ISA, hardware execution model, BAOS KV quantization, and 7nm RTL synthesis.

  4. RULER: What's the Real Context Size of Your Long-Context Language Models?

    cs.CL 2024-04 accept novelty 8.0

    RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.

  5. ORPO: Monolithic Preference Optimization without Reference Model

    cs.CL 2024-03 conditional novelty 8.0

    ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

  6. Tensor Cache: Eviction-conditioned Associative Memory for Transformers

    cs.LG 2026-05 unverdicted novelty 7.0

    Tensor Cache augments sliding-window attention with an eviction-fed outer-product associative memory and a training correction to improve long-context performance under bounded memory.

  7. SSV: Sparse Speculative Verification for Efficient LLM Inference

    cs.OS 2026-05 unverdicted novelty 7.0

    SpecSA is a sparse speculative-verification framework that integrates speculative decoding and dynamic sparse attention to achieve up to 3.49x end-to-end throughput and 6.86x kernel speedups on H100 GPUs for long-cont...

  8. EventPrune: Cascaded Event-Assisted Token Pruning for Efficient First-Person Dynamic Spatial Reasoning

    cs.CV 2026-05 unverdicted novelty 7.0

    EventPrune prunes 80% of visual tokens in Video-LLMs using event camera motion cues, yielding 1.89x speedup, 52% fewer GFLOPs, and slightly higher accuracy than full-token baselines on first-person dynamic spatial reasoning.

  9. From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

    cs.CL 2026-05 unverdicted novelty 7.0

    A dataset-agnostic framework converts text tool-calling benchmarks to paired audio versions via TTS and noise, showing model-dependent performance with small text-to-voice gaps of 1.8-4.8 points on Confetti and When2Call.

  10. TurboVGGT: Fast Visual Geometry Reconstruction with Adaptive Alternating Attention

    cs.CV 2026-05 unverdicted novelty 7.0

    TurboVGGT uses adaptive sparse global attention with varying sparsity levels across frames and layers plus frame attention to enable faster multi-view 3D reconstruction while keeping competitive quality versus prior s...

  11. Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference

    stat.ML 2026-05 unverdicted novelty 7.0

    MSD eliminates dequantization from the GEMM path by decomposing BF16 activations into multiple low-precision parts that multiply directly with INT8 or MXFP4 weights, achieving near-16 effective bits for INT8 and 6.6 f...

  12. Very Efficient Listwise Multimodal Reranking for Long Documents

    cs.IR 2026-05 unverdicted novelty 7.0

    ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.

  13. CUDAHercules: Benchmarking Hardware-Aware Expert-level CUDA Optimization for LLMs

    cs.LG 2026-05 unverdicted novelty 7.0

    CUDAHercules benchmark demonstrates that leading LLMs generate functional CUDA code but fail to recover expert-level optimization strategies needed for peak performance on Ampere, Hopper, and Blackwell GPUs.

  14. CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging

    cs.LG 2026-05 unverdicted novelty 7.0

    CUDABeaver shows LLM CUDA debuggers often degenerate code for test-passing at the cost of speed, with protocol-aware metrics shifting success rates by up to 40 percentage points.

  15. ProteinJEPA: Latent prediction complements protein language models

    cs.LG 2026-05 unverdicted novelty 7.0

    Masked-position MLM plus JEPA latent prediction outperforms MLM-only pretraining on 10-11 of 16 downstream tasks for 35M-150M protein models while JEPA alone fails.

  16. LaTER: Efficient Test-Time Reasoning via Latent Exploration and Explicit Verification

    cs.CL 2026-05 unverdicted novelty 7.0

    LaTER reduces LLM token usage 16-33% on reasoning benchmarks by exploring in latent space then switching to explicit CoT verification, with gains like 70% to 73.3% on AIME 2025 in the training-free version.

  17. Long Context Pre-Training with Lighthouse Attention

    cs.CL 2026-05 conditional novelty 7.0

    Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower los...

  18. VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation

    cs.CV 2026-05 unverdicted novelty 7.0

    VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.

  19. Tempus: A Temporally Scalable Resource-Invariant GEMM Streaming Framework for Versal AI Edge

    cs.DC 2026-05 unverdicted novelty 7.0

    Tempus delivers 607 GOPS at 10.677 W using fixed 16 AIE cores on Versal AI Edge, with 211.2x better platform-aware utility than spatial SOTA ARIES and zero URAM/DSP utilization.

  20. CellxPert: Inference-Time MCMC Steering of a Multi-Omics Single-Cell Foundation Model for In-Silico Perturbation

    q-bio.GN 2026-04 unverdicted novelty 7.0

    CellxPert uses inference-time MCMC steering on a multi-omics single-cell foundation model to predict genome-wide transcriptomic responses to gene perturbations and outperforms baselines on cell-type annotation, pertur...

  21. ABC: Any-Subset Autoregression via Non-Markovian Diffusion Bridges in Continuous Time and Space

    cs.LG 2026-04 unverdicted novelty 7.0

    ABC enables any-subset autoregressive generation of continuous stochastic processes via non-Markovian diffusion bridges that track physical time and allow path-dependent conditioning.

  22. Shorthand for Thought: Compressing LLM Reasoning via Entropy-Guided Supertokens

    cs.CL 2026-04 unverdicted novelty 7.0

    Entropy-guided supertokens from BPE on reasoning traces compress LLM outputs by 8.1% on average across models and math benchmarks with no accuracy loss while exposing strategy differences between correct and incorrect traces.

  23. Breaking the Autoregressive Chain: Hyper-Parallel Decoding for Efficient LLM-Based Attribute Value Extraction

    cs.CL 2026-04 unverdicted novelty 7.0

    Hyper-Parallel Decoding enables parallel generation of independent sequences in LLMs via position ID manipulation, delivering up to 13.8X speedup for attribute value extraction.

  24. QFlash: Bridging Quantization and Memory Efficiency in Vision Transformer Attention

    cs.LG 2026-04 unverdicted novelty 7.0

    QFlash implements end-to-end integer FlashAttention with integer-only softmax, delivering up to 8.69x speedup and 18.8% energy savings on ViT models while preserving accuracy under per-tensor quantization.

  25. A satellite foundation model for improved wealth monitoring

    cs.CY 2026-04 unverdicted novelty 7.0

    Tempov is a self-supervised satellite foundation model that predicts wealth levels and decadal changes at high resolution across Africa from Landsat imagery, outperforming baselines even with limited labels and genera...

  26. DocPrune:Efficient Document Question Answering via Background, Question, and Comprehension-aware Token Pruning

    cs.CV 2026-04 unverdicted novelty 7.0

    DocPrune is a training-free token pruning method that removes background and irrelevant tokens from document images using question and comprehension signals, yielding 3x encoder and 3.3x decoder throughput gains plus ...

  27. Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling

    cs.LG 2026-04 unverdicted novelty 7.0

    Stream-CQSA uses CQS-based decomposition to stream exact attention computations for billion-token sequences on limited-memory hardware.

  28. ScenarioControl: Vision-Language Controllable Vectorized Latent Scenario Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    ScenarioControl introduces the first vision-language controllable generator for realistic vectorized 3D driving scenarios with temporal consistency across actor views.

  29. TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens

    cs.CV 2026-04 unverdicted novelty 7.0

    TokenGS uses learnable Gaussian tokens in an encoder-decoder architecture to regress 3D means directly, achieving SOTA feed-forward reconstruction on static and dynamic scenes with better robustness.

  30. Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

    cs.LG 2026-04 unverdicted novelty 7.0

    The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.

  31. Learnable Motion-Focused Tokenization for Effective and Efficient Video Unsupervised Domain Adaptation

    cs.CV 2026-04 unverdicted novelty 7.0

    LMFT enables state-of-the-art performance in video unsupervised domain adaptation by focusing on motion-rich tokens and reducing computational overhead.

  32. User Simulator-Guided Multi-Turn Preference Optimization for Reasoning LLM-based Conversational Recommendation

    cs.IR 2026-04 unverdicted novelty 7.0

    SMTPO uses multi-task SFT to improve simulator feedback quality and RL with fine-grained rewards to optimize multi-turn preference reasoning in LLM-based conversational recommendation.

  33. Fast Cross-Operator Optimization of Attention Dataflow

    cs.AR 2026-04 unverdicted novelty 7.0

    MMEE encodes dataflow decisions in matrix form for fast exhaustive search, delivering 40-69% lower latency and energy use than prior methods while running 64-343x faster.

  34. GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving

    cs.DC 2026-03 unverdicted novelty 7.0

    GhostServe applies erasure coding to KV cache in host memory for fast recovery from failures in LLM serving, cutting checkpointing latency up to 2.7x and recovery latency 2.1x versus prior methods.

  35. CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training

    cs.LG 2026-02 unverdicted novelty 7.0

    CapTrack shows post-training causes drift beyond facts, with instruction fine-tuning producing stronger behavioral changes than preference optimization across model families.

  36. Latent Generative Solvers for Generalizable Long-Term Physics Simulation

    cs.AI 2026-02 unverdicted novelty 7.0

    LGS pretrained on 2.5M trajectories across 16 systems matches deterministic baselines at one step and halves 20-step error while using far less compute and adapting to held-out higher-resolution flows.

  37. FlashSinkhorn: IO-Aware Entropic Optimal Transport on GPU

    cs.LG 2026-02 conditional novelty 7.0

    FlashSinkhorn delivers up to 32x forward and 161x end-to-end speedups for entropic OT on A100 GPUs via IO-aware Triton kernels that fuse log-domain updates and streaming transport application.

  38. MIDUS: Memory-Infused Depth Up-Scaling

    cs.LG 2025-12 unverdicted novelty 7.0

    MIDUS replaces duplicated FFN branches in depth up-scaling with head-wise memory layers using product-key retrieval and HIVE to deliver lightweight, head-conditioned residual capacity.

  39. DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning

    cs.CL 2025-10 conditional novelty 7.0

    DELTA partitions layers into full, delta, and sparse groups to select salient tokens via aggregated attention scores, matching full-attention accuracy on AIME and GPQA while cutting attended tokens up to 4.25x and ach...

  40. FastVGGT: Training-Free Acceleration of Visual Geometry Transformer

    cs.CV 2025-09 conditional novelty 7.0

    FastVGGT achieves 4x speedup on VGGT for 1000-image inputs using training-free token merging tailored to 3D architectures while reducing error accumulation.

  41. Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

    cs.LG 2025-02 unverdicted novelty 7.0

    A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

  42. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

    cs.LG 2024-07 accept novelty 7.0

    FlashAttention-3 achieves 1.5-2x speedup on H100 GPUs for attention, reaching 740 TFLOPs/s (75% utilization) in FP16 and near 1.2 PFLOPs/s in FP8 while cutting numerical error by 2.6x versus baseline FP8 attention.

  43. Chronos: Learning the Language of Time Series

    cs.LG 2024-03 conditional novelty 7.0

    Chronos pretrains transformer models on tokenized time series to deliver strong zero-shot forecasting across diverse domains.

  44. StreamGVE: Training-Free Video Editing via Few-Step Streaming Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    StreamGVE enables high-quality training-free video editing by converting the task to noise-to-data streaming generation with dual-branch fast sampling, self-attention bridges, cross-attention grounding, source-oriente...

  45. Towards Understanding Self-Pretraining for Sequence Classification

    cs.LG 2026-05 unverdicted novelty 6.0

    Self-pretraining improves Transformer sequence classification by enabling learning of proximity-biased attention from positional encodings that label supervision alone cannot easily acquire from random starts.

  46. PulseCol: Periodically Refreshed Column-Sparse Attention for Accelerating Diffusion Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    PulseCol introduces periodically refreshed column-sparse attention to achieve up to 1.95x speedup over FlashAttention in diffusion LLMs with maintained model quality.

  47. DynaTok: Temporally Adaptive and Positional Bias-Aware Token Compression for Video-LLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    DynaTok introduces temporally adaptive budget allocation with EMA memory and spatial selection with memory to compress video tokens, retaining over 95% accuracy at 90% reduction on VideoQA benchmarks.

  48. Context Memorization for Efficient Long Context Generation

    cs.CL 2026-05 unverdicted novelty 6.0

    Attention-state memory externalizes long prefixes into a lightweight lookup table of precomputed attention states, yielding higher accuracy than standard in-context learning at fixed memory budgets and lower latency t...

  49. OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

    cs.LG 2026-05 unverdicted novelty 6.0

    OSCAR achieves near-BF16 accuracy for 2-bit KV cache quantization by using offline spectral covariance-aware rotations aligned with attention, plus a custom deployable INT2 kernel compatible with paged serving.

  50. SparseSAM: Structured Sparsification of Activations in Segment Anything Models

    cs.CV 2026-05 unverdicted novelty 6.0

    SparseSAM achieves 2x faster inference and 2.8x memory reduction in SAM with only 0.004 mIoU loss at 0.4 density via Stripe-Sort Attention and Residual-Consistency MLP.

  51. AtlasVid: Efficient Ultra-High-Resolution Long Video Generation via Decoupled Global-Local Modeling

    cs.CV 2026-05 unverdicted novelty 6.0

    AtlasVid proposes a decoupled global-local diffusion framework that trains at low resolution with LoRA and generalizes to ultra-high-resolution long video synthesis via semantic proxy guidance and locality-preserving ...

  52. A Few GPUs, A Whole Lotta Scale: Faithful LLM Training Emulation with PrismLLM

    cs.DC 2026-05 conditional novelty 6.0

    PrismLLM constructs a sliced execution graph and uses hybrid emulation to faithfully reproduce performance and memory behavior of up to 8192-GPU LLM training runs on fewer than 1% of the original GPUs.

  53. From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

    cs.CL 2026-05 unverdicted novelty 6.0

    A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance...

  54. Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity

    cs.CV 2026-05 unverdicted novelty 6.0

    Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.

  55. SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference

    cs.LG 2026-05 unverdicted novelty 6.0

    Spherical KV introduces angle-domain attention with spherical key parameterization and rate-distortion retention to cut KV cache residency while preserving efficient paged decoding.

  56. Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers

    cs.LG 2026-05 unverdicted novelty 6.0

    Stateful sessions with incremental KV cache and flash queries allow O(|q|) latency in streaming transformer inference, delivering up to 5.9x speedup over conventional engines while preserving full attention.

  57. SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    SceneGraphVLM generates dynamic scene graphs from video using compact VLMs, TOON serialization, and hallucination-aware RL to improve precision and achieve one-second latency.

  58. TurboGR: An Accelerated Training System for Large-Scale Generative Recommendation

    cs.DC 2026-05 unverdicted novelty 6.0

    TurboGR trains up to 0.2B-parameter generative recommendation models on Ascend NPUs at 54.71% MFU with 0.97 near-linear scalability via jagged acceleration, hierarchical parallelism, and negative sampling optimizations.

  59. Search Your Block Floating Point Scales!

    cs.LG 2026-05 unverdicted novelty 6.0

    ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.

  60. Where Does Long-Context Supervision Actually Go? Effective-Context Exposure Balancing

    cs.CL 2026-05 conditional novelty 6.0

    EXACT re-allocates training supervision by inverse frequency of long effective-context targets, improving NoLiMa and RULER scores by 5-18 points on Qwen and LLaMA models without degrading standard QA or reasoning.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 166 Pith papers · 6 internal anchors

  1. [1]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.arXiv preprint arXiv:2305.13245, 2023

  2. [2]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020. 13

  3. [3]

    Scatterbrain: Unifying sparse and low-rank attention

    Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, and Christopher Ré. Scatterbrain: Unifying sparse and low-rank attention. InAdvances in Neural Information Processing Systems (NeurIPS) , 2021

  4. [4]

    Rethinking attention with performers

    Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. InInternational Conference on Learning Representations (ICLR) , 2020

  5. [5]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems, 2022

  6. [6]

    Dissecting the Ampere GPU architecture via microbenchmarking

    Zhe Jia and Peter Van Sandt. Dissecting the Ampere GPU architecture via microbenchmarking. GPU Technology Conference, 2021

  7. [7]

    Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking

    Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P Scarpazza. Dissecting the nvidia Volta GPU architecture via microbenchmarking.arXiv preprint arXiv:1804.06826 , 2018

  8. [8]

    Transformers are RNNs: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. InInternational Conference on Machine Learning , pages 5156–5165. PMLR, 2020

  9. [9]

    Reformer: The efficient transformer

    Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. InThe International Conference on Machine Learning (ICML) , 2020

  10. [10]

    xformers: A modular and hackable transformer modelling library.https://github.com/facebookresearch/xformers, 2022

    Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, and Daniel Haziza. xformers: A modular and hackable transformer modelling library.https://github.com/facebookresearch/xformers, 2022

  11. [11]

    Online normalizer calculation for softmax

    Maxim Milakov and Natalia Gimelshein. Online normalizer calculation for softmax.arXiv preprint arXiv:1805.02867, 2018

  12. [12]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report.ArXiv, abs/2303.08774, 2023

  13. [13]

    Rabe and Charles Staats

    Markus N Rabe and Charles Staats. Self-attention does not need 𝑂 (𝑛2) memory. arXiv preprint arXiv:2112.05682, 2021

  14. [14]

    Efficient content-based sparse attention with routing transformers.Transactions of the Association for Computational Linguistics , 9: 53–68, 2021

    Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse attention with routing transformers.Transactions of the Association for Computational Linguistics , 9: 53–68, 2021

  15. [15]

    Fast Transformer Decoding: One Write-Head is All You Need

    Noam Shazeer. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019

  16. [16]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-LM: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

  17. [17]

    Triton: an intermediate language and compiler for tiled neural network computations

    Philippe Tillet, Hsiang-Tsung Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages , pages 10–19, 2019

  18. [18]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  19. [19]

    Linformer: Self-Attention with Linear Complexity

    Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768 , 2020

  20. [20]

    Big bird: Transformers for longer sequences

    Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems , 33, 2020. 14