hub

FlashAttention-2: Faster attention with better parallelism and work partitioning

Tri Dao · 2024

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

browse 10 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

method 3 background 1

citation-polarity summary

use method 3 background 1

representative citing papers

PaceVGGT: Pre-Alternating-Attention Token Pruning for Visual Geometry Transformers

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

PaceVGGT reduces VGGT inference latency by up to 5.1x on ScanNet-50 via pre-AA token pruning with a distilled Token Scorer, per-frame keep budgets, adaptive merge/prune, and feature-guided restoration, while preserving reconstruction quality on ScanNet-50 and 7-Scenes.

LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute

cs.CV · 2026-05-07 · conditional · novelty 7.0

LookWhen factorizes video recognition into learning when, where, and what to compute via uniqueness-based token selection and dual-teacher distillation, achieving better accuracy-FLOPs trade-offs than baselines on multiple datasets.

Neural posterior estimation of the neutrino direction in IceCube using transformer-encoded normalizing flows on the sphere

hep-ex · 2026-04-21 · unverdicted · novelty 7.0

A transformer-encoded spherical normalizing flow achieves state-of-the-art angular resolution for IceCube neutrino tracks and showers, improving median resolution by factors of 1.3-2.5 over B-spline likelihoods at 100 TeV and outperforming prior ML methods for muons.

Ragged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU

cs.PF · 2026-04-16 · unverdicted · novelty 7.0

RPA kernel for TPUs achieves 86% MBU in decode and 73% MFU in prefill on Llama 3 8B via tiling for ragged memory, fused pipelines, and specialized compilation for prefill/decode workloads.

Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression

cs.CL · 2025-02-04 · unverdicted · novelty 7.0

KV cache compression causes task-dependent degradation in high-density reasoning due to disrupted CoT links; ShotKV mitigates this by preserving few-shot examples as indivisible semantic units through phase separation, delivering 9-18% accuracy gains and 11% latency reduction.

OmniDrop: Layer-wise Token Pruning for Omni-modal LLMs via Query-Guidance

cs.AI · 2026-05-14 · unverdicted · novelty 6.0

OmniDrop is a training-free layer-wise token pruning framework for omni-modal LLMs that uses query guidance and temporal diversity to reduce prefill latency by up to 40% and memory by 14.7% while improving benchmark scores by up to 3.58 points.

ChunkFlow: Communication-Aware Chunked Prefetching for Layerwise Offloading in Distributed Diffusion Transformer Inference

cs.DC · 2026-05-11 · unverdicted · novelty 6.0

ChunkFlow achieves up to 1.28x step-time speedup and up to 49% lower peak GPU memory for DiT inference by using a first-order model to guide communication-aware chunked prefetching.

Scalable Spatiotemporal Inference with Biased Scan Attention Transformer Neural Processes

cs.LG · 2025-06-10 · unverdicted · novelty 6.0

BSA-TNP is a new neural process model with KRBlocks and biased scan attention that claims to match top accuracy while scaling inference to over 1M points in under a minute on a single GPU and supporting translation invariance.

ERASE: Eliminating Redundant Visual Tokens via Adaptive Two-Stage Token Pruning

cs.CV · 2026-05-11 · unverdicted · novelty 5.0

ERASE prunes 85% of vision tokens in Qwen2.5-VL-7B while retaining 89.46% accuracy, outperforming prior methods that retain only 78.1%.

Step-Audio-R1.5 Technical Report

eess.AS · 2026-04-28 · unverdicted · novelty 4.0

Step-Audio-R1.5 applies RLHF to audio reasoning models to escape the verifiable reward trap of RLVR, preserving analytical ability while restoring prosodic naturalness and immersion in long dialogues.

citing papers explorer

Showing 10 of 10 citing papers.

PaceVGGT: Pre-Alternating-Attention Token Pruning for Visual Geometry Transformers cs.CV · 2026-05-08 · unverdicted · none · ref 32
PaceVGGT reduces VGGT inference latency by up to 5.1x on ScanNet-50 via pre-AA token pruning with a distilled Token Scorer, per-frame keep budgets, adaptive merge/prune, and feature-guided restoration, while preserving reconstruction quality on ScanNet-50 and 7-Scenes.
LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute cs.CV · 2026-05-07 · conditional · none · ref 33
LookWhen factorizes video recognition into learning when, where, and what to compute via uniqueness-based token selection and dual-teacher distillation, achieving better accuracy-FLOPs trade-offs than baselines on multiple datasets.
Neural posterior estimation of the neutrino direction in IceCube using transformer-encoded normalizing flows on the sphere hep-ex · 2026-04-21 · unverdicted · none · ref 43
A transformer-encoded spherical normalizing flow achieves state-of-the-art angular resolution for IceCube neutrino tracks and showers, improving median resolution by factors of 1.3-2.5 over B-spline likelihoods at 100 TeV and outperforming prior ML methods for muons.
Ragged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU cs.PF · 2026-04-16 · unverdicted · none · ref 3
RPA kernel for TPUs achieves 86% MBU in decode and 73% MFU in prefill on Llama 3 8B via tiling for ragged memory, fused pipelines, and specialized compilation for prefill/decode workloads.
Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression cs.CL · 2025-02-04 · unverdicted · none · ref 8
KV cache compression causes task-dependent degradation in high-density reasoning due to disrupted CoT links; ShotKV mitigates this by preserving few-shot examples as indivisible semantic units through phase separation, delivering 9-18% accuracy gains and 11% latency reduction.
OmniDrop: Layer-wise Token Pruning for Omni-modal LLMs via Query-Guidance cs.AI · 2026-05-14 · unverdicted · none · ref 5
OmniDrop is a training-free layer-wise token pruning framework for omni-modal LLMs that uses query guidance and temporal diversity to reduce prefill latency by up to 40% and memory by 14.7% while improving benchmark scores by up to 3.58 points.
ChunkFlow: Communication-Aware Chunked Prefetching for Layerwise Offloading in Distributed Diffusion Transformer Inference cs.DC · 2026-05-11 · unverdicted · none · ref 5
ChunkFlow achieves up to 1.28x step-time speedup and up to 49% lower peak GPU memory for DiT inference by using a first-order model to guide communication-aware chunked prefetching.
Scalable Spatiotemporal Inference with Biased Scan Attention Transformer Neural Processes cs.LG · 2025-06-10 · unverdicted · none · ref 12
BSA-TNP is a new neural process model with KRBlocks and biased scan attention that claims to match top accuracy while scaling inference to over 1M points in under a minute on a single GPU and supporting translation invariance.
ERASE: Eliminating Redundant Visual Tokens via Adaptive Two-Stage Token Pruning cs.CV · 2026-05-11 · unverdicted · none · ref 23
ERASE prunes 85% of vision tokens in Qwen2.5-VL-7B while retaining 89.46% accuracy, outperforming prior methods that retain only 78.1%.
Step-Audio-R1.5 Technical Report eess.AS · 2026-04-28 · unverdicted · none · ref 11
Step-Audio-R1.5 applies RLHF to audio reasoning models to escape the verifiable reward trap of RLVR, preserving analytical ability while restoring prosodic naturalness and immersion in long dialogues.

FlashAttention-2: Faster attention with better parallelism and work partitioning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer