StreamKL is the first fused GPU primitive for attention KL divergence that reduces memory from O(N_Q N_K) to O(1) via an online one-pass formulation and tile-wise recomputation.
hub Mixed citations
PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transforma- tion and Graph Compilation
Mixed citation behavior. Most common role is background (57%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
First empirical study of correctness bugs in torch.compile characterizes their patterns and proposes AlignGuard, which found 23 confirmed new bugs via LLM-guided test mutation.
A VGG10 predictive coding network is trained on ImageNet via equilibrium propagation to 13.23% top-5 error, close to the 12.2% backpropagation baseline, marking the first such demonstration at this scale.
DLR-Lock locks open-weight LLMs against unauthorized fine-tuning by swapping MLPs for deep low-rank residual networks that inflate backprop memory and complicate optimization, yet preserve original capabilities via module-wise distillation.
An FPGA implementation of a neuromorphic auditory sensor plus graph neural network achieves 87.43% accuracy on Google Speech Commands v2 with sub-35 µs latency and 1.12 W power.
A text-supervised global layout embedding augments local patch representations in late-interaction VDR, yielding +2.4 nDCG@5 and +2.3 MAP@5 gains over ColPali/ColQwen baselines on ViDoRe-v2.
VNN-LIB 2.0 defines a network theory abstraction, formal query syntax, type system over numeric domains, and Agda-mechanized semantics to provide rigorous foundations for neural network verification independent of evolving model formats.
Sarus Suite shows HPC can match production container performance using an unmodified Podman engine plus explicit system layers for scheduling, scalable images, and host integration.
Latent Grammar Flow discovers ODEs by placing grammar-based equation representations in a discrete latent space, using a behavioral loss to cluster similar equations, and sampling via a discrete flow model guided by data fit and constraints.
A large benchmark finds traditional imputation methods for scRNA-seq data generally outperform deep learning ones, but numerical recovery does not reliably improve biological downstream analyses and no method wins across all settings.
Sketch-based regularization allows in situ training of implicit neural compressors to approximately match offline performance on 2D/3D simulation data at high compression rates.
XCheck extracts cross-layer constraints to generate test models and monitor behaviors, revealing 2,034 compiler-platform interaction bugs in three DL compilers.
GF-DiT introduces elastic GPU parallelism scheduling for DiT serving via asynchronous trajectory tasks and group-free collectives, reporting up to 6.01x throughput gains over static configurations.
The paper constructs an SCPI dataset via LLM-based annotation and trains classifiers to detect sensitive personal information in Japanese pre-training corpora, claiming this is the first such exploration.
WHET applies fine-grained coefficient-to-slot transforms, plaintext compression, and modulus raising plus lightweight hardware tweaks to FHE accelerators, delivering 1.38-8.74x per-area gains and sub-millisecond CKKS bootstrapping.
PiSO computes exact optimal channel-wise quantization scales for PTQ by partitioning the scale search space into intervals admitting closed-form minimizers, with extensions to group-wise quantization and error correction.
ANNS-AMP adapts distance-computation precision to vector-space regions via a lightweight cluster-level predictor and a bit-serial accelerator, delivering 163.76x/10.57x/2.06x average speedups and 1100x/39.41x/6.66x energy reductions versus CPU/GPU/custom baselines with <2.7% accuracy loss.
KForge uses dual LLM agents for cross-platform kernel generation, reporting 2.12% throughput gain on NVIDIA B200 vs TensorRT-LLM and 5.13x geometric mean speedup on Intel Arc B580 vs PyTorch on 37 workloads.
PINN failure modes are overfitting to collocation points; regularization and double backpropagation over full residuals fix them, achieving SOTA with up to 23x fewer points on standard benchmarks.
Reinforcement learning with graph neural networks finds minimally rigid graphs that match known planar realization optima and set new records for spherical realization counts.
ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.
Learns state-conditioned commitment depth in a 7B vision-language policy that jointly predicts actions and replan intervals, outperforming fixed-depth baselines and larger models on Sliding Puzzle and Sokoban while providing a theoretical dominance result.
A neural doubly robust proxy causal learning framework using mean embeddings for treatment bridges provides consistent estimators for causal dose-response functions under unobserved confounding for continuous and structured treatments.
citing papers explorer
-
StreamKL: Fast and Memory-Efficient KL Divergence for Boosting Attention Distillation
StreamKL is the first fused GPU primitive for attention KL divergence that reduces memory from O(N_Q N_K) to O(1) via an online one-pass formulation and tile-wise recomputation.
-
Demystifying the Silence of Correctness Bugs in PyTorch Compiler
First empirical study of correctness bugs in torch.compile characterizes their patterns and proposes AlignGuard, which found 23 confirmed new bugs via LLM-guided test mutation.
-
Training a Predictive Coding Network on ImageNet using Equilibrium Propagation
A VGG10 predictive coding network is trained on ImageNet via equilibrium propagation to 13.23% top-5 error, close to the 12.2% backpropagation baseline, marking the first such demonstration at this scale.
-
Locking Pretrained Weights via Deep Low-Rank Residual Distillation
DLR-Lock locks open-weight LLMs against unauthorized fine-tuning by swapping MLPs for deep low-rank residual networks that inflate backprop memory and complicate optimization, yet preserve original capabilities via module-wise distillation.
-
End-to-End Keyword Spotting on FPGA Using Graph Neural Networks with a Neuromorphic Auditory Sensor
An FPGA implementation of a neuromorphic auditory sensor plus graph neural network achieves 87.43% accuracy on Google Speech Commands v2 with sub-35 µs latency and 1.12 W power.
-
Beyond Bag-of-Patches: Learning Global Layout via Textual Supervision for Late-Interaction Visual Document Retrieval
A text-supervised global layout embedding augments local patch representations in late-interaction VDR, yielding +2.4 nDCG@5 and +2.3 MAP@5 gains over ColPali/ColQwen baselines on ViDoRe-v2.
-
VNN-LIB 2.0: Rigorous Foundations for Neural Network Verification
VNN-LIB 2.0 defines a network theory abstraction, formal query syntax, type system over numeric domains, and Agda-mechanized semantics to provide rigorous foundations for neural network verification independent of evolving model formats.
-
Sarus Suite: Cloud-native Containers for HPC
Sarus Suite shows HPC can match production container performance using an unmodified Podman engine plus explicit system layers for scheduling, scalable images, and host integration.
-
Neuro-Symbolic ODE Discovery with Latent Grammar Flow
Latent Grammar Flow discovers ODEs by placing grammar-based equation representations in a discrete latent space, using a behavioral loss to cluster similar equations, and sampling via a discrete flow model guided by data fit and constraints.
-
A Large-Scale Comparative Analysis of Imputation Methods for Single-Cell RNA Sequencing Data
A large benchmark finds traditional imputation methods for scRNA-seq data generally outperform deep learning ones, but numerical recovery does not reliably improve biological downstream analyses and no method wins across all settings.
-
In Situ Training of Implicit Neural Compressors for Scientific Simulations via Sketch-Based Regularization
Sketch-based regularization allows in situ training of implicit neural compressors to approximately match offline performance on 2D/3D simulation data at high compression rates.
-
Finding Compiler-Platform Interaction Bugs in Deep Learning Pipelines via Cross-Layer Constraints
XCheck extracts cross-layer constraints to generate test models and monitor behaviors, revealing 2,034 compiler-platform interaction bugs in three DL compilers.
-
GF-DiT: Scheduling Parallelism for Diffusion Transformer Serving
GF-DiT introduces elastic GPU parallelism scheduling for DiT serving via asynchronous trajectory tasks and group-free collectives, reporting up to 6.01x throughput gains over static configurations.
-
Detecting Sensitive Personal Information in Japanese Pre-Training Corpora for Large Language Models
The paper constructs an SCPI dataset via LLM-based annotation and trains classifiers to detect sensitive personal information in Japanese pre-training corpora, claiming this is the first such exploration.
-
WHET: Welding Homomorphic Encryption to Accelerator Architectures
WHET applies fine-grained coefficient-to-slot transforms, plaintext compression, and modulus raising plus lightweight hardware tweaks to FHE accelerators, delivering 1.38-8.74x per-area gains and sub-millisecond CKKS bootstrapping.
-
Optimal Post-Training Quantization Scales and Where to Find Them
PiSO computes exact optimal channel-wise quantization scales for PTQ by partitioning the scale search space into intervals admitting closed-form minimizers, with extensions to group-wise quantization and error correction.
-
ANNS-AMP: Accelerating Approximate Nearest Neighbor Search via Adaptive Mixed-Precision Computing
ANNS-AMP adapts distance-computation precision to vector-space regions via a lightweight cluster-level predictor and a bit-serial accelerator, delivering 163.76x/10.57x/2.06x average speedups and 1100x/39.41x/6.66x energy reductions versus CPU/GPU/custom baselines with <2.7% accuracy loss.
-
KForge: LLM-Driven Cross-Platform Kernel Generation for AI Accelerators
KForge uses dual LLM agents for cross-platform kernel generation, reporting 2.12% throughput gain on NVIDIA B200 vs TensorRT-LLM and 5.13x geometric mean speedup on Intel Arc B580 vs PyTorch on 37 workloads.
-
PINNs Failure Modes are Overfitting
PINN failure modes are overfitting to collocation points; regularization and double backpropagation over full residuals fix them, achieving SOTA with up to 23x fewer points on standard benchmarks.
-
Learning Minimally Rigid Graphs with High Realization Counts
Reinforcement learning with graph neural networks finds minimally rigid graphs that match known planar realization optima and set new records for spherical realization counts.
-
ShardTensor: Domain Parallelism for Scientific Machine Learning
ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
-
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale
LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.
-
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning
Learns state-conditioned commitment depth in a 7B vision-language policy that jointly predicts actions and replan intervals, outperforming fixed-depth baselines and larger models on Sliding Puzzle and Sokoban while providing a theoretical dominance result.
-
Doubly Robust Proxy Causal Learning with Neural Mean Embeddings
A neural doubly robust proxy causal learning framework using mean embeddings for treatment bridges provides consistent estimators for causal dose-response functions under unobserved confounding for continuous and structured treatments.
-
ExecuTorch -- A Unified PyTorch Solution to Run AI Models On-Device
ExecuTorch is a unified PyTorch-native deployment framework that enables seamless on-device execution of AI models across heterogeneous hardware while preserving original PyTorch semantics.
-
TouchAnything: Diffusion-Guided 3D Reconstruction from Sparse Robot Touches
TouchAnything reconstructs accurate 3D object geometries from only a few tactile contacts by optimizing for consistency with a pretrained visual diffusion prior.
-
Mambalaya: Einsum-Based Fusion Optimizations on State-Space Models
Mambalaya delivers 4.9x prefill and 1.9x generation speedups on Mamba layers over prior accelerators by systematically fusing inter-Einsum operations.
-
M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling
M²RNN achieves perfect state tracking at unseen lengths and outperforms Gated DeltaNet hybrids by 0.4-0.5 perplexity on 7B models with 3x smaller recurrent states.
-
Fast and memory-efficient classical simulation of quantum machine learning via forward and backward gate fusion
Gate fusion applied to both forward and backward passes in quantum circuit simulation achieves 20-30x throughput gains and supports training large 20-qubit 1000-layer QML models with 60000 parameters using gradient checkpointing.
-
Mitigating Catastrophic Forgetting in Target Language Adaptation of LLMs via Source-Shielded Updates
SSU mitigates catastrophic forgetting in low-resource LLM target-language adaptation by scoring and column-wise freezing source-critical parameters, reducing source degradation to ~3% versus ~20% for full fine-tuning while matching target performance.
-
Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants
Flashlight is a compiler-native PyTorch framework that generates efficient fused kernels for arbitrary and data-dependent attention variants, supporting more cases than FlexAttention with competitive performance.
-
ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution
ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and gains on math reasoning and competitive programming tasks.
-
GraphMend: Code Transformations for Fixing Graph Breaks in PyTorch 2
GraphMend uses two Jaseci-based code transformations to eliminate dynamic-control-flow and side-effect graph breaks in PyTorch 2, reducing breaks to zero in six of eight Hugging Face models and yielding up to 75% latency reduction on RTX 3090 and A40 GPUs.
-
Flex Attention: A Programming Model for Generating Optimized Attention Kernels
FlexAttention supplies a compiler-driven interface that expresses common attention variants in a few lines of PyTorch and emits optimized kernels whose speed matches hand-written implementations.
-
Deep Learning Alternatives of the Kolmogorov Superposition Theorem
ActNet is a new KST-based neural network that outperforms KANs and competes with MLPs in PINN benchmarks for PDE simulation tasks.
-
Bundle Adjustment in the Eager Mode
Introduces an eager-mode PyTorch BA library with GPU-accelerated sparse ops claiming 18.5-23x speedups over GTSAM, g2o, and Ceres.
-
MatterSim: A Deep Learning Atomistic Model Across Elements, Temperatures and Pressures
MatterSim delivers a single deep learning force field that simulates inorganic materials across elements, 0-5000 K, and up to 1000 GPa with near first-principles accuracy for lattice dynamics, mechanics, and Gibbs free energies.
-
Sakana Fugu Technical Report
Sakana Fugu trains LLM orchestrators using fine-tuning, evolutionary algorithms, and RL to build query-adaptive multi-agent scaffolds, claiming SOTA results on benchmarks including SWE-Bench Pro and GPQA-Diamond.
-
Piper: A Programmable Distributed Training System
Piper decouples user-defined distributed training strategies from runtime execution using transformations on a unified global training DAG IR, achieving parity on ZeRO and gains on composed strategies like DualPipe.
-
Accelerated Fourier SAT (AFSAT): Fully Realising a GPU-based Symmetric Pseudo-Boolean SAT Solver
AFSAT realizes FastFourierSAT as a production GPU solver for heterogeneous symmetric pseudo-Boolean SAT via JAX-compiled continuous local search, with tailored DFT for stability and near-linear multi-accelerator scaling.
-
Anytime Training with Schedule-Free Spectral Optimization
SF-NorMuon is a new schedule-free spectral optimizer that closes the gap with tuned AdamW on 125M-772M parameter models across 1-8x Chinchilla horizons while providing stationarity guarantees.
-
torchtune: PyTorch native post-training library
torchtune is a modular PyTorch library for LLM post-training that delivers competitive performance and memory efficiency while supporting rapid research iteration through hackable components.
-
The $\textit{Silicon Society}$ Cookbook: Design Space of LLM-based Social Simulations
The base LLM choice dominates simulation outcomes in LLM-based social networks, while other design parameters show either additive or complex interactive effects.
-
Flint: Compiler Enabled Cluster-Free Design Space Exploration for Distributed ML
Flint generates compiler-derived workload graphs that support cluster-free design space exploration for distributed machine learning systems.
-
The Role and Relationship of Initialization and Densification in 3D Gaussian Splatting
Current densification methods in 3D Gaussian Splatting do not significantly benefit from dense initializations and perform similarly to sparse SfM-based ones.
-
Hyperdimensional Decoding of Spiking Neural Networks
SNN-HDC decoding delivers better accuracy, lower latency, and 1.24x-3.67x lower estimated energy than standard methods on DvsGesture and SL-Animals-DVS while detecting 100% of samples from an untrained class.
-
A Study of Parallel Continuous Local Search
Empirical study of parallel continuous local search for SAT finds redundant constraints can slow convergence, CLS works as a hybrid sub-solver, and search stabilizes quickly due to saddle-dense objectives.
-
Can Muon Fine-tune Adam-Pretrained Models?
Constraining fine-tuning updates with LoRA mitigates performance degradation when switching from Adam to Muon on pretrained models.
-
Evaluating Artificial Intelligence Algorithms for the Standardization of Transtibial Prosthetic Socket Shape Design
Random forest predicting prosthetist adaptations from limb scans achieves median surface-to-surface error of 1.24 mm, outperforming direct socket shape prediction and other models.
-
An Engineering Journey Training Large Language Models at Scale on Alps: The Apertus Experience
Apertus, a 70B open multilingual foundation model, was pre-trained on the Alps supercomputer, with details on adapting HPC infrastructure into a resilient ML platform.