StreamKL is the first fused GPU primitive for attention KL divergence that reduces memory from O(N_Q N_K) to O(1) via an online one-pass formulation and tile-wise recomputation.
hub Mixed citations
PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transforma- tion and Graph Compilation
Mixed citation behavior. Most common role is background (57%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
First empirical study of correctness bugs in torch.compile characterizes their patterns and proposes AlignGuard, which found 23 confirmed new bugs via LLM-guided test mutation.
A multi-scale spectral pipeline using deep learning filament detection automatically identifies 91 oscillatory events in two weeks of 2014 GONG data, recovering known events and finding new ones with periods 20-126 min.
A VGG10 predictive coding network is trained on ImageNet via equilibrium propagation to 13.23% top-5 error, close to the 12.2% backpropagation baseline, marking the first such demonstration at this scale.
Reformulates RkNN queries as graphics ray casting to leverage GPU ray-tracing cores, claiming better performance than prior methods in challenging spatial database scenarios.
AIGaitor is the first claimed end-to-end on-device monocular motion-capture and deep-learning gait analysis pipeline demonstrated on consumer smartphones.
DLR-Lock locks open-weight LLMs against unauthorized fine-tuning by swapping MLPs for deep low-rank residual networks that inflate backprop memory and complicate optimization, yet preserve original capabilities via module-wise distillation.
An FPGA implementation of a neuromorphic auditory sensor plus graph neural network achieves 87.43% accuracy on Google Speech Commands v2 with sub-35 µs latency and 1.12 W power.
A text-supervised global layout embedding augments local patch representations in late-interaction VDR, yielding +2.4 nDCG@5 and +2.3 MAP@5 gains over ColPali/ColQwen baselines on ViDoRe-v2.
VNN-LIB 2.0 defines a network theory abstraction, formal query syntax, type system over numeric domains, and Agda-mechanized semantics to provide rigorous foundations for neural network verification independent of evolving model formats.
Sarus Suite shows HPC can match production container performance using an unmodified Podman engine plus explicit system layers for scheduling, scalable images, and host integration.
Latent Grammar Flow discovers ODEs by placing grammar-based equation representations in a discrete latent space, using a behavioral loss to cluster similar equations, and sampling via a discrete flow model guided by data fit and constraints.
A large benchmark finds traditional imputation methods for scRNA-seq data generally outperform deep learning ones, but numerical recovery does not reliably improve biological downstream analyses and no method wins across all settings.
Sketch-based regularization allows in situ training of implicit neural compressors to approximately match offline performance on 2D/3D simulation data at high compression rates.
A monotonic ICNN architecture with domain reduction to the positive octant approximates polyconvex envelopes of isotropic functions more efficiently than existing necessary-and-sufficient methods, demonstrated on Saint Venant-Kirchhoff energy.
MALOQ introduces a scalable SO(2)-equivariant ML framework with custom kernels and edge-wise graph distribution for predicting large-scale quantum transport operators.
XCheck extracts cross-layer constraints to generate test models and monitor behaviors, revealing 2,034 compiler-platform interaction bugs in three DL compilers.
GF-DiT dynamically adapts parallelism during DiT serving via trajectory tasks and group-free collectives, reporting up to 6x throughput and 95% latency reduction versus static configurations.
The paper constructs an SCPI dataset via LLM-based annotation and trains classifiers to detect sensitive personal information in Japanese pre-training corpora, claiming this is the first such exploration.
WHET applies fine-grained coefficient-to-slot transforms, plaintext compression, and modulus raising plus lightweight hardware tweaks to FHE accelerators, delivering 1.38-8.74x per-area gains and sub-millisecond CKKS bootstrapping.
PiSO computes exact optimal channel-wise quantization scales for PTQ by partitioning the scale search space into intervals admitting closed-form minimizers, with extensions to group-wise quantization and error correction.
ANNS-AMP adapts distance-computation precision to vector-space regions via a lightweight cluster-level predictor and a bit-serial accelerator, delivering 163.76x/10.57x/2.06x average speedups and 1100x/39.41x/6.66x energy reductions versus CPU/GPU/custom baselines with <2.7% accuracy loss.
KForge uses dual LLM agents for cross-platform kernel generation, reporting 2.12% throughput gain on NVIDIA B200 vs TensorRT-LLM and 5.13x geometric mean speedup on Intel Arc B580 vs PyTorch on 37 workloads.
PINN failure modes are overfitting to collocation points; regularization and double backpropagation over full residuals fix them, achieving SOTA with up to 23x fewer points on standard benchmarks.
citing papers explorer
-
In Situ Training of Implicit Neural Compressors for Scientific Simulations via Sketch-Based Regularization
Sketch-based regularization allows in situ training of implicit neural compressors to approximately match offline performance on 2D/3D simulation data at high compression rates.
-
Mitigating Catastrophic Forgetting in Target Language Adaptation of LLMs via Source-Shielded Updates
SSU mitigates catastrophic forgetting in low-resource LLM target-language adaptation by scoring and column-wise freezing source-critical parameters, reducing source degradation to ~3% versus ~20% for full fine-tuning while matching target performance.
-
Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants
Flashlight is a compiler-native PyTorch framework that generates efficient fused kernels for arbitrary and data-dependent attention variants, supporting more cases than FlexAttention with competitive performance.
-
ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution
ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and gains on math reasoning and competitive programming tasks.
-
Hyperdimensional Decoding of Spiking Neural Networks
SNN-HDC decoding delivers better accuracy, lower latency, and 1.24x-3.67x lower estimated energy than standard methods on DvsGesture and SL-Animals-DVS while detecting 100% of samples from an untrained class.
-
Evaluating Artificial Intelligence Algorithms for the Standardization of Transtibial Prosthetic Socket Shape Design
Random forest predicting prosthetist adaptations from limb scans achieves median surface-to-surface error of 1.24 mm, outperforming direct socket shape prediction and other models.
- GraphMend: Code Transformations for Fixing Graph Breaks in PyTorch 2
- Synthetic Homes: A Multimodal Generative AI Pipeline for Residential Building Data Generation under Data Scarcity