PyTorch: An Imperative Style, High-Performance Deep Learning Library

Adam Lerer; Adam Paszke; Alban Desmaison; Alykhan Tejani; Andreas K\"opf; Benoit Steiner; Edward Yang; Francisco Massa; Gregory Chanan; James Bradbury

arxiv: 1912.01703 · v1 · submitted 2019-12-03 · 💻 cs.LG · cs.MS· stat.ML

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Adam Paszke , Sam Gross , Francisco Massa , Adam Lerer , James Bradbury , Gregory Chanan , Trevor Killeen , Zeming Lin

show 13 more authors

Natalia Gimelshein Luca Antiga Alban Desmaison Andreas K\"opf Edward Yang Zach DeVito Martin Raison Alykhan Tejani Sasank Chilamkurthy Benoit Steiner Lu Fang Junjie Bai Soumith Chintala

This is my paper

Pith reviewed 2026-05-24 15:12 UTC · model grok-4.3

classification 💻 cs.LG cs.MSstat.ML

keywords deep learningmachine learning libraryimperative programmingPython integrationGPU accelerationdynamic modelsusabilityperformance

0 comments

The pith

PyTorch shows that an imperative Pythonic style can deliver both usability and performance in deep learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Deep learning frameworks have often focused on either usability or speed, but not both. This paper presents PyTorch as a library that demonstrates the two goals are compatible. It provides an imperative and Pythonic programming style that supports code as a model, makes debugging easy, and stays consistent with other scientific computing libraries. The library remains efficient and supports hardware accelerators such as GPUs. The authors detail the architectural principles and show benchmark results for individual subsystems and overall speed.

Core claim

PyTorch provides an imperative and Pythonic programming style that supports code as a model, makes debugging easy and is consistent with other popular scientific computing libraries, while remaining efficient and supporting hardware accelerators such as GPUs. Every aspect of PyTorch is a regular Python program under the full control of its user, and the careful pragmatic implementation of key runtime components enables them to work together for compelling performance.

What carries the argument

The imperative style in which model definition occurs through direct execution of Python code under user control, backed by runtime components that deliver efficiency and accelerator support.

If this is right

Models can be defined and altered using ordinary Python control flow structures such as loops and conditionals.
Debugging reduces to standard Python debugging tools and workflows.
The library interface aligns directly with tools such as NumPy for seamless data handling.
Hardware acceleration through GPUs is available without altering the core programming model.
Overall system speed on standard benchmarks reaches levels competitive with other deep learning libraries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same imperative approach could be applied to libraries targeting new accelerator hardware beyond GPUs.
Interactive model exploration in notebooks would become more reliable because changes to control flow are immediately visible.
Other frameworks might incorporate similar Python-native model definitions to reduce the gap between research code and production code.

Load-bearing premise

The careful and pragmatic implementation of key runtime components enables them to work together to achieve compelling performance.

What would settle it

A set of benchmarks on common deep learning tasks where PyTorch runs substantially slower than alternative frameworks while still using the described imperative Python interface.

Figures

Figures reproduced from arXiv: 1912.01703 by Adam Lerer, Adam Paszke, Alban Desmaison, Alykhan Tejani, Andreas K\"opf, Benoit Steiner, Edward Yang, Francisco Massa, Gregory Chanan, James Bradbury, Junjie Bai, Luca Antiga, Lu Fang, Martin Raison, Natalia Gimelshein, Sam Gross, Sasank Chilamkurthy, Soumith Chintala, Trevor Killeen, Zach DeVito, Zeming Lin.

**Figure 1.** Figure 1: shows a representative timeline of execution for the first few operations of a ResNet-50 model. The host CPU which queues the work quickly outpaces the execution of the operators on the GPU. This allows PyTorch to achieve almost perfect device utilization. In this example, GPU execution takes around three times longer than CPU scheduling. The exact ratio depends on the relative performance of the host CPU … view at source ↗

**Figure 2.** Figure 2: , the behavior of the first iteration differs significantly from that of subsequent ones. At first, calls to the CUDA memory management functions (cudaMalloc and cudaFree) slow down the execution quite dramatically by blocking the CPU thread for long periods of time, hence lowering the utilization of the GPU. This effect disappears in subsequent iterations as the PyTorch caching memory allocator starts reu… view at source ↗

**Figure 3.** Figure 3: Among arXiv papers each month that mention common deep learning frameworks, percentage of them that mention PyTorch. 7 Conclusion and future work PyTorch has become a popular tool in the deep learning research community by combining a focus on usability with careful performance considerations. In addition to continuing to support the latest trends and advances in deep learning, in the future we plan to con… view at source ↗

read the original abstract

Deep learning frameworks have often focused on either usability or speed, but not both. PyTorch is a machine learning library that shows that these two goals are in fact compatible: it provides an imperative and Pythonic programming style that supports code as a model, makes debugging easy and is consistent with other popular scientific computing libraries, while remaining efficient and supporting hardware accelerators such as GPUs. In this paper, we detail the principles that drove the implementation of PyTorch and how they are reflected in its architecture. We emphasize that every aspect of PyTorch is a regular Python program under the full control of its user. We also explain how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance. We demonstrate the efficiency of individual subsystems, as well as the overall speed of PyTorch on several common benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The paper claims that PyTorch demonstrates the compatibility of usability and speed in deep learning libraries through an imperative, Pythonic interface that treats code as the model, enables straightforward debugging, maintains consistency with scientific computing libraries such as NumPy, and achieves competitive efficiency via pragmatic runtime design while supporting GPU accelerators. It details the guiding implementation principles and their reflection in the architecture, stresses that every component remains a regular Python program under user control, and reports benchmark results on individual subsystems and overall performance on common tasks.

Significance. If the architectural claims and benchmark evidence hold, the paper is significant for documenting a framework whose design choices directly enabled widespread adoption of dynamic neural network models in research. The explicit focus on full Python control and pragmatic runtime integration provides a concrete reference for balancing flexibility with performance, influencing subsequent library designs and lowering barriers to experimentation with non-static computation graphs.

minor comments (1)

The abstract states that benchmarks demonstrate 'compelling performance' but does not name the specific tasks or hardware configurations; adding one sentence with example workloads (e.g., ResNet training throughput) would improve precision without altering the central narrative.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript and for recommending acceptance. The summary accurately captures the core claims regarding PyTorch's design principles and performance characteristics.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper is a systems description of a library implementation. Its central claim (imperative Pythonic interface is compatible with competitive performance) rests on explicit architecture choices, runtime details, and benchmark comparisons to external baselines. No equations, fitted parameters renamed as predictions, self-citations used as uniqueness theorems, or ansatzes smuggled via prior work appear in the provided text. The argument does not reduce any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper describes a software library and its implementation; it introduces no free parameters, mathematical axioms, or invented entities beyond standard programming and hardware assumptions already present in the field.

pith-pipeline@v0.9.0 · 5757 in / 1014 out tokens · 22031 ms · 2026-05-24T15:12:54.148960+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Efficient Training on Multiple Consumer GPUs with RoundPipe
cs.DC 2026-04 conditional novelty 8.0

RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on...
Stability and Generalization in Looped Transformers
cs.LG 2026-04 unverdicted novelty 8.0

Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant ...
Traces of Helium Detected in Type Ic Supernova 2014L
astro-ph.HE 2026-03 accept novelty 8.0

Quantitative Bayesian inference using a deep-learning emulator detects 0.018-0.020 M_sun of helium in the Type Ic supernova 2014L.
LeLaR: The First In-Orbit Demonstration of an AI-Based Satellite Attitude Controller
cs.RO 2025-12 conditional novelty 8.0

First in-orbit demonstration of a DRL-trained AI satellite attitude controller that performs robust inertial pointing after sim-to-real transfer.
Automated discovery of heralded ballistic graph state generators for fusion-based photonic quantum computation
quant-ph 2025-08 unverdicted novelty 8.0

A two-pass optimization framework with polynomial-based simulation discovers heralded ballistic circuits for 3-5 qubit graph states achieving up to 7.5x higher success probabilities than fusion baselines, including fi...
Editing Models with Task Arithmetic
cs.LG 2022-12 accept novelty 8.0

Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.
Forecasting megaelectron-volt electron flux in the Earth's outer radiation belt using supervised machine learning algorithms and a timeseries foundation model
astro-ph.IM 2026-05 unverdicted novelty 7.0

Hybrid TimesFM plus ridge regression on covariates forecasts 1-MeV electron flux with average R² of 0.9 on out-of-sample 2024 data, outperforming linear regression, CNN, LSTM and Transformer models.
Reconstructing the Stripping History of the Sagittarius Stream with Neural Networks
astro-ph.GA 2026-05 unverdicted novelty 7.0

A neural network trained on simulations infers stripping times for Sagittarius stream stars from phase-space data, measuring a 0.3 dex/Gyr metallicity gradient and estimating ages for globular clusters such as Pal 12 ...
Events as Triggers for Behavioral Diversity in Multi-Agent Reinforcement Learning
cs.MA 2026-05 unverdicted novelty 7.0

Events trigger on-the-fly LoRA module generation via hypernetworks over a shared team policy in MARL, paired with a Neural Manifold Diversity metric, enabling sequential role reassignment while preserving reward maximization.
Frequency-Space Mechanics: A Sequence and Coordinate-Free Representation for Protein Function Prediction
q-bio.BM 2026-05 unverdicted novelty 7.0

Vibrational mode graphs from molecular dynamics enable sequence-free protein function prediction via graph neural networks, with entrainment improving signals for collective dynamics.
End-to-End Population Inference from Gravitational-Wave Strain using Transformers
gr-qc 2026-05 unverdicted novelty 7.0

Dingo-Pop uses a transformer to perform amortized, end-to-end population inference from GW strain data in seconds, bypassing per-event Monte Carlo sampling.
Learning reveals invisible structure in low-rank RNNs
cs.LG 2026-05 unverdicted novelty 7.0

Learning in low-rank RNNs reduces to an exact low-dimensional ODE system in overlap space, where loss-invisible overlaps encode training history without affecting function.
Dynamical magnetotropic susceptibility as a new probe of Kitaev materials and beyond
cond-mat.str-el 2026-05 unverdicted novelty 7.0

Dynamical magnetotropic susceptibility k(ω) acts as a probe of uniform spin and charge fluctuations, with its static scaling in α-RuCl3 arising specifically from dominant Kitaev interactions in the models examined.
Sampling two-dimensional spin systems with transformers
cond-mat.dis-nn 2026-04 unverdicted novelty 7.0

Transformer networks sample up to 180x180 2D Ising systems and 64x64 Edwards-Anderson systems by generating spin groups with probability approximations, yielding ~20x higher effective sample size than prior neural sam...
Homogeneous Stellar Parameters from Heterogeneous Spectra with Deep Learning
astro-ph.GA 2026-04 unverdicted novelty 7.0

A single end-to-end Transformer model unifies stellar labels from heterogeneous spectroscopic surveys into a self-consistent scale without post-hoc recalibration.
Big Dipper, Help Me Find A Way -- Dip-hunting at hadron colliders
hep-ph 2026-04 unverdicted novelty 7.0

Parametric neural networks learn likelihood ratios to infer top-philic scalar resonances from dip patterns caused by signal-background interference in hadron collider data.
Graph-Conditioned Meta-Optimizer for QAOA Parameter Generation on Multiple Problem Classes
quant-ph 2026-04 unverdicted novelty 7.0

A graph-conditioned meta-optimizer learns QAOA parameter trajectories from one problem class and transfers them to others, yielding better initializations than standard methods in an empirical study of 64 settings.
Rates of forgetting for the sequentially Markov coalescent
math.PR 2026-04 unverdicted novelty 7.0

SMC forgets its initial condition geometrically in the jump chain and as 1/ℓ in continuous genetic distance, justifying independent-locus approximations.
Concept Graph Convolutions: Message Passing in the Concept Space
cs.LG 2026-04 unverdicted novelty 7.0

Concept Graph Convolutions perform message passing on node concepts to increase interpretability of graph neural networks without losing task performance.
A Non-stationary, Amortized, Transfer Learning Approach for Modeling Italian Air Quality
stat.AP 2026-04 unverdicted novelty 7.0

A neural network learns non-stationary anisotropic correlations from gridded CTM outputs and transfers the structure via LatticeKrig basis functions to station data for refined fine-scale NO2 predictions with uncertainty.
Probing the 3D Structures of Supernovae through IR Signatures of CO and SiO
astro-ph.HE 2026-04 unverdicted novelty 7.0

MOFAT applied to SN2024ggi shows CO triggering inner SiO formation with a receding edge, order-of-magnitude mass drop, clumping signatures, and no dust formation.
Partitioning Unstructured Sparse Tensor Algebra for Load-Balanced Parallel Execution
cs.PL 2026-04 unverdicted novelty 7.0

A new partitioning algorithm that provably load-balances arbitrary sparse tensor algebra expressions by generalizing parallel merging to multi-operand, multi-dimensional hierarchical structures, implemented in a compi...
Tensor Memory Engine: On-the-fly Data Reorganization for Ideal Locality
cs.AR 2026-04 unverdicted novelty 7.0

The Tensor Memory Engine provides on-the-fly data reorganization to achieve ideal memory locality for CPU computations in edge systems.
Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings
q-bio.QM 2026-04 unverdicted novelty 7.0

Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and show...
How pore-scale disorder controls fluid stretching in porous media
physics.flu-dyn 2026-04 unverdicted novelty 7.0

Pore-scale disorder accelerates fluid stretching in porous media, producing quadratic time growth and faster mixing than the linear growth seen in ordered structures.
Vibe Coding XR: Accelerating AI + XR Prototyping with XR Blocks and Gemini
cs.HC 2026-03 unverdicted novelty 7.0

XR Blocks supplies an LLM-optimized Reality Model and Vibe Coding XR workflow that converts high-level prompts into working physics-aware XR applications with high one-shot success.
Polarized Target Nuclear Magnetic Resonance Measurements with Deep Neural Networks
physics.ins-det 2026-03 unverdicted novelty 7.0

Deep neural networks reduce fitting uncertainties in CW-NMR polarization measurements for dynamically polarized targets.
RLGT: A reinforcement learning framework for extremal graph theory
cs.LG 2026-02 unverdicted novelty 7.0

RLGT is a modular reinforcement learning framework for extremal graph theory that handles undirected, directed, looped, and multi-colored graphs to facilitate future research.
Reduced-Order Surrogates for Forced Flexible Mesh Coastal-Ocean Models
cs.CE 2026-02 unverdicted novelty 7.0

Koopman autoencoders with forcings and temporal unrolling deliver accurate year-long predictions for coastal-ocean models at 300-1400x speedup, outperforming POD in two of three cases.
Cobble: Compiling Block Encodings for Quantum Computational Linear Algebra
cs.PL 2025-11 unverdicted novelty 7.0

Cobble is a domain-specific language for quantum block encodings that compiles high-level matrix expressions to optimized circuits using analyses and quantum singular value transformation, achieving 2.6x-25.4x speedup...
Atomistic Machine Learning with Irreducible Cartesian Natural Tensors
cond-mat.mtrl-sci 2025-10 unverdicted novelty 7.0

CarNet develops irreducible Cartesian natural tensors and an equivariant model that matches leading spherical-tensor performance for ML interatomic potentials and high-rank tensor predictions like elastic constants.
pop-cosmos: Star formation over 12 Gyr from generative modelling of a deep infrared-selected galaxy catalogue
astro-ph.GA 2025-09 unverdicted novelty 7.0

A score-based diffusion generative model on deep infrared galaxy photometry yields a star formation rate density peaking at z=1.3 and shows distinct non-parametric star formation histories plus AGN activity peaking du...
Meson spectroscopy of exotic symmetries of Ising criticality in Rydberg atom arrays
quant-ph 2025-06 unverdicted novelty 7.0

Rydberg arrays realize Ising criticality with E8 mass spectra in chains and first signatures of D8^(1)-organized bound states from interchain confinement in ladders.
GraphGDel: Constructing and Learning Graph Representations of Genome-Scale Metabolic Models for Growth-Coupled Gene Deletion Prediction
q-bio.QM 2025-04 conditional novelty 7.0

GraphGDel builds graph representations from constraint-based metabolic models and trains a deep learning framework integrating graph structure with gene and metabolite sequences to predict growth-coupled gene deletion...
KernelBench: Can LLMs Write Efficient GPU Kernels?
cs.LG 2025-02 accept novelty 7.0

KernelBench shows that even the best current LLMs generate correct and faster-than-baseline GPU kernels in fewer than 20 percent of realistic ML workloads.
Clustering in pure-attention hardmax transformers and its role in sentiment analysis
cs.CL 2024-06 unverdicted novelty 7.0

Hardmax transformers converge to leader-determined clusters, enabling an interpretable model for sentiment analysis.
Diffusion Models Beat GANs on Image Synthesis
cs.LG 2021-05 accept novelty 7.0

Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.
Equation of State at High Baryon Densities from a Thermodynamically Informed Neural Network
hep-ph 2026-05 unverdicted novelty 6.0

A thermodynamically consistent neural-network equation of state for QCD matter at finite temperature and conserved charges that matches known low-density results and extrapolates to high baryon densities for use in re...
When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning
cs.LG 2026-05 unverdicted novelty 6.0

Position-Weighted On-Policy Self-Distillation (PW-OPSD) weights later tokens more heavily after a diagnostic shows position predicts teacher reliability better than entropy, yielding +1.0 and +1.1 Avg@12 gains on AIME...
Universal Jaynes-Cummings Control of an Oscillator
quant-ph 2026-05 unverdicted novelty 6.0

Experimental demonstration of universal qudit control on a cavity oscillator via compiled Jaynes-Cummings gates with a transmon ancilla, reaching 96% mean post-selected process fidelity for qutrit gates.
CAM-VFD: Cross-Attention Multimodal Video Forgery Detection
cs.CV 2026-05 unverdicted novelty 6.0

CAM-VFD detects video forgeries by using cross-attention to identify contradictions between CLIP appearance, VideoMAE motion, and MiDaS depth features.
Mechanistically Interpretable Neural Encoding Reveals Fine-Grained Functional Selectivity in Human Visual Cortex
cs.CV 2026-05 unverdicted novelty 6.0

MINE uses mechanistic interpretability on language-aligned image representations to generate per-voxel feature descriptions, validated via image generation and counterfactual edits that causally shift brain activation.
Events as Triggers for Behavioral Diversity in Multi-Agent Reinforcement Learning
cs.MA 2026-05 unverdicted novelty 6.0

Proposes an event-triggered MARL framework with Neural Manifold Diversity and event-based hypernetworks to enable dynamic, agent-agnostic behavioral transitions while preserving reward maximization.
Block-Based Double Decoders
cs.LG 2026-05 unverdicted novelty 6.0

Block-based double decoders achieve full supervision in pretraining like decoder-only models and efficient inference like encoder-decoders through doubly-causal block-based attention masks, outperforming encoder-decod...
Curvature-Aware Captioning:Leveraging Geodesic Attention for 3D Scene Understanding
cs.CV 2026-05 unverdicted novelty 6.0

A new framework combines self-attention on the Oblique manifold with bidirectional geodesic cross-attention on the Lorentz hyperboloid to improve both localization accuracy and descriptive coherence in 3D dense captioning.
POETS: Uncertainty-Aware LLM Optimization via Compute-Efficient Policy Ensembles
cs.LG 2026-05 unverdicted novelty 6.0

POETS uses compute-efficient LLM policy ensembles to implicitly perform KL-regularized Thompson sampling, delivering O(sqrt(T gamma_T)) regret bounds and state-of-the-art sample efficiency in scientific discovery task...
What If We Let Forecasting Forget? A Sparse Bottleneck for Cross-Variable Dependencies
cs.LG 2026-05 unverdicted novelty 6.0

MS-FLOW uses a capacity-limited sparse routing mechanism to model only critical inter-variable dependencies in time series data, achieving state-of-the-art accuracy on 12 benchmarks with fewer but more reliable connections.
Why Does Agentic Safety Fail to Generalize Across Tasks?
cs.LG 2026-05 conditional novelty 6.0

Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstr...
Decoding Alignment without Encoding Alignment: A critique of similarity analysis in neuroscience
q-bio.NC 2026-05 unverdicted novelty 6.0

Decoding alignment metrics can remain high and unchanged even when encoding manifold topology is causally altered, so they do not imply similar function or computation across neural populations.
Stage Light is Sequence$^2$: Multi-Light Control via Imitation Learning
cs.MM 2026-05 unverdicted novelty 6.0

SeqLight maps music to multi-light HSV control via SkipBART for global color prediction followed by hybrid imitation learning in a goal-conditioned MDP to decompose colors across lights.
Euclid preparation. CosmoPostProcess: A simulation calibrated framework for weak lensing selection bias in richness-selected galaxy clusters
astro-ph.CO 2026-05 unverdicted novelty 6.0

CosmoPostProcess delivers simulation-calibrated radial corrections for projection-induced selection bias (20-40% amplitude near 1 h^{-1} Mpc) and baryonic effects in Euclid richness-selected cluster weak lensing profiles.
ClarifySTL: An Interactive LLM Agent Framework for STL Transformation through Requirements Clarification
cs.SE 2026-05 unverdicted novelty 6.0

ClarifySTL uses LLM agents to interactively detect and resolve vagueness and ambiguity in natural language requirements via clarification queries before generating STL formulas, with evaluations on existing and new be...
Compressibility of micromagnetic solutions in tensor train format
cond-mat.mes-hall 2026-04 unverdicted novelty 6.0

Tensor-train compressed micromagnetic solutions for flux-closure states in soft-magnetic prisms scale as L^{1.8} and (1/a)^{1.2} by exploiting spatial sparsity in domain walls versus uniform domains.
Towards Generalizable Mapping of Hedges and Linear Woody Features from Earth Observation Data: a national Product for Germany
cs.CV 2026-04 unverdicted novelty 6.0

A modular deep learning workflow maps linear woody features at national scale in Germany from three different resolution EO sources using a single trained model.
Rendering-Aware Sparse Sampling for BRDF Acquisition
cs.CV 2026-04 unverdicted novelty 6.0

A sampler network learns to select informative sparse BRDF measurement directions by optimizing against a fixed pretrained hypernetwork reconstructor and differentiable renderer, improving low-budget reconstruction on...
A Physics Informed Bayesian Neural Network for the Neutron Star Equation of State
astro-ph.HE 2026-04 unverdicted novelty 6.0

A physics-informed Bayesian neural network learns neutron-star equations of state from theoretical priors and constraints, then generates posterior mass-radius and mass-tidal-deformability distributions consistent wit...
Data-Driven Acceleration of Eccentricity Reduction for Binary Black Hole Simulations
gr-qc 2026-04 unverdicted novelty 6.0

A Gaussian Process Regression model trained on an archive of eccentricity-reduced binary black hole simulations predicts initial conditions that achieve low eccentricity with zero or one iteration.
JAX-BEM: Gradient-Based Acoustic Shape Optimisation via a Differentiable Boundary Element Method
cs.CE 2026-04 unverdicted novelty 6.0

A JAX-based differentiable BEM solver matches traditional BEM accuracy on benchmarks and supports gradient-driven acoustic geometry optimization.
PRiMeFlow: Capturing Complex Expression Heterogeneity in Perturbation Response Modelling
cs.LG 2026-04 unverdicted novelty 6.0

PRiMeFlow is a flow-matching model that approximates the full empirical distribution of single-cell gene expression after perturbations.
TCL: Enabling Fast and Efficient Cross-Hardware Tensor Program Optimization via Continual Learning
cs.LG 2026-04 conditional novelty 6.0

TCL delivers 16.8x faster tuning on CPU and 12.48x on GPU with modestly lower inference latency by combining RDU active sampling, a lightweight Mamba cost model, and cross-platform continual knowledge distillation.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 140 Pith papers · 4 internal anchors

[1]

Caffe: Convolutional Architecture for Fast Feature Embedding

Yangqing "Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor" Darrell. "caffe: Convolutional architecture for fast feature embedding". "arXiv preprint arXiv:1408.5093", "2014"

work page internal anchor Pith review Pith/arXiv arXiv 2014
[2]

Cntk: Microsoft’s open-source deep-learning toolkit

Frank Seide and Amit Agarwal. Cntk: Microsoft’s open-source deep-learning toolkit. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 2135–2135, New York, NY , USA, 2016. ACM. 9

work page 2016
[3]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murra...

work page 2015
[4]

Theano: A Python framework for fast computation of mathematical expressions

Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints, abs/1605.02688, May 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[5]

Chainer: a next-generation open source framework for deep learning

Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. Chainer: a next-generation open source framework for deep learning. In Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS), 2015

work page 2015
[6]

Torch: a modular machine learning software library

Ronan Collobert, Samy Bengio, and Johnny Mariéthoz. Torch: a modular machine learning software library. Technical report, Idiap, 2002

work page 2002
[7]

Neubig, C

G. Neubig, C. Dyer, Y . Goldberg, A. Matthews, W. Ammar, A. Anastasopoulos, M. Balles- teros, D. Chiang, D. Clothiaux, T. Cohn, K. Duh, M. Faruqui, C. Gan, D. Garrette, Y . Ji, L. Kong, A. Kuncoro, G. Kumar, C. Malaviya, P. Michel, Y . Oda, M. Richardson, N. Saphra, S. Swayamdipta, and P. Yin. DyNet: The Dynamic Neural Network Toolkit. ArXiv e-prints, Jan...

work page 2017
[8]

Philip S. Abrams. An APL Machine. PhD thesis, Stanford University, 1970

work page 1970
[9]

MATLAB and Statistics Toolbox

The MathWorks, Inc., Natick, Massachusetts, United States. MATLAB and Statistics Toolbox

work page
[10]

R: A Language and Environment for Statistical Computing

R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria

work page
[11]

Julia: A fresh approach to numerical computing

Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B Shah. Julia: A fresh approach to numerical computing. SIAM review, 59(1):65–98, 2017

work page 2017
[12]

NumPy: A guide to NumPy

Travis Oliphant. NumPy: A guide to NumPy. USA: Trelgol Publishing, 2006. http://www.numpy.org/

work page 2006
[13]

Eigen v3

Gaël Guennebaud, Benoît Jacob, et al. Eigen v3. http://eigen.tuxfamily.org, 2010

work page 2010
[14]

Lush reference manual

Y LeCun and L Bottou. Lush reference manual. Technical report, code available at http://lush.sourceforge.net, 2002

work page 2002
[15]

Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind

Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. Automatic differentiation in machine learning: A survey. J. Mach. Learn. Res. , 18(1):5595–5637, January 2017

work page 2017
[16]

Modeling, Inference and Optimization with Composable Differentiable Procedures

Dougal Maclaurin. Modeling, Inference and Optimization with Composable Differentiable Procedures. PhD thesis, Harvard University, April 2016

work page 2016
[17]

Matthew Johnson et. al. Jax. https://github.com/google/jax, 2018

work page 2018
[18]

Mike Innes et. al. Flux.jl. https://github.com/FluxML/Flux.jl, 2018

work page 2018
[19]

SciPy: Open source scientiﬁc tools for Python, 2001–

Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scientiﬁc tools for Python, 2001–. http://www.scipy.org/

work page 2001
[20]

Data structures for statistical computing in python

Wes McKinney. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference, 51-56, 2010

work page 2010
[21]

Eblearn: Open-source energy-based learning in c++

Pierre Sermanet, Koray Kavukcuoglu, and Yann LeCun. Eblearn: Open-source energy-based learning in c++. In2009 21st IEEE International Conference on Tools with Artiﬁcial Intelligence, pages 693–697. IEEE, 2009. 10

work page 2009
[22]

cuDNN: Efficient Primitives for Deep Learning

Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan D. Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efﬁcient primitives for deep learning. CoRR, abs/1410.0759, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[23]

maxdnn: An efﬁcient convolution kernel for deep learning with maxwell gpus, January 2015

Andrew Lavin. maxdnn: An efﬁcient convolution kernel for deep learning with maxwell gpus, January 2015

work page 2015
[24]

Fast algorithms for convolutional neural networks

Andrew Lavin and Scott Gray. Fast algorithms for convolutional neural networks. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4013–4021, 2016

work page 2016
[25]

Torch7: A matlab-like environment for machine learning

Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. Torch7: A matlab-like environment for machine learning. In NIPS 2011, 2011

work page 2011
[26]

The rise of worse is better

Richard Gabriel. The rise of worse is better. http://dreamsongs.com/RiseOfWorseIsBetter.html

work page
[27]

MNIST handwritten digit database

Yann LeCun and Corinna Cortes. MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/

work page
[28]

StarCraft II: A New Challenge for Reinforcement Learning

Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexander Sasha Vezhnevets, Michelle Yeo, Alireza Makhzani, Heinrich Küttler, John Agapiou, Julian Schrittwieser, John Quan, Stephen Gaffney, Stig Petersen, Karen Simonyan, Tom Schaul, Hado van Hasselt, David Silver, Timothy P. Lillicrap, Kevin Calderone, Paul Keet, Anthony Brunasso, David Lawre...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[29]

Dlpack: Open in memory tensor structure

DMLC. Dlpack: Open in memory tensor structure. https://github.com/dmlc/dlpack

work page
[30]

Automatic differentiation in pytorch

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS Workshop, 2017

work page 2017
[31]

Automatic differentiation, C++ templates, and photogrammetry

Dan Piponi. Automatic differentiation, C++ templates, and photogrammetry. J. Graphics, GPU, & Game Tools, 9(4):41–55, 2004

work page 2004
[32]

Automatic differentiation facilitates of-integration into steering-angle-based road vehicle tracking

Holger Leuck and Hans-Hellmut Nagel. Automatic differentiation facilitates of-integration into steering-angle-based road vehicle tracking. In 1999 Conference on Computer Vision and Pattern Recognition (CVPR ’99), 23-25 June 1999, Ft. Collins, CO, USA, pages 2360–2365, 1999

work page 1999
[33]

The cpython global interpreter lock

The Python team. The cpython global interpreter lock. https://wiki.python.org/moin/GlobalInterpreterLock

work page
[34]

Nimtorch

Giovanni Petrantoni and Jörg Wollenschläger. Nimtorch. https://github.com/fragcolor- xyz/nimtorch

work page
[35]

Hasktorch

Austin Huang, Junji Hashimoto, and Sam Stites. Hasktorch. https://github.com/hasktorch/hasktorch

work page
[36]

Synnaeve, Z

G. Synnaeve, Z. Lin, J. Gehring, D. Gant, V . Mella, V . Khalidov, N. Carion, and N. Usunier. Forward modeling for partial observation strategy games - a starcraft defogger. In Advances in Neural Information Processing Systems, pages 10761–10771, 2018

work page 2018
[37]

Torch Script

The PyTorch team. Torch Script. https://pytorch.org/docs/stable/jit.html

work page
[38]

Cuda streams

Justin Luitjens. Cuda streams. GPU technology conference, 2014

work page 2014
[39]

Berger, Kathryn S

Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R. Wilson. Hoard: A scalable memory allocator for multithreaded applications. In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS IX, pages 117–128, New York, NY , USA, 2000. ACM

work page 2000
[40]

J. Evans. A scalable concurrent malloc(3) implementation for freebsd. In In BSDCan — The Technical BSD Conference, May 2006

work page 2006
[41]

Ghemawat and P

S. Ghemawat and P. Menage. Tcmalloc: Thread-caching malloc. 11

work page
[42]

Wright, and Feng Niu

Benjamin Recht, Christopher Ré, Stephen J. Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems

work page
[43]

Proceedings of a meeting held 12-14 December 2011, Granada, Spain., pages 693–701, 2011

work page 2011
[44]

Matthew Hertz and Emery D. Berger. Quantifying the performance of garbage collection vs. explicit memory management. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications, OOPSLA ’05, pages 313–326, New York, NY , USA, 2005. ACM

work page 2005
[45]

https://pytorch.org/docs/1.0.1/autograd.html#proﬁler

The PyTorch team.Pytorch Autograd Proﬁler. https://pytorch.org/docs/1.0.1/autograd.html#proﬁler. 12

work page

[1] [1]

Caffe: Convolutional Architecture for Fast Feature Embedding

Yangqing "Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor" Darrell. "caffe: Convolutional architecture for fast feature embedding". "arXiv preprint arXiv:1408.5093", "2014"

work page internal anchor Pith review Pith/arXiv arXiv 2014

[2] [2]

Cntk: Microsoft’s open-source deep-learning toolkit

Frank Seide and Amit Agarwal. Cntk: Microsoft’s open-source deep-learning toolkit. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 2135–2135, New York, NY , USA, 2016. ACM. 9

work page 2016

[3] [3]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murra...

work page 2015

[4] [4]

Theano: A Python framework for fast computation of mathematical expressions

Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints, abs/1605.02688, May 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[5] [5]

Chainer: a next-generation open source framework for deep learning

Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. Chainer: a next-generation open source framework for deep learning. In Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS), 2015

work page 2015

[6] [6]

Torch: a modular machine learning software library

Ronan Collobert, Samy Bengio, and Johnny Mariéthoz. Torch: a modular machine learning software library. Technical report, Idiap, 2002

work page 2002

[7] [7]

Neubig, C

G. Neubig, C. Dyer, Y . Goldberg, A. Matthews, W. Ammar, A. Anastasopoulos, M. Balles- teros, D. Chiang, D. Clothiaux, T. Cohn, K. Duh, M. Faruqui, C. Gan, D. Garrette, Y . Ji, L. Kong, A. Kuncoro, G. Kumar, C. Malaviya, P. Michel, Y . Oda, M. Richardson, N. Saphra, S. Swayamdipta, and P. Yin. DyNet: The Dynamic Neural Network Toolkit. ArXiv e-prints, Jan...

work page 2017

[8] [8]

Philip S. Abrams. An APL Machine. PhD thesis, Stanford University, 1970

work page 1970

[9] [9]

MATLAB and Statistics Toolbox

The MathWorks, Inc., Natick, Massachusetts, United States. MATLAB and Statistics Toolbox

work page

[10] [10]

R: A Language and Environment for Statistical Computing

R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria

work page

[11] [11]

Julia: A fresh approach to numerical computing

Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B Shah. Julia: A fresh approach to numerical computing. SIAM review, 59(1):65–98, 2017

work page 2017

[12] [12]

NumPy: A guide to NumPy

Travis Oliphant. NumPy: A guide to NumPy. USA: Trelgol Publishing, 2006. http://www.numpy.org/

work page 2006

[13] [13]

Eigen v3

Gaël Guennebaud, Benoît Jacob, et al. Eigen v3. http://eigen.tuxfamily.org, 2010

work page 2010

[14] [14]

Lush reference manual

Y LeCun and L Bottou. Lush reference manual. Technical report, code available at http://lush.sourceforge.net, 2002

work page 2002

[15] [15]

Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind

Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. Automatic differentiation in machine learning: A survey. J. Mach. Learn. Res. , 18(1):5595–5637, January 2017

work page 2017

[16] [16]

Modeling, Inference and Optimization with Composable Differentiable Procedures

Dougal Maclaurin. Modeling, Inference and Optimization with Composable Differentiable Procedures. PhD thesis, Harvard University, April 2016

work page 2016

[17] [17]

Matthew Johnson et. al. Jax. https://github.com/google/jax, 2018

work page 2018

[18] [18]

Mike Innes et. al. Flux.jl. https://github.com/FluxML/Flux.jl, 2018

work page 2018

[19] [19]

SciPy: Open source scientiﬁc tools for Python, 2001–

Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scientiﬁc tools for Python, 2001–. http://www.scipy.org/

work page 2001

[20] [20]

Data structures for statistical computing in python

Wes McKinney. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference, 51-56, 2010

work page 2010

[21] [21]

Eblearn: Open-source energy-based learning in c++

Pierre Sermanet, Koray Kavukcuoglu, and Yann LeCun. Eblearn: Open-source energy-based learning in c++. In2009 21st IEEE International Conference on Tools with Artiﬁcial Intelligence, pages 693–697. IEEE, 2009. 10

work page 2009

[22] [22]

cuDNN: Efficient Primitives for Deep Learning

Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan D. Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efﬁcient primitives for deep learning. CoRR, abs/1410.0759, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[23] [23]

maxdnn: An efﬁcient convolution kernel for deep learning with maxwell gpus, January 2015

Andrew Lavin. maxdnn: An efﬁcient convolution kernel for deep learning with maxwell gpus, January 2015

work page 2015

[24] [24]

Fast algorithms for convolutional neural networks

Andrew Lavin and Scott Gray. Fast algorithms for convolutional neural networks. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4013–4021, 2016

work page 2016

[25] [25]

Torch7: A matlab-like environment for machine learning

Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. Torch7: A matlab-like environment for machine learning. In NIPS 2011, 2011

work page 2011

[26] [26]

The rise of worse is better

Richard Gabriel. The rise of worse is better. http://dreamsongs.com/RiseOfWorseIsBetter.html

work page

[27] [27]

MNIST handwritten digit database

Yann LeCun and Corinna Cortes. MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/

work page

[28] [28]

StarCraft II: A New Challenge for Reinforcement Learning

Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexander Sasha Vezhnevets, Michelle Yeo, Alireza Makhzani, Heinrich Küttler, John Agapiou, Julian Schrittwieser, John Quan, Stephen Gaffney, Stig Petersen, Karen Simonyan, Tom Schaul, Hado van Hasselt, David Silver, Timothy P. Lillicrap, Kevin Calderone, Paul Keet, Anthony Brunasso, David Lawre...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[29] [29]

Dlpack: Open in memory tensor structure

DMLC. Dlpack: Open in memory tensor structure. https://github.com/dmlc/dlpack

work page

[30] [30]

Automatic differentiation in pytorch

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS Workshop, 2017

work page 2017

[31] [31]

Automatic differentiation, C++ templates, and photogrammetry

Dan Piponi. Automatic differentiation, C++ templates, and photogrammetry. J. Graphics, GPU, & Game Tools, 9(4):41–55, 2004

work page 2004

[32] [32]

Automatic differentiation facilitates of-integration into steering-angle-based road vehicle tracking

Holger Leuck and Hans-Hellmut Nagel. Automatic differentiation facilitates of-integration into steering-angle-based road vehicle tracking. In 1999 Conference on Computer Vision and Pattern Recognition (CVPR ’99), 23-25 June 1999, Ft. Collins, CO, USA, pages 2360–2365, 1999

work page 1999

[33] [33]

The cpython global interpreter lock

The Python team. The cpython global interpreter lock. https://wiki.python.org/moin/GlobalInterpreterLock

work page

[34] [34]

Nimtorch

Giovanni Petrantoni and Jörg Wollenschläger. Nimtorch. https://github.com/fragcolor- xyz/nimtorch

work page

[35] [35]

Hasktorch

Austin Huang, Junji Hashimoto, and Sam Stites. Hasktorch. https://github.com/hasktorch/hasktorch

work page

[36] [36]

Synnaeve, Z

G. Synnaeve, Z. Lin, J. Gehring, D. Gant, V . Mella, V . Khalidov, N. Carion, and N. Usunier. Forward modeling for partial observation strategy games - a starcraft defogger. In Advances in Neural Information Processing Systems, pages 10761–10771, 2018

work page 2018

[37] [37]

Torch Script

The PyTorch team. Torch Script. https://pytorch.org/docs/stable/jit.html

work page

[38] [38]

Cuda streams

Justin Luitjens. Cuda streams. GPU technology conference, 2014

work page 2014

[39] [39]

Berger, Kathryn S

Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R. Wilson. Hoard: A scalable memory allocator for multithreaded applications. In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS IX, pages 117–128, New York, NY , USA, 2000. ACM

work page 2000

[40] [40]

J. Evans. A scalable concurrent malloc(3) implementation for freebsd. In In BSDCan — The Technical BSD Conference, May 2006

work page 2006

[41] [41]

Ghemawat and P

S. Ghemawat and P. Menage. Tcmalloc: Thread-caching malloc. 11

work page

[42] [42]

Wright, and Feng Niu

Benjamin Recht, Christopher Ré, Stephen J. Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems

work page

[43] [43]

Proceedings of a meeting held 12-14 December 2011, Granada, Spain., pages 693–701, 2011

work page 2011

[44] [44]

Matthew Hertz and Emery D. Berger. Quantifying the performance of garbage collection vs. explicit memory management. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications, OOPSLA ’05, pages 313–326, New York, NY , USA, 2005. ACM

work page 2005

[45] [45]

https://pytorch.org/docs/1.0.1/autograd.html#proﬁler

The PyTorch team.Pytorch Autograd Proﬁler. https://pytorch.org/docs/1.0.1/autograd.html#proﬁler. 12

work page