Optimal transport (OT) has emerged as a fundamental tool in modern machine learning, yet its computational cost remains a significant bottleneck for large-scale applications. While harnessing the massive parallelism of modern GPU hardware is critical for efficiency, the de facto standard Sinkhorn algorithm, despite its ease of parallelization, often suffers from slow convergence in challenging problems. More recently, the sparse-plus-low-rank quasi-Newton method offers a balance between convergence rate and per-iteration complexity; however, its efficiency on GPUs is severely hindered by the serial nature of sparse matrix symbolic analysis and irregular memory access patterns. To bridge this gap, we present cuRegOT, a high-performance GPU solver tailored for entropic-regularized OT. We introduce a suite of algorithmic and architectural optimizations, including an amortized symbolic analysis strategy to mitigate CPU bottlenecks, an asynchronous Sinkhorn iterates generation mechanism, and a fused kernel for bandwidth-efficient gradient evaluation. These strategies are backed by rigorous theoretical guarantees ensuring algorithmic convergence. Extensive numerical experiments demonstrate that cuRegOT achieves significant speedups over state-of-the-art GPU-based solvers across a variety of benchmark tasks.
We present CombOL (Combinatorial Objects Library), an open-source library for the enumeration and Boltzmann sampling of combinatorial classes. Classes can be specified by a concise string syntax, and may depend on an arbitrary number of parameters. CombOL automatically derives the associated generating functions, enabling the generation of counting sequences and the compilation of Boltzmann samplers. The library supports exact and approximate-size Boltzmann rejection sampling with automatic parameter tuning to target specific sizes. In addition to implementing established methods, CombOL contributes a novel early-rejection scheme, as well as guaranteed statistical correctness by dynamically increasing the numerical precision, eliminating bias due to floating-point rounding errors. Through the Python interface, sampled structures can be mapped to application-specific objects, enabling direct sampling of domain objects such as graphs, chemical structure representations, or other complex data types. CombOL is available from PyPI as 'combol' (pypi.org/project/combol). The source code is available at gitlab.com/casbjorn/combol.
We discuss implementation details of OSCAR's serialization framework, highlighting the design decisions that allow the fine tuning of serialization methods for specific use cases.
In particular, we show how the mrdi file format can be used for distributed computing.
We describe the Bandicoot GPU linear algebra toolkit, a C++ based library that prioritises ease of use without compromising efficiency. Bandicoot's API is compatible with the popular Armadillo CPU linear algebra library, enabling easy transition for existing CPU-based codebases. Unlike other GPU-focused toolkits, Bandicoot uses template metaprogramming to generate fused GPU kernels directly at compile time, yielding efficient kernels that are often able to saturate memory bandwidth. This removes the need for runtime overhead or JIT infrastructure. Empirical results show that Bandicoot outperforms (sometimes by considerable margins) commonly-used linear algebra toolkits including PyTorch, TensorFlow, and JAX.
The rapid rise of scientific machine learning (SciML) has expanded the role of differentiable modeling, surrogate modeling, and data-driven constitutive laws in large-scale simulation. The JAX framework provides an attractive environment for these workflows through automatically differentiable programs, vectorization, GPU acceleration, and while enabling seamless learning of surrogate models. However, large-scale simulation still relies on mature HPC infrastructure. Libraries, such as PETSc, provide scalable MPI-based parallelism, robust linear and nonlinear solvers, and advanced preconditioning capabilities that remain difficult to reproduce in JAX-only workflows. We present JetSCI, a hybrid JAX-PETSc framework that unifies these complementary strengths. JetSCI uses JAX for GPU-parallel differentiable discretizations and PETSc for robust, scalable solution of the resulting systems on distributed-memory architectures, exposing multilevel parallelism through GPU acceleration within nodes and MPI parallelism across nodes. For finite element discretizations of heterogeneous micromechanics problems, JetSCI outperforms JAX-only implementations in efficiency and accuracy.
We leverage highly successful prior projects sponsored by multiple NSF grants and gifts from industry: the BLAS-like Library Instantiation Software (BLIS) and the libflame efforts to lay the foundation for a new flexible framework by vertically integrating the dense linear and multi-linear (tensor) software stacks that are important to modern computing. This vertical integration will enable high-performance computations from node-level to massively-parallel, and across both CPU and GPU architectures. The effort builds on decades of experience by the research team turning fundamental research on the systematic derivation of algorithms (the NSF-sponsored FLAME project) into practical software for this domain, targeting single and multi-core (BLIS, TBLIS, and libflame), GPU-accelerated (SuperMatrix), and massively parallel (PLAPACK, Elemental, and ROTE) compute environments. This project will implement key linear algebra and tensor operations which highlight the flexibility and effectiveness of the new framework, and set the stage for further work in broadening functionality and integration into diverse scientific and machine learning software.
We introduce a code-based challenge for automated, open-ended mathematical discovery based on the $k$-server conjecture, a central open problem in competitive analysis. The task is to discover a potential function satisfying a large graph-structured system of simple linear inequalities. The resulting evaluation procedure is sound but incomplete: any violated inequality definitively refutes a candidate, whereas satisfying all inequalities does not by itself constitute a proof of the corresponding conjecture's special case. Nevertheless, a candidate that passes all constraints would be strong evidence toward a valid proof and, to the best of our knowledge, no currently known potential achieves this under our formulation in the open $k=4$ circle case. As such, a successful candidate would already be an interesting contribution to the $k$-server conjecture, and could become a substantial theoretical result when paired with a full proof.
Experiments on the resolved $k=3$ regime show that current agentic methods can solve nontrivial instances, and in the open $k=4$ regime they reduce the number of violations relative to existing potentials without fully resolving the task. Taken together, these results suggest that the task is challenging but plausibly within reach of current methods.
Beyond its relevance to the $k$-server community, where the developed tooling enables researchers to test new hypotheses and potentially improve on the current record, the task also serves as a useful \emph{benchmark} for developing code-based discovery agents. In particular, our $k=3$ results show that it mitigates important limitations of existing open-ended code-based benchmarks, including early saturation and the weak separation between naive random baselines and more sophisticated methods.
Polylab is a MATLAB toolbox for multivariate polynomial scalars and polynomial matrices with a unified symbolic-numeric interface across CPU and GPU-oriented backends. The software exposes three aligned classes: MPOLY for CPU execution, MPOLY_GPU as a legacy GPU baseline, and MPOLY_HP as an improved GPU-oriented implementation. Across these backends, Polylab supports polynomial construction, algebraic manipulation, simplification, matrix operations, differentiation, Jacobian and Hessian construction, LaTeX export, CPU-side LaTeX reconstruction, backend conversion, and interoperability with YALMIP and SOSTOOLS. Versions 3.0 and 3.1 add two practically important extensions: explicit variable identity and naming for safe mixed-variable expression handling, and affine-normal direction computation via automatic differentiation, MF-logDet-Exact, and MF-logDet-Stochastic. The toolbox has already been used successfully in prior research applications, and Polylab Version 3.1 adds a new geometry-oriented computational layer on top of a mature polynomial modeling core. This article documents the architecture and user-facing interface of the software, organizes its functionality by workflow, presents representative MATLAB sessions with actual outputs, and reports reproducible benchmarks. The results show that MPOLY is the right default for lightweight interactive workloads, whereas MPOLY-HP becomes advantageous for reduction-heavy simplification and medium-to-large affine-normal computation; the stochastic log-determinant variant becomes attractive in larger sparse regimes under approximation-oriented parameter choices.
Floating-point arithmetic is error-prone and unintuitive. Floating-point debuggers instrument programs to monitor floating-point arithmetic at run time and flag numerical issues. They estimate residues, i.e., the difference between actual floating-point and ideal real values, for every floating-point value in the program. Prior work explores various approaches for computing these residues accurately and efficiently. Unfortunately, the most efficient methods, based on "error-free transformations", have a high rate of false reports, while the most accurate methods, based on high-precision arithmetic, are very slow. This paper builds on error-free-transformations-based approaches and aims to improve their accuracy while preserving efficiency. To more accurately compute residues, this paper divides residue computation into two steps (rounding error computation and residue function evaluation) and shows how to perform each step accurately via careful improvements to the current state of the art. We evaluate on 44 large scientific computing workloads, focusing on the 14 benchmarks where prior tools produce false reports: our approach eliminates false reports on 10 benchmarks and substantially reduces them on the remaining 3 benchmarks. Moreover, complex numerical issues require additional care due to absorption, where two machine-precision residues cannot both be computed accurately in a single execution. This paper introduces residue override, which re-executes the program multiple times, computing different residues in different executions and assembling a final "patchwork" execution. We evaluate on 169 standard benchmarks drawn from numerical analysis papers and textbooks, requiring only 3.6 re-executions on average. Among 34 benchmarks with false reports in the initial run, residue override is triggered on 29 of them and reduces false reports on 25 of them, averaging 7.1 re-executions.