The maximum meander number for cyclic permutations on n letters is bounded above and below by quadratic functions of n.
super hub Canonical reference
AlphaEvolve: A coding agent for scientific and algorithmic discovery
Canonical reference. 74% of citing Pith papers cite this work as background.
abstract
In this white paper, we present AlphaEvolve, an evolutionary coding agent that substantially enhances capabilities of state-of-the-art LLMs on highly challenging tasks such as tackling open scientific problems or optimizing critical pieces of computational infrastructure. AlphaEvolve orchestrates an autonomous pipeline of LLMs, whose task is to improve an algorithm by making direct changes to the code. Using an evolutionary approach, continuously receiving feedback from one or more evaluators, AlphaEvolve iteratively improves the algorithm, potentially leading to new scientific and practical discoveries. We demonstrate the broad applicability of this approach by applying it to a number of important computational problems. When applied to optimizing critical components of large-scale computational stacks at Google, AlphaEvolve developed a more efficient scheduling algorithm for data centers, found a functionally equivalent simplification in the circuit design of hardware accelerators, and accelerated the training of the LLM underpinning AlphaEvolve itself. Furthermore, AlphaEvolve discovered novel, provably correct algorithms that surpass state-of-the-art solutions on a spectrum of problems in mathematics and computer science, significantly expanding the scope of prior automated discovery methods (Romera-Paredes et al., 2023). Notably, AlphaEvolve developed a search algorithm that found a procedure to multiply two $4 \times 4$ complex-valued matrices using $48$ scalar multiplications; offering the first improvement, after 56 years, over Strassen's algorithm in this setting. We believe AlphaEvolve and coding agents like it can have a significant impact in improving solutions of problems across many areas of science and computation.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract In this white paper, we present AlphaEvolve, an evolutionary coding agent that substantially enhances capabilities of state-of-the-art LLMs on highly challenging tasks such as tackling open scientific problems or optimizing critical pieces of computational infrastructure. AlphaEvolve orchestrates an autonomous pipeline of LLMs, whose task is to improve an algorithm by making direct changes to the code. Using an evolutionary approach, continuously receiving feedback from one or more evaluators, AlphaEvolve iteratively improves the algorithm, potentially leading to new scientific and practical d
authors
co-cited works
representative citing papers
AutoLab benchmark shows frontier models mostly fail at sustained iterative optimization due to premature termination, with persistence as the key success factor.
The Meta-Agent Challenge shows frontier AI models rarely match human-engineered agent baselines when tasked with autonomous development, with proprietary models succeeding most often and some exhibiting cheating under pressure.
LLM-guided evolutionary search yields the first domain-independent C++ planning heuristics that exceed the strongest hand-engineered baselines on coverage and speed trade-offs across unseen domains.
FastKernels is a production-aligned benchmark covering 96.2% of HuggingFace Transformers that reveals state-of-the-art kernel agents deliver at most 0.94x aggregate speedup.
Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning tasks at low cost.
VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.
MappingEvolve applies LLMs through Planner-Evolver-Evaluator agents to evolve technology mapping code, delivering 10.04% area reduction versus ABC and 7.93% versus mockturtle on EPFL benchmarks.
Prism is the first symbolic superoptimizer for tensor programs that uses sGraph for compact representation of program families, two-level search, e-graph equivalence checking, and auto-tuning to achieve up to 2.2x speedup over prior superoptimizers on LLM workloads.
InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.
CHIA introduces a framework for building and deploying agentic AI co-design flows as CHIA loops with tool nodes, reliability mechanisms, and five case-study demonstrations.
KernelPro combines LLM code generation, roofline-guided tool orchestration, and domain-adapted MCTS to produce GPU kernels that outperform prior automated and some hand-tuned baselines on KernelBench and VeOmni workloads.
SPIRAL is a reinforcement learning framework that jointly optimizes sequential reasoning, parallel trace generation, and aggregation in language models for improved test-time performance.
A machine-checkable catalog of low-rank matrix multiplication algorithms up to 32x32x32 is built over multiple fields via frontier-closure search that recombines entries while preserving a non-overlap property with prior bilinear cores.
Presents a query-complexity framework for genetic algorithms with guided operators and shows necessity of multiple operators and tight bounds for diversity in solution pools.
AgentCanary introduces an Entry × Impact risk taxonomy, high-fidelity real tool environments with persistent state, and multi-dimensional trajectory evaluation to assess AI agent security across models and attacks.
EinsteinArena is a platform for AI agents to collectively discover new mathematical results through open interaction, achieving 12 new state-of-the-art outcomes including raising the 11-dimensional kissing number lower bound from 593 to 604.
Self-Harness lets LLM agents autonomously refine their interaction harnesses through weakness mining, proposal generation, and validation, raising held-out pass rates on Terminal-Bench-2.0 from 40.5% to 61.9%, 23.8% to 38.1%, and 42.9% to 57.1% across three models.
FunctionEvolve recovers 107 exact symbolic forms out of 129 synthetic tasks (82.9% SA@50) by using expression-tree structure for evolutionary search, parent selection, mutation, and coefficient scoring with LLMs.
MotionDisco discovers long-horizon humanoid loco-manipulation motions from scratch via LLM-guided evolutionary search, trajectory optimization, and pruning, then transfers them to real robots with RL policies.
Proves R(B_8, B_10) = 37 via an AI-assisted short proof with a Lean formalization of the upper bound.
LeanMarathon uses four contract-scoped agents on an evolving blueprint coordinated by a two-stage orchestrator to formalize seven theorems from Erdős problems in Lean, proving 258 lemmas with no sorry across three runs.
Enumeration yields 1579 non-isomorphic maximum independent sets in J±(12,4) giving non-isometric kissing arrangements of size 840, with a proof that for n≡2 or 4 mod 6 all such sets arise from Steiner quadruple systems.
citing papers explorer
-
Resolving the Schwartz Quadratic Meander Number Conjecture
The maximum meander number for cyclic permutations on n letters is bounded above and below by quadratic functions of n.
-
AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?
AutoLab benchmark shows frontier models mostly fail at sustained iterative optimization due to premature termination, with persistence as the key success factor.
-
The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?
The Meta-Agent Challenge shows frontier AI models rarely match human-engineered agent baselines when tasked with autonomous development, with proprietary models succeeding most often and some exhibiting cheating under pressure.
-
LLM-Evolved Domain-Independent Heuristics for Symbolic AI Planning
LLM-guided evolutionary search yields the first domain-independent C++ planning heuristics that exceed the strongest hand-engineered baselines on coverage and speed trade-offs across unseen domains.
-
FastKernels: Benchmarking GPU Kernel Generation in Production
FastKernels is a production-aligned benchmark covering 96.2% of HuggingFace Transformers that reveals state-of-the-art kernel agents deliver at most 0.94x aggregate speedup.
-
Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty
Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
-
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning tasks at low cost.
-
VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.
-
MappingEvolve: LLM-Driven Code Evolution for Technology Mapping
MappingEvolve applies LLMs through Planner-Evolver-Evaluator agents to evolve technology mapping code, delivering 10.04% area reduction versus ABC and 7.93% versus mockturtle on EPFL benchmarks.
-
Prism: Symbolic Superoptimization of Tensor Programs
Prism is the first symbolic superoptimizer for tensor programs that uses sGraph for compact representation of program families, two-level search, e-graph equivalence checking, and auto-tuning to achieve up to 2.2x speedup over prior superoptimizers on LLM workloads.
-
InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis
InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.
-
CHIA: An open-source framework for principled, agentic AI-driven hardware/software co-design research
CHIA introduces a framework for building and deploying agentic AI co-design flows as CHIA loops with tool nodes, reliability mechanisms, and five case-study demonstrations.
-
Optimizing CUDA like a Human: Micro-Profiling Tools as Expert Surrogates for LLM-Based GPU Kernel Optimization
KernelPro combines LLM code generation, roofline-guided tool orchestration, and domain-adapted MCTS to produce GPU kernels that outperform prior automated and some hand-tuned baselines on KernelBench and VeOmni workloads.
-
SPIRAL: Learning to Search and Aggregate
SPIRAL is a reinforcement learning framework that jointly optimizes sequential reasoning, parallel trace generation, and aggregation in language models for improved test-time performance.
-
A catalog of fast matrix multiplication algorithms with frontier-closure search
A machine-checkable catalog of low-rank matrix multiplication algorithms up to 32x32x32 is built over multiple fields via frontier-closure search that recombines entries while preserving a non-overlap property with prior bilinear cores.
-
Mathematical perspective on genetic algorithms with optimization guided operators
Presents a query-complexity framework for genetic algorithms with guided operators and shows necessity of multiple operators and tight bounds for diversity in solution pools.
-
AgentCanary: A Security Evaluation Framework for Autonomous AI Agents in Real Executable Environments
AgentCanary introduces an Entry × Impact risk taxonomy, high-fidelity real tool environments with persistent state, and multi-dimensional trajectory evaluation to assess AI agent security across models and attacks.
-
Harnessing the Collective Intelligence of AI Agents in the Wild for New Discoveries
EinsteinArena is a platform for AI agents to collectively discover new mathematical results through open interaction, achieving 12 new state-of-the-art outcomes including raising the 11-dimensional kissing number lower bound from 593 to 604.
-
Self-Harness: Harnesses That Improve Themselves
Self-Harness lets LLM agents autonomously refine their interaction harnesses through weakness mining, proposal generation, and validation, raising held-out pass rates on Terminal-Bench-2.0 from 40.5% to 61.9%, 23.8% to 38.1%, and 42.9% to 57.1% across three models.
-
FunctionEvolve: Structure-Guided Symbolic Regression with LLMs
FunctionEvolve recovers 107 exact symbolic forms out of 129 synthetic tasks (82.9% SA@50) by using expression-tree structure for evolutionary search, parent selection, mutation, and coefficient scoring with LLMs.
-
MotionDisco: Motion Discovery for Extreme Humanoid Loco-Manipulation
MotionDisco discovers long-horizon humanoid loco-manipulation motions from scratch via LLM-guided evolutionary search, trajectory optimization, and pruning, then transfers them to real robots with RL policies.
-
An automated proof that R(B_8,B_10)=37
Proves R(B_8, B_10) = 37 via an AI-assisted short proof with a Lean formalization of the upper bound.
-
LeanMarathon: Toward Reliable AI Co-Mathematicians through Long-Horizon Lean Autoformalization
LeanMarathon uses four contract-scoped agents on an evolving blueprint coordinated by a two-stage orchestrator to formalize seven theorems from Erdős problems in Lean, proving 258 lemmas with no sorry across three runs.
-
Classification of independent sets in signed Johnson graphs and applications to kissing arrangements
Enumeration yields 1579 non-isomorphic maximum independent sets in J±(12,4) giving non-isometric kissing arrangements of size 840, with a proof that for n≡2 or 4 mod 6 all such sets arise from Steiner quadruple systems.
-
MobEvolve: An Agentic Self-Evolving Heuristic System for Interpretable Human Mobility Generation
MobEvolve is an agentic self-evolving heuristic framework that generates interpretable human mobility trajectories and outperforms deep generative and LLM-based methods on Singapore and Montreal benchmarks.
-
When are LLMs Sufficient Policy Optimizers for Sequential RL Tasks?
PromptPO shows LLMs can act as black-box policy optimizers for sequential RL when leveraging prior knowledge, matching baselines in exploration and robotics but underperforming in MuJoCo.
-
Discovering a Zeta Map Algorithm on Dyck Paths via Mechanistic Interpretability
A deliberately small transformer trained on the zeta map on Dyck paths yields, via attention and probing analysis, an explicit combinatorial algorithm proven equivalent to the zeta map.
-
DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking
DiscoverPhysics is a new benchmark with 22 on-demand N-body simulated worlds where LLM agents design experiments to infer non-standard physics, evaluated via held-out trajectory MSE and LLM-judged explanation quality.
-
CyberEvolver: Structured Self-Evolution for Cybersecurity Agents On the Fly
CyberEvolver introduces a four-layer self-evolving agent architecture with trace-to-diagnosis and population beam search that raises seed agent success rates by 13.6% on CTF, exploitation, and penetration tasks across four LLMs.
-
FrontierOR: Benchmarking LLMs' Capacity for Efficient Algorithm Design in Large-Scale Optimization
FrontierOR benchmark shows frontier LLMs outperform Gurobi on solution quality and efficiency in only 31% of one-shot cases and 50% with test-time evolution on hard large-scale optimization tasks.
-
Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems
IDS is an agentic LLM system that incrementally synthesizes both implementation and proof for distributed key-value stores, succeeding on all 7 specs where prior agents succeeded on only 2.
-
Advancing Mathematics Research with AI-Driven Formal Proof Search
An LLM-based agent with Lean verification autonomously solved multiple open Erdős problems and OEIS conjectures in the first large-scale test.
-
Forecasting Scientific Progress with Artificial Intelligence
Introduces the CUSP benchmark across 4760 events and finds frontier AI models can pick plausible directions but fail to predict whether or when scientific advances will occur, with performance varying by domain and insensitive to training cutoffs.
-
Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents
Life-Harness evolves reusable interventions from training trajectories to enhance frozen LLM agents on unseen tasks across seven deterministic environments, yielding 88.5% average relative improvement in 116 of 126 model-environment settings.
-
What Do Evolutionary Coding Agents Evolve?
Evolutionary coding agents achieve most benchmark gains through a small subset of edit types and by cycling previously deleted code lines rather than developing new algorithmic structures.
-
Automated Kernel Discovery Towards Understanding High-dimensional Bayesian Optimization
An LLM-based evolutionary search discovers novel kernels for high-dimensional Bayesian optimization, achieving an average rank of 1.2 out of 17 on five benchmarks via two-stage proposal and LOO-CRPS selection.
-
FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics
FML-Bench shows a simple greedy hill-climber nearly matches tree search on dense-opportunity tasks while an adaptive agent that broadens search on stagnation outperforms six baselines across 18 tasks.
-
Latent Heuristic Search: Continuous Optimization for Automated Algorithm Design
Latent Heuristic Search performs continuous optimization over learned embeddings of heuristics, using normalizing flows and LLM prompting to discover competitive solvers for TSP, CVRP, KSP, and OBP.
-
Probabilistic Seasonal Streamflow Forecasting Across California's Sierra Nevada Watersheds with Agentic AI
An agentic AI workflow evolves an adaptive XGBoost quantile regression ensemble that reduces watershed-averaged forecast error by up to 29% versus California's operational forecasts for April-July runoff at 1-6 month leads across 23 Sierra Nevada sites.
-
Property-Guided LLM Program Synthesis for Planning
Property-guided LLM program synthesis with counterexample feedback creates direct heuristics for PDDL planning domains that require far fewer generations and less evaluation cost than score-based baselines.
-
From I/O to Code with Discovery Agent
DIO-Agent frames IO2Code as LLM-driven evolutionary search over programs with a Transformation Priority Premise to favor simple hypotheses, outperforming baselines on a new IO2CodeBench.
-
SMCEvolve: Principled Scientific Discovery via Sequential Monte Carlo Evolution
SMCEvolve applies Sequential Monte Carlo sampling to LLM program search with adaptive resampling, mutation mixtures, and convergence control, delivering finite-sample complexity bounds and benchmark gains over prior systems.
-
SemaTune: Semantic-Aware Online OS Tuning with Large Language Models
SemaTune uses LLM guidance with semantic context to tune up to 41 Linux OS parameters, delivering 72.5% performance gains over defaults and 153.3% over non-LLM baselines on 13 workloads while avoiding degraded states.
-
Adapting AlphaEvolve to Optimize Fully Homomorphic Encryption on TPUs
AlphaEvolve automates optimization of TFHE and CKKS FHE kernels on TPUv5e, finding changes that cut bootstrap latency by 2.5x and rotation/multiplication by 1.31x/1.18x versus human baselines.
-
Test-Time Learning with an Evolving Library
EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without parameter updates or supervision.
-
FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale
FrontierSmith automates synthesis of open-ended coding problems from closed-ended seeds and shows measurable gains on two open-ended LLM coding benchmarks.
-
Harnessing Agentic Evolution
AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.
-
Learning POMDP World Models from Observations with Language-Model Priors
Pinductor leverages language-model priors to learn POMDP world models from limited trajectories, matching privileged-access methods in performance and exceeding tabular baselines in sample efficiency.
-
AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents
AssayBench is a new gene-ranking benchmark for phenotypic CRISPR screens that shows zero-shot generalist LLMs outperform both biology-specific LLMs and trainable baselines on adjusted nDCG.
-
Budget-Efficient Automatic Algorithm Design via Code Graph
A code-graph and correction-based LLM search framework outperforms full-algorithm generation at equal token budgets on three combinatorial optimization problems.