arxiv: 2605.09764 · v1 · submitted 2026-05-10 · 💻 cs.NE · cs.AI

Recognition: 2 theorem links

· Lean Theorem

LEVI: Stronger Search Architectures Can Substitute for Larger LLMs in Evolutionary Search

Temoor Tanveer

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:55 UTC · model grok-4.3

classification 💻 cs.NE cs.AI

keywords evolutionary algorithmslarge language modelssearch architecturesdiversity preservationmutation routingproxy benchmarksprompt optimizationsystems research

0 comments

The pith

LEVI demonstrates that stronger evolutionary search architectures can substitute for larger LLMs, achieving better results at far lower cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that existing LLM evolutionary methods waste resources on poor diversity management, inappropriate model sizing for mutations, and full evaluations. LEVI addresses this with a diversity database, mutation router, and proxy benchmark. If correct, this means advanced evolutionary search no longer requires expensive frontier models, making it more accessible for systems research and prompt optimization. A sympathetic reader would care because it shifts the focus from scaling models to scaling smart search designs.

Core claim

LEVI is an evolutionary framework built on the idea that architectural strength can replace model size. Its key features include a solution database that preserves diversity from the start and throughout, a mutation router that assigns tasks to large or small LLMs based on their strengths, and a rank-preserving proxy for efficient benchmarking in rollout-heavy scenarios. Experiments show it reaches the highest scores on systems benchmarks with 3.3-6.7 times smaller budgets than prior methods, matching the best result in one case at 35 times lower cost, and performs comparably or better on prompt optimization with half the budget.

What carries the argument

The combination of a diversity-establishing and maintaining solution database, a strength-aware mutation router for large and small LLMs, and a rank-preserving proxy benchmark.

Load-bearing premise

The reported improvements stem directly from the three architectural innovations rather than variations in baseline implementations, problem sets, or experimental conditions.

What would settle it

Reproducing the baseline methods with identical problem selections, evaluation protocols, and model versions would eliminate the performance and cost advantages claimed for LEVI.

Figures

Figures reproduced from arXiv: 2605.09764 by Temoor Tanveer.

**Figure 2.** Figure 2: LEVI uses one bootstrapped seed pass to initialize solution families, calibrate the CVT-MAP [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Controlled architecture comparison on Transaction Scheduling (left) and Spot Scheduling [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation on the solution database, studying the impact of bootstrapped initialization and [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation on the model allocation, studying the impact of using larger models and role [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Proxy-benchmark ranking quality as a function of [PITH_FULL_IMAGE:figures/full_fig_p029_6.png] view at source ↗

**Figure 7.** Figure 7: Transaction Scheduling — final-archive composition for the three diversity-ablation [PITH_FULL_IMAGE:figures/full_fig_p030_7.png] view at source ↗

**Figure 8.** Figure 8: Spot Scheduling — final-archive composition for the three diversity-ablation conditions. [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗

read the original abstract

LLM-guided evolutionary methods such as AlphaEvolve have proven effective in domains like math, systems research, and algorithmic discovery, but their reliance on frontier models makes each run expensive. We argue this is largely an artifact of how existing frameworks allocate search: archives that fail to preserve solution diversity force compensation through stronger mutation models; blind model use spends frontier dollars on local edits a smaller model could handle; and full-set evaluation wastes rollouts on redundant examples. We introduce LEVI, a harness-first evolutionary framework built on the bet that stronger search architectures can substitute for or even outperform larger LLMs in evolutionary search. LEVI improves on three core components of evolutionary search: a solution database that establishes diversity from the beginning, and then maintains it throughout the run; a smarter mutation router that plays into the strengths of large and small LLMs; and a rank-preserving proxy benchmark for rollout-heavy settings. Across systems-research benchmarks LEVI attains the highest score on a budget 3.3-6.7x smaller than the published frontier-model runs of existing frameworks like ShinkaEvolve, GEPA, and AdaEvolve; on one problem, LEVI matches the existing best at a 35x lower cost. On prompt optimization, LEVI matches or exceeds GEPA at less than half of its rollout budget on four different benchmarks. LEVI is available as an open-source framework at https://github.com/ttanv/levi.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LEVI claims better search architecture can replace bigger LLMs for evolutionary tasks, but the gains rest on comparisons to published baselines rather than matched re-runs.

read the letter

The main takeaway is that LEVI tries to show you can improve LLM-guided evolutionary search by strengthening the search setup instead of defaulting to larger models. The framework adds three pieces: an archive that builds and keeps diversity from the start, a router that sends mutations to big or small LLMs depending on the edit, and a proxy benchmark that keeps ranking order without full evaluations every time. They report LEVI hitting the best scores on systems benchmarks at 3.3-6.7x lower budget than the published runs of ShinkaEvolve, GEPA, and AdaEvolve, with one case matching the prior best at 35x less cost. On prompt optimization it matches or beats GEPA at under half the rollout budget. The code is open-sourced, which makes the claims easier to test directly. That openness and the concrete component list are the parts that stand out as useful. The comparisons are the soft spot. The paper measures against the numbers already published by the other frameworks rather than re-running those baselines under the same problems, models, rollout counts, and protocols. This leaves open the chance that some of the reported savings come from setup differences instead of the three new pieces. No error bars or statistical tests appear in the abstract, and the full text would need to show ablations that isolate each change. There is no formal derivation or proof, just the empirical numbers. Citations hit the main prior systems work without obvious holes. This is for people running evolutionary search with LLMs on discovery or optimization problems where budget is a constraint. A reader who wants a ready-to-try framework and some headline cost numbers would get value from it. I would send it to peer review because the ideas are specific, the code is public, and the central bet is testable even if the current experiments need tighter controls.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces LEVI, a harness-first evolutionary framework for LLM-guided search that augments three components: a solution database that enforces diversity from initialization onward, a mutation router that routes edits to large or small LLMs based on task demands, and a rank-preserving proxy benchmark that reduces rollout costs while preserving relative ordering. The central empirical claim is that these changes allow LEVI to reach the highest scores on systems-research benchmarks at 3.3-6.7x lower budget than published frontier-model runs of ShinkaEvolve, GEPA, and AdaEvolve (with one problem matched at 35x lower cost) and to match or exceed GEPA on four prompt-optimization benchmarks at less than half the rollout budget. The framework is released open-source.

Significance. If the performance advantages survive controlled re-evaluation, the result would be significant for evolutionary computation and LLM-augmented discovery: it supplies concrete evidence that search-architecture refinements can substitute for model scale, lowering the cost barrier for math, systems, and algorithmic tasks. The open-source release and the explicit framing against published baselines are strengths that would aid adoption and follow-on work.

major comments (2)

[Experimental Results] Experimental Results section: all quantitative claims (3.3-6.7x budget reduction, 35x on one problem, half rollout budget on prompt optimization) rest on comparisons to previously published numbers from ShinkaEvolve, GEPA, and AdaEvolve rather than re-executions of those baselines on the identical benchmark instances, evaluation protocols, rollout counts, and LLM checkpoints used by LEVI. This is load-bearing for the central attribution that the three architectural changes (diversity database, mutation router, rank-preserving proxy) are responsible for the gains; uncontrolled differences in problem difficulty or metric definitions could produce the same cost reductions.
[Method and Evaluation] Method and Evaluation sections: no ablation tables or controlled variants are presented that isolate the contribution of each of the three proposed changes while holding the LLM size, total rollout budget, and problem set fixed. Without such isolations, the claim that stronger search architecture substitutes for larger LLMs remains correlational.

minor comments (3)

[Abstract] Abstract: the phrases 'highest score' and 'matches the existing best' would be clearer if the exact metric (e.g., pass@1, success rate) and the number of independent runs were stated.
[Figures and Tables] Figure captions and tables: several budget-vs-performance plots lack error bars or run-to-run variance, making it hard to judge whether the reported margins exceed noise.
[Related Work] Related Work: the positioning against AlphaEvolve and other LLM-evolution frameworks could include a short table summarizing their archive, mutation, and evaluation strategies for direct comparison with LEVI's three changes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments correctly identify areas where stronger controls would improve the attribution of our results. We address each point below and have prepared revisions to the manuscript that incorporate additional discussion and new experiments where feasible.

read point-by-point responses

Referee: [Experimental Results] Experimental Results section: all quantitative claims (3.3-6.7x budget reduction, 35x on one problem, half rollout budget on prompt optimization) rest on comparisons to previously published numbers from ShinkaEvolve, GEPA, and AdaEvolve rather than re-executions of those baselines on the identical benchmark instances, evaluation protocols, rollout counts, and LLM checkpoints used by LEVI. This is load-bearing for the central attribution that the three architectural changes (diversity database, mutation router, rank-preserving proxy) are responsible for the gains; uncontrolled differences in problem difficulty or metric definitions could produce the same cost reductions.

Authors: We agree that re-executing the baselines under identical conditions would provide the most direct evidence. However, the published numbers come from the original authors' implementations and hardware setups, and repeating those runs at frontier scale would require resources beyond the scope of this work. In the revised manuscript we have added a dedicated subsection that enumerates known differences in problem instances, evaluation protocols, and model checkpoints, together with an explicit discussion of how these factors could affect the reported speed-ups. We also emphasize that the open-source release of LEVI enables independent re-evaluation by the community. revision: partial
Referee: [Method and Evaluation] Method and Evaluation sections: no ablation tables or controlled variants are presented that isolate the contribution of each of the three proposed changes while holding the LLM size, total rollout budget, and problem set fixed. Without such isolations, the claim that stronger search architecture substitutes for larger LLMs remains correlational.

Authors: We concur that isolating each component is necessary to move beyond correlational claims. The original submission reported only the full-system results. For the revision we have added a new ablation study section that holds LLM size, total rollout budget, and benchmark instances fixed while individually enabling or disabling the diversity database, the mutation router, and the rank-preserving proxy. The resulting tables quantify the marginal contribution of each change and directly support the architectural-substitution thesis. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on benchmark comparisons

full rationale

The paper introduces LEVI as an empirical evolutionary search framework with three described architectural components (diversity database, mutation router, rank-preserving proxy) and reports performance gains via direct comparisons to published baseline runs on systems-research and prompt-optimization benchmarks. No equations, derivations, fitted parameters, or self-referential definitions appear in the abstract or described content. The central claims are experimental outcomes rather than reductions of predictions to inputs by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked. The derivation chain is therefore self-contained against external benchmarks, consistent with the default expectation for non-circular empirical systems papers.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated. The central claim implicitly assumes that evolutionary search with LLMs is limited by archive diversity, model allocation, and evaluation cost rather than by model capability per se.

axioms (2)

domain assumption LLM-guided mutation and selection can discover high-quality solutions when diversity is maintained.
Core premise of all cited LLM evolutionary methods; invoked throughout the abstract.
domain assumption A rank-preserving proxy benchmark can substitute for full evaluation without changing relative ordering.
Required for the third component; not justified in abstract.

pith-pipeline@v0.9.0 · 5556 in / 1299 out tokens · 36559 ms · 2026-05-12T02:55:36.843705+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
LEVI improves on three core components of evolutionary search: a solution database that establishes diversity from the beginning... a smarter mutation router... and a rank-preserving proxy benchmark
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability echoes
a CVT-MAP-Elites archive preserves diversity by construction

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

[1]

PE” = punctuated equilibrium; “cascade

Spot Multi uses five custom AST extractors capturing state-machine structure rather than the generic AST features. Benchmark Budget Mutation models AST descriptors Score-key descrip- tors Cloudcast $3.00 Qwen3-30B loops, branches, math ops 5 per-cloud cost columns EPLB $4.50 MiMo-v2 + Qwen3-30B loop nesting, cyclomatic, math ops exec. time + 3 work- loads...

work page
[2]

Your code must be COMPLETE and RUNNABLE as a standalone file

work page
[3]

Include ALL necessary imports at the top of your code

work page
[4]

The function signature must match exactly what is specified

work page
[5]

‘‘‘ DO NOT include any explanation or text outside the code block

Ensure there are no syntax errors (matching parentheses, quotes, indentation) Your response must follow this structure: ‘‘‘python # all necessary imports here def function_name(...): # your complete implementation return ... ‘‘‘ DO NOT include any explanation or text outside the code block. DO NOT assume any imports are already available - include every i...

work page
[6]

The resulting code must be COMPLETE and RUNNABLE

work page
[7]

Do NOT remove import statements unless replacing with different imports

work page
[8]

Ensure your replacements maintain valid Python syntax 17

work page
[9]

If adding new functionality that needs imports, add them with a separate SEARCH/REPLACE block RULES:

work page
[10]

Make SURGICAL changes - small, focused edits (5-20 lines max per block)

work page
[11]

Copy the SEARCH section EXACTLY from the original (including whitespace/indentation)

work page
[12]

Use multiple small SEARCH/REPLACE blocks instead of one large block

work page
[13]

Start your response immediately with <<<<<<< SEARCH

work page
[14]

Do NOT include any explanation or text outside the blocks

work page
[15]

The representative_solutions field is filled with the highest-scoring elite from each k-means cluster of the occupied archive cells

Do NOT use ‘‘‘python code blocks E.4 Paradigm shift (PE trigger) Sent to the paradigm-shift (heavy) model whenever PE fires (Appendix F). The representative_solutions field is filled with the highest-scoring elite from each k-means cluster of the occupied archive cells. # Algorithmic Paradigm Shift Challenge ## Problem {problem_description} ## Function Si...

work page
[16]

**Identify current paradigms**: What algorithmic strategies do the existing solutions use? (e.g., greedy, graph-based, dynamic programming, heuristic search, brute-force with pruning, etc.)

work page
[17]

**Find the gap**: What paradigms are NOT represented in the current solutions?

work page
[18]

**Design a novel approach**: Synthesize a solution using a completely different conceptual framework and data structure strategy than those found in the examples ### Instructions:

work page
[19]

Study the function signature carefully - match it EXACTLY

work page
[20]

Actively avoid the core logic, heuristics, and search patterns used in the existing solutions

work page
[21]

No explanations before or after

Design a solution using a COMPLETELY DIFFERENT strategy ### Critical Requirements: - Your function signature MUST match exactly: ‘{function_signature}‘ - Use only standard Python libraries (numpy, collections, itertools, math, heapq, functools, etc.) and torch if needed - The code must be syntactically valid and complete - Include ALL necessary imports at...

work page
[22]

Keeping the core algorithmic approach intact

work page
[23]

replace X with Y

Making targeted modifications to: - Constants and thresholds - Secondary heuristics - Edge case handling - Implementation details The variant should explore nearby regions of the solution space while preserving the novel approach. ## Output Output ONLY the complete Python code in a ‘‘‘python block. E.6 Prompt-optimization mutation (single-prompt artifact)...

work page
[24]

Cluster occupied cells.Run k-means in descriptor space over the behavior vectors of the currently occupied centroids, withk=n_clusters(default 3)

work page
[25]

The result is a small set of mutually-distant, individually-strong solutions

Select representatives.From each cluster, take the highest-scoring elite. The result is a small set of mutually-distant, individually-strong solutions

work page
[26]

Route to the paradigm-shift model (default: Gemini Flash 3) with max_tokens=4096,timeout=300s, at the configuredpe.temperature

Paradigm-shift call.Build the paradigm-shift prompt (Section E.4) with the represen- tatives as context. Route to the paradigm-shift model (default: Gemini Flash 3) with max_tokens=4096,timeout=300s, at the configuredpe.temperature

work page
[27]

Insert into the archive if it dominates its target cell

Evaluate paradigm.Score the resulting candidate against the full evaluator. Insert into the archive if it dominates its target cell. 5.Variant burst.Generaten variants (default 3) variants of the accepted paradigm shift using the variant-generation prompt (Section E.5), routed to the smaller mutation model. Each variant is independently evaluated and inse...

work page
[28]

These logs are what the figures in Figure 5 are derived from

Logging.Record paradigm_generated, paradigm_accepted, variants_generated, variants_accepted, and total_cost for the trigger event. These logs are what the figures in Figure 5 are derived from. The trigger interval is the only PE knob that varies across our experiments: the per-benchmark and ablation runs override interval between 5 and 18 depending on bud...

work page
[29]

-> tuple[torch.Tensor, torch.Tensor, torch.Tensor]: 10’’’ 11Paradigm: Spectral-Inspired Balanced Partitioning (SIBP). 12 20 13Instead of greedy allocation or simple interleaving, this approach treats the 14load balancing problem as a ’Bin Packing with Variable Item Sizes’ combined 15with a ’Static Circular Buffer’ layout. 16

work page
[30]

Approximation: We use a Logarithmic Binning strategy to allocate expert counts, 18representing a shift from linear proportionality to entropy-minimizing distribution

work page
[31]

Point-of-No-Return

Placement: We utilize a ’Circular Zig-Zag’ mapping to assign experts to physical 20slots, ensuring that the highest-load experts are geographically distributed 21using a relative-prime offset and localized mirroring. 22’’’ 23num_layers, num_logical_experts = weight.shape 24device = weight.device 25 26# 1. Expert Count Determination: Entropy-based Logarith...

work page
[32]

Initial Greedy Packing using a Pressure-Aware First-Fit

work page
[33]

Simulated Annealing / Local Search to refine the bottleneck (max pressure). 22""" 23 24def calculate_kvpr(gpu_models): 25if not gpu_models: 26return 0.0 27total_weight = sum(m.req_rate / m.slo for m in gpu_models) 28used_mem = sum(m.model_size for m in gpu_models) 29remaining_mem = GPU_MEM_SIZE - used_mem 30# Respect hard constraint; return inf if exceede...

work page
[34]

20 21Strategy:

-> pd.DataFrame: 18""" 19Optimized Column and Row Reordering for Maximum Prefix Hit Rate. 20 21Strategy:

work page
[35]

Apply column merging with robust error handling

work page
[36]

25- For larger sets, use greedy selection based on compression-like utility (sum of squares of frequencies)

Use a hybrid column ordering approach: 24- For small column sets (<= col_stop), use brute-force with early pruning. 25- For larger sets, use greedy selection based on compression-like utility (sum of squares of frequencies)

work page
[37]

Row reordering via sequential stable frequency ranking (high-frequency values grouped together)

work page
[38]

Leverage vectorized operations and sampling to ensure runtime < 10s

work page
[39]

_".join(valid_cols) 48# Vectorized string concatenation: fast and safe 49df[merged_name] = df[valid_cols].astype(str).fillna(

Avoid common pitfalls: nested multiprocessing, apply(), missing columns, and row truncation. 29 30Key improvements over v1/v2: 31- Avoids ‘df.apply(lambda)‘ entirely; uses vectorized string operations. 32- Prevents daemonic process errors by using ThreadPoolExecutor instead of ProcessPoolExecutor. 33- Handles edge cases: empty DataFrame, missing columns, ...

work page
[40]

deadline-aware restart-overhead cautious

-> float: 171""" 172Vectorized prefix hit rate calculation using string concatenation. 173Avoids loops and apply; uses pandas vectorized operations. 174 175Args: 176col_order: List of column names in order 177sample_str: DataFrame of string values (already converted) 178 179Returns: 180Average prefix hit rate (fraction of rows where current row starts wit...

work page