arxiv: 2601.02997 · v2 · submitted 2026-01-06 · 💻 cs.LG · cs.CV

Recognition: no theorem link

From Memorization to Creativity: LLM as a Designer of Novel Neural Architectures

Waleed Khalid , Dmitry Ignatov , Radu Timofte

Authors on Pith no claims yet

Pith reviewed 2026-05-16 17:32 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords neural architecture searchLLM fine-tuningprogram synthesisLoRACIFAR-10novel architecturesself-supervised learningconvolutional networks

0 comments

The pith

Iterative fine-tuning with execution feedback and novelty filtering turns an LLM into a specialized generator of novel neural network architectures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that running an LLM through 22 cycles of code synthesis, low-fidelity validation, MinHash-Jaccard novelty filtering, and LoRA updates on successful examples produces a clear shift in its output distribution. Valid PyTorch CNN generations rise to dominate, first-epoch accuracies on CIFAR-10 increase substantially, and hundreds of previously unseen architectures enter the training set. A sympathetic reader would care because the method replaces hand-crafted search spaces with an annotation-free loop that improves reliability, performance proxies, and structural diversity at the same time. The same gains transfer to CIFAR-100 and SVHN, indicating the learned architectural prior is not tied to one dataset.

Core claim

Over 22 supervised fine-tuning cycles the LLM internalizes empirical architectural priors so that valid and high-performing outputs evolve from scarce to dominant: on CIFAR-10 the valid generation rate stabilizes at 50.6 percent, mean first-epoch accuracy rises from 28.1 percent to 51.0 percent, candidates exceeding 40 percent accuracy grow from 2.0 percent to 96.8 percent, and 455 unique architectures absent from the original corpus are admitted under the novelty filter.

What carries the argument

The closed-loop NNGPT pipeline that pairs LLM code synthesis with low-fidelity execution feedback and MinHash-Jaccard redundancy filtering before converting high-performing novel candidates into prompt-code pairs for LoRA fine-tuning.

If this is right

Valid generation rates rise from scarce to dominant across cycles.
Mean first-epoch accuracy and the fraction of high-accuracy candidates increase steadily.
Structural novelty is maintained while redundancy is controlled by the filter.
Improved validity, accuracy distributions, and novelty transfer to CIFAR-100 and SVHN without retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The learned prior could be frozen and used to seed architecture search for new vision tasks or modalities.
The same feedback loop might be applied to recurrent or transformer blocks once suitable low-fidelity proxies are defined.
The method supplies a reproducible way to grow task-specific architecture corpora without human annotation.

Load-bearing premise

Low-fidelity performance signals serve as reliable proxies for true model quality and the MinHash-Jaccard filter removes redundancy without discarding useful novel structures.

What would settle it

Full end-to-end training of the top proxy performers from cycle 22 on CIFAR-10 showing no accuracy gain over the top performers from cycle 1 or over standard hand-designed baselines.

read the original abstract

Large language models (LLMs) excel in program synthesis, yet their capacity for neural architecture design -- balancing syntactic reliability, performance, and structural novelty -- remains underexplored. We present a closed-loop architecture synthesis pipeline within the NNGPT framework, in which a code-oriented LLM evolves over 22 supervised fine-tuning cycles. At each cycle, the LLM synthesizes PyTorch convolutional networks, validated via low-fidelity performance signals and filtered via a MinHash--Jaccard criterion to prevent structural redundancy before being incorporated into the LEMUR dataset. High-performing candidates with novel architectures are converted into prompt--code pairs for parameter-efficient LoRA fine-tuning. This feedback loop drives a measurable distributional shift, progressively internalizing empirical architectural priors such that valid and high-performing outputs evolve from scarce to dominant across cycles. On CIFAR-10, the valid generation rate stabilizes at 50.6% (peaking at 74.5%), mean first-epoch accuracy rises from 28.1% to 51.0%, and candidates exceeding 40% accuracy grow from 2.0% to 96.8%. Cross-dataset transfer to CIFAR-100 and SVHN confirms that improved validity, shifted accuracy distributions, and sustained novelty generalize across benchmarks of varying difficulty and visual domain. Across 22 cycles, 455 unique architectures absent from the original corpus are admitted under the novelty filter. By grounding synthesis in execution feedback and novelty filtering, we demonstrate that iterative self-supervised fine-tuning reshapes an LLM into a task-specialized architectural prior -- improving generation reliability, proxy performance, and structural diversity -- offering a reproducible, annotation-free alternative to hand-crafted search spaces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper runs a closed LLM fine-tuning loop that shifts CNN generation toward higher validity and first-epoch accuracy, but the first-epoch proxy is the main untested link.

read the letter

The main takeaway is that the authors close the loop on an LLM generating PyTorch CNN code: it produces candidates, scores them on first-epoch accuracy, drops duplicates via MinHash-Jaccard, and feeds the keepers back as LoRA pairs. After 22 cycles the outputs become more often valid, the mean first-epoch score rises from 28% to 51%, and nearly all candidates clear 40%. They also admit 455 architectures that were not in the starting set and show the same pattern holds on CIFAR-100 and SVHN.

Referee Report

3 major / 2 minor

Summary. The manuscript presents a closed-loop pipeline in which a code-oriented LLM undergoes 22 supervised fine-tuning cycles. At each cycle, the model generates PyTorch CNN architectures that are validated with low-fidelity first-epoch accuracy signals, filtered by a MinHash-Jaccard novelty criterion to remove redundancy, and used to create prompt-code pairs for LoRA updates. The central claim is that this process induces a distributional shift, raising valid generation rate to 50.6% (peak 74.5%), mean first-epoch accuracy from 28.1% to 51.0%, and the fraction of candidates exceeding 40% accuracy from 2.0% to 96.8% on CIFAR-10, with transfer to CIFAR-100 and SVHN and admission of 455 novel architectures absent from the initial corpus.

Significance. If the reported shifts are robust, the work supplies concrete evidence that iterative execution-grounded fine-tuning can internalize empirical architectural priors, yielding an annotation-free route to task-specialized generators. The emphasis on structural novelty, cross-dataset generalization, and reproducibility distinguishes it from purely prompt-based synthesis methods and could inform future automated architecture search pipelines.

major comments (3)

[Pipeline description (candidate labeling step)] Pipeline description (candidate labeling step): the decision to label architectures as high-performing on the basis of first-epoch accuracy alone lacks any reported correlation analysis with final test accuracy after full training. Because the feedback loop directly optimizes the LLM toward this proxy, the observed jumps (2.0% → 96.8% above 40%) may reflect selection for fast-converging rather than ultimately superior architectures, which is load-bearing for the claim of genuine architectural improvement.
[Results section and abstract] Results section and abstract: no error bars, standard deviations, or multi-seed statistics are supplied for the key metrics (valid rate, mean accuracy, high-accuracy fraction) across the 22 cycles, preventing assessment of whether the distributional shifts are statistically reliable or sensitive to random seeds and prompt phrasing.
[Novelty filtering subsection] Novelty filtering subsection: the precise MinHash-Jaccard similarity threshold, number of hash functions, and any ablation on its retention of high-potential structures are omitted. Without these details it is impossible to verify that the filter removes only redundancy rather than discarding architectures that would have performed well under full evaluation.

minor comments (2)

[Abstract and methods] Abstract and methods: the term 'low-fidelity performance signals' is used without an explicit definition of the first-epoch training protocol (optimizer, batch size, number of epochs).
[Cross-dataset experiments] Cross-dataset experiments: clarify whether the same prompt templates and LoRA hyperparameters were reused for CIFAR-100 and SVHN or whether any dataset-specific adjustments were introduced.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: Pipeline description (candidate labeling step): the decision to label architectures as high-performing on the basis of first-epoch accuracy alone lacks any reported correlation analysis with final test accuracy after full training. Because the feedback loop directly optimizes the LLM toward this proxy, the observed jumps (2.0% → 96.8% above 40%) may reflect selection for fast-converging rather than ultimately superior architectures, which is load-bearing for the claim of genuine architectural improvement.

Authors: We appreciate the referee's concern that first-epoch accuracy serves only as a proxy. This choice was made to keep the closed-loop computationally tractable. In the revised manuscript we will add a correlation analysis (Pearson and rank) between first-epoch and fully trained test accuracies on a held-out sample of 100 architectures drawn from multiple cycles, together with a brief discussion of the proxy's limitations. revision: yes
Referee: Results section and abstract: no error bars, standard deviations, or multi-seed statistics are supplied for the key metrics (valid rate, mean accuracy, high-accuracy fraction) across the 22 cycles, preventing assessment of whether the distributional shifts are statistically reliable or sensitive to random seeds and prompt phrasing.

Authors: We agree that statistical characterization is needed. The original experiments used single runs per cycle because of the high cost of 22 full cycles. In revision we will (i) report standard deviations computed across three independent prompt-phrasing variants for the final three cycles and (ii) add error bars to the main figures; we will also note the single-seed limitation in the text. revision: partial
Referee: Novelty filtering subsection: the precise MinHash-Jaccard similarity threshold, number of hash functions, and any ablation on its retention of high-potential structures are omitted. Without these details it is impossible to verify that the filter removes only redundancy rather than discarding architectures that would have performed well under full evaluation.

Authors: We apologize for the missing implementation details. The revised subsection will state the exact similarity threshold, the number of hash functions, the MinHash configuration, and will include a short ablation showing the fraction of high-accuracy (>40 %) candidates retained at different thresholds. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results rely on external execution feedback

full rationale

The paper presents an iterative empirical pipeline in which candidate architectures are generated, evaluated via independent low-fidelity execution signals on held-out data (first-epoch accuracy), filtered for novelty with MinHash-Jaccard, and used to create prompt-code pairs for LoRA fine-tuning. Reported improvements in valid generation rate, mean first-epoch accuracy, and fraction of high-accuracy candidates are measured on newly synthesized outputs after each cycle rather than derived from parameters fitted directly to the target metric or reduced by definition to the selection inputs. No equations, self-definitional claims, or load-bearing self-citations appear in the text that would make the distributional shift equivalent to the input data by construction. The process is self-contained against external benchmarks and falsifiable via the observed cross-dataset transfer results.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions from LLM code generation and neural architecture search rather than new postulates; no invented entities are introduced.

free parameters (2)

Number of supervised fine-tuning cycles
Fixed at 22 to observe the evolution of validity and accuracy distributions.
Accuracy threshold for high-performing candidates
Set at 40% to select architectures for inclusion in the LEMUR dataset and subsequent fine-tuning.

axioms (2)

domain assumption Low-fidelity performance signals on first-epoch accuracy correlate sufficiently with final model quality to guide selection.
Invoked in the validation step of the closed-loop pipeline.
domain assumption MinHash-Jaccard similarity effectively identifies and filters structural redundancy without excluding high-potential novel architectures.
Used in the filtering criterion before dataset incorporation.

pith-pipeline@v0.9.0 · 5613 in / 1469 out tokens · 36351 ms · 2026-05-16T17:32:39.000178+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Delta-Based Neural Architecture Search: LLM Fine-Tuning via Code Diffs
cs.LG 2026-05 unverdicted novelty 7.0

Fine-tuned 7B LLMs generating unified diffs for neural architecture refinement achieve 66-75% valid rates and 64-66% mean first-epoch accuracy, outperforming full-generation baselines by large margins while cutting ou...
Closed-Loop LLM Discovery of Non-Standard Channel Priors in Vision Models
cs.CV 2026-01 unverdicted novelty 6.0

Closed-loop LLM search with AST-generated examples discovers non-standard channel widths that improve vision model performance over initial architectures on CIFAR-100.