Recognition: no theorem link
From Memorization to Creativity: LLM as a Designer of Novel Neural Architectures
Pith reviewed 2026-05-16 17:32 UTC · model grok-4.3
The pith
Iterative fine-tuning with execution feedback and novelty filtering turns an LLM into a specialized generator of novel neural network architectures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Over 22 supervised fine-tuning cycles the LLM internalizes empirical architectural priors so that valid and high-performing outputs evolve from scarce to dominant: on CIFAR-10 the valid generation rate stabilizes at 50.6 percent, mean first-epoch accuracy rises from 28.1 percent to 51.0 percent, candidates exceeding 40 percent accuracy grow from 2.0 percent to 96.8 percent, and 455 unique architectures absent from the original corpus are admitted under the novelty filter.
What carries the argument
The closed-loop NNGPT pipeline that pairs LLM code synthesis with low-fidelity execution feedback and MinHash-Jaccard redundancy filtering before converting high-performing novel candidates into prompt-code pairs for LoRA fine-tuning.
If this is right
- Valid generation rates rise from scarce to dominant across cycles.
- Mean first-epoch accuracy and the fraction of high-accuracy candidates increase steadily.
- Structural novelty is maintained while redundancy is controlled by the filter.
- Improved validity, accuracy distributions, and novelty transfer to CIFAR-100 and SVHN without retraining.
Where Pith is reading between the lines
- The learned prior could be frozen and used to seed architecture search for new vision tasks or modalities.
- The same feedback loop might be applied to recurrent or transformer blocks once suitable low-fidelity proxies are defined.
- The method supplies a reproducible way to grow task-specific architecture corpora without human annotation.
Load-bearing premise
Low-fidelity performance signals serve as reliable proxies for true model quality and the MinHash-Jaccard filter removes redundancy without discarding useful novel structures.
What would settle it
Full end-to-end training of the top proxy performers from cycle 22 on CIFAR-10 showing no accuracy gain over the top performers from cycle 1 or over standard hand-designed baselines.
read the original abstract
Large language models (LLMs) excel in program synthesis, yet their capacity for neural architecture design -- balancing syntactic reliability, performance, and structural novelty -- remains underexplored. We present a closed-loop architecture synthesis pipeline within the NNGPT framework, in which a code-oriented LLM evolves over 22 supervised fine-tuning cycles. At each cycle, the LLM synthesizes PyTorch convolutional networks, validated via low-fidelity performance signals and filtered via a MinHash--Jaccard criterion to prevent structural redundancy before being incorporated into the LEMUR dataset. High-performing candidates with novel architectures are converted into prompt--code pairs for parameter-efficient LoRA fine-tuning. This feedback loop drives a measurable distributional shift, progressively internalizing empirical architectural priors such that valid and high-performing outputs evolve from scarce to dominant across cycles. On CIFAR-10, the valid generation rate stabilizes at 50.6% (peaking at 74.5%), mean first-epoch accuracy rises from 28.1% to 51.0%, and candidates exceeding 40% accuracy grow from 2.0% to 96.8%. Cross-dataset transfer to CIFAR-100 and SVHN confirms that improved validity, shifted accuracy distributions, and sustained novelty generalize across benchmarks of varying difficulty and visual domain. Across 22 cycles, 455 unique architectures absent from the original corpus are admitted under the novelty filter. By grounding synthesis in execution feedback and novelty filtering, we demonstrate that iterative self-supervised fine-tuning reshapes an LLM into a task-specialized architectural prior -- improving generation reliability, proxy performance, and structural diversity -- offering a reproducible, annotation-free alternative to hand-crafted search spaces.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a closed-loop pipeline in which a code-oriented LLM undergoes 22 supervised fine-tuning cycles. At each cycle, the model generates PyTorch CNN architectures that are validated with low-fidelity first-epoch accuracy signals, filtered by a MinHash-Jaccard novelty criterion to remove redundancy, and used to create prompt-code pairs for LoRA updates. The central claim is that this process induces a distributional shift, raising valid generation rate to 50.6% (peak 74.5%), mean first-epoch accuracy from 28.1% to 51.0%, and the fraction of candidates exceeding 40% accuracy from 2.0% to 96.8% on CIFAR-10, with transfer to CIFAR-100 and SVHN and admission of 455 novel architectures absent from the initial corpus.
Significance. If the reported shifts are robust, the work supplies concrete evidence that iterative execution-grounded fine-tuning can internalize empirical architectural priors, yielding an annotation-free route to task-specialized generators. The emphasis on structural novelty, cross-dataset generalization, and reproducibility distinguishes it from purely prompt-based synthesis methods and could inform future automated architecture search pipelines.
major comments (3)
- [Pipeline description (candidate labeling step)] Pipeline description (candidate labeling step): the decision to label architectures as high-performing on the basis of first-epoch accuracy alone lacks any reported correlation analysis with final test accuracy after full training. Because the feedback loop directly optimizes the LLM toward this proxy, the observed jumps (2.0% → 96.8% above 40%) may reflect selection for fast-converging rather than ultimately superior architectures, which is load-bearing for the claim of genuine architectural improvement.
- [Results section and abstract] Results section and abstract: no error bars, standard deviations, or multi-seed statistics are supplied for the key metrics (valid rate, mean accuracy, high-accuracy fraction) across the 22 cycles, preventing assessment of whether the distributional shifts are statistically reliable or sensitive to random seeds and prompt phrasing.
- [Novelty filtering subsection] Novelty filtering subsection: the precise MinHash-Jaccard similarity threshold, number of hash functions, and any ablation on its retention of high-potential structures are omitted. Without these details it is impossible to verify that the filter removes only redundancy rather than discarding architectures that would have performed well under full evaluation.
minor comments (2)
- [Abstract and methods] Abstract and methods: the term 'low-fidelity performance signals' is used without an explicit definition of the first-epoch training protocol (optimizer, batch size, number of epochs).
- [Cross-dataset experiments] Cross-dataset experiments: clarify whether the same prompt templates and LoRA hyperparameters were reused for CIFAR-100 and SVHN or whether any dataset-specific adjustments were introduced.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: Pipeline description (candidate labeling step): the decision to label architectures as high-performing on the basis of first-epoch accuracy alone lacks any reported correlation analysis with final test accuracy after full training. Because the feedback loop directly optimizes the LLM toward this proxy, the observed jumps (2.0% → 96.8% above 40%) may reflect selection for fast-converging rather than ultimately superior architectures, which is load-bearing for the claim of genuine architectural improvement.
Authors: We appreciate the referee's concern that first-epoch accuracy serves only as a proxy. This choice was made to keep the closed-loop computationally tractable. In the revised manuscript we will add a correlation analysis (Pearson and rank) between first-epoch and fully trained test accuracies on a held-out sample of 100 architectures drawn from multiple cycles, together with a brief discussion of the proxy's limitations. revision: yes
-
Referee: Results section and abstract: no error bars, standard deviations, or multi-seed statistics are supplied for the key metrics (valid rate, mean accuracy, high-accuracy fraction) across the 22 cycles, preventing assessment of whether the distributional shifts are statistically reliable or sensitive to random seeds and prompt phrasing.
Authors: We agree that statistical characterization is needed. The original experiments used single runs per cycle because of the high cost of 22 full cycles. In revision we will (i) report standard deviations computed across three independent prompt-phrasing variants for the final three cycles and (ii) add error bars to the main figures; we will also note the single-seed limitation in the text. revision: partial
-
Referee: Novelty filtering subsection: the precise MinHash-Jaccard similarity threshold, number of hash functions, and any ablation on its retention of high-potential structures are omitted. Without these details it is impossible to verify that the filter removes only redundancy rather than discarding architectures that would have performed well under full evaluation.
Authors: We apologize for the missing implementation details. The revised subsection will state the exact similarity threshold, the number of hash functions, the MinHash configuration, and will include a short ablation showing the fraction of high-accuracy (>40 %) candidates retained at different thresholds. revision: yes
Circularity Check
No significant circularity; empirical results rely on external execution feedback
full rationale
The paper presents an iterative empirical pipeline in which candidate architectures are generated, evaluated via independent low-fidelity execution signals on held-out data (first-epoch accuracy), filtered for novelty with MinHash-Jaccard, and used to create prompt-code pairs for LoRA fine-tuning. Reported improvements in valid generation rate, mean first-epoch accuracy, and fraction of high-accuracy candidates are measured on newly synthesized outputs after each cycle rather than derived from parameters fitted directly to the target metric or reduced by definition to the selection inputs. No equations, self-definitional claims, or load-bearing self-citations appear in the text that would make the distributional shift equivalent to the input data by construction. The process is self-contained against external benchmarks and falsifiable via the observed cross-dataset transfer results.
Axiom & Free-Parameter Ledger
free parameters (2)
- Number of supervised fine-tuning cycles
- Accuracy threshold for high-performing candidates
axioms (2)
- domain assumption Low-fidelity performance signals on first-epoch accuracy correlate sufficiently with final model quality to guide selection.
- domain assumption MinHash-Jaccard similarity effectively identifies and filters structural redundancy without excluding high-potential novel architectures.
Forward citations
Cited by 2 Pith papers
-
Delta-Based Neural Architecture Search: LLM Fine-Tuning via Code Diffs
Fine-tuned 7B LLMs generating unified diffs for neural architecture refinement achieve 66-75% valid rates and 64-66% mean first-epoch accuracy, outperforming full-generation baselines by large margins while cutting ou...
-
Closed-Loop LLM Discovery of Non-Standard Channel Priors in Vision Models
Closed-loop LLM search with AST-generated examples discovers non-standard channel widths that improve vision model performance over initial architectures on CIFAR-100.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.