arxiv: 2605.03686 · v2 · submitted 2026-05-05 · 💻 cs.LG · cs.CV

Recognition: 2 theorem links

· Lean Theorem

From Code to Prediction: Fine-Tuning LLMs for Neural Network Performance Classification in NNGPT

Mahmoud Hanouneh , Radu Timofte , Dmitry Ignatov

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:24 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords LLM fine-tuningneural architecture performance predictioncross-dataset classificationAutoMLcode-based reasoningprompt variantsLEMUR datasetLoRA adaptation

0 comments

The pith

Fine-tuned LLMs can predict from neural network code alone which of two datasets it will perform better on, reaching 80% accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can learn to judge neural network suitability across datasets by examining architecture source code instead of running experiments. It creates a classification task on the LEMUR dataset of standardized PyTorch models, where the LLM must decide which of two image datasets yields higher accuracy for a given network. Three prompt styles are compared: one that leaks normalized accuracies as a trivial baseline, one that supplies only dataset metadata, and one that supplies only the raw architecture code plus dataset names. Fine-tuning DeepSeek-Coder-7B with LoRA on the code-only prompt reaches 80% peak accuracy while the metadata prompt reaches 70%, with the code version showing more even performance across datasets that share similar properties. This matters for AutoML because it suggests models can extract predictive signals about inductive biases directly from code rather than relying on expensive training runs or surface-level dataset descriptions.

Core claim

The central claim is that fine-tuning an LLM on standardized neural architecture implementations enables it to predict cross-dataset performance superiority from architecture code alone at up to 80% accuracy, which exceeds the 70% achieved by prompts that provide only dataset metadata properties, demonstrating that source code encodes richer discriminative information about dataset suitability than metadata alone.

What carries the argument

The three prompt configurations of increasing difficulty (normalized-accuracy baseline, metadata-enriched, code-only) applied to the LEMUR dataset of standardized PyTorch neural network implementations, fine-tuned via LoRA on DeepSeek-Coder models of different sizes.

If this is right

LLMs can be used in AutoML pipelines to recommend or filter architectures for specific datasets before any training occurs.
Code-only prompts deliver more balanced predictions across datasets with overlapping characteristics than metadata prompts do.
Larger model capacity improves the ability to extract architectural reasoning signals, as evidenced by the performance gap between the 7B and 1.3B models.
Per-dataset results show that code prompts maintain utility even when datasets lack distinctive metadata properties.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same code-based signal could be tested for predicting performance on entirely unseen datasets outside the original training distribution.
Architecture code may directly encode dataset-specific inductive biases that current metadata descriptions fail to capture.
Hybrid prompts combining code and metadata could be explored to further boost accuracy in practical AutoML settings.

Load-bearing premise

The LEMUR dataset implementations and reported accuracies contain no bugs or data leakage, and the prompt variants cleanly separate genuine code-based reasoning from memorization of training examples.

What would settle it

A new test set of architectures or datasets where the code-only prompt accuracy falls below the metadata prompt accuracy or where high accuracy persists even after labels are randomly shuffled during fine-tuning.

Figures

Figures reproduced from arXiv: 2605.03686 by Dmitry Ignatov, Mahmoud Hanouneh, Radu Timofte.

**Figure 3.** Figure 3: Metadata prompt: accuracy vs. training loss. Loss de view at source ↗

**Figure 1.** Figure 1: Classification accuracy over epochs for all three prompt view at source ↗

**Figure 2.** Figure 2: Code-only prompt: DeepSeek-7B (peak 80%, epoch 15) view at source ↗

**Figure 4.** Figure 4: Per-dataset aggregated accuracy across three configu view at source ↗

read the original abstract

Automated Machine Learning (AutoML) frameworks increasingly leverage Large Language Models (LLMs) for tasks such as hyperparameter optimization and neural architecture code generation. However, current LLM-based approaches focus on generative outputs and evaluate them by training the produced artifacts. Whether LLMs can learn to reason about neural network performance across datasets remains underexplored. We present a classification task integrated into the NNGPT framework, in which a fine-tuned LLM predicts which of two image classification datasets a given neural network architecture achieves higher accuracy on. The task is built on the LEMUR dataset, which provides standardized PyTorch implementations with reproducible performance metrics. Three prompt configurations of increasing difficulty are evaluated: a normalized-accuracy baseline (trivially reaching 100%), a metadata-enriched prompt replacing accuracies with dataset properties, and a code-only prompt presenting only architecture source code and dataset names. Using DeepSeek-Coder-7B-Instruct fine-tuned with LoRA, the code-only prompt reaches 80% peak accuracy over 15 epochs, while the metadata prompt peaks at 70%. Perdataset analysis reveals complementary strengths: metadata excels for datasets with distinctive properties (CelebAGender at 90.9%) but degrades for overlapping characteristics, whereas the code-only prompt shows more balanced performance. A comparison with DeepSeek-Coder1.3B confirms that model capacity affects this form of architectural reasoning. The results establish that LLMs can be fine-tuned to predict cross-dataset suitability from neural network code, suggesting that architecture source code contains richer discriminative signal than dataset metadata alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fine-tuning an LLM on architecture code lets it predict which dataset a network will do better on at 80% accuracy, beating metadata prompts, but the result hinges on whether the split holds out whole architectures.

read the letter

The paper's main finding is that a fine-tuned DeepSeek-Coder-7B can classify which of two image datasets a given neural net code will achieve higher accuracy on. Code-only prompts reach 80% while metadata-only prompts reach 70%, with the code version showing more balanced per-dataset performance. They build this on the LEMUR dataset of standardized PyTorch implementations and test three prompt variants, including a trivial accuracy-leak baseline that hits 100% and a comparison to the 1.3B model variant. This task framing is new; most prior LLM work in AutoML has focused on generating architectures rather than reasoning about their cross-dataset performance from source code alone. The per-dataset breakdown and the capacity comparison are useful additions that make the empirical picture clearer. Using LEMUR's fixed codes and metrics is a practical choice that supports reproducibility. The central soft spot is the data split. If examples for the same architecture appear in both train and test (just with different prompts), the model can simply memorize the label for that code instead of learning general patterns from the code structure. The abstract is silent on whether they split at the architecture level or the example level, and there is no mention of statistical significance or run-to-run variance. If the full paper shows a strict architecture hold-out and the 10-point gap survives, the claim that code carries richer signal strengthens considerably. Otherwise the difference is harder to interpret as genuine reasoning. This is for researchers working on LLM-assisted architecture search or cost-efficient AutoML who want to prune candidates without training every option. A reader focused on code-based reasoning in LLMs would find the prompt comparisons worth examining. The work is coherent enough on its own terms to deserve peer review so the split and controls can be checked in detail.

Referee Report

2 major / 2 minor

Summary. The paper introduces a classification task in the NNGPT framework where an LLM is fine-tuned to predict, for a given neural network architecture, which of two image-classification datasets it will achieve higher accuracy on. Built on the LEMUR dataset of standardized PyTorch implementations, the work evaluates three prompt variants of increasing difficulty (normalized-accuracy baseline, metadata-enriched, and code-only) using DeepSeek-Coder-7B-Instruct with LoRA fine-tuning. The code-only prompt reaches 80% peak accuracy while the metadata prompt reaches 70%; per-dataset breakdowns and a comparison to the 1.3B variant are also reported.

Significance. If the central empirical claim holds after addressing split and leakage controls, the result would demonstrate that architecture source code carries richer performance-discriminative information than dataset metadata alone. The concrete accuracy numbers, explicit baseline comparison, per-dataset analysis, and model-capacity ablation constitute a clear empirical contribution that could inform code-aware AutoML pipelines. The work is purely empirical and therefore benefits from the reproducibility of the LEMUR dataset, but its significance is currently limited by missing experimental controls.

major comments (2)

[Experimental setup and evaluation of prompt variants] The manuscript provides no description of the train/test split procedure (neither in the abstract nor in the implied experimental section). It is therefore impossible to determine whether the split is performed at the architecture level (holding out all prompt instances involving a given NN code) or at the individual example level. If the latter, both the code-only and metadata prompts can exploit memorized LEMUR architecture-dataset labels, rendering the reported 10-point gap (80% vs 70%) non-diagnostic of code-specific reasoning and undermining the central claim that 'architecture source code contains richer discriminative signal than dataset metadata alone'.
[Results and per-dataset analysis] Results section (per-dataset analysis and overall accuracies): the reported peak accuracies (80% code-only, 70% metadata, 90.9% for CelebAGender) are given without the number of independent runs, standard deviations, or any statistical significance test. This makes it impossible to assess whether the observed differences are reliable or whether the per-dataset complementary strengths are robust.

minor comments (2)

[Abstract] The abstract states that the normalized-accuracy baseline 'trivially reaches 100%' but does not clarify how this baseline is constructed or what exact test-set size and metric are used for all reported numbers.
[Model-capacity comparison] The comparison with DeepSeek-Coder-1.3B is mentioned but lacks the corresponding accuracy numbers or confirmation that identical training hyperparameters and data splits were used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for identifying two important omissions that affect the interpretability of our results. We will revise the manuscript to provide the missing experimental details and statistical reporting. Our point-by-point responses follow.

read point-by-point responses

Referee: The manuscript provides no description of the train/test split procedure (neither in the abstract nor in the implied experimental section). It is therefore impossible to determine whether the split is performed at the architecture level (holding out all prompt instances involving a given NN code) or at the individual example level. If the latter, both the code-only and metadata prompts can exploit memorized LEMUR architecture-dataset labels, rendering the reported 10-point gap (80% vs 70%) non-diagnostic of code-specific reasoning and undermining the central claim that 'architecture source code contains richer discriminative signal than dataset metadata alone'.

Authors: We agree that the absence of a split description is a serious omission. The split was performed at the architecture level: we partitioned the LEMUR architectures into disjoint train and test sets (approximately 80/20) and assigned every prompt instance derived from a given architecture to the same partition. This ensures that no neural-network code seen at test time appears in any form during training. We will add a dedicated subsection describing the exact partitioning procedure, the number of architectures per split, and the total number of prompt examples, together with a short argument why this split eliminates the leakage the referee correctly identifies. revision: yes
Referee: Results section (per-dataset analysis and overall accuracies): the reported peak accuracies (80% code-only, 70% metadata, 90.9% for CelebAGender) are given without the number of independent runs, standard deviations, or any statistical significance test. This makes it impossible to assess whether the observed differences are reliable or whether the per-dataset complementary strengths are robust.

Authors: We acknowledge that single-run peak accuracies limit the strength of the claims. The numbers reported in the current manuscript are the highest validation accuracy attained during a single 15-epoch LoRA fine-tuning run. In the revision we will repeat each prompt-variant experiment with at least five independent random seeds, report mean accuracy and standard deviation across runs, and add a paired statistical test (e.g., Wilcoxon signed-rank) comparing the code-only and metadata conditions. The per-dataset breakdown will likewise be presented with variability measures. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical fine-tuning experiment with measured accuracies

full rationale

The paper presents an empirical study involving fine-tuning LLMs (DeepSeek-Coder variants with LoRA) on the LEMUR dataset to perform a classification task predicting which of two datasets a neural network architecture performs better on, using three prompt variants. Reported results are peak accuracies (e.g., 80% for code-only prompt) obtained via standard training and evaluation over epochs on held-out data. No mathematical derivations, equations, or first-principles claims exist. No parameters are fitted such that predictions reduce to the fit by construction. No self-citations are invoked to justify uniqueness or load-bearing premises. The central claim rests on direct experimental measurements rather than any self-referential definitions or renamings. Potential concerns about data splits or memorization pertain to experimental validity, not circularity in the derivation chain.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the LEMUR dataset supplying accurate, reproducible accuracy labels and on the assumption that fine-tuning on the described prompts produces genuine generalization rather than dataset-specific memorization.

free parameters (2)

LoRA rank and scaling parameters
Specific LoRA configuration values are chosen during fine-tuning but not reported in the abstract.
Training epochs
Fine-tuning is performed for 15 epochs to reach the reported peak accuracy.

axioms (1)

domain assumption The LEMUR dataset contains standardized PyTorch implementations with reproducible performance metrics across multiple image classification datasets.
This supplies both the network code and the ground-truth labels used to create the classification task.

pith-pipeline@v0.9.0 · 5593 in / 1416 out tokens · 29108 ms · 2026-05-08T18:24:05.175809+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 2 canonical work pages

[1]

Deepseek-v3 technical report, 2024

DeepSeek-AI. Deepseek-v3 technical report, 2024. 2

2024
[2]

AutoMLBench: A comprehensive experimen- tal evaluation of automated machine learning frameworks

Hassan Eldeeb, Mohamed Maher, Radwa Elshawi, and Sherif Sakr. AutoMLBench: A comprehensive experimen- tal evaluation of automated machine learning frameworks. Expert Systems with Applications, 243:122877, 2024. 2

2024
[3]

Pieter Gijsbers, Marcos L. P. Bueno, Stefan Coors, Erin LeDell, S´ebastien Poirier, Janek Thomas, Bernd Bischl, and Joaquin Vanschoren. AMLB: an AutoML benchmark.Jour- nal of Machine Learning Research, 25:1–65, 2024. 1, 2, 3

2024
[4]

LEMUR neural net- work dataset: Towards seamless AutoML.arXiv preprint, arXiv:2504.10552, 2025

Arash Torabi Goodarzi, Roman Kochnev, Waleed Khalid, Furui Qin, Tolgay Atinc Uzun, Yashkumar Sanjaybhai Dhameliya, Yash Kanubhai Kathiriya, Zofia Antonina Ben- tyn, Dmitry Ignatov, and Radu Timofte. LEMUR neural net- work dataset: Towards seamless AutoML.arXiv preprint, arXiv:2504.10552, 2025. 1, 2, 3, 5

work page arXiv 2025
[5]

From memorization to creativity: LLM as a designer of novel neu- ral architectures

Waleed Khalid, Dmitry Ignatov, and Radu Timofte. From memorization to creativity: LLM as a designer of novel neu- ral architectures. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition Work- shops (CVPRW), 2026. to appear. 2

2026
[6]

Roman Kochnev, Arash Torabi Goodarzi, Zofia Antonina Bentyn, Dmitry Ignatov, and Radu Timofte. Optuna vs code llama: Are LLMs a new paradigm for hyperparameter tun- ing? InProceedings of the IEEE/CVF International Confer- ence on Computer Vision Workshops (ICCVW), pages 5664– 5674, 2025. 1, 2, 3

2025
[7]

NNGPT: Rethinking AutoML with large language models

Roman Kochnev, Waleed Khalid, Tolgay Atinc Uzun, Xi Zhang, Yashkumar Sanjaybhai Dhameliya, Furui Qin, Chan- dini Vysyaraju, Raghuvir Duvvuri, Avi Goyal, Dmitry Igna- tov, and Radu Timofte. NNGPT: Rethinking AutoML with large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2026. to app...

2026
[8]

AgentHPO: Large lan- guage model agent for hyper-parameter optimization.arXiv preprint, arXiv:2402.01881, 2025

Siyi Liu, Chen Gao, and Yong Li. AgentHPO: Large lan- guage model agent for hyper-parameter optimization.arXiv preprint, arXiv:2402.01881, 2025. Second Conference on Parsimony and Learning (CPAL 2025). 1, 2

work page arXiv 2025
[9]

CelebFaces Attributes (CelebA) Dataset.http : / / mmlab

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. CelebFaces Attributes (CelebA) Dataset.http : / / mmlab . ie . cuhk . edu . hk / projects / CelebA . html, 2015. Accessed: 2026-04-22. 4

2015
[10]

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bis- sacco, Bo Wu, and Andrew Y . Ng. Reading digits in nat- ural images with unsupervised feature learning.NIPS Work- shop on Deep Learning and Unsupervised Feature Learning,
[11]

ONNX: Open neural network ex- change.https://github.com/onnx/onnx, 2019

ONNX Community. ONNX: Open neural network ex- change.https://github.com/onnx/onnx, 2019. 3

2019
[12]

LEMUR 2: Unlocking neural net- work diversity for AI

Tolgay Atinc Uzun, Waleed Khalid, Saif U Din, Sai Re- vanth Mulukuledu, Akashdeep Singh, Chandini Vysyaraju, Raghuvir Duvvuri, Avi Goyal, Yashkumar Rajeshbhai Lukhi, Ahsan Hussain, Krunal Jesani, Usha Shrestha, Yash Mittal, Roman Kochnev, Pritam Kadam, Mohsin Ikram, Harsh Rameshbhai Moradiya, Alice Arslanian, Dmitry Ig- natov, and Radu Timofte. LEMUR 2: U...

2026