Auto-FL-Research: Agentic Search for Federated Learning Algorithms

Andrew Feng; Chester Chen; Daguang Xu; Holger R. Roth; Peter Cnudde; Ziyue Xu

arxiv: 2607.01366 · v1 · pith:45KPEXC2new · submitted 2026-07-01 · 💻 cs.AI

Auto-FL-Research: Agentic Search for Federated Learning Algorithms

Holger R. Roth , Ziyue Xu , Chester Chen , Daguang Xu , Peter Cnudde , Andrew Feng This is my paper

Pith reviewed 2026-07-03 20:31 UTC · model grok-4.3

classification 💻 cs.AI

keywords federated learningagentic searchalgorithm discoveryFLambyLEAFcoding agentsrecipe searchhyperparameter controls

0 comments

The pith

A constrained coding-agent workflow searches for federated learning algorithmic recipes and separates recipe changes from scalar tuning and seed variance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Auto-FL-Research, a system in which coding agents propose and implement modifications to federated learning procedures such as server aggregation rules, client update schedules, local objectives, and model variants. Fixed task profiles hold constant the allowable mutations, resource budgets, communication contracts, and final evaluation methods, enabling attribution of performance differences to the proposed algorithmic edits. Across five healthcare FLamby tasks and six LEAF profiles, five-seed repeat evaluations record gains on four FLamby tasks and five LEAF profiles, while same-budget scalar controls and held-out checks show that some improvements trace to fixed-recipe tuning or single-run artifacts rather than repeatable recipe mechanisms.

Core claim

Auto-FL-Research deploys coding agents to generate and code candidate federated learning training algorithms inside profiles that fix the mutation surface, compute budget, communication contract, and model evaluation. Five-seed repeat testing supports gains on four FLamby tasks and five of six LEAF profiles. Same-budget baselines demonstrate that several gains arise from alterations to the federated learning recipe itself, whereas others are recovered by scalar controls on the original recipe or fail to replicate under repeated or held-out evaluation. These outcomes allow separation of repeated algorithmic mechanisms from tuning effects and selected artifacts.

What carries the argument

The constrained coding-agent workflow that proposes and edits code for server aggregation, client schedules, local objectives, and model variants while task profiles fix mutation surface, compute budget, communication contract, and final model evaluation.

If this is right

Gains are recorded on four FLamby tasks and five of six LEAF profiles after five-seed repeat evaluations.
Several gains correspond to changes in the federated learning recipe rather than scalar adjustments within a fixed recipe.
Seed-sensitive candidates and search-selected failure cases appear during evaluation.
Agent-generated candidates can be grouped into repeated mechanisms, fixed-surface tuning effects, and single-run artifacts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same workflow could be applied to other machine learning settings where many small algorithmic choices interact with training dynamics.
Tighter integration of scalar control baselines inside the agent loop might further automate the separation of recipe novelty from tuning.
Extending profiles to vary communication constraints could expose additional classes of non-robust agent proposals.

Load-bearing premise

Task profiles can be set so that they fix the mutation surface, compute budget, communication contract, and evaluation protocol in a way that cleanly isolates algorithmic recipe changes from scalar tuning effects and seed variance.

What would settle it

An experiment in which scalar hyperparameter search on the original fixed recipes matches or exceeds every agent-proposed gain across the same tasks and five-seed repeats, or in which no gains survive on a fresh set of held-out tasks.

Figures

Figures reproduced from arXiv: 2607.01366 by Andrew Feng, Chester Chen, Daguang Xu, Holger R. Roth, Peter Cnudde, Ziyue Xu.

**Figure 2.** Figure 2: AFR loop and evaluation coverage. Left: the agent starts from research intent, program.md, an active task profile, a fixed budget, and a fixed mutation surface. Candidate NVFlare runs append results to results.tsv; reviewed batches are kept, narrowed, discarded, or used to select the next candidate. Right: stylized benchmark modalities from FLamby and grouped-client LEAF profiles evaluated through the same… view at source ↗

**Figure 3.** Figure 3: Mean relative gains over matched repeated baselines across the two benchmark suites (five-seed repeat). [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: FEMNIST ablation gains over matched baselines. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Validation-selected and held-out-reported check. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Federated learning (FL) research often depends on many small but consequential algorithmic choices: optimizer variants, server aggregation rules, local training schedules, normalization, regularization, and model architecture. These choices are expensive to explore manually and difficult to compare fairly when candidate changes can also alter the FL training or evaluation path. In this work, we present Auto-FL-Research (AFR), a constrained coding-agent workflow for FL algorithmic recipe search. Agents may propose and implement candidate training algorithms, including server aggregation rules, client update schedules, local objectives, and registered model variants, while task profiles fix the mutation surface, compute budget, communication contract, and final model evaluation. Each campaign records candidate scores, runtime, edited files, artifacts, and failure status. We evaluate AFR on five healthcare cross-silo FLamby tasks and on grouped-client profiles for the five fixed LEAF datasets plus the LEAF synthetic task. Five-seed repeat evaluations support gains on four FLamby tasks and five of six LEAF profiles, while also exposing seed-sensitive and search-selected failure cases. Same-budget controls show that several gains correspond to FL-recipe changes, whereas other improvements are recovered by fixed-surface scalar controls or fail under repeat or held-out evaluation. These mixed outcomes are part of the contribution: they show how agent-generated candidates can be separated into repeated FL mechanisms, fixed-surface tuning effects, and selected single-run artifacts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core is a constrained agentic workflow for searching FL recipes with same-budget controls and transparent mixed results, but the enforcement of task profiles to separate recipe changes from scalar tuning needs stronger demonstration.

read the letter

The main thing to know is that this work applies coding agents to propose and implement changes to federated learning setups like server aggregation and local objectives, while using task profiles to fix the mutation surface, budget, and evaluation. They test on FLamby healthcare tasks and LEAF profiles, run five-seed repeats, and compare against same-budget scalar controls.

What the paper does well is report the outcomes honestly. Gains appear on four of five FLamby tasks and five of six LEAF profiles, but the text flags seed sensitivity, cases where scalar tuning recovers the improvement, and failures under repeat or held-out checks. Recording edited files and failure status is a practical addition that lets readers see what the agents actually did.

The soft spot is the constraint mechanism itself. The claim that gains come from algorithmic recipe changes rather than tuning or variance rests on the profiles strictly limiting what agents can alter. The abstract asserts this separation works, yet the stress-test concern holds: without explicit lists of enforced constraints or examples showing how communication contracts and evaluation paths stay fixed, it is hard to confirm the controls are tight enough. This is not a fatal issue but it is load-bearing for the main argument.

The paper is aimed at FL researchers who want systematic ways to explore algorithmic variants on standard benchmarks. It deserves peer review because the workflow is concrete, the evaluation uses real datasets with repeats, and the mixed results are presented without overclaim.

Referee Report

2 major / 2 minor

Summary. The paper introduces Auto-FL-Research (AFR), a constrained coding-agent workflow for searching federated learning algorithmic recipes (server aggregation, client schedules, local objectives, model variants). Task profiles are asserted to fix the mutation surface, compute budget, communication contract, and evaluation; agents propose and implement candidates, with campaigns logging scores, runtimes, edits, and failures. Five-seed repeat evaluations on five FLamby healthcare tasks and six LEAF profiles (five fixed datasets plus synthetic) report gains on four FLamby tasks and five LEAF profiles, while same-budget scalar controls and held-out checks separate recipe-driven gains from tuning effects or single-run artifacts; mixed outcomes (seed sensitivity, search-selected failures) are presented as part of the contribution.

Significance. If the claimed separation between FL-recipe changes and scalar tuning holds under the stated controls, the work supplies a reproducible agentic search method plus transparent negative results that could reduce manual trial-and-error in FL algorithm design. The explicit reporting of seed-sensitive and control-recovered cases is a strength that supports falsifiability.

major comments (2)

[Abstract] Abstract (evaluation protocol): the central attribution of gains to 'FL-recipe changes' rather than scalar tuning or seed variance rests on the claim that task profiles strictly fix the mutation surface, communication contract, and evaluation path. No mechanism, constraint list, or worked example is supplied showing how the surface is enforced or how the same-budget scalar baseline is constructed to match the agent's search space exactly; without this, the separation cannot be verified and the reported gains remain vulnerable to post-hoc selection.
[Abstract] Abstract (five-seed repeats and controls): while five-seed repeats and same-budget controls are described, the abstract notes that 'several gains correspond to FL-recipe changes, whereas other improvements are recovered by fixed-surface scalar controls or fail under repeat.' This mixed outcome is load-bearing for the claim that AFR isolates algorithmic novelty; the manuscript must quantify how many of the reported gains survive both the scalar control and the held-out evaluation, with per-task breakdowns.

minor comments (2)

[Abstract] The abstract refers to 'grouped-client profiles for the five fixed LEAF datasets plus the LEAF synthetic task' without defining the grouping criteria or how they differ from standard LEAF partitions.
[Abstract] Terminology: 'fixed-surface scalar controls' and 'mutation surface' are used without an explicit definition or reference to the section that operationalizes them.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments emphasizing verifiability of the evaluation protocol and explicit quantification of results. We address each major comment below and note the revisions that will be incorporated.

read point-by-point responses

Referee: [Abstract] Abstract (evaluation protocol): the central attribution of gains to 'FL-recipe changes' rather than scalar tuning or seed variance rests on the claim that task profiles strictly fix the mutation surface, communication contract, and evaluation path. No mechanism, constraint list, or worked example is supplied showing how the surface is enforced or how the same-budget scalar baseline is constructed to match the agent's search space exactly; without this, the separation cannot be verified and the reported gains remain vulnerable to post-hoc selection.

Authors: We agree that the abstract does not supply these details. The manuscript body defines task profiles with explicit constraint lists (fixed vs. mutable components) and describes same-budget scalar baselines as restricting mutations to scalar hyperparameters only. To make the separation immediately verifiable from the abstract alone, we will add a short description of the enforcement mechanism together with a worked example of baseline construction. revision: partial
Referee: [Abstract] Abstract (five-seed repeats and controls): while five-seed repeats and same-budget controls are described, the abstract notes that 'several gains correspond to FL-recipe changes, whereas other improvements are recovered by fixed-surface scalar controls or fail under repeat.' This mixed outcome is load-bearing for the claim that AFR isolates algorithmic novelty; the manuscript must quantify how many of the reported gains survive both the scalar control and the held-out evaluation, with per-task breakdowns.

Authors: The referee correctly notes that the abstract uses the term 'several' without counts or per-task detail. The results section and supplements already contain the per-task data on which gains survive the scalar controls and held-out checks. We will revise the abstract to state the exact counts and include a concise per-task breakdown of surviving gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; evaluation uses external benchmarks and explicit controls

full rationale

The paper describes an empirical agent-based search workflow evaluated on FLamby and LEAF benchmarks. It reports five-seed repeats and same-budget scalar controls to separate recipe changes from tuning effects. No equations, fitted parameters, or self-citations are presented as load-bearing derivations that reduce the reported gains to quantities defined inside the search itself. The central claims rest on external task profiles and held-out evaluations rather than self-referential constructions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations, no explicit free parameters, and no invented entities; the workflow implicitly assumes that the mutation surface can be constrained without introducing new unmeasured biases, but this cannot be audited from the given text.

pith-pipeline@v0.9.1-grok · 5794 in / 1320 out tokens · 19570 ms · 2026-07-03T20:31:24.737711+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 9 canonical work pages · 4 internal anchors

[1]

Communication-efficient learning of deep networks from decentralized data,

H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. Ag¨uera y Arcas, “Communication-efficient learning of deep networks from decentralized data,” inAISTATS, vol. 54. PMLR, 2017, pp. 1273–1282

2017
[2]

Advances and open problems in federated learning,

P . Kairouz and H. B. McMahan, “Advances and open problems in federated learning,”F oundations and trends in machine learning, vol. 14, no. 1-2, pp. 1–210, 2021

2021
[3]

Auto- fedavg: Learnable federated averaging for multi-institutional medical image segmentation,

Y . Xia, D. Y ang, W. Li, A. Myronenko, D. Xu, H. Obinataet al., “Auto- fedavg: Learnable federated averaging for multi-institutional medical image segmentation,” arXiv:2104.10195, 2021

work page arXiv 2021
[4]

From federated learning to federated neural architecture search: A survey,

H. Zhu, H. Zhang, and Y . Jin, “From federated learning to federated neural architecture search: A survey,”Complex and Intelligent Systems, vol. 7, pp. 639–657, 2021

2021
[5]

Auto-fedrl: Federated hyperparameter optimization for multi-institutional medical image segmentation,

P . Guo, D. Y ang, A. Hatamizadeh, A. Xu, Z. Xu, W. Liet al., “Auto-fedrl: Federated hyperparameter optimization for multi-institutional medical image segmentation,” ECCV , 2022

2022
[6]

Adaptive federated optimization,

S. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Kone ˇcn´yet al., “Adaptive federated optimization,” inICLR, 2021

2021
[7]

FLamby: Datasets and benchmarks for cross-silo federated learning in realistic healthcare settings,

J. O. du Terrail, S.-S. Ayed, E. Cyffers, F. Grimberg, C. He, R. Loebet al., “FLamby: Datasets and benchmarks for cross-silo federated learning in realistic healthcare settings,” inNeurIPS, vol. 35, 2022

2022
[8]

LEAF: A benchmark for federated settings,

S. Caldas, S. M. K. Duddu, P . Wu, T. Li, J. Koneˇcn´y, H. B. McMahanet al., “LEAF: A benchmark for federated settings,” inW orkshop on F ederated Learning for Data Privacy and Confidentiality, 2019

2019
[9]

Federated optimization in heterogeneous networks,

T. Li, A. K. Sahu, A. Talwalkar, and V . Smith, “Federated optimization in heterogeneous networks,” inProc., MLSys, vol. 2, 2020, pp. 429–450

2020
[10]

SCAFFOLD: Stochastic controlled averaging for federated learning,

S. P . Karimireddy, S. Kale, M. Mohri, S. Reddi, S. U. Stich, and A. T. Suresh, “SCAFFOLD: Stochastic controlled averaging for federated learning,” inProc., 37th ICML, vol. 119. PMLR, 2020, pp. 5132–5143

2020
[11]

Hutter, L

F. Hutter, L. Kotthoff, and J. V anschoren, Eds.,Automated Machine Learning: Methods, Systems, Challenges. Springer, 2019

2019
[12]

Neural architecture search: A survey,

T. Elsken, J. H. Metzen, and F. Hutter, “Neural architecture search: A survey,” JMLR, vol. 20, no. 55, pp. 1–21, 2019

2019
[13]

Automl: A survey of the state-of-the-art,

X. He, K. Zhao, and X. Chu, “Automl: A survey of the state-of-the-art,” Knowledge-Based Systems, vol. 212, p. 106622, 2021

2021
[14]

AutoFL: Towards AutoML in a federated learning context,

D. Preuveneers, “AutoFL: Towards AutoML in a federated learning context,” Applied Sciences, vol. 13, no. 14, p. 8019, 2023

2023
[15]

AutoFL: Enabling heterogeneity-aware energy efficient federated learning,

Y . G. Kim and C.-J. Wu, “AutoFL: Enabling heterogeneity-aware energy efficient federated learning,” inProceedings of the 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021, pp. 183–198

2021
[16]

AutoFL: A bayesian game approach for autonomous client participation in federated edge learning,

M. Hu, W. Y ang, Z. Luo, X. Liu, Y . Zhou, X. Chenet al., “AutoFL: A bayesian game approach for autonomous client participation in federated edge learning,” IEEE Transactions on Mobile Computing, vol. 23, no. 1, pp. 194–208, 2024

2024
[17]

Automated federated learning in mobile-edge networks: Fast adaptation and convergence,

C. Y ou, K. Guo, G. Feng, P . Y ang, and T. Q. S. Quek, “Automated federated learning in mobile-edge networks: Fast adaptation and convergence,”IEEE Internet of Things Journal, vol. 10, no. 15, pp. 13 571–13 586, 2023

2023
[18]

Hyper-parameter optimization for federated learning with step-wise adaptive mechanism,

Y . Saadati and M. H. Amini, “Hyper-parameter optimization for federated learning with step-wise adaptive mechanism,” arXiv:2411.12244, 2024

work page arXiv 2024
[19]

Towards non-I.I.D. and invisible data with FedNAS: Federated deep learning via neural architecture search,

C. He, M. Annavaram, and S. Avestimehr, “Towards non-I.I.D. and invisible data with FedNAS: Federated deep learning via neural architecture search,” arXiv:2004.08546, 2020

work page arXiv 2004
[20]

LEAF: A benchmark for federated settings project page,

LEAF Project, “LEAF: A benchmark for federated settings project page,” https://leaf.cmu.edu/, 2026, accessed 2026-06-03

2026
[21]

NVIDIA FLARE: Federated learning from simulation to real-world,

H. R. Roth, Y . Cheng, Y . Wen, I. Y ang, Z. Xu, Y .-T. Hsiehet al., “NVIDIA FLARE: Federated learning from simulation to real-world,”arXiv:2210.13291, 2022

work page arXiv 2022
[22]

EAIRA: Establishing a methodology for evaluating ai models as scientific research assistants,

F. Cappello, S. Madireddy, R. Underwood, N. L.-P . Chiaet al., “EAIRA: Establishing a methodology for evaluating ai models as scientific research assistants,” arXiv:2502.20309, 2025

work page arXiv 2025
[23]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha, “The AI scientist: Towards fully automated open-ended scientific discovery,” arXiv:2408.06292, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Y . Y amada, R. T. Lange, C. Lu, S. Hu, C. Lu, J. Foersteret al., “The AI scientist-v2: Workshop-level automated scientific discovery via agentic tree search,” arXiv:2504.08066, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Agent Laboratory: Using LLM Agents as Research Assistants

S. Schmidgall, Y . Su, Z. Wang, X. Sun, J. Wu, X. Y uet al., “Agent laboratory: Using LLM agents as research assistants,” arXiv:2501.04227, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

autoresearch: Ai agents running research on single-gpu nanochat training automatically,

A. Karpathy, “autoresearch: Ai agents running research on single-gpu nanochat training automatically,” https://github.com/karpathy/autoresearch, 2026, software repository; accessed 2026-06-08

2026
[27]

Camyla: Scaling Autonomous Research in Medical Image Segmentation

Y . Gao, H. Li, F. Y uan, X. Gao, W . Huang, and X. Wang, “Camyla: Scaling au- tonomous research in medical image segmentation,” arXiv:2604.10696, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

Revisiting ensembling in one-shot federated learning,

Y . Allouah, A. Dhasade, R. Guerraoui, N. Gupta, A.-M. Kermarrec, R. Pinot et al., “Revisiting ensembling in one-shot federated learning,” inNeurIPS, vol. 37, 2024

2024
[29]

FedCompass: Efficient cross-silo federated learning on heterogeneous client devices using a computing power-aware scheduler,

Z. Li, P . Chaturvedi, S. He, H. Chen, G. Singh, V . Kindratenkoet al., “FedCompass: Efficient cross-silo federated learning on heterogeneous client devices using a computing power-aware scheduler,” inICLR, 2024

2024
[30]

Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning,

B. Li, Y . Li, and K. W. Eliceiri, “Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning,” inCVPR, 2021, pp. 14 318–14 328

2021
[31]

3D U-Net: Learning dense volumetric segmentation from sparse annotation,

O. Cicek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3D U-Net: Learning dense volumetric segmentation from sparse annotation,” in MICCAI. Springer, 2016, pp. 424–432

2016
[32]

Byzantine-robust distributed learn- ing: Towards optimal statistical rates,

D. Yin, Y . Chen, R. Kannan, and P . Bartlett, “Byzantine-robust distributed learn- ing: Towards optimal statistical rates,” inProceedings of the 35th International Conference on Machine Learning, vol. 80. PMLR, 2018, pp. 5650–5659

2018

[1] [1]

Communication-efficient learning of deep networks from decentralized data,

H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. Ag¨uera y Arcas, “Communication-efficient learning of deep networks from decentralized data,” inAISTATS, vol. 54. PMLR, 2017, pp. 1273–1282

2017

[2] [2]

Advances and open problems in federated learning,

P . Kairouz and H. B. McMahan, “Advances and open problems in federated learning,”F oundations and trends in machine learning, vol. 14, no. 1-2, pp. 1–210, 2021

2021

[3] [3]

Auto- fedavg: Learnable federated averaging for multi-institutional medical image segmentation,

Y . Xia, D. Y ang, W. Li, A. Myronenko, D. Xu, H. Obinataet al., “Auto- fedavg: Learnable federated averaging for multi-institutional medical image segmentation,” arXiv:2104.10195, 2021

work page arXiv 2021

[4] [4]

From federated learning to federated neural architecture search: A survey,

H. Zhu, H. Zhang, and Y . Jin, “From federated learning to federated neural architecture search: A survey,”Complex and Intelligent Systems, vol. 7, pp. 639–657, 2021

2021

[5] [5]

Auto-fedrl: Federated hyperparameter optimization for multi-institutional medical image segmentation,

P . Guo, D. Y ang, A. Hatamizadeh, A. Xu, Z. Xu, W. Liet al., “Auto-fedrl: Federated hyperparameter optimization for multi-institutional medical image segmentation,” ECCV , 2022

2022

[6] [6]

Adaptive federated optimization,

S. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Kone ˇcn´yet al., “Adaptive federated optimization,” inICLR, 2021

2021

[7] [7]

FLamby: Datasets and benchmarks for cross-silo federated learning in realistic healthcare settings,

J. O. du Terrail, S.-S. Ayed, E. Cyffers, F. Grimberg, C. He, R. Loebet al., “FLamby: Datasets and benchmarks for cross-silo federated learning in realistic healthcare settings,” inNeurIPS, vol. 35, 2022

2022

[8] [8]

LEAF: A benchmark for federated settings,

S. Caldas, S. M. K. Duddu, P . Wu, T. Li, J. Koneˇcn´y, H. B. McMahanet al., “LEAF: A benchmark for federated settings,” inW orkshop on F ederated Learning for Data Privacy and Confidentiality, 2019

2019

[9] [9]

Federated optimization in heterogeneous networks,

T. Li, A. K. Sahu, A. Talwalkar, and V . Smith, “Federated optimization in heterogeneous networks,” inProc., MLSys, vol. 2, 2020, pp. 429–450

2020

[10] [10]

SCAFFOLD: Stochastic controlled averaging for federated learning,

S. P . Karimireddy, S. Kale, M. Mohri, S. Reddi, S. U. Stich, and A. T. Suresh, “SCAFFOLD: Stochastic controlled averaging for federated learning,” inProc., 37th ICML, vol. 119. PMLR, 2020, pp. 5132–5143

2020

[11] [11]

Hutter, L

F. Hutter, L. Kotthoff, and J. V anschoren, Eds.,Automated Machine Learning: Methods, Systems, Challenges. Springer, 2019

2019

[12] [12]

Neural architecture search: A survey,

T. Elsken, J. H. Metzen, and F. Hutter, “Neural architecture search: A survey,” JMLR, vol. 20, no. 55, pp. 1–21, 2019

2019

[13] [13]

Automl: A survey of the state-of-the-art,

X. He, K. Zhao, and X. Chu, “Automl: A survey of the state-of-the-art,” Knowledge-Based Systems, vol. 212, p. 106622, 2021

2021

[14] [14]

AutoFL: Towards AutoML in a federated learning context,

D. Preuveneers, “AutoFL: Towards AutoML in a federated learning context,” Applied Sciences, vol. 13, no. 14, p. 8019, 2023

2023

[15] [15]

AutoFL: Enabling heterogeneity-aware energy efficient federated learning,

Y . G. Kim and C.-J. Wu, “AutoFL: Enabling heterogeneity-aware energy efficient federated learning,” inProceedings of the 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021, pp. 183–198

2021

[16] [16]

AutoFL: A bayesian game approach for autonomous client participation in federated edge learning,

M. Hu, W. Y ang, Z. Luo, X. Liu, Y . Zhou, X. Chenet al., “AutoFL: A bayesian game approach for autonomous client participation in federated edge learning,” IEEE Transactions on Mobile Computing, vol. 23, no. 1, pp. 194–208, 2024

2024

[17] [17]

Automated federated learning in mobile-edge networks: Fast adaptation and convergence,

C. Y ou, K. Guo, G. Feng, P . Y ang, and T. Q. S. Quek, “Automated federated learning in mobile-edge networks: Fast adaptation and convergence,”IEEE Internet of Things Journal, vol. 10, no. 15, pp. 13 571–13 586, 2023

2023

[18] [18]

Hyper-parameter optimization for federated learning with step-wise adaptive mechanism,

Y . Saadati and M. H. Amini, “Hyper-parameter optimization for federated learning with step-wise adaptive mechanism,” arXiv:2411.12244, 2024

work page arXiv 2024

[19] [19]

Towards non-I.I.D. and invisible data with FedNAS: Federated deep learning via neural architecture search,

C. He, M. Annavaram, and S. Avestimehr, “Towards non-I.I.D. and invisible data with FedNAS: Federated deep learning via neural architecture search,” arXiv:2004.08546, 2020

work page arXiv 2004

[20] [20]

LEAF: A benchmark for federated settings project page,

LEAF Project, “LEAF: A benchmark for federated settings project page,” https://leaf.cmu.edu/, 2026, accessed 2026-06-03

2026

[21] [21]

NVIDIA FLARE: Federated learning from simulation to real-world,

H. R. Roth, Y . Cheng, Y . Wen, I. Y ang, Z. Xu, Y .-T. Hsiehet al., “NVIDIA FLARE: Federated learning from simulation to real-world,”arXiv:2210.13291, 2022

work page arXiv 2022

[22] [22]

EAIRA: Establishing a methodology for evaluating ai models as scientific research assistants,

F. Cappello, S. Madireddy, R. Underwood, N. L.-P . Chiaet al., “EAIRA: Establishing a methodology for evaluating ai models as scientific research assistants,” arXiv:2502.20309, 2025

work page arXiv 2025

[23] [23]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha, “The AI scientist: Towards fully automated open-ended scientific discovery,” arXiv:2408.06292, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Y . Y amada, R. T. Lange, C. Lu, S. Hu, C. Lu, J. Foersteret al., “The AI scientist-v2: Workshop-level automated scientific discovery via agentic tree search,” arXiv:2504.08066, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Agent Laboratory: Using LLM Agents as Research Assistants

S. Schmidgall, Y . Su, Z. Wang, X. Sun, J. Wu, X. Y uet al., “Agent laboratory: Using LLM agents as research assistants,” arXiv:2501.04227, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

autoresearch: Ai agents running research on single-gpu nanochat training automatically,

A. Karpathy, “autoresearch: Ai agents running research on single-gpu nanochat training automatically,” https://github.com/karpathy/autoresearch, 2026, software repository; accessed 2026-06-08

2026

[27] [27]

Camyla: Scaling Autonomous Research in Medical Image Segmentation

Y . Gao, H. Li, F. Y uan, X. Gao, W . Huang, and X. Wang, “Camyla: Scaling au- tonomous research in medical image segmentation,” arXiv:2604.10696, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[28] [28]

Revisiting ensembling in one-shot federated learning,

Y . Allouah, A. Dhasade, R. Guerraoui, N. Gupta, A.-M. Kermarrec, R. Pinot et al., “Revisiting ensembling in one-shot federated learning,” inNeurIPS, vol. 37, 2024

2024

[29] [29]

FedCompass: Efficient cross-silo federated learning on heterogeneous client devices using a computing power-aware scheduler,

Z. Li, P . Chaturvedi, S. He, H. Chen, G. Singh, V . Kindratenkoet al., “FedCompass: Efficient cross-silo federated learning on heterogeneous client devices using a computing power-aware scheduler,” inICLR, 2024

2024

[30] [30]

Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning,

B. Li, Y . Li, and K. W. Eliceiri, “Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning,” inCVPR, 2021, pp. 14 318–14 328

2021

[31] [31]

3D U-Net: Learning dense volumetric segmentation from sparse annotation,

O. Cicek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3D U-Net: Learning dense volumetric segmentation from sparse annotation,” in MICCAI. Springer, 2016, pp. 424–432

2016

[32] [32]

Byzantine-robust distributed learn- ing: Towards optimal statistical rates,

D. Yin, Y . Chen, R. Kannan, and P . Bartlett, “Byzantine-robust distributed learn- ing: Towards optimal statistical rates,” inProceedings of the 35th International Conference on Machine Learning, vol. 80. PMLR, 2018, pp. 5650–5659

2018