pith. machine review for the scientific record. sign in

arxiv: 2604.14455 · v1 · submitted 2026-04-15 · 💻 cs.AI

Recognition: unknown

AIBuildAI: An AI Agent for Automatically Building AI Models

Li Zhang, Peijia Qin, Pengtao Xie, Qi Cao, Ruiyi Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:45 UTC · model grok-4.3

classification 💻 cs.AI
keywords AI agentautomated model buildinghierarchical agentsLLM agentsMLE-BenchAutoMLmodel development
0
0 comments X

The pith

A hierarchical AI agent system automatically builds complete models from task descriptions and data, achieving first place on a benchmark of realistic development tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AIBuildAI as an agent that receives only a task description and training data then produces a working AI model through fully automated steps. It structures the process with one manager agent directing three sub-agents that separately handle modeling choices, code creation and fixes, and training adjustments. This approach extends past conventional AutoML tools that operate only inside fixed model families and hyperparameter spaces. The system is tested on MLE-Bench, a collection of Kaggle-style problems covering image, text, time-series, and tabular data, where it records the top score and matches the results of experienced human engineers. The central demonstration is that coordinated LLM agents can carry out the entire model-development lifecycle without ongoing human guidance.

Core claim

AIBuildAI uses a manager agent to coordinate a designer sub-agent for choosing modeling strategies, a coder sub-agent for writing and debugging code, and a tuner sub-agent for training and performance refinement. Each sub-agent is an LLM-based system that performs multi-step reasoning and tool use. On the MLE-Bench benchmark of diverse real-world tasks, this architecture delivers a 63.1 percent medal rate, the highest among tested methods and comparable to the output of skilled human practitioners.

What carries the argument

Hierarchical agent architecture in which a manager coordinates three specialized LLM agents (designer, coder, tuner) that together execute architecture selection, code implementation, debugging, and optimization.

If this is right

  • End-to-end automation becomes feasible for the full AI model development process from specification to deployable artifact.
  • Performance on realistic tasks reaches levels previously associated only with experienced human engineers.
  • The approach surpasses existing AutoML systems by handling open-ended architecture design and implementation steps.
  • AI model creation could become accessible with far less specialized expertise than is currently required.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same coordination pattern might transfer to other multi-stage engineering workflows that currently demand teams of specialists.
  • Further reliability gains in the underlying language models could reduce the frequency of failures on harder or less common data modalities.
  • Combining the agent with existing code repositories or external APIs might shorten the remaining manual review steps even more.

Load-bearing premise

LLM-based agents can execute long sequences of architecture design, coding, debugging, and performance tuning across different data types without repeated human corrections or breakdowns.

What would settle it

A follow-up evaluation on additional MLE-Bench tasks or similar problems in which AIBuildAI produces no competitive model or requires substantial human fixes to reach working performance.

read the original abstract

AI models underpin modern intelligent systems, driving advances across science, medicine, finance, and technology. Yet developing high-performing AI models remains a labor-intensive process that requires expert practitioners to iteratively design architectures, engineer representations, implement training pipelines and refine approaches through empirical evaluation. Existing AutoML methods partially alleviate this burden but remain limited to narrow aspects such as hyperparameter optimization and model selection within predefined search spaces, leaving the full development lifecycle largely dependent on human expertise. To address this gap, we introduce AIBuildAI, an AI agent that automatically builds AI models from a task description and training data. AIBuildAI adopts a hierarchical agent architecture in which a manager agent coordinates three specialized sub-agents: a designer for modeling strategy, a coder for implementation and debugging, and a tuner for training and performance optimization. Each sub-agent is itself a large language model (LLM) based agent capable of multi-step reasoning and tool use, enabling end-to-end automation of the AI model development process that goes beyond the scope of existing AutoML approaches. We evaluate AIBuildAI on MLE-Bench, a benchmark of realistic Kaggle-style AI development tasks spanning visual, textual, time-series and tabular modalities. AIBuildAI ranks first on MLE-Bench with a medal rate of 63.1%, outperforming all existing baseline methods and matching the capability of highly experienced AI engineers. These results demonstrate that hierarchical agent systems can automate the full AI model development process from task specification to deployable model, suggesting a pathway toward broadly accessible AI development with minimal human intervention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces AIBuildAI, a hierarchical LLM-based agent system in which a manager agent coordinates three specialized sub-agents (designer for modeling strategy, coder for implementation/debugging, and tuner for optimization) to automate the full AI model development pipeline from task description and data to a deployable model. It evaluates the system on MLE-Bench, a collection of realistic Kaggle-style tasks across visual, textual, time-series, and tabular modalities, and claims a first-place ranking with a 63.1% medal rate that outperforms existing AutoML and agent baselines while matching experienced human engineers.

Significance. If the performance claims can be substantiated with complete experimental details, this would constitute a notable advance in automated machine learning. The hierarchical multi-agent design extends beyond conventional AutoML (limited to hyperparameter search within fixed spaces) by attempting end-to-end automation of architecture design, coding, debugging, and tuning. Successful validation would support the broader hypothesis that LLM agents can reliably handle complex, multi-step engineering workflows across modalities with minimal human oversight.

major comments (3)
  1. [Abstract and Experiments section] Abstract and Experiments section: The headline result of a 63.1% medal rate and first-place ranking on MLE-Bench is presented without any description of the base LLMs powering the manager/designer/coder/tuner agents, the number of tasks attempted versus completed, the number of independent trials per task, retry budgets, failure-handling protocols, or the precise definition of a 'medal' used by the benchmark. These omissions make it impossible to determine whether the reported superiority is attributable to the hierarchical architecture or to unreported implementation choices, directly undermining the central empirical claim.
  2. [Method section] Method section: The paper asserts that the sub-agents enable 'end-to-end automation ... without human intervention,' yet provides no concrete specification of inter-agent communication protocols, tool-use interfaces, state sharing, or error-recovery mechanisms. Without these details, the weakest assumption—that LLM agents can reliably execute the full multi-step pipeline across diverse modalities—cannot be evaluated, leaving the architectural contribution untestable.
  3. [Experiments section] Experiments section: The claim that AIBuildAI 'outperforms all existing baseline methods' is unsupported by any description of baseline re-implementations, statistical significance tests, variance across runs, or ablation studies isolating the contribution of the manager or individual sub-agents. This absence prevents assessment of whether the medal-rate advantage is robust or confounded by differences in underlying model capabilities.
minor comments (2)
  1. [Abstract] The abstract states that results 'match the capability of highly experienced AI engineers' without any quantitative human baseline or side-by-side comparison; this phrasing should be qualified or removed unless supported by data in the full evaluation.
  2. [Method section] A system diagram or pseudocode illustrating the exact workflow and handoff between manager and sub-agents would improve clarity of the hierarchical architecture.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which identify important gaps in experimental transparency and methodological specification. We agree that addressing these points will strengthen the manuscript's reproducibility and allow for a more rigorous evaluation of the hierarchical agent architecture. We respond to each major comment below and commit to the indicated revisions.

read point-by-point responses
  1. Referee: [Abstract and Experiments section] Abstract and Experiments section: The headline result of a 63.1% medal rate and first-place ranking on MLE-Bench is presented without any description of the base LLMs powering the manager/designer/coder/tuner agents, the number of tasks attempted versus completed, the number of independent trials per task, retry budgets, failure-handling protocols, or the precise definition of a 'medal' used by the benchmark. These omissions make it impossible to determine whether the reported superiority is attributable to the hierarchical architecture or to unreported implementation choices, directly undermining the central empirical claim.

    Authors: We agree that these details are necessary to substantiate the central empirical claims and to distinguish the contribution of the architecture from implementation specifics. In the revised manuscript we will add a dedicated 'Experimental Setup' subsection that explicitly states the base LLMs used for the manager and each sub-agent, the total number of MLE-Bench tasks attempted and completed, the number of independent trials per task, the retry budgets and failure-handling protocols, and the precise definition of a 'medal' as specified by the benchmark. These additions will be placed in both the Experiments section and referenced from the abstract where appropriate. revision: yes

  2. Referee: [Method section] Method section: The paper asserts that the sub-agents enable 'end-to-end automation ... without human intervention,' yet provides no concrete specification of inter-agent communication protocols, tool-use interfaces, state sharing, or error-recovery mechanisms. Without these details, the weakest assumption—that LLM agents can reliably execute the full multi-step pipeline across diverse modalities—cannot be evaluated, leaving the architectural contribution untestable.

    Authors: We acknowledge that the current Method section lacks the level of implementation detail required to make the system testable and reproducible. In the revision we will expand the hierarchical architecture description with a new subsection that specifies the inter-agent communication protocols (including message formats and delegation procedures), tool-use interfaces (code execution environment, data loaders, and evaluation tools), state sharing mechanisms (shared workspace and conversation history), and error-recovery mechanisms (retry logic, fallback strategies, and escalation to the manager). These concrete specifications will allow readers to assess the reliability of the end-to-end automation claim. revision: yes

  3. Referee: [Experiments section] Experiments section: The claim that AIBuildAI 'outperforms all existing baseline methods' is unsupported by any description of baseline re-implementations, statistical significance tests, variance across runs, or ablation studies isolating the contribution of the manager or individual sub-agents. This absence prevents assessment of whether the medal-rate advantage is robust or confounded by differences in underlying model capabilities.

    Authors: We will revise the Experiments section to include detailed descriptions of all baseline methods, specifying whether they were re-implemented from original code or taken from published results and noting any adaptations required for fair comparison. We will also report statistical significance tests, discuss observed variance across runs (accounting for LLM stochasticity), and present ablation studies that isolate the manager agent and each sub-agent. Where full multi-run variance or exhaustive ablations were not performed in the original experiments, we will explicitly note this as a limitation and provide the available partial results. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical benchmark result stands independent of internal definitions

full rationale

The paper reports an empirical outcome (63.1% medal rate on external MLE-Bench) obtained by running the described hierarchical LLM agent on a fixed public benchmark. No equations, fitted parameters, or first-principles derivations are present; the central claim is a measured performance number on tasks whose success criteria and data are defined outside the paper. No self-citations are invoked to justify uniqueness or to close any logical loop, and the architecture description does not redefine or presuppose the reported metric. The result is therefore self-contained against the external benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim depends on the unproven assumption that current LLMs can serve as reliable autonomous agents for software engineering tasks and that MLE-Bench tasks are representative of real deployment scenarios. The AIBuildAI system itself is the primary new entity introduced.

axioms (2)
  • domain assumption Large language models can perform reliable multi-step reasoning, tool use, code generation, and iterative debugging for AI model development tasks.
    Invoked when describing the capabilities of the designer, coder, and tuner sub-agents.
  • domain assumption The MLE-Bench benchmark tasks and evaluation protocol accurately reflect the capabilities of experienced human AI engineers.
    Used to interpret the 63.1% medal rate as matching human expert performance.
invented entities (1)
  • AIBuildAI hierarchical agent system no independent evidence
    purpose: To coordinate design, coding, and tuning sub-agents for end-to-end AI model construction
    The system is the primary contribution; no independent external evidence for its reliability is provided in the abstract.

pith-pipeline@v0.9.0 · 5592 in / 1481 out tokens · 54498 ms · 2026-05-10T12:45:07.400103+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

88 extracted references · 18 canonical work pages · 4 internal anchors

  1. [1]

    M.Computing machinery and intelligence, 23–65 (Springer, 2007)

    Turing, A. M.Computing machinery and intelligence, 23–65 (Springer, 2007)

  2. [2]

    Jordan, M. I. & Mitchell, T. M. Machine learning: Trends, perspectives, and prospects.Science 349, 255–260 (2015)

  3. [3]

    Probabilistic machine learning and artificial intelligence.Nature521, 452–459 (2015)

    Ghahramani, Z. Probabilistic machine learning and artificial intelligence.Nature521, 452–459 (2015)

  4. [4]

    Biamonte, J.et al.Quantum machine learning.Nature549, 195–202 (2017)

  5. [5]

    & Sun, J

    He, K., Zhang, X., Ren, S. & Sun, J. Agapito, L., Berg, T., Kosecka, J. & Zelnik-Manor, L. (eds)Deep residual learning for image recognition. (eds Agapito, L., Berg, T., Kosecka, J. & Zelnik-Manor, L.) Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778 (2016)

  6. [6]

    W., Medina, J

    Otter, D. W., Medina, J. R. & Kalita, J. K. A survey of the usages of deep learning for natural language processing.IEEE transactions on neural networks and learning systems32, 604–624 (2020). 17

  7. [7]

    Price, I.et al.Probabilistic weather forecasting with machine learning.Nature637, 84–90 (2025)

  8. [8]

    Hollmann, N.et al.Accurate predictions on small data with a tabular foundation model.Nature 637, 319–326 (2025)

  9. [9]

    & Bengio, Y

    Bergstra, J. & Bengio, Y. Random search for hyper-parameter optimization.Journal of machine learning research13(2012)

  10. [10]

    D., Lee, D

    Sculley, D.et al.Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M. & Garnett, R. (eds)Hidden technical debt in machine learning systems. (eds Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M. & Garnett, R.)Proceedings of the International Conference on Neural Information Processing Systems, Vol. 2, 2503–2511 (2015)

  11. [11]

    & Vanschoren, J.Automated machine learning: methods, systems, challenges (Springer, 2019)

    Hutter, F., Kotthoff, L. & Vanschoren, J.Automated machine learning: methods, systems, challenges (Springer, 2019)

  12. [12]

    Aldoseri, A., Al-Khalifa, K. N. & Hamouda, A. M. Re-thinking data strategy and integration for artificial intelligence: concepts, opportunities, and challenges.Applied Sciences13, 7082 (2023)

  13. [13]

    W., Katabi, D

    Yang, Y., Zhang, H., Gichoya, J. W., Katabi, D. & Ghassemi, M. The limits of fair medical imaging ai in real-world generalization.Nature medicine30, 2838–2848 (2024)

  14. [14]

    A few useful things to know about machine learning.Communications of the ACM 55, 78–87 (2012)

    Domingos, P. A few useful things to know about machine learning.Communications of the ACM 55, 78–87 (2012)

  15. [15]

    D., Lee, D

    Feurer, M.et al.Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M. & Garnett, R. (eds)Efficient and robust automated machine learning. (eds Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M. & Garnett, R.)Proceedings of the International Conference on Neural Information Processing Systems, 2755–2763 (2015)

  16. [16]

    & Weinberger, K

    Henderson, P.et al.McIlraith, S. & Weinberger, K. (eds)Deep reinforcement learning that matters. (eds McIlraith, S. & Weinberger, K.)Proceedings of the AAAI conference on artificial intelligence, Vol. 32 (2018)

  17. [17]

    Thornton, C., Hutter, F., Hoos, H. H. & Leyton-Brown, K. Dhillon, I. S., Koren, Y., Ghani, R., Senator, T. E. & Schmerl, B. (eds)Auto-weka: Combined selection and hyperparameter optimization of classification algorithms. (eds Dhillon, I. S., Koren, Y., Ghani, R., Senator, T. E. & Schmerl, B.)Proceedings of the 19th ACM SIGKDD International Conference on K...

  18. [18]

    & Chu, X

    He, X., Zhao, K. & Chu, X. Automl: A survey of the state-of-the-art.Knowledge-based systems 212, 106622 (2021)

  19. [19]

    URL https://openreview.net/forum?id=RwfrdKSgCE

    Toledo, E.et al.AI research agents for machine learning: Search, exploration, and generalization in MLE-bench (2025). URL https://openreview.net/forum?id=RwfrdKSgCE

  20. [20]

    CobraAgent: Results on MLE-bench, 2026.https://dalphakr.github.io/CobraAgent/

    Du, S.et al.Automlgen: Navigating fine-grained optimization for coding agents.ArXiv abs/2510.08511(2025). URL https://api.semanticscholar.org/CorpusID:281951479

  21. [21]

    Hong, S.et al.MetaGPT: Meta programming for a multi-agent collaborative framework.The Twelfth International Conference on Learning Representations (ICLR)(2024)

  22. [22]

    Conference on Language Modeling (COLM)(2024)

    Wu, Q.et al.AutoGen: Enabling next-gen LLM applications via multi-agent conversation. Conference on Language Modeling (COLM)(2024)

  23. [23]

    S.et al.Mle-bench: Evaluating machine learning agents on machine learning engineering (2025)

    Chan, J. S.et al.Mle-bench: Evaluating machine learning agents on machine learning engineering (2025). International Conference on Learning Representations (ICLR)

  24. [24]

    Mle-bench leaderboard (commit c5631ba)

    OpenAI. Mle-bench leaderboard (commit c5631ba). https://github.com/openai/mle-bench/tree/ c5631ba61ceeb0573235a6ce209db435327a1e84 (2026). Accessed: 2026-03-18. 18

  25. [25]

    Chen, J.et al.MARS: Modular agent with reflective search for automated AI research.arXiv preprint arXiv:2602.02660(2026)

  26. [26]

    Li, A.et al.The FM agent.arXiv preprint arXiv:2510.26144(2025)

  27. [27]

    Liu, Z.et al.ML-Master: Towards AI-for-AI via integration of exploration and reasoning.arXiv preprint arXiv:2506.16499(2025)

  28. [28]

    Nadafian, A

    Nadafian, A., Mohammadshahi, A. & Yazdani, M. KAPSO: A knowledge-grounded framework for autonomous program synthesis and optimization.arXiv preprint arXiv:2601.21526(2026)

  29. [29]

    Team, I.et al.InternAgent: When agent becomes the scientist—building closed-loop system from hypothesis to verification.arXiv preprint arXiv:2505.16938(2025)

  30. [30]

    Yang, X.et al.R&D-Agent: An LLM-Agent framework towards autonomous data science.arXiv preprint arXiv:2505.14738(2025)

  31. [31]

    Jiang, Z.et al.AIDE: AI-Driven exploration in the space of code.arXiv preprint arXiv:2502.13138 (2025)

  32. [32]

    Krizhevsky, I

    Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks.Commun. ACM60, 84–90 (2017). URL https://doi.org/10.1145/3065386

  33. [33]

    Tan, M. & Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks (2019)

  34. [34]

    URL https://openreview.net/forum?id=YicbFdNTTy

    Dosovitskiy, A.et al.An image is worth 16x16 words: Transformers for image recognition at scale (2021). URL https://openreview.net/forum?id=YicbFdNTTy

  35. [35]

    Liu, Z.et al.Swin transformer: Hierarchical vision transformer using shifted windows.Proceedings of the IEEE/CVF International Conference on Computer Vision10012–10022 (2021)

  36. [36]

    Neurocomputing , author =

    Wang, M. & Deng, W. Deep visual domain adaptation: A survey.Neurocomput.312, 135–153 (2018). URL https://doi.org/10.1016/j.neucom.2018.05.083

  37. [37]

    D., Zoph, B., Man´ e, D., Vasudevan, V

    Cubuk, E. D., Zoph, B., Man´ e, D., Vasudevan, V. & Le, Q. V. Autoaugment: Learning augmen- tation strategies from data. (2019). URL http://dblp.uni-trier.de/db/conf/cvpr/cvpr2019.html# CubukZMVL19

  38. [38]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Liu, Z.et al.A convnet for the 2020s. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

  39. [39]

    & Komodakis, N

    Zagoruyko, S. & Komodakis, N. Wide residual networks.Proceedings of the British Machine Vision Conference (BMVC)(2016)

  40. [40]

    Deng, J.et al.ImageNet: A large-scale hierarchical image database.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition248–255 (2009)

  41. [41]

    Tan, M. & Le, Q. V. EfficientNetV2: Smaller models and faster training.Proceedings of the International Conference on Machine Learning (ICML)10096–10106 (2021)

  42. [42]

    B., He, K

    Lin, T.-Y., Goyal, P., Girshick, R. B., He, K. & Doll´ ar, P. Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence42, 318–327 (2017). URL https: //api.semanticscholar.org/CorpusID:206771220

  43. [43]

    Liu, L.et al.Deep learning for generic object detection: A survey.Int. J. Comput. Vision128, 261–318 (2020). URL https://doi.org/10.1007/s11263-019-01247-4

  44. [44]

    Sharma, R., Saqib, M., Lin, C. T. & Blumenstein, M. A survey on object instance segmentation. SN Comput. Sci.3(2022). URL https://doi.org/10.1007/s42979-022-01407-3

  45. [45]

    & Sun, J

    Ren, S., He, K., Girshick, R. & Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks.Advances in Neural Information Processing Systems28(2015). 19

  46. [46]

    & Farhadi, A

    Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You only look once: Unified, real-time object detection.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition779– 788 (2016)

  47. [47]

    & Brox, T

    Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmen- tation. International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) (2015)

  48. [48]

    Rethinking Atrous Convolution for Semantic Image Segmentation

    Chen, L.-C., Papandreou, G., Schroff, F. & Adam, H. Rethinking atrous convolution for semantic image segmentation (2017). arXiv:1706.05587

  49. [49]

    & Zisserman, A

    Simonyan, K. & Zisserman, A. Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems (NeurIPS) (2014)

  50. [50]

    Arnab, A.et al.ViViT: A video vision transformer.Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)6836–6846 (2021)

  51. [51]

    Berman, M., Triki, A. R. & Blaschko, M. B. The Lov´ asz-Softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks 4413–4421 (2018)

  52. [52]

    & Ahmadi, S.-A

    Milletari, F., Navab, N. & Ahmadi, S.-A. V-Net: Fully convolutional neural networks for volumetric medical image segmentation.Proceedings of the International Conference on 3D Vision (3DV) 565–571 (2016)

  53. [53]

    Advances in Neural Information Processing Systems34, 12077–12090 (2021)

    Xie, E.et al.SegFormer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems34, 12077–12090 (2021)

  54. [54]

    & Janvin, C

    Bengio, Y., Ducharme, R., Vincent, P. & Janvin, C. A neural probabilistic language model.J. Mach. Learn. Res.3, 1137–1155 (2003)

  55. [55]

    & Toutanova, K

    Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: Pre-training of deep bidirectional trans- formers for language understanding. Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL) (2019)

  56. [56]

    arXiv (2019)

    Liu, Y.et al.Roberta: A robustly optimized bert pretraining approach. arXiv (2019)

  57. [57]

    Raffel, C.et al.Exploring the limits of transfer learning with a unified text-to-text transformer.J. Mach. Learn. Res.21(2020)

  58. [58]

    OpenAI Technical Report (2019)

    Radford, A.et al.Language models are unsupervised multitask learners. OpenAI Technical Report (2019)

  59. [59]

    & Wolf, T

    Sanh, V., Debut, L., Chaumond, J. & Wolf, T. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv (2019)

  60. [60]

    van den Oord, A.et al.WaveNet: A generative model for raw audio.arXiv preprint arXiv:1609.03499 (2016)

  61. [61]

    URL https: //api.semanticscholar.org/CorpusID:8810481

    Hershey, S.et al.Cnn architectures for large-scale audio classification.2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)131–135 (2016). URL https: //api.semanticscholar.org/CorpusID:8810481

  62. [62]

    Hochreiter and J

    Hochreiter, S. & Schmidhuber, J. Long short-term memory.Neural Comput.9, 1735–1780 (1997). URL https://doi.org/10.1162/neco.1997.9.8.1735

  63. [63]

    Bai, S., Kolter, J. Z. & Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling.ArXivabs/1803.01271(2018). URL https://api.semanticscholar. org/CorpusID:4747877

  64. [64]

    & Varoquaux, G

    Grinsztajn, L., Oyallon, E. & Varoquaux, G. Why do tree-based models still outperform deep learning on typical tabular data? (2022). URL https://openreview.net/forum?id=Fp7 phQszn. 20

  65. [65]

    Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =

    Chen, T. & Guestrin, C. Xgboost: A scalable tree boosting system (2016). URL https://doi.org/ 10.1145/2939672.2939785

  66. [66]

    Ke, G.et al.Lightgbm: a highly efficient gradient boosting decision tree (2017)

  67. [67]

    Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V. & Gulin, A. CatBoost: Unbiased boosting with categorical features.Advances in Neural Information Processing Systems31(2018)

  68. [68]

    Arik, S. ¨O. & Pfister, T. TabNet: Attentive interpretable tabular learning.Proceedings of the AAAI Conference on Artificial Intelligence35, 6679–6687 (2021)

  69. [69]

    Jasper, H. H. The ten-twenty electrode system of the International Federation.Electroencephalog- raphy and Clinical Neurophysiology10, 371–375 (1958)

  70. [70]

    Wu, K.et al.TinyViT: Fast pretraining distillation for small vision transformers.Proceedings of the European Conference on Computer Vision (ECCV)(2022)

  71. [71]

    N., Hani, A

    Acharya, J. N., Hani, A. J., Thirumala, P. D. & Tsuchida, T. N. American clinical neurophysiology society guideline 3: A proposal for standard montages to be used in clinical EEG.Journal of Clinical Neurophysiology33, 312–316 (2016)

  72. [72]

    On the theory of filter amplifiers.Experimental Wireless and the Wireless Engineer 7, 536–541 (1930)

    Butterworth, S. On the theory of filter amplifiers.Experimental Wireless and the Wireless Engineer 7, 536–541 (1930)

  73. [73]

    Ding, D.et al.Hybrid LLM: Cost-efficient and quality-aware query routing.Proceedings of the Twelfth International Conference on Learning Representations(2024)

  74. [74]

    Wang, X.et al.MixLLM: Dynamic routing in mixed large language models.Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (2025)

  75. [75]

    Hoffmann, J.et al.Training compute-optimal large language models.Advances in Neural Information Processing Systems (NeurIPS)(2022)

  76. [76]

    Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI)(2022)

    Zheng, L.et al.Alpa: Automating inter- and intra-operator parallelism for distributed deep learning. Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI)(2022)

  77. [77]

    Zhu, Z.et al.Mist: Efficient distributed training of large language models via memory-parallelism co-optimization.Proceedings of the 20th European Conference on Computer Systems (EuroSys) (2025)

  78. [78]

    & Shi, W

    Wang, Z., Li, Z., Jiang, Z., Tu, D. & Shi, W. Crafting personalized agents through retrieval- augmented generation on editable memory graphs.Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)(2024)

  79. [79]

    Xu, W.et al.A-MEM: Agentic memory for LLM agents.Advances in Neural Information Processing Systems (NeurIPS)(2025)

  80. [80]

    Peidli, S.et al.scPerturb: Harmonized single-cell perturbation data.Nature Methods21, 531–540 (2024)

Showing first 80 references.