arxiv: 2605.08305 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI· cs.CL· cs.PF· cs.SE

Recognition: 2 theorem links

· Lean Theorem

LLMSYS-HPOBench: Hyperparameter Optimization Benchmark Suite for Real-World LLM Systems

Siyu Wu , Yulong Ye , Zezhen Xiang , Pengzhou Chen , Gangda Xiong , Tao Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:18 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.PFcs.SE

keywords hyperparameter optimizationlarge language modelsbenchmark suiteAutoMLLLM systemsinference optimizationfidelity factorscost-aware optimization

0 comments

The pith

A benchmark suite supplies real run data from large language model systems to support hyperparameter optimization research.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current hyperparameter optimization benchmarks miss the compound configuration spaces that mix AI model settings with system-level choices in LLM deployments, along with nonlinear effects from fidelity choices and uneven measurement costs. The paper introduces LLMSYS-HPOBench as the first live collection of profiled data drawn directly from running LLM systems. It includes 364450 hyperparameter configurations in 12-23 dimensions, 932 fidelity settings from 3-5 factors, multiple inference metrics, and cost metrics, plus associated logs. The release aims to let existing HPO methods be tested on frontier LLM workloads and to open new research avenues in AutoML that address these system realities.

Core claim

No existing benchmark captures the full mix of AI and non-AI hyperparameters, fidelity implications, and cost diversity that appear when optimizing real LLM systems; LLMSYS-HPOBench addresses this by releasing datasets of inference objective values obtained from actual system runs, currently containing 364450 configurations with 12-23 dimensions, 3-5 fidelity dimensions that produce 932 settings, 3-9 objective metrics, 2-10 cost metrics, and the generated measurement logs.

What carries the argument

LLMSYS-HPOBench, the suite that aggregates profiled inference objective values and cost data from hyperparameter configurations executed on live LLM systems.

If this is right

Existing HPO algorithms can now be revalidated directly against data from frontier LLM systems rather than synthetic or simplified proxies.
AutoML methods can be developed that explicitly handle mixed AI and system hyperparameters together with fidelity and cost trade-offs.
The live nature of the suite permits ongoing addition of new LLM systems and workloads as the field evolves.
Improved HPO on these benchmarks could translate into lower inference latency or resource use in deployed LLM applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark may reveal that standard HPO techniques require substantial adaptation to manage the high-dimensional mixed spaces and variable costs typical of LLM serving.
Researchers could use the provided logs to derive new cost models that predict measurement expense before running full configurations.
Integration with automated deployment tools might allow closed-loop optimization where HPO results feed back into live system tuning.

Load-bearing premise

The specific LLM systems and profiled runs used to build the datasets represent the compound spaces, nonlinear fidelities, and cost patterns found across wider real-world LLM deployments.

What would settle it

Demonstrating that hyperparameter optimization algorithms tuned on this benchmark produce no measurable gains when applied to independent, production-scale LLM deployments would undermine the claim of representativeness.

Figures

Figures reproduced from arXiv: 2605.08305 by Gangda Xiong, Pengzhou Chen, Siyu Wu, Tao Chen, Yulong Ye, Zezhen Xiang.

**Figure 3.** Figure 3: An example of the hyperparameter configuration data of moderate-req1-code_generation.csv. 5. Data validation and cleaning: The collected raw records are subsequently validated and cleaned to ensure consistency, completeness, and usability. This process identifies failed or abnormal executions, filters invalid measurements, and standardizes the remaining records into a unified format for downstream analysi… view at source ↗

**Figure 4.** Figure 4: High-level pseudocode example of using the unified [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Overview of the LLMSYS-HPOBench workflow. x under a fidelity setting z and receive the structured metadata M, including all the column data shown in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Optimization trajectories of HPO algorithms over 10 runs using [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Sanitized logging trace of the LLM system A [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗

read the original abstract

Large Language Model (LLM) systems have been the frontier of AI in many application domains, leading to new challenges and opportunities for hyperparameter optimization (HPO) for the AutoML community. However, this type of system exhibits an unprecedented compound space of hyperparameter configuration from both the AI and non-AI components; rich and nonlinear implications from the fidelity factors; and diverse costs of measuring hyperparameter configurations, none of which have been fully captured in existing benchmarks. This paper presents the first (live) benchmark suite and datasets for HPO of real-world LLM systems, dubbed LLMSYS-HPOBench, covering data related to the inference objective values of hyperparameter configurations profiled from running the LLM systems. Currently, LLMSYS-HPOBench contains 364,450 hyperparameter configurations with a dimensionality of 12-23, 3-5 dimensions of fidelity factor leading to 932 settings, 3-9 inference objective metrics, and 2-10 cost metrics, together with generated logs from measuring the LLM systems. What we seek to advocate is not only a revalidation of the existing HPO algorithms over the frontier LLM systems, but also to provide an evolving platform for the AutoML community to explore new directions of research in this regard. The benchmark suite has been made available at: https://github.com/ideas-labo/llmsys-hpobench

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMSYS-HPOBench gives the AutoML community a big dataset of real LLM system runs, but its representativeness for broader use isn't validated.

read the letter

The main thing to know is that LLMSYS-HPOBench supplies a large collection of real hyperparameter runs from LLM systems, filling a gap in existing benchmarks that lack this kind of compound space and cost data. What the paper does well is actually executing the configurations on the systems and releasing the resulting datasets and logs. With 364k entries spanning high dimensions, multiple fidelities, objectives, and costs, it gives people something to run HPO algorithms against that comes from live inference rather than simulations. Making it open on GitHub supports the idea of an evolving platform. The soft spot is the lack of evidence that these particular systems represent real-world LLM deployments more broadly. The stress-test note is on point here: no diversity analysis or cross-system comparisons are mentioned, so the nonlinear fidelity effects and cost metrics might be specific to the profiled instances. That weakens the first benchmark for real-world claim until that's addressed. Methodology details on how measurements were taken and noise handled would also strengthen it, as the abstract is light on that. This is for the AutoML community focused on practical applications to large models. A reader testing new HPO methods or needing realistic multi-objective data would get direct value from the released resources. It deserves a serious referee because benchmark suites like this can influence how the field evaluates algorithms, so the data collection and representativeness need checking. I would recommend sending it to peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LLMSYS-HPOBench, presented as the first live benchmark suite and associated datasets for hyperparameter optimization (HPO) of real-world LLM systems. It profiles 364,450 hyperparameter configurations (dimensionality 12-23) across LLM systems, incorporating 3-5 fidelity factors (yielding 932 settings), 3-9 inference objective metrics, and 2-10 cost metrics, together with generated measurement logs. The work supplies the raw data via GitHub and advocates for revalidating existing HPO algorithms on these frontier systems while enabling new AutoML research directions on compound spaces, nonlinear fidelities, and diverse costs not captured in prior benchmarks.

Significance. If the profiled systems and measurements prove representative, the benchmark would fill a clear gap by supplying realistic, high-dimensional, multi-fidelity, multi-objective HPO instances with explicit costs for the AutoML community. The public release of raw datasets and logs is a concrete strength that supports reproducibility and secondary analyses, distinguishing this from purely synthetic or low-fidelity benchmarks.

major comments (2)

[Abstract] Abstract and the benchmark description: the central claim that the suite captures 'the compound space of hyperparameter configuration from both the AI and non-AI components; rich and nonlinear implications from the fidelity factors; and diverse costs of measuring hyperparameter configurations' for real-world LLM systems rests on the unvalidated assumption that the specific profiled systems, hardware, and protocols generalize; no diversity analysis, cross-system comparison, coverage metrics, or sensitivity study is supplied to support extrapolation beyond the chosen instances.
[Abstract] Data collection and validation section (inferred from the abstract's description of profiling): the manuscript provides no details on data collection methodology, measurement validation protocols, or noise handling, which directly affects the soundness of the 364,450 configurations and the 3-9 objectives as reliable ground truth for HPO algorithm testing.

minor comments (2)

[Abstract] The abstract lists aggregate statistics (364,450 configurations, 12-23 dims, etc.) but does not state how many distinct LLM systems or hardware platforms were used; adding this count would help readers gauge coverage.
[Abstract] The GitHub link is given but the manuscript does not describe the exact file formats, schema, or example usage scripts for the released logs and datasets, which would improve immediate usability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of generalizability and methodological transparency that we will address in the revision to strengthen the presentation of LLMSYS-HPOBench as a practical benchmark for the AutoML community.

read point-by-point responses

Referee: [Abstract] Abstract and the benchmark description: the central claim that the suite captures 'the compound space of hyperparameter configuration from both the AI and non-AI components; rich and nonlinear implications from the fidelity factors; and diverse costs of measuring hyperparameter configurations' for real-world LLM systems rests on the unvalidated assumption that the specific profiled systems, hardware, and protocols generalize; no diversity analysis, cross-system comparison, coverage metrics, or sensitivity study is supplied to support extrapolation beyond the chosen instances.

Authors: We agree that the manuscript would benefit from greater clarity on the scope and representativeness of the profiled systems. LLMSYS-HPOBench is explicitly positioned as a collection of real-world instances drawn from prevalent open-source LLM inference setups (e.g., Llama and Mistral variants on standard GPU hardware), rather than a universal claim about all possible LLM deployments. In the revised version we will add a dedicated subsection that (i) states the selection criteria for the systems and hardware, (ii) discusses limitations on generalization, and (iii) explains why these concrete, high-dimensional, multi-fidelity instances still fill a documented gap in existing HPO benchmarks. We will not add new empirical diversity or sensitivity analyses, as the core contribution is the release of the raw profiling data and logs themselves; however, the added discussion will make the extrapolation assumptions explicit and invite community extensions. revision: partial
Referee: [Abstract] Data collection and validation section (inferred from the abstract's description of profiling): the manuscript provides no details on data collection methodology, measurement validation protocols, or noise handling, which directly affects the soundness of the 364,450 configurations and the 3-9 objectives as reliable ground truth for HPO algorithm testing.

Authors: We acknowledge that the current manuscript does not devote sufficient space to these procedural details. Although the GitHub repository contains the raw measurement logs, the paper text itself should make the collection process transparent. In the revision we will expand (or add, if the existing section is too brief) a 'Data Collection and Validation' subsection that describes: the inference frameworks and hardware used, the exact protocol for recording each objective and cost metric, the number of repeated runs per configuration, the averaging procedure employed to reduce measurement noise, and any outlier filtering steps. This addition will directly support the claim that the released data constitute reliable ground truth for HPO algorithm evaluation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark release with no derivations or self-referential claims

full rationale

The manuscript presents LLMSYS-HPOBench as a new collection of 364450 profiled configurations and logs from specific LLM systems. No equations, predictions, fitted parameters, or first-principles derivations appear in the provided text. The central claim is simply the existence and release of the dataset itself; this is self-contained empirical material and does not reduce to any input by construction, self-citation, or renaming. Representativeness concerns are validity issues, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations or fitted models are presented; the contribution is an empirical resource collection. No free parameters, axioms, or invented entities are required for the central claim.

pith-pipeline@v0.9.0 · 5575 in / 1100 out tokens · 36163 ms · 2026-05-12T01:18:13.724443+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LLMSYS-HPOBench contains 364,450 hyperparameter configurations with a dimensionality of 12-23, 3-5 dimensions of fidelity factor leading to 932 settings, 3-9 inference objective metrics, and 2-10 cost metrics
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The benchmark suite has been made available at: https://github.com/ideas-labo/llmsys-hpobench

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages

[1]

https://huggingface.co/datasets/anon8231489123/ ShareGPT_Vicuna_unfiltered, 2025

Sharegpt conversation dataset. https://huggingface.co/datasets/anon8231489123/ ShareGPT_Vicuna_unfiltered, 2025. Accessed: 2026-05-05

work page 2025
[2]

Gulavani, Ramachandran Ramjee, and Alexey Tumanov

Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav S. Gulavani, Ramachandran Ramjee, and Alexey Tumanov. VIDUR: A large-scale simulation framework for LLM inference. In Phillip B. Gibbons, Gennady Pekhimenko, and Christopher De Sa, editors,Proceedings of the Seventh Annual Conference on Machine Learning and Systems, MLSys 202...

work page 2024
[3]

AutoGPT Contributors. AutoGPT. https://github.com/Significant-Gravitas/ AutoGPT, 2023. GitHub repository. Accessed: 2024

work page 2023
[4]

Awad, Neeratyoy Mallik, and Frank Hutter

Noor H. Awad, Neeratyoy Mallik, and Frank Hutter. DEHB: evolutionary hyberband for scalable, robust and efficient hyperparameter optimization. In Zhi-Hua Zhou, editor,Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19-27 August 2021, pages 2147–2153. ijcai.org, 2021

work page 2021
[5]

Mist: A co-design framework for heterogeneous, multi-stage llm inference, 2026

Abhimanyu Rajeshkumar Bambhaniya, Hanjiang Wu, Suvinay Subramanian, Sudarshan Srini- vasan, Souvik Kundu, Amir Yazdanbakhsh, Midhilesh Elavazhagan, Madhu Kumar, Minlan Yu, Arijit Raychowdhury, and Tushar Krishna. Mist: A co-design framework for heterogeneous, multi-stage llm inference, 2026

work page 2026
[6]

Jahs-bench-201: A foundation for research on joint architecture and hyperparameter search

Archit Bansal, Danny Stoll, Maciej Janowski, Arber Zela, and Frank Hutter. Jahs-bench-201: A foundation for research on joint architecture and hyperparameter search. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing...

work page 2022
[7]

Faster, cheaper, better: Multi-objective hyperparameter optimization for LLM and RAG systems

Matthew Barker, Andrew Bell, Evan Thomas, James Carr, Thomas Andrews, and Umang Bhatt. Faster, cheaper, better: Multi-objective hyperparameter optimization for LLM and RAG systems. CoRR, abs/2502.18635, 2025

work page arXiv 2025
[8]

V oskanyan, Rafail Giavrimis, Matthew Truscott, Mina Ilieva, Chrys- talla Pavlou, Andrei Staicu, Manal T

Paul Brookes, Vardan K. V oskanyan, Rafail Giavrimis, Matthew Truscott, Mina Ilieva, Chrys- talla Pavlou, Andrei Staicu, Manal T. Adham, Will Evers-Hood, Jingzhi Gong, Kejia Zhang, Matvey Fedoseev, Vishal Sharma, Roman Bauer, Zheng Wang, Hema Nair, Wei Jie, Tianhua Xu, Aurora Constantin, Leslie Kanthan, and Michail Basios. Evolving excellence: Automated o...

work page arXiv 2025
[9]

Wu, Panpan Zhangsun, Yufei Li, and Zhe Zhang

Rong Cao, Liang Bao, Chase Q. Wu, Panpan Zhangsun, Yufei Li, and Zhe Zhang. CM-CASL: comparison-based performance modeling of software systems via collaborative active and semisupervised learning.J. Syst. Softw., 201:111686, 2023

work page 2023
[10]

Cds4rag: Cyclic dual-sequential hyperparameter optimization for rag

Pengzhou Chen and Tao Chen. Cds4rag: Cyclic dual-sequential hyperparameter optimization for rag. InProceedings of the 35th International Joint Conference on Artificial Intelligence, IJCAI 2026, Bremen, Germany, 15-21 August 2026. ijcai.org, 2026

work page 2026
[11]

Promisetune: Unveiling causally promising and explainable configuration tuning

Pengzhou Chen and Tao Chen. Promisetune: Unveiling causally promising and explainable configuration tuning. In48th International Conference on Software Engineering. IEEE, 2026

work page 2026
[12]

MMO: meta multi-objectivization for software configuration tuning.IEEE Trans

Pengzhou Chen, Tao Chen, and Miqing Li. MMO: meta multi-objectivization for software configuration tuning.IEEE Trans. Software Eng., 50(6):1478–1504, 2024

work page 2024
[13]

Accuracy can lie: On the impact of surrogate model in configuration tuning.IEEE Trans

Pengzhou Chen, Jingzhi Gong, and Tao Chen. Accuracy can lie: On the impact of surrogate model in configuration tuning.IEEE Trans. Software Eng., 51(2):548–580, 2025

work page 2025
[14]

FEMOSAA: feature-guided and knee-driven multi-objective optimization for self-adaptive software.ACM Trans

Tao Chen, Ke Li, Rami Bahsoon, and Xin Yao. FEMOSAA: feature-guided and knee-driven multi-objective optimization for self-adaptive software.ACM Trans. Softw. Eng. Methodol., 27(2):5:1–5:50, 2018

work page 2018
[15]

Multi-objectivizing software configuration tuning

Tao Chen and Miqing Li. Multi-objectivizing software configuration tuning. In Diomidis Spinellis, Georgios Gousios, Marsha Chechik, and Massimiliano Di Penta, editors,ESEC/FSE ’21: 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece, August 23-28, 2021, pages 453–465. ACM, 2021

work page 2021
[16]

Do performance aspirations matter for guiding software configuration tuning? an empirical investigation under dual performance objectives.ACM Trans

Tao Chen and Miqing Li. Do performance aspirations matter for guiding software configuration tuning? an empirical investigation under dual performance objectives.ACM Trans. Softw. Eng. Methodol., 32(3):68:1–68:41, 2023

work page 2023
[17]

Adapting multi-objectivized software configuration tuning.Proc

Tao Chen and Miqing Li. Adapting multi-objectivized software configuration tuning.Proc. ACM Softw. Eng., 1(FSE):539–561, 2024

work page 2024
[18]

HtmlRAG Contributors. HtmlRAG. https://github.com/plageon/HtmlRAG, 2024. GitHub repository. Accessed: 2024

work page 2024
[19]

HEBO: Pushing the limits of sample-efficient hyperparameter optimisa- tion

Alexander Imani Cowen-Rivers, Wenlong Lyu, Rasul Tutunov, Zhi Wang, Antoine Gros- nit, Ryan-Rhys Griffiths, Alexandre Max Maraval, Jianye HAO, Jun Wang, Jan Peters, and Haitham Bou Ammar. HEBO: Pushing the limits of sample-efficient hyperparameter optimisa- tion. InFirst Conference on Automated Machine Learning (Journal Track), 2022

work page 2022
[20]

Kalyanmoy Deb and Himanshu Jain. An evolutionary many-objective optimization algorithm using reference-point-based nondominated sorting approach, part i: solving problems with box constraints.IEEE transactions on evolutionary computation, 18(4):577–601, 2013

work page 2013
[21]

Speeding up automatic hyperpa- rameter optimization of deep neural networks by extrapolation of learning curves

Tobias Domhan, Jost Tobias Springenberg, and Frank Hutter. Speeding up automatic hyperpa- rameter optimization of deep neural networks by extrapolation of learning curves. In Qiang Yang and Michael J. Wooldridge, editors,Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25...

work page 2015
[22]

Nas-bench-201: Extending the scope of reproducible neural architecture search

Xuanyi Dong and Yi Yang. Nas-bench-201: Extending the scope of reproducible neural architecture search. In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020

work page 2020
[23]

Evaluating large language models in class-level code generation

Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. Evaluating large language models in class-level code generation. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024, pages 81:1–81:13. ACM, 2024. 11

work page 2024
[24]

Awad, Marius Lindauer, and Frank Hutter

Katharina Eggensperger, Philipp Müller, Neeratyoy Mallik, Matthias Feurer, René Sass, Aaron Klein, Noor H. Awad, Marius Lindauer, and Frank Hutter. Hpobench: A collection of repro- ducible multi-fidelity benchmark problems for HPO. In Joaquin Vanschoren and Sai-Kit Yeung, editors,Proceedings of the Neural Information Processing Systems Track on Datasets a...

work page 2021
[25]

BOHB: robust and efficient hyperparameter optimization at scale

Stefan Falkner, Aaron Klein, and Frank Hutter. BOHB: robust and efficient hyperparameter optimization at scale. In Jennifer G. Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, Proceedings of Machine Learning Research, pages 1436–1445. PMLR, 2018

work page 2018
[26]

Pieter Gijsbers, Marcos L. P. Bueno, Stefan Coors, Erin LeDell, Sébastien Poirier, Janek Thomas, Bernd Bischl, and Joaquin Vanschoren. AMLB: an automl benchmark.J. Mach. Learn. Res., 25:101:1–101:65, 2024

work page 2024
[27]

Predicting software performance with divide-and-learn

Jingzhi Gong and Tao Chen. Predicting software performance with divide-and-learn. In Satish Chandra, Kelly Blincoe, and Paolo Tonella, editors,Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, San Francisco, CA, USA, December 3-9, 2023, pages 858–870. ACM, 2023

work page 2023
[28]

Predicting configuration performance in multiple environments with sequential meta-learning.Proc

Jingzhi Gong and Tao Chen. Predicting configuration performance in multiple environments with sequential meta-learning.Proc. ACM Softw. Eng., 1(FSE):359–382, 2024

work page 2024
[29]

Dividable configuration performance learning

Jingzhi Gong, Tao Chen, and Rami Bahsoon. Dividable configuration performance learning. IEEE Trans. Software Eng., 51(1):106–134, 2025

work page 2025
[30]

LightRAG: Simple and fast retrieval-augmented generation

Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang. LightRAG: Simple and fast retrieval-augmented generation. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Lin- guistics: EMNLP 2025, pages 10746–10761, Suzhou, China, November 2025. Association for Computation...

work page 2025
[31]

Deepperf: performance prediction for configurable software with deep sparse neural network

Huong Ha and Hongyu Zhang. Deepperf: performance prediction for configurable software with deep sparse neural network. In Joanne M. Atlee, Tevfik Bultan, and Jon Whittle, edi- tors,Proceedings of the 41st International Conference on Software Engineering, ICSE 2019, Montreal, QC, Canada, May 25-31, 2019, pages 1095–1106. IEEE / ACM, 2019

work page 2019
[32]

LightRAG

HKUDS. LightRAG. https://github.com/HKUDS/LightRAG, 2024. GitHub repository. Accessed: 2024

work page 2024
[33]

Multi-fidelity automatic hyper-parameter tuning via transfer series expansion

Yi-Qi Hu, Yang Yu, Wei-Wei Tu, Qiang Yang, Yuqiang Chen, and Wenyuan Dai. Multi-fidelity automatic hyper-parameter tuning via transfer series expansion. InThe Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational...

work page 2019
[34]

Hoos, and Kevin Leyton-Brown

Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for general algorithm configuration. In Carlos A. Coello Coello, editor,Learning and Intelligent Optimization, pages 507–523, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg

work page 2011
[35]

Evolutionary many-objective opti- mization: A short review

Hisao Ishibuchi, Noritaka Tsukamoto, and Yusuke Nojima. Evolutionary many-objective opti- mization: A short review. InProceedings of the IEEE Congress on Evolutionary Computation, CEC 2008, June 1-6, 2008, Hong Kong, China, pages 2419–2426. IEEE, 2008

work page 2008
[36]

Finrpt: Dataset, evaluation system and llm-based multi-agent framework for equity research report generation

Song Jin, Shuqi Li, Shukun Zhang, and Rui Yan. Finrpt: Dataset, evaluation system and llm-based multi-agent framework for equity research report generation. In Sven Koenig, Chad Jenkins, and Matthew E. Taylor, editors,Fortieth AAAI Conference on Artificial Intelligence, Thirty-Eighth Conference on Innovative Applications of Artificial Intelligence, Sixtee...

work page 2026
[37]

Distance-based sampling of software configuration spaces

Christian Kaltenecker, Alexander Grebhahn, Norbert Siegmund, Jianmei Guo, and Sven Apel. Distance-based sampling of software configuration spaces. In2019 IEEE/ACM 41st Interna- tional Conference on Software Engineering (ICSE), pages 1084–1094. IEEE, 2019

work page 2019
[38]

Jiin Kim, Byeong-Gon Shin, Jin-Won Chung, and Minsoo Rhu. The cost of dynamic reasoning: Demystifying ai agents and test-time scaling from an ai infrastructure perspective.2026 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 1–16, 2025

work page 2026
[39]

Llm-based skill diffusion for zero-shot policy adaptation

Woo Kyung Kim, Youngseok Lee, Jooyoung Kim, and Honguk Woo. Llm-based skill diffusion for zero-shot policy adaptation. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2...

work page 2024
[40]

Fast bayesian optimization of machine learning hyperparameters on large datasets

Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, and Frank Hutter. Fast bayesian optimization of machine learning hyperparameters on large datasets. In Aarti Singh and Xiaojin (Jerry) Zhu, editors,Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA,...

work page 2017
[41]

Whence to learn? transferring knowledge in configurable systems using BEETLE.IEEE Trans

Rahul Krishna, Vivek Nair, Pooyan Jamshidi, and Tim Menzies. Whence to learn? transferring knowledge in configurable systems using BEETLE.IEEE Trans. Software Eng., 47(12):2956– 2972, 2021

work page 2021
[42]

Bioasq-qa: A manually curated corpus for biomedical question answering.Scientific Data, 10:170, 2023

Anastasia Krithara, Anastasios Nentidis, Konstantinos Bougiatiotis, and Georgios Paliouras. Bioasq-qa: A manually curated corpus for biomedical question answering.Scientific Data, 10:170, 2023

work page 2023
[43]

NaiveRAG

LangChain AI. NaiveRAG. https://github.com/langchain-ai/langchain, 2024. GitHub repository. Accessed: 2024

work page 2024
[44]

Input sensitivity on the performance of configurable systems an empirical study.J

Luc Lesoil, Mathieu Acher, Arnaud Blouin, and Jean-Marc Jézéquel. Input sensitivity on the performance of configurable systems an empirical study.J. Syst. Softw., 201:111671, 2023

work page 2023
[45]

Understanding the automated parameter optimization on transfer learning for cross-project defect prediction: an empirical study

Ke Li, Zilin Xiang, Tao Chen, Shuo Wang, and Kay Chen Tan. Understanding the automated parameter optimization on transfer learning for cross-project defect prediction: an empirical study. In Gregg Rothermel and Doo-Hwan Bae, editors,ICSE ’20: 42nd International Confer- ence on Software Engineering, Seoul, South Korea, 27 June - 19 July, 2020, pages 566–57...

work page 2020
[46]

Hyper- band: A novel bandit-based approach to hyperparameter optimization.J

Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyper- band: A novel bandit-based approach to hyperparameter optimization.J. Mach. Learn. Res., 18:185:1–185:52, 2017

work page 2017
[47]

How to evaluate solutions in pareto-based search-based software engineering: A critical review and methodological guidance.IEEE Trans

Miqing Li, Tao Chen, and Xin Yao. How to evaluate solutions in pareto-based search-based software engineering: A critical review and methodological guidance.IEEE Trans. Software Eng., 48(5):1771–1799, 2022

work page 2022
[48]

Eoh-s: Evolution of heuristic set using llms for automated heuristic design

Fei Liu, Yilu Liu, Qingfu Zhang, Xialiang Tong, and Mingxuan Yuan. Eoh-s: Evolution of heuristic set using llms for automated heuristic design. In Sven Koenig, Chad Jenkins, and Matthew E. Taylor, editors,Fortieth AAAI Conference on Artificial Intelligence, Thirty- Eighth Conference on Innovative Applications of Artificial Intelligence, Sixteenth Symposiu...

work page 2026
[49]

Latin hypercube sampling as a tool in uncertainty analysis of computer models

Michael D McKay. Latin hypercube sampling as a tool in uncertainty analysis of computer models. InProceedings of the 24th conference on Winter simulation, pages 557–564, 1992

work page 1992
[50]

Nas-bench-suite: NAS evaluation is (now) surprisingly easy

Yash Mehta, Colin White, Arber Zela, Arjun Krishnakumar, Guri Zabergja, Shakiba Moradian, Mahmoud Safari, Kaicheng Yu, and Frank Hutter. Nas-bench-suite: NAS evaluation is (now) surprisingly easy. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. 13

work page 2022
[51]

MOOT: a repository of many multi-objective optimization tasks

Tim Menzies, Tao Chen, Yulong Ye, Kishan Kumar Ganguly, Amirali Rayegan, Srinath Srini- vasan, and Andre Lustosa. MOOT: a repository of many multi-objective optimization tasks. IEEE Mining Software Repositories (MSR) Conference, 2026

work page 2026
[52]

Analysing the impact of workloads on modeling the performance of configurable software systems

Stefan Mühlbauer, Florian Sattler, Christian Kaltenecker, Johannes Dorn, Sven Apel, and Norbert Siegmund. Analysing the impact of workloads on modeling the performance of configurable software systems. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 2085–2097. IEEE, 2023

work page 2085
[53]

Finding faster configura- tions using FLASH.IEEE Trans

Vivek Nair, Zhe Yu, Tim Menzies, Norbert Siegmund, and Sven Apel. Finding faster configura- tions using FLASH.IEEE Trans. Software Eng., 46(7):794–811, 2020

work page 2020
[54]

OpenHands

OpenHands Contributors. OpenHands. https://github.com/OpenHands/OpenHands,

work page
[55]

Accessed: 2024

GitHub repository. Accessed: 2024

work page 2024
[56]

Y AHPO gym - an efficient multi-objective multi-fidelity benchmark for hyperparameter optimization

Florian Pfisterer, Lennart Schneider, Julia Moosbauer, Martin Binder, and Bernd Bischl. Y AHPO gym - an efficient multi-objective multi-fidelity benchmark for hyperparameter optimization. In Isabelle Guyon, Marius Lindauer, Mihaela van der Schaar, Frank Hutter, and Roman Garnett, editors,International Conference on Automated Machine Learning, AutoML 2022,...

work page 2022
[57]

Jomaa, Martin Wistuba, and Josif Grabocka

Sebastian Pineda-Arango, Hadi S. Jomaa, Martin Wistuba, and Josif Grabocka. HPO-B: A large- scale reproducible benchmark for black-box HPO based on openml. In Joaquin Vanschoren and Sai-Kit Yeung, editors,Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021

work page 2021
[58]

A comprehensive survey on fitness landscape analysis

Erik Pitzer and Michael Affenzeller. A comprehensive survey on fitness landscape analysis. Recent advances in intelligent engineering systems, pages 161–191, 2012

work page 2012
[59]

SGLang Contributors. SGLang. https://github.com/sgl-project/sglang, 2023. GitHub repository. Accessed: 2024

work page 2023
[60]

ProofWriter: Generating implications, proofs, and abductive statements over natural language

Oyvind Tafjord, Bhavana Dalvi, and Peter Clark. ProofWriter: Generating implications, proofs, and abductive statements over natural language. InFindings of the Association for Compu- tational Linguistics: ACL-IJCNLP 2021, pages 3621–3634. Association for Computational Linguistics, 2021

work page 2021
[61]

Htmlrag: HTML is better than plain text for modeling retrieved knowledge in RAG systems

Jiejun Tan, Zhicheng Dou, Wen Wang, Mang Wang, Weipeng Chen, and Ji-Rong Wen. Htmlrag: HTML is better than plain text for modeling retrieved knowledge in RAG systems. In Guodong Long, Michale Blumestein, Yi Chang, Liane Lewin-Eytan, Zi Helen Huang, and Elad Yom-Tov, editors,Proceedings of the ACM on Web Conference 2025, WWW 2025, Sydney, NSW, Australia, 2...

work page 2025
[62]

Catbench: A compiler autotuning benchmarking suite for black-box optimization

Jacob O Tørring, Carl Hvarfner, Luigi Nardi, and Magnus Själander. Catbench: A compiler autotuning benchmarking suite for black-box optimization. InInternational Conference on Automated Machine Learning, pages 24–1. PMLR, 2025

work page 2025
[63]

vLLM Contributors. vLLM. https://github.com/vllm-project/vllm, 2023. GitHub repository. Accessed: 2024

work page 2023
[64]

Large language models as urban residents: An LLM agent framework for personal mobility generation

Jiawei Wang, Renhe Jiang, Chuang Yang, Zengqing Wu, Makoto Onizuka, Ryosuke Shibasaki, Noboru Koshizuka, and Chuan Xiao. Large language models as urban residents: An LLM agent framework for personal mobility generation. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in N...

work page 2024
[65]

Conjecture and inquiry: Quantifying software performance requirements via interactive retrieval-augmented preference elicitation

Shihai Wang and Tao Chen. Conjecture and inquiry: Quantifying software performance requirements via interactive retrieval-augmented preference elicitation. InFindings: Annual Meeting of the Association for Computational Linguistics (ACL). ACL, 2026. 14

work page 2026
[66]

Light over heavy: Automated performance requirements quantifi- cation with linguistic inducement

Shihai Wang and Tao Chen. Light over heavy: Automated performance requirements quantifi- cation with linguistic inducement. In48th IEEE/ACM International Conference on Software Engineering (ICSE). ACM, 2026

work page 2026
[67]

Twins or false friends? A study on energy consumption and performance of configurable software

Max Weber, Christian Kaltenecker, Florian Sattler, Sven Apel, and Norbert Siegmund. Twins or false friends? A study on energy consumption and performance of configurable software. In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023, pages 2098–2110. IEEE, 2023

work page 2023
[68]

Dually hierarchical drift adaptation for online configuration performance learning

Zezhen Xiang, Jingzhi Gong, and Tao Chen. Dually hierarchical drift adaptation for online configuration performance learning. In48th IEEE/ACM International Conference on Software Engineering (ICSE). ACM, 2026

work page 2026
[69]

Cotune: Co-evolutionary configuration tuning

Gangda Xiong and Tao Chen. Cotune: Co-evolutionary configuration tuning. In40th IEEE/ACM International Conference on Automated Software Engineering, ASE 2025, Seoul, Korea, Repub- lic of, November 16-20, 2025, pages 1490–1502. IEEE, 2025

work page 2025
[70]

Cohen, Ruslan Salakhut- dinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhut- dinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Pro-...

work page 2018
[71]

Distilled lifelong self-adaptation for configurable systems

Yulong Ye, Tao Chen, and Miqing Li. Distilled lifelong self-adaptation for configurable systems. In47th IEEE/ACM International Conference on Software Engineering, ICSE 2025, Ottawa, ON, Canada, April 26 - May 6, 2025, pages 1333–1345. IEEE, 2025

work page 2025
[72]

Revealing domain-spatiality patterns for configuration tuning: Domain knowledge meets fitness landscapes.ACM Trans

Yulong Ye, Hongyuan Liang, Chao Jiang, Miqing Li, and Tao Chen. Revealing domain-spatiality patterns for configuration tuning: Domain knowledge meets fitness landscapes.ACM Trans. Softw. Eng. Methodol., March 2026. Just Accepted

work page 2026
[73]

Nas-bench-101: Towards reproducible neural architecture search

Chris Ying, Aaron Klein, Eric Christiansen, Esteban Real, Kevin Murphy, and Frank Hutter. Nas-bench-101: Towards reproducible neural architecture search. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, Proceedings of Machin...

work page 2019
[74]

Surrogate NAS benchmarks: Going beyond the limited search spaces of tabular NAS benchmarks

Arber Zela, Julien Niklas Siems, Lucas Zimmer, Jovita Lukasik, Margret Keuper, and Frank Hutter. Surrogate NAS benchmarks: Going beyond the limited search spaces of tabular NAS benchmarks. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022

work page 2022
[75]

On speeding up language model evaluation

Jin Zhou, Christian Belardi, Ruihan Wu, Travis Zhang, Carla Gomes, Wen Sun, and Kilian Weinberger. On speeding up language model evaluation. In Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors,International Conference on Learning Representations, volume 2025, pages 65092–65111, 2025

work page 2025
[76]

Bestconfig: tapping the performance potential of systems via automatic configuration tuning

Yuqing Zhu, Jianxun Liu, Mengying Guo, Yungang Bao, Wenlong Ma, Zhuoyue Liu, Kunpeng Song, and Yingchun Yang. Bestconfig: tapping the performance potential of systems via automatic configuration tuning. InProceedings of the 2017 Symposium on Cloud Computing, SoCC ’17, page 338–350, New York, NY , USA, 2017. Association for Computing Machinery. 15 A Limita...

work page 2017
[77]

Open a tracking issue.The contributor first describes the target LLM system, its family (RAG pipeline, inference engine, or agentic system), the intended contribution type (new sys- tem, new fidelity settings, or additional measurements), and the expected hardware/software requirements. 18

work page
[78]

The manual should define the AI and non-AI hyperparameters, their types and value ranges, implementation references, default settings, and any deployment assumptions

Create the system manual.Add a manual under the corresponding documentation directory, i.e., RAG/manuals/, Engine/manuals/, or Agent/manuals/. The manual should define the AI and non-AI hyperparameters, their types and value ranges, implementation references, default settings, and any deployment assumptions. 3.Implement the benchmark interface.Add scripts...

work page
[79]

Specify fidelities and sampling policy.Document all fidelity dimensions, their allowed val- ues, the Cartesian product or sampling strategy, and their expected impact on cost/resources. If the full Cartesian product is too expensive, the contributor should justify the reduced sampling plan and include a small pre-experiment showing that the selected fidel...

work page
[80]

Every measured configuration should have a stable identifier so that objectives, costs, hardware metrics, and execution logs can be joined unambiguously

Generate and organize measurements.Store raw outputs, processed .csv summaries, cost logs, error logs, hardware traces, and visualizations in the standardized data layout. Every measured configuration should have a stable identifier so that objectives, costs, hardware metrics, and execution logs can be joined unambiguously

work page

Showing first 80 references.