pith. sign in

arxiv: 2605.19330 · v1 · pith:QBZP7G3Tnew · submitted 2026-05-19 · 💻 cs.AI · cs.LG· cs.SE

MOCHA: Multi-Objective Chebyshev Annealing for Agent Skill Optimization

Pith reviewed 2026-05-20 06:05 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.SE
keywords multi-objective optimizationChebyshev scalarizationLLM agent skillsPareto frontannealingprompt optimizationskill discoveryplatform constraints
0
0 comments X p. Extension
pith:QBZP7G3T Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{QBZP7G3T}

Prints a linked pith:QBZP7G3T badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

MOCHA optimizes LLM agent skills across conflicting platform constraints by using Chebyshev scalarization to cover the full Pareto front plus annealing to shift from exploration to exploitation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that skills for LLM agents are multi-field objects forced into trade-offs by real platform limits such as truncated descriptions, compacted instructions, and shared context windows. Standard prompt optimizers either ignore those limits or fold them into a single weighted score, so they miss good solutions in non-convex regions and often make no progress at all. MOCHA instead scalarizes the objectives with the Chebyshev metric to reach every part of the Pareto surface and applies exponential annealing to move from broad search to precise refinement. On six tasks where every method receives the same mutation operator and per-objective feedback, this approach improves mean correctness on all tasks while surfacing twice as many Pareto-optimal skill variants. The result matters because agent deployments live inside tight resource budgets, and any method that reliably finds better feasible skill sets directly raises performance without extra hardware.

Core claim

MOCHA replaces single-objective selection with Chebyshev scalarization that covers the full Pareto front, including non-convex regions, combined with exponential annealing that transitions from exploration to exploitation. Across six diverse agent skills, all methods share the identical multi-objective mutation operator and baselines receive identical per-objective textual feedback; existing optimizers fail to improve the seed skill on four of the six tasks after 1000 rollouts, while MOCHA improves on every task with a 7.5 percent relative gain in mean correctness and twice as many Pareto-optimal variants.

What carries the argument

Chebyshev scalarization, which minimizes the maximum weighted deviation from ideal per-objective values so that non-convex parts of the Pareto front remain reachable, paired with an exponential annealing schedule that gradually tightens the search from exploration to exploitation.

If this is right

  • Skill libraries for agents can be maintained as explicit Pareto sets rather than single best prompts, letting deployers pick variants that fit different context budgets.
  • Multi-objective mutation plus Chebyshev selection can be dropped into existing agent frameworks without changing the mutation code or the feedback format.
  • Tasks that previously showed zero progress under weighted-sum or single-objective optimizers become solvable once the full non-convex front is searched.
  • The annealing schedule provides a controllable knob between discovering diverse skill variants and converging on high-correctness ones for a given deployment.
  • Platform constraints such as description length and instruction compaction become first-class objectives instead of after-the-fact filters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same Chebyshev-plus-annealing pattern could be applied to other LLM tuning problems that trade accuracy against latency or cost, even outside agent skill design.
  • If the annealing temperature is made adaptive to the observed spread of objective values rather than fixed, further reductions in the number of wasted rollouts may be possible.
  • Extending the method to include dynamic context-window resizing as an additional objective would test whether the Pareto front itself moves during deployment.
  • Open-sourcing the discovered Pareto skill sets would let downstream researchers measure how much of the reported gain transfers to new model families or new task distributions.

Load-bearing premise

That giving every optimizer the same mutation operator and the same per-objective textual feedback isolates the benefit to the selection mechanism, and that the six chosen tasks represent the hard platform constraints typical in actual LLM deployments.

What would settle it

Re-running the identical experimental protocol but on a new set of tasks whose context-window or truncation limits are twice as severe, then checking whether MOCHA still improves correctness on every task and still returns at least twice the number of Pareto-optimal variants.

Figures

Figures reproduced from arXiv: 2605.19330 by Anlan Zhang, Branislav Kveton, Jayakumar Subramanian, Md Mehrab Tanjim, Somdeb Sarkhel, Subhojyoti Mukherjee, Sunav Choudhury, Sungchul Kim, Xiang Chen.

Figure 1
Figure 1. Figure 1: (a) Skill optimization produces a correctness–compliance trade-off: the optimized skill [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Optimization dynamics across six skills. Correctness vs. iteration (mean ± 1 std, 5 seeds). MOCHA (blue) consistently improves beyond the initial prompt, while baselines plateau early or remain stuck at the seed skill. Dashed grey: seed skill performance. Baselines. As discussed in Section 2, fine-tuning is inapplicable for our scope: our setting operates on the skill definition axis rather than model weig… view at source ↗
Figure 3
Figure 3. Figure 3: 2D Pareto front (correctness × body compliance): MOCHA (blue, HV=.563) sits bal￾anced between w/o HVC (exploitation, purple) and w/o Annealing (exploration, green). Baselines clus￾ter at a single operating point. HV values in legend [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: FEVER qualitative comparison. Grey: shared YAML fields. Red : baseline skill (all three baselines returned the seed template unchanged). Green : MOCHA-optimized skill with structured rules and explicit reasoning. Per-task comparisons in Section C.6. 5 Discussion and Conclusion When does MOCHA help? MOCHA’s gains scale with objective conflict. On FEVER (14.9% relative gain) and TheoremQA (10.4%), improving … view at source ↗
Figure 5
Figure 5. Figure 5: 2D Pareto fronts (correctness × body compliance) for all six skills. Three baselines (TextGrad, ProTeGi, GEPA) and three MOCHA variants are shown. Shaded regions indicate domi￾nated hypervolume. MOCHA variants consistently explore multiple non-dominated operating points while baselines remain near the initial prompt. C.4 Convergence Curves See [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: 2D Pareto fronts (correctness × description compliance) for all six skills. The same pattern holds: MOCHA discovers diverse non-dominated skill variants spanning the correctness–description compliance frontier, while baselines cluster at a single operating point. error avoidance (stereochemistry traps, redshift calculations, reduction reaction selectivity). Test correctness: MOCHA .636 vs. GEPA .592 (+4.4p… view at source ↗
Figure 7
Figure 7. Figure 7: 2D Pareto fronts (correctness × overall compliance, i.e., average of body and description compliance) for all six skills. The pattern is consistent across all three compliance views: MOCHA’s multi-objective selection enables Pareto front exploration that single-objective baselines cannot achieve. qualitative, not just quantitative: MOCHA skills contain domain-specific reasoning protocols, explicit error av… view at source ↗
Figure 8
Figure 8. Figure 8: Prompt evolution trees for MOCHA across all six skills ( shown for one seed). Each node is a committed skill variant; node labels show candidate ID and mean test score (%). Blue node = best test correctness; blue edges = path from root. Metric annotations (C/D/B) at root and best node reveal how MOCHA trades compliance for correctness gains. Grey nodes = other committed candidates. C.8 Ablation: Hypervolum… view at source ↗
Figure 9
Figure 9. Figure 9: GPQA: Seed skill vs. MOCHA-optimized. The seed skill (top, red) is a single-line template returned unchanged by all three baselines. MOCHA (bottom, green) discovers a 6-step expert verification protocol with adversarial self-checking and domain-specific error patterns for organic chemistry, physics, and genetics. Correctness improves from .59 to .71. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: TheoremQA: Seed skill vs. MOCHA-optimized. Baselines partially optimize but produce verbose, loosely structured output. MOCHA discovers a lean skill with theorem identification, sign/unit tracking, domain-specific templates, and strict formatting rules. Correctness improves from .53 to .82. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: HoVer: Seed skill vs. MOCHA-optimized. The seed skill (top, red) is returned unchanged by all baselines. MOCHA (bottom, green) discovers a 7-step verification procedure with “default toward SUPPORTED” bias and retriever-augmented gap filling. Correctness improves from .62 to .67. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: HotpotQA: Seed skill vs. MOCHA-optimized. Both baselines and MOCHA partially optimize this task. MOCHA discovers a skill emphasizing verbatim extraction (exact name forms, location qualifiers) with explicit good/bad formatting examples. Correctness improves from .34 to .66. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: DebugBench: Seed skill vs. MOCHA-optimized. The seed template (top, red) provides no debugging strategy. MOCHA (bottom, green) develops a category-aware protocol: classify by bug type, apply type-specific heuristics (reference → scope check, logic → boundary check, multiple → count 2–4), and follow a “conservative fixing principle” that prevents over-correction on multi-bug inputs. 24 [PITH_FULL_IMAGE:fi… view at source ↗
Figure 14
Figure 14. Figure 14: Ablation heatmap: Correctness ∆ over GEPA for each MOCHA variant across six skills. All MOCHA variants achieve substantial gains on TheoremQA and FEVER. Removing HVC gating shifts toward exploitation (highest per-task correctness); removing annealing shifts toward exploration (highest Pareto diversity). See [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
read the original abstract

LLM agents organize behavior through skills - structured natural-language specifications governing how an agent reasons, retrieves, and responds. Unlike monolithic prompts, skills are multi-field artifacts subject to hard platform constraints: description fields are truncated for routing, instruction bodies are compacted via progressive disclosure, and co-resident skills compete for limited context windows. These constraints make skill optimization inherently multi-objective: a skill must simultaneously maximize task performance and satisfy platform limits. Yet existing prompt optimizers either ignore these trade-offs or collapse them into a weighted sum, missing Pareto-optimal variants in non-convex objective regions. We introduce MOCHA (Multi-Objective Chebyshev Annealing), which replaces single-objective selection with Chebyshev scalarization - covering the full Pareto front, including non-convex regions - combined with exponential annealing that transitions from exploration to exploitation. In our experiments across six diverse agent skills - where all methods share the same multi-objective mutation operator and baselines receive identical per-objective textual feedback - existing optimizers fail to improve the seed skill on 4 of 6 tasks: 1000 rollouts yield zero progress. MOCHA breaks through on every task, achieving 7.5% relative improvement in mean correctness over the strongest baseline (up to 14.9% on FEVER and 10.4% on TheoremQA) while discovering twice as many more Pareto-optimal skill variants.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces MOCHA, which applies Chebyshev scalarization combined with exponential annealing to optimize LLM agent skills as multi-objective artifacts subject to platform constraints such as truncation and context limits. It claims that, when all methods share the same multi-objective mutation operator and per-objective feedback, MOCHA improves mean correctness by 7.5% relative to the strongest baseline (with peaks of 14.9% on FEVER and 10.4% on TheoremQA), discovers twice as many Pareto-optimal variants, and succeeds on all six tasks while baselines fail on four even after 1000 rollouts.

Significance. If the reported gains prove robust under statistical controls and the experimental isolation of the selection mechanism holds, the work would meaningfully advance multi-objective prompt and skill optimization for constrained LLM agents by addressing non-convex Pareto fronts without weighted-sum collapse. The concrete task-specific numbers and the emphasis on platform constraints provide a practical contribution, though the current empirical presentation limits immediate impact.

major comments (2)
  1. [Abstract] Abstract and experimental results: the reported 7.5% relative improvement in mean correctness (and task-specific gains) is presented without variance estimates, statistical significance tests, exact rollout counts per method, or a precise definition and measurement procedure for Pareto optimality. This omission makes the central empirical claim difficult to evaluate and requires additional tables or reporting to substantiate.
  2. [Experiments] Experimental setup: the design asserts that sharing the identical multi-objective mutation operator and per-objective textual feedback across methods isolates the benefit of Chebyshev scalarization plus annealing. However, without an ablation that swaps only the selection rule while holding mutation fixed, performance differences could arise from asymmetric interactions between mutation proposals and selection dynamics rather than the claimed MOCHA components; this assumption is load-bearing for attributing the 7.5% lift and doubled Pareto count.
minor comments (1)
  1. [Abstract] Abstract: the phrasing 'twice as many more Pareto-optimal skill variants' is imprecise and should be replaced with exact counts and a clear definition of how Pareto optimality is determined in the skill space.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We address each of the major comments below and have made revisions to improve the clarity and robustness of our empirical claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract and experimental results: the reported 7.5% relative improvement in mean correctness (and task-specific gains) is presented without variance estimates, statistical significance tests, exact rollout counts per method, or a precise definition and measurement procedure for Pareto optimality. This omission makes the central empirical claim difficult to evaluate and requires additional tables or reporting to substantiate.

    Authors: We agree with the referee that the empirical claims would benefit from additional statistical rigor and precise reporting. In the revised manuscript, we have included variance estimates from multiple independent runs, conducted statistical significance tests (such as paired t-tests with p-values reported), specified the exact number of rollouts for each method, and added a clear definition and measurement procedure for identifying Pareto-optimal variants. These details are now presented in a new supplementary table and expanded experimental section. revision: yes

  2. Referee: [Experiments] Experimental setup: the design asserts that sharing the identical multi-objective mutation operator and per-objective textual feedback across methods isolates the benefit of Chebyshev scalarization plus annealing. However, without an ablation that swaps only the selection rule while holding mutation fixed, performance differences could arise from asymmetric interactions between mutation proposals and selection dynamics rather than the claimed MOCHA components; this assumption is load-bearing for attributing the 7.5% lift and doubled Pareto count.

    Authors: We thank the referee for highlighting this important point about experimental isolation. Our original design ensured that the multi-objective mutation operator and per-objective feedback are identical across all compared methods, with the only varying component being the selection mechanism. This directly attributes differences to the Chebyshev scalarization and annealing in MOCHA. To further strengthen this isolation, we have added an explicit ablation experiment in the revised manuscript where we hold the mutation operator fixed and vary only the selection rule, demonstrating that the performance improvements stem from MOCHA's selection strategy rather than interactions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation and claims are self-contained

full rationale

The paper presents MOCHA as an algorithmic combination of Chebyshev scalarization for Pareto coverage and exponential annealing for exploration-exploitation transition. The central claims of improved mean correctness and doubled Pareto-optimal variants are supported by empirical results on six tasks under a shared mutation operator. No equation, selection rule, or performance metric reduces by construction to a fitted parameter, self-citation chain, or input definition. The experimental isolation of selection benefit is an assumption about fairness rather than a definitional tautology, and the derivation does not invoke uniqueness theorems or ansatzes from prior self-work that would force the outcome.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on standard properties of Chebyshev scalarization in multi-objective optimization and the effectiveness of annealing schedules for transitioning from exploration to exploitation; no new entities are postulated.

free parameters (1)
  • annealing rate and Chebyshev parameter
    The exponential annealing schedule and any scalarization weighting parameter are likely tuned or chosen to control the exploration-exploitation transition and Pareto coverage.
axioms (1)
  • domain assumption Chebyshev scalarization can cover the full Pareto front including non-convex regions
    Invoked when claiming the method finds variants missed by weighted-sum approaches.

pith-pipeline@v0.9.0 · 5810 in / 1349 out tokens · 57924 ms · 2026-05-20T06:05:50.914908+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 6 internal anchors

  1. [1]

    Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J

    Lakshya A. Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J. Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Dan Klein, Ion Stoica, Matei Zaharia, and Omar Khattab. GEPA: Reflective prompt evolution can outperform reinforcement learning. InICLR, 2026

  2. [2]

    Extend claude with skills

    Anthropic. Extend claude with skills. https://code.claude.com/docs/en/skills. Ac- cessed: 2026-04-25

  3. [3]

    Approximation quality of the hypervolume indicator

    Karl Bringmann and Tobias Friedrich. Approximation quality of the hypervolume indicator. Artificial Intelligence, 195:265–290, 2013

  4. [4]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InNeurIPS, volume 33, pages 1877–1901, 2020

  5. [5]

    TheoremQA: A theorem-driven question answering dataset

    Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. TheoremQA: A theorem-driven question answering dataset. InEMNLP, pages 7889–7901, 2023

  6. [6]

    Trace is the next autodiff: Generative optimization with rich feedback, execution traces, and LLMs.arXiv preprint arXiv:2406.16218, 2024

    Ching-An Cheng, Allen Nie, and Adith Swaminathan. Trace is the next autodiff: Generative optimization with rich feedback, execution traces, and LLMs.arXiv preprint arXiv:2406.16218, 2024

  7. [7]

    Differentiable expected hypervolume improvement for parallel multi-objective Bayesian optimization

    Samuel Daulton, Maximilian Balandat, and Eytan Bakshy. Differentiable expected hypervolume improvement for parallel multi-objective Bayesian optimization. InNeurIPS, volume 33, pages 9851–9864, 2020

  8. [8]

    A fast and elitist multiobjective genetic algorithm: Nsga-ii.IEEE Transactions on Evolutionary Computation, 6 (2):182–197, 2002

    Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and TAMT Meyarivan. A fast and elitist multiobjective genetic algorithm: Nsga-ii.IEEE Transactions on Evolutionary Computation, 6 (2):182–197, 2002

  9. [9]

    Xing, and Zhiting Hu

    Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan Wang, Han Guo, Tianmin Shu, Meng Song, Eric P. Xing, and Zhiting Hu. RLPrompt: Optimizing discrete text prompts with reinforcement learning. InEMNLP, 2022

  10. [10]

    Michael T. M. Emmerich and Andr ´e H. Deutz. A tutorial on multiobjective optimization: Fundamentals and evolutionary methods.Natural Computing, 17(3):585–609, 2018

  11. [11]

    Guerreiro, Carlos M

    Andreia P. Guerreiro, Carlos M. Fonseca, and Lu ´ıs Paquete. The hypervolume indicator: Problems and algorithms.ACM Computing Surveys, 54(6):1–42, 2021

  12. [12]

    EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers

    Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers.arXiv preprint arXiv:2309.08532, 2024

  13. [13]

    HoVer: A dataset for many-hop fact extraction and claim verification

    Yichen Jiang, Shikha Bordia, Zheng Zhong, Charles Dognin, Maneesh Singh, and Mohit Bansal. HoVer: A dataset for many-hop fact extraction and claim verification. InFindings of EMNLP, 2020

  14. [14]

    Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling declarative language model calls into self-improving pipelines. InICLR, 2024

  15. [15]

    Meta-Harness: End-to-End Optimization of Model Harnesses

    Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026

  16. [16]

    Smooth tchebycheff scalarization for multi-objective optimization

    Xi Lin, Xiaoyuan Zhang, Zhiyuan Yang, Fei Liu, Zhenkun Wang, and Qingfu Zhang. Smooth tchebycheff scalarization for multi-objective optimization. InICML, 2024. 10

  17. [17]

    Eureka: Human-level reward design via coding large language models

    Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models. InICLR, 2024

  18. [18]

    Springer, Boston, MA, 1999

    Kaisa Miettinen.Nonlinear Multiobjective Optimization. Springer, Boston, MA, 1999

  19. [19]

    Multi-objective alignment of large language models through hypervolume maximization

    Subhojyoti Mukherjee, Anusha Lalitha, Sailik Sengupta, Aniket Deshmukh, and Branislav Kve- ton. Multi-objective alignment of large language models through hypervolume maximization. arXiv preprint arXiv:2412.05469, 2024

  20. [20]

    Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab

    Krista Opsahl-Ong, Michael J. Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. Optimizing instructions and demonstrations for multi-stage language model programs. InEMNLP, 2024

  21. [21]

    gradient descent

    Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with “gradient descent” and beam search. InEMNLP, 2023

  22. [22]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. GPQA: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022, 2023

  23. [23]

    FEVER: a large-scale dataset for fact extraction and VERification

    James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. FEVER: a large-scale dataset for fact extraction and VERification. InNAACL-HLT, pages 809–819, 2018

  24. [24]

    DebugBench: Evaluating debugging capability of large language models

    Runchu Tian, Yining Ye, Yujia Qin, Xin Cong, Yankai Lin, Yinxu Pan, Yesai Wu, Haotian Hui, Weichuan Liu, Zhiyuan Liu, and Maosong Sun. DebugBench: Evaluating debugging capability of large language models. InFindings of ACL, pages 4173–4198, 2024

  25. [25]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

  26. [26]

    Arithmetic control of LLMs for diverse user preferences: Directional preference alignment with multi-objective rewards.arXiv preprint arXiv:2402.18571, 2024

    Haoxiang Wang, Yong Lin, Wei Xiong, Rui Yang, Shizhe Diao, Shuang Qiu, Han Zhao, and Tong Zhang. Arithmetic control of LLMs for diverse user preferences: Directional preference alignment with multi-objective rewards.arXiv preprint arXiv:2402.18571, 2024

  27. [27]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, volume 35, pages 24824–24837, 2022

  28. [28]

    SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

    Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026

  29. [29]

    Large language models as optimizers

    Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InICLR, 2024

  30. [30]

    Cohen, Ruslan Salakhut- dinov, and Christopher D

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhut- dinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. InEMNLP, pages 2369–2380, 2018

  31. [31]

    TextGrad: Automatic "Differentiation" via Text

    Mert Y¨uksekg¨on¨ul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. TextGrad: Automatic “differentiation” via text.arXiv preprint arXiv:2406.07496, 2024

  32. [32]

    Large language models are human-level prompt engineers

    Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. InICLR, 2023

  33. [33]

    Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization

    Zhanhui Zhou, Jie Liu, Jing Shao, Xiangyu Yue, Chao Yang, Wanli Ouyang, and Yu Qiao. Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization. In Findings of ACL, 2024. 11

  34. [34]

    Multiobjective evolutionary algorithms: a comparative case study and the strength pareto approach.IEEE Transactions on Evolutionary Computation, 3(4): 257–271, 1999

    Eckart Zitzler and Lothar Thiele. Multiobjective evolutionary algorithms: a comparative case study and the strength pareto approach.IEEE Transactions on Evolutionary Computation, 3(4): 257–271, 1999

  35. [35]

    Correct! Verdict is{expected}

    Eckart Zitzler, Lothar Thiele, Marco Laumanns, Carlos M Fonseca, and Viviane Grunert Da Fonseca. Performance assessment of multiobjective optimizers: An analysis and review. IEEE Transactions on Evolutionary Computation, 7(2):117–132, 2003. 12 A Background: Scalarization and Hypervolume Theory We provide extended background on the theoretical foundations ...

  36. [36]

    Identify the domain and relevant theorem(s): State which theorem, formula, or principle applies

  37. [37]

    Define all variables and given quantities explicitly: Write out every given value with correct signs and units

  38. [38]

    Double-check: •Signs: Pay extreme attention to negative signs

    Apply the theorem step by step: Show each algebraic/logical step. Double-check: •Signs: Pay extreme attention to negative signs. Never drop them. •Powers of 10: Verify exponent arithmetic carefully. •Units: Track throughout. Convert as needed but CHECK expected units

  39. [39]

    Radial:R= (ρ/2πL) ln(R o/Ri)

    Domain-specific rules: •Resistance with geometry: Axial:R=ρL/(π(R 2 o −R 2 i )). Radial:R= (ρ/2πL) ln(R o/Ri). •Stopping times:Tis stopping time iff{T≤t} ∈ F t. Sum of non-negative stopping times IS a stopping time. •Iteration methods: For Aitken’s∆ 2, count iterations of the ACCELERATED method only. CRITICAL Formatting Rules: •If multiple sub-parts, retu...

  40. [40]

    Never ‘‘PARTIALLY SUPPORTED’’ or any other value

    Binary output only: exactly SUPPORTED or NOT SUPPORTED. Never ‘‘PARTIALLY SUPPORTED’’ or any other value

  41. [41]

    Do NOT require every detail to be explicitly stated---implicit support and reasonable inference count

    Default toward SUPPORTED when evidence is consistent. Do NOT require every detail to be explicitly stated---implicit support and reasonable inference count

  42. [42]

    default toward SUPPORTED

    Only NOT SUPPORTED when evidenceactively contradictsthe claim. Reasoning Strategy: Step 1: Decompose claim into atomic sub-claims. Step 2: Map evidence to sub-claims. Note direct vs. inferential support. Step 3: Use retriever tool to fill gaps with targeted queries. Step 4: Chain reasoning across passages. Follow entity links completely. Step 5: Check for...

  43. [43]

    Identify what entity/fact each hop requires

    Decompose: Break question into sub-questions. Identify what entity/fact each hop requires

  44. [44]

    Extract all names (full formal names), dates, nicknames, roles, locations---even from parenthetical remarks

    Extract: Read every evidence piece. Extract all names (full formal names), dates, nicknames, roles, locations---even from parenthetical remarks

  45. [45]

    Do NOT give up

    Retrieve: If evidence is insufficient, call retriever with targeted queries. Do NOT give up

  46. [46]

    Entity A in passage 1→Entity B in passage 2

    Chain: Connect facts across passages. Entity A in passage 1→Entity B in passage 2

  47. [47]

    Critical Rules for Answer Field: •Short exact phrase---name, date, number, place, or brief noun phrase

    Synthesize: Determine final answer. Critical Rules for Answer Field: •Short exact phrase---name, date, number, place, or brief noun phrase. •EXACT form from evidence: ‘‘Jerral Wayne Jones Sr.’’ NOT ‘‘Jerry Jones’’. ‘‘Dayton, Ohio’’ NOT ‘‘Dayton’’. •Copy verbatim whenever possible. Preserve location qualifiers. •Use the most complete, formal name version f...

  48. [48]

    Fix ONLY the incorrect reference(s)

    Understand bug type first---it determines fixing strategy: •reference error: Wrong variable/function/method name. Fix ONLY the incorrect reference(s). •syntax error: Missing colon, semicolon, bracket, wrong operator syntax. Fix ONLY syntax. •logic error: Off-by-one, wrong comparison, wrong return, wrong condition. Fix ONLY logic. •type error: Wrong type u...

  49. [49]

    A wrong fix is worse than a missing fix

    Conservative fixing principle: When uncertain, do NOT change. A wrong fix is worse than a missing fix

  50. [50]

    conservative fixing principle

    Reproduce the rest EXACTLY---preserve all indentation, spacing, comments, structure. Reasoning Process: Step 1: Read bug type. Single-category or multiple? Step 2: Understand algorithm PURPOSE before making changes. Step 3: For each bug, state: (a) exact line, (b) what is wrong, (c) fix, (d) why it is definitely a bug. Step 4: For multiple error---count b...