arxiv: 2605.07251 · v1 · submitted 2026-05-08 · 💻 cs.AI

Recognition: 1 theorem link

· Lean Theorem

Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning

Yuyang Wu , Yue Huang , Shuaike Shen , Xujian Wang , Shuhao Zhang , Qiyao Xue , Weichen Liu , Runtian Gao

show 3 more authors

Jian Ma Xiangliang Zhang Olexandr Isayev

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:27 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM agentschemical cost estimationprocurement benchmarktool use evaluationreaction planningscientific AInoise robustness

0 comments

The pith

LLM agents reach only 50.6 percent accuracy when estimating chemical procurement costs for reactions even with supplier quote tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can function as agents to calculate the total purchase cost of chemicals needed for a described reaction. It builds a benchmark of 1,427 reactions tied to a fixed set of supplier quotes and runs controlled tests on both clean descriptions and versions with realistic noise such as alias names or missing details. Results show that giving models access to search and calculation tools improves performance but still leaves the best systems below 51 percent accuracy within a 25 percent error tolerance, with sharp drops when inputs become imperfect. Failures occur at multiple stages including identifying the right chemical, choosing valid pack sizes, and performing the final arithmetic. This evaluation matters because cost estimation is a concrete, verifiable step in laboratory planning and automated synthesis workflows.

Core claim

Current LLM agents, even frontier models equipped with retrieval tools, achieve at most 50.6 percent accuracy within 25 percent relative error when computing chemical procurement costs from reaction descriptions; performance falls substantially once inputs contain realistic perturbations such as chemical aliases or omitted fields. Stage-level breakdown reveals that errors stem from brittle parsing of identities and quantities, ineffective integration of retrieved evidence, invalid selection of purchasable packs, and non-convergent tool-calling loops.

What carries the argument

The ChemCost benchmark, a collection of 1,427 evaluable reactions grounded to a frozen snapshot of 230,775 supplier quotes for 2,261 chemicals, that supplies exact ground-truth costs and enables separate scoring of grounding, retrieval, procurement, and arithmetic stages.

If this is right

Tool access by itself does not enable reliable scientific procurement reasoning in current LLMs.
Stage-level error analysis can pinpoint whether failures occur during chemical grounding, quote retrieval, pack selection, or arithmetic.
Performance measured on clean inputs overestimates real-world capability once input variations such as aliases or missing fields are introduced.
Non-convergent tool use and invalid pack choices remain dominant failure modes even in the strongest tested agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the benchmark to include dynamic real-time pricing feeds would test whether agents can handle time-varying data rather than a static snapshot.
Similar grounded, judge-free tasks could be constructed for other scientific domains that require lookup and normalization of external numeric data.
Training regimens that explicitly reward convergent tool sequences and correct pack-size selection might close part of the observed performance gap.

Load-bearing premise

The 1,427 curated reactions and the single frozen pricing snapshot of 230,775 quotes form an unbiased and representative sample of real-world chemical procurement tasks.

What would settle it

A new agent that reaches above 80 percent accuracy within 25 percent relative error on the clean benchmark inputs and shows no more than a 10-point drop on the noise-injected views would falsify the claim that tool access remains insufficient.

Figures

Figures reproduced from arXiv: 2605.07251 by Jian Ma, Olexandr Isayev, Qiyao Xue, Runtian Gao, Shuaike Shen, Shuhao Zhang, Weichen Liu, Xiangliang Zhang, Xujian Wang, Yue Huang, Yuyang Wu.

**Figure 2.** Figure 2: Data curation and noise-injection pipeline. (a) Construction of CHEMCOST from reaction sources to deterministic cost labels. (b) Noise injection over four domains. Chemical Synthesis Planning with Cost Estimation. Cost has long been recognized as a practical constraint in synthesis planning and pharmaceutical development. Prior work like Chematica, ASKCOS, and RouteScore [13, 28, 22] has incorporated cost … view at source ↗

**Figure 3.** Figure 3: Tool calling analysis. (A) Average tool calls versus CTA@25 across ReAct agents, with an Ordinary Least Squares fit and 95% confidence band. (B) Tool-call composition for the strongest agents, where solid bars denote successful trajectories and hatched bars denote failed trajectories [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Noise effect on ReAct agents. (A) Abstention rates for agents under the clean and noise-injected views. (B) Decomposition of +Fmt abstentions into active non-completion, budget exhaustion, and explicit refusal. (C) Fine-grained analysis of quantity and format perturbations [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Reaction-type difficulty along four observable axes. (A) CTA@25 as a function of synthesis depth, with shaded 95% bootstrap confidence intervals. (B) CTA@25 versus the number of chemicals. (C) CTA@25 versus product molecular weight. (D) Per-model ∆CTA@25, defined as CTA@25(easier) − CTA@25(harder). Step depth (single-step vs. multi-step ≥ 3), Chemicals (single-step reactions with ≤ 3 vs. ≥ 5 non-solvent co… view at source ↗

read the original abstract

Large Language Models (LLMs) have become increasingly capable as tool-using agents, with benchmarks spanning diverse general agentic tasks. Yet rigorous evaluation of scientific tool use remains limited. In chemistry, recent agents can plan syntheses and invoke domain-specific tools, but evaluations often rely on curated demonstrations, expert assessment, or LLM-as-judge scoring rather than exact, judge-free ground truth. We address this gap with chemical procurement cost estimation, a practical task in which an agent must ground chemical identities, retrieve supplier quotes, select valid purchasable packs, normalize quantities, and compute cost from a reaction description. We introduce ChemCost, a benchmark of 1,427 evaluable reactions grounded to a frozen pricing snapshot covering 2,261 chemicals and 230,775 supplier quotes, supporting scalar scoring and stage-level diagnosis of grounding, retrieval, procurement, and arithmetic failures. To evaluate robustness, we further construct controlled noise-injected views that perturb chemical aliases, quantity expressions, missing fields, and input formatting. Experiments with frontier, open-weight, and chemistry-specialized LLM agents show that tool access is necessary but insufficient for solving the task. The strongest agents reach only 50.6% accuracy within 25% relative error on clean inputs and degrade substantially with realistic noise. Stage-level analysis further shows that failures arise from brittle parsing, ineffective evidence integration, invalid pack selection, and non-convergent tool use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ChemCost gives a grounded benchmark for chemical cost reasoning and shows tool-using LLMs top out around 50% even on clean inputs, but the reaction selection process needs clearer justification.

read the letter

The main thing here is a new benchmark called ChemCost that ties 1,427 reactions to a frozen set of real supplier quotes. The authors test frontier, open-weight, and chemistry-tuned LLM agents on the full pipeline of identifying chemicals, pulling prices, picking valid packs, normalizing quantities, and calculating costs. Even the strongest setups only hit 50.6% accuracy within 25% relative error on clean data, and performance drops further with added noise. Tool access helps but does not solve the task on its own.

Referee Report

2 major / 2 minor

Summary. The paper introduces ChemCost, a benchmark of 1,427 evaluable reactions grounded to a single frozen pricing snapshot (2,261 chemicals, 230,775 quotes). It evaluates frontier, open-weight, and chemistry-specialized LLM agents on the multi-stage task of grounding chemicals, retrieving quotes, selecting packs, normalizing quantities, and computing costs from reaction descriptions. The central results are that tool access is necessary but insufficient, with the strongest agents reaching only 50.6% accuracy within 25% relative error on clean inputs and degrading under controlled noise; stage-level analysis attributes failures to brittle parsing, invalid pack selection, and non-convergent tool use.

Significance. If the benchmark construction is representative, this supplies a rare judge-free, scalar-scored evaluation framework for agentic tool use in a practical scientific domain. The static snapshot enables exact, reproducible matching against ground-truth quotes, and the stage-level diagnostics provide concrete, actionable failure modes rather than aggregate scores. These elements strengthen the claim that current LLMs remain limited on realistic procurement reasoning even when tools are available.

major comments (2)

[Benchmark construction] Benchmark construction section: the criteria used to select the 1,427 'evaluable' reactions and to validate them against the pricing snapshot (e.g., chemical alias matching, quantity normalizability, supplier coverage) are not specified in sufficient detail. Because the reactions are filtered to be evaluable, this selection step risks curation bias that could make the observed 50.6% ceiling and the listed failure modes (brittle parsing, invalid pack selection) artifacts of the benchmark rather than intrinsic agent limits.
[Evaluation metrics] Evaluation metrics subsection: the 25% relative-error success threshold is introduced without justification, sensitivity analysis, or comparison to other cutoffs (e.g., 10% or 50%). This choice directly determines the headline accuracy figure and the interpretation that tool access is 'insufficient,' so the threshold requires explicit rationale or robustness checks.

minor comments (2)

[Abstract] Abstract: the benchmark size (1,427 reactions) and snapshot scale (230,775 quotes) are stated but could be foregrounded earlier to give readers immediate context for the scale of the evaluation.
[Noise-injection experiments] Noise-injection description: a short table or set of concrete examples illustrating each perturbation type (alias, quantity expression, missing fields) would improve reproducibility of the robustness experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have made revisions to the manuscript where appropriate to improve clarity and robustness.

read point-by-point responses

Referee: [Benchmark construction] Benchmark construction section: the criteria used to select the 1,427 'evaluable' reactions and to validate them against the pricing snapshot (e.g., chemical alias matching, quantity normalizability, supplier coverage) are not specified in sufficient detail. Because the reactions are filtered to be evaluable, this selection step risks curation bias that could make the observed 50.6% ceiling and the listed failure modes (brittle parsing, invalid pack selection) artifacts of the benchmark rather than intrinsic agent limits.

Authors: We agree that the original manuscript did not provide sufficient detail on the selection criteria. In the revised version, we have substantially expanded the Benchmark Construction section to describe the full filtering pipeline, including explicit rules for chemical alias matching against the pricing snapshot, quantity normalizability validation, and supplier coverage requirements. We also report the number of reactions filtered at each stage and include a supplementary analysis comparing failure mode distributions on the final evaluable set versus a broader unfiltered sample, which supports that the observed limitations are not artifacts of curation. revision: yes
Referee: [Evaluation metrics] Evaluation metrics subsection: the 25% relative-error success threshold is introduced without justification, sensitivity analysis, or comparison to other cutoffs (e.g., 10% or 50%). This choice directly determines the headline accuracy figure and the interpretation that tool access is 'insufficient,' so the threshold requires explicit rationale or robustness checks.

Authors: The 25% threshold was chosen to reflect practical tolerances in chemical procurement, where minor price fluctuations and pack-size rounding commonly produce deviations of this magnitude. To address the concern, we have added a sensitivity analysis subsection that reports accuracy at 10%, 25%, and 50% relative-error thresholds. The trends in agent performance and the conclusion that tool access is necessary but insufficient remain consistent across these cutoffs, with only modest changes in absolute numbers. revision: yes

Circularity Check

0 steps flagged

No significant circularity: results are empirical metrics against external supplier data

full rationale

The paper introduces ChemCost as a benchmark of 1,427 reactions grounded to a frozen external pricing snapshot of 230,775 supplier quotes. All reported results (50.6% accuracy, stage-level failure modes, noise robustness) are direct comparisons of agent outputs to this independent ground truth rather than any fitted parameter, self-referential metric, or derivation that reduces to the authors' own inputs by construction. No equations, predictions, or uniqueness claims appear in the provided text; the evaluation pipeline is judge-free and externally anchored. Curation choices affect representativeness but do not create circularity in the reported performance numbers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that real supplier quotes can be treated as ground truth and that the chosen reactions capture typical procurement difficulties.

axioms (1)

domain assumption Chemical procurement cost can be decomposed into independent stages of grounding, retrieval, pack selection, and arithmetic.
Stage-level diagnosis in the abstract presupposes this decomposition is both valid and diagnostic.

pith-pipeline@v0.9.0 · 5588 in / 1262 out tokens · 43256 ms · 2026-05-11T01:27:14.062341+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce CHEMCOST, a benchmark of 1,427 evaluable reactions grounded to a frozen pricing snapshot covering 2,261 chemicals and 230,775 supplier quotes, supporting scalar scoring and stage-level diagnosis of grounding, retrieval, procurement, and arithmetic failures.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 9 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023

Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023

work page 2023
[4]

M., Cox, S., Schilter, O., Baldassari, C., White, A

Andres M Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller. Chemcrow: Augmenting large-language models with chemistry tools.arXiv preprint arXiv:2304.05376, 2023

work page arXiv 2023
[5]

T-eval: Evaluating the tool utilization capability of large language models step by step

Zehui Chen, Weihua Du, Wenwei Zhang, Kuikun Liu, Jiangning Liu, Miao Zheng, Jingming Zhuo, Songyang Zhang, Dahua Lin, Kai Chen, et al. T-eval: Evaluating the tool utilization capability of large language models step by step. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9510–9529, 2024

work page 2024
[6]

Chemprice, a python package for automated chemical price search.Chemistry-Methods, 5(2):e202400005, 2025

Murat Cihan Sorkun, Baptiste Saliou, and Süleyman Er. Chemprice, a python package for automated chemical price search.Chemistry-Methods, 5(2):e202400005, 2025

work page 2025
[7]

Paroutes: towards a framework for benchmarking retrosynthesis route predictions.Digital Discovery, 1(4):527–539, 2022

Samuel Genheden and Esben Bjerrum. Paroutes: towards a framework for benchmarking retrosynthesis route predictions.Digital Discovery, 1(4):527–539, 2022

work page 2022
[8]

What can large language models do in chemistry? a comprehensive benchmark on eight tasks.Advances in neural information processing systems, 36:59662–59688, 2023

Taicheng Guo, Bozhao Nan, Zhenwen Liang, Zhichun Guo, Nitesh Chawla, Olaf Wiest, Xi- angliang Zhang, et al. What can large language models do in chemistry? a comprehensive benchmark on eight tasks.Advances in neural information processing systems, 36:59662–59688, 2023

work page 2023
[9]

MetaTool benchmark for large language models: Deciding whether to use tools and which to use, 2024

Yue Huang, Jiawen Shi, Yuan Li, Chenrui Fan, Siyuan Wu, Qihui Zhang, Yixin Liu, Pan Zhou, Yao Wan, Neil Zhenqiang Gong, et al. Metatool benchmark for large language models: Deciding whether to use tools and which to use.arXiv preprint arXiv:2310.03128, 2023

work page arXiv 2023
[10]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

The open reaction database.Journal of the American Chemical Society, 143(45):18820–18826, 2021

Steven M Kearnes, Michael R Maser, Michael Wleklinski, Anton Kast, Abigail G Doyle, Spencer D Dreher, Joel M Hawkins, Klavs F Jensen, and Connor W Coley. The open reaction database.Journal of the American Chemical Society, 143(45):18820–18826, 2021

work page 2021
[12]

Pubchem 2023 update.Nucleic acids research, 51(D1):D1373–D1380, 2023

Sunghwan Kim, Jie Chen, Tiejun Cheng, Asta Gindulyte, Jia He, Siqian He, Qingliang Li, Benjamin A Shoemaker, Paul A Thiessen, Bo Yu, et al. Pubchem 2023 update.Nucleic acids research, 51(D1):D1373–D1380, 2023

work page 2023
[13]

Efficient syntheses of diverse, medicinally relevant targets planned by computer and executed in the laboratory.Chem, 4(3):522–532, 2018

Tomasz Klucznik, Barbara Mikulak-Klucznik, Michael P McCormack, Heather Lima, Sara Szymku´c, Manishabrata Bhowmick, Karol Molga, Yubai Zhou, Lindsey Rickershauser, Ewa P Gajewska, et al. Efficient syntheses of diverse, medicinally relevant targets planned by computer and executed in the laboratory.Chem, 4(3):522–532, 2018

work page 2018
[14]

Rdkit documentation.Release, 1(1-79):4, 2013

Greg Landrum et al. Rdkit documentation.Release, 1(1-79):4, 2013

work page 2013
[15]

Beyond chemical qa: Evaluating llm’s chemical reasoning with modular chemical operations.arXiv preprint arXiv:2505.21318, 2025

Hao Li, He Cao, Bin Feng, Yanjun Shao, Xiangru Tang, Zhiyuan Yan, Li Yuan, Yonghong Tian, and Yu Li. Beyond chemical qa: Evaluating llm’s chemical reasoning with modular chemical operations.arXiv preprint arXiv:2505.21318, 2025. 10

work page arXiv 2025
[16]

Name reactions a collection of detailed mechanisms and synthetic applications fifth edition, 2004

Jie Jack Li. Name reactions a collection of detailed mechanisms and synthetic applications fifth edition, 2004

work page 2004
[17]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller

Andres M. Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller. Augmenting large language models with chemistry tools.Nature machine intelli- gence, 6(5):525–535, 2024

work page 2024
[19]

Toolllm: Facilitating large language models to master 16000+ real-world apis

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. InThe twelfth international conference on learning representations, 2023

work page 2023
[20]

Digitization and validation of a chemical synthesis literature database in the chempu.Science, 377(6602):172–180, 2022

Simon Rohrbach, Mindaugas Šiauˇciulis, Greig Chisholm, Petrisor-Alin Pirvan, Michael Saleeb, S Hessam M Mehr, Ekaterina Trushina, Artem I Leonov, Graham Keenan, Aamir Khan, et al. Digitization and validation of a chemical synthesis literature database in the chempu.Science, 377(6602):172–180, 2022

work page 2022
[21]

An automatic end-to-end chemical synthesis development platform powered by large language models.Nature communications, 15(1):10160, 2024

Yixiang Ruan, Chenyin Lu, Ning Xu, Yuchen He, Yixin Chen, Jian Zhang, Jun Xuan, Jianzhang Pan, Qun Fang, Hanyu Gao, et al. An automatic end-to-end chemical synthesis development platform powered by large language models.Nature communications, 15(1):10160, 2024

work page 2024
[22]

Routescore: punching the ticket to more efficient materials development.ACS Central Science, 8(1):122– 131, 2022

Martin Seifrid, Riley J Hickman, Andrés Aguilar-Granda, Cyrille Lavigne, Jenya Vestfrid, Tony C Wu, Théophile Gaudin, Emily J Hopkins, and Alán Aspuru-Guzik. Routescore: punching the ticket to more efficient materials development.ACS Central Science, 8(1):122– 131, 2022

work page 2022
[23]

Elsevier, 2024

Michael Smith.Organic synthesis. Elsevier, 2024

work page 2024
[24]

Evaluating Large Language Models in Scientific Discovery

Zhangde Song, Jieyu Lu, Yuanqi Du, Botao Yu, Thomas M Pruyn, Yue Huang, Kehan Guo, Xiuzhe Luo, Yuanhao Qu, Yi Qu, et al. Evaluating large language models in scientific discovery. arXiv preprint arXiv:2512.15567, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Appworld: A controllable world of apps and people for benchmarking interactive coding agents

Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. Appworld: A controllable world of apps and people for benchmarking interactive coding agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p...

work page 2024
[28]

Askcos: open-source, data-driven synthesis planning.Accounts of chemical research, 58(11):1764–1775, 2025

Zhengkai Tu, Sourabh J Choure, Mun Hong Fong, Jihye Roh, Itai Levin, Kevin Yu, Joonyoung F Joung, Nathan Morgan, Shih-Cheng Li, Xiaoqi Sun, et al. Askcos: open-source, data-driven synthesis planning.Accounts of chemical research, 58(11):1764–1775, 2025

work page 2025
[29]

arXiv preprint arXiv:2307.10635

Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. Scibench: Evaluating college-level scientific problem-solving abilities of large language models.arXiv preprint arXiv:2307.10635, 2023

work page arXiv 2023
[30]

Smiles, a chemical language and information system

David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules.Journal of chemical information and computer sciences, 28(1):31–36, 1988. 11

work page 1988
[31]

arXiv preprint arXiv:2506.07551 , year=

Mengsong Wu, YaFei Wang, Yidong Ming, Yuqi An, Yuwei Wan, Wenliang Chen, Binbin Lin, Yuqiang Li, Tong Xie, and Dongzhan Zhou. Chematagent: Enhancing llms for chemistry and materials science through tree-search based tool learning.arXiv preprint arXiv:2506.07551, 2025

work page arXiv 2025
[32]

Molerr2fix: Benchmarking llm trustworthiness in chemistry via modular error detection, localization, ex- planation, and correction

Yuyang Wu, Jinhui Ye, Shuhao Zhang, Lu Dai, Yonatan Bisk, and Olexandr Isayev. Molerr2fix: Benchmarking llm trustworthiness in chemistry via modular error detection, localization, ex- planation, and correction. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 19365–19382, 2025

work page 2025
[33]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

arXiv preprint arXiv:2402.09391 (2024)

Botao Yu, Frazier N Baker, Ziqi Chen, Xia Ning, and Huan Sun. Llasmol: Advancing large language models for chemistry with a large-scale, comprehensive, high-quality instruction tuning dataset.arXiv preprint arXiv:2402.09391, 2024

work page arXiv 2024
[35]

Chemllm: A chemical large language model.arXiv preprint arXiv:2402.06852, 2024

Di Zhang, Wei Liu, Qian Tan, Jingdan Chen, Hang Yan, Yuliang Yan, Jiatong Li, Weiran Huang, Xiangyu Yue, Wanli Ouyang, et al. Chemllm: A chemical large language model.arXiv preprint arXiv:2402.06852, 2024

work page arXiv 2024
[36]

Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges

Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin. Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13643–13658, 2024

work page 2024
[37]

Chemdfm: A large language foundation model for chemistry

Zihan Zhao, Da Ma, Lu Chen, Liangtai Sun, Zihao Li, Yi Xia, Hongshen Xu, Zichen Zhu, Su Zhu, Shuai Fan, et al. Chemdfm: A large language foundation model for chemistry. InNeurips 2024 Workshop Foundation Models for Science: Progress, Opportunities, and Challenges, 2024

work page 2024
[38]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023. 12 A Limitations CHEMCOSTis a controlled benchmark for scientific tool use and intentionally abstracts from sev- e...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Use only quotes with purity≥95%

work page
[40]

Select the smallest pack with quantity_g≥required_mass_g

work page
[41]

If no single pack covers the required mass, buy⌈required_mass_g / largest_pack_g⌉units of the largest pack

work page
[42]

predicted_cost_per_gram

The selected pack price, or the total price of repeated largest packs, is the component purchase cost. This is a non-interactive benchmark. Do not ask clarifying questions. Return a JSON answer. If you cannot complete the estimate, return {"predicted_cost_per_gram": null, "predicted_components": []}. Respond with the final answer in JSON: {"predicted_cost...

work page