pith. sign in

arxiv: 2605.31478 · v1 · pith:YI6C52JSnew · submitted 2026-05-29 · 💻 cs.SE · cs.CL· cs.SY· eess.SY

Knowledge Boundary Probing and Demand-Guided Intervention for LLM-Based Power System Code Generation

Pith reviewed 2026-06-28 21:23 UTC · model grok-4.3

classification 💻 cs.SE cs.CLcs.SYeess.SY
keywords large language modelscode generationpower systemsAPI knowledge boundariesbenchmarkprompt interventionon-premise deploymentsimulation libraries
0
0 comments X

The pith

A boundary-aware intervention that corrects API knowledge gaps raises LLM accuracy on power-system code generation by 32 to 56 points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that errors when LLMs generate code for power system analysis arise mainly from gaps in knowing exact library functions and parameters rather than from shortfalls in logical reasoning. It develops a probing method to map each model's API knowledge profile across documentation levels and pairs it with an intervention that estimates documentation needs for a query then supplies the information either before generation or through later correction. This produces large accuracy gains on a benchmark of thousands of natural-language to code tasks that include numerical verification. A sympathetic reader would conclude that targeted knowledge injection at inference time can make open models suitable for sensitive on-premise use in regulated settings without retraining.

Core claim

The authors claim that first-pass failures in power-system code generation are dominated by structured API-knowledge boundary errors such as hallucinated function names, misused parameters, and mishandled result tables in versioned simulation libraries. They introduce a boundary-aware intervention that combines query-side API demand estimation with targeted proactive documentation injection and routed reactive correction, which improves accuracy by 32 to 56 points across every evaluated open-weight model of at least 7B parameters and every commercial API on a 2,000-task frozen benchmark.

What carries the argument

The boundary-aware intervention, which estimates the API documentation demands from the input query and supplies relevant information proactively or corrects outputs reactively.

If this is right

  • The intervention improves every evaluated open-weight model of at least 7B parameters and every commercial API by 32 to 56 accuracy points.
  • Open-weight models in the 70B-120B range reach the commercial mid-tier accuracy range.
  • The targeted prompts preserve the full-context accuracy ceiling while using 41 percent of the prompt-token cost.
  • The largest evaluated open-weight models lead the panel on the benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same probing and injection pattern could be tested on code generation tasks that rely on other versioned domain libraries.
  • The results suggest that maintaining accurate documentation access may be more efficient than repeated fine-tuning for keeping models current.
  • Utilities could apply the method to achieve reliable local assistance without sending data to external services.

Load-bearing premise

That the dominant source of first-pass errors in this domain is incomplete knowledge of library APIs rather than deeper limitations in reasoning about the underlying physical or mathematical problems.

What would settle it

A collection of tasks where models still produce incorrect code even after correct API documentation is supplied, because they misunderstand the underlying calculations or optimization logic, would show that the intervention does not address the main failure mode.

Figures

Figures reproduced from arXiv: 2605.31478 by Hui Wu, Xiaoyang Wang, Zhong Fan.

Figure 1
Figure 1. Figure 1: Reliability workflow for LLM-based power-system code generation. Data flow per item: natural-language grid-analysis request → LLM-generated program → execution on the target grid case → scalar engineering output. PowerCodeBench provides execution-validated tasks across power-system analyses (contingency, OPF, short-circuit, time-series). L0–L3 probing converts structured API documentation into per-model kn… view at source ↗
Figure 2
Figure 2. Figure 2: Frozen-release composition statistics. (a) Item count per difficulty level (D1–D4). (b) Task-family frequency distribution across the 15 families sampled in this release. (c) Grid network size distribution across the 39 test cases used (bus count). 6 [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The L0–L3 probe generator and profile output. Structured API documentation is transformed into recognition, recall, comprehension, and application probes, then aggregated into a model-specific knowledge-risk profile ρM(f) over the API corpus. instantiated from the corresponding documentation fields when available. This tests whether the model can correctly reason about API semantics when the function name … view at source ↗
Figure 4
Figure 4. Figure 4: Category-averaged knowledge-risk profile derived from ρM(f)= 1 − L0..L3 across the open-weight panel (10 models) and 4 closed-source APIs; warmer colours indicate higher risk. Vertical separators delimit the two groups; the right-most column is the all-model mean. must-inject floor, admitted only when its per-layer injection score Iℓ ranks above competing candidates in the budget. L3 also scales non-monoto… view at source ↗
Figure 5
Figure 5. Figure 5: Boundary-aware intervention: two-phase pipeline. Phase 1 (Proactive) injects layer-specific API documentation before generation, guided by task demand dˆ(f, q) and model-specific knowledge deficits over L0–L3. Phase 2 (Reactive) maps execution feedback to either runtime/API repair routes or value-level semantic diagnosis before re-execution. Three properties distinguish boundary-aware intervention from gen… view at source ↗
Figure 6
Figure 6. Figure 6: Cross-vendor results for the 9-model panel of [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Per-task accuracy heatmap with identical row and column ordering in both panels for direct side-by-side comparison. (a) Baseline A (no proactive injection, R0). (b) Full method C+FDRS after 3 fix rounds. Rows are pandapower-derived task families; columns split into open-weight models (left) and closed-source APIs (right) [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Token economy: Pareto trade-offs (a, b) and per-model saving breakdowns (c.1, c.2). (a) Proactive R0: prompt tokens vs. matched items (out of 2,000) for conditions A/B/C/X/R. (b) Reactive economy under base C: prompt tokens vs. successful fix attempts for FX/FD/FDR/FDRS. (c.1) Per-model R0 prompt decomposition under C—base prompt, documentation block (L0–L3 snippets plus DataFrame boundary cards), and temp… view at source ↗
read the original abstract

Large language models (LLMs) are increasingly used to automate power-system analysis, but many utilities and energy-research labs require on-premise serving for confidentiality, regulatory, reproducibility, and cost reasons. This makes the reliability of open-weight models a deployment issue. We show that first-pass failures in power-system code generation are dominated not by reasoning alone, but by structured API-knowledge boundary errors: hallucinated function names, misused parameters, and mishandled result tables in versioned simulation libraries. We introduce PowerCodeBench, an execution-validated benchmark generator that pairs natural-language operator queries with pandapower code and numerical ground truth; an L0-L3 documentation-driven probing procedure that measures per-model API knowledge profiles; and a boundary-aware intervention that combines query-side API demand estimation with targeted proactive documentation injection and routed reactive correction. On a 2,000-task frozen release, we evaluate ten open-weight LLMs (1.5B-480B parameters) and four commercial mid-tier APIs. The intervention improves every evaluated open-weight model of at least 7B parameters and every commercial API by 32 to 56 accuracy points. Open-weight models in the 70B-120B range match the commercial mid-tier accuracy range, while Llama-3.1-405B and Qwen3-Coder-480B lead the panel. The targeted prompts preserve the full-context accuracy ceiling while using 41% of the prompt-token cost. The result is an accuracy-side, deployment-time path toward reliable on-premise LLM assistance for grid-analysis workflows without fine-tuning or cloud inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript claims that first-pass failures in power-system code generation are dominated by structured API-knowledge boundary errors rather than reasoning shortfalls. It introduces PowerCodeBench (an execution-validated benchmark pairing natural-language queries with pandapower code and numerical ground truth), an L0-L3 documentation-driven probing procedure to measure per-model API knowledge profiles, and a boundary-aware intervention (query-side demand estimation + targeted documentation injection + routed correction). On a frozen 2,000-task release, the intervention is reported to raise accuracy by 32-56 points for every evaluated open-weight model ≥7B and every commercial API tested, while preserving the full-context accuracy ceiling at 41% of the prompt-token cost.

Significance. If the empirical results hold, the work supplies a practical, deployment-time method to improve reliability of open-weight LLMs for grid-analysis workflows without fine-tuning or cloud inference. The uniform gains across model scales (including 70B-120B models matching commercial mid-tier performance) and the explicit targeting of hallucinated function names, misused parameters, and mishandled result tables constitute a concrete contribution to on-premise LLM deployment in regulated domains.

major comments (1)
  1. [Abstract] The abstract asserts large accuracy gains and benchmark validation but supplies no experimental details, error bars, dataset construction steps, or statistical tests; only the abstract is available, so the support for the central claim cannot be verified.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review. The sole major comment concerns the abstract's lack of experimental details and the fact that only the abstract appears to have been available. We address this point below; the full manuscript supplies all requested information in the body and appendices.

read point-by-point responses
  1. Referee: [Abstract] The abstract asserts large accuracy gains and benchmark validation but supplies no experimental details, error bars, dataset construction steps, or statistical tests; only the abstract is available, so the support for the central claim cannot be verified.

    Authors: We agree the abstract is concise by design and omits granular experimental details, which is standard to keep abstracts high-level. The full manuscript details dataset construction (Section 3: execution-validated pairing of queries with pandapower code and numerical ground truth), error bars and statistical tests (Section 4 and Appendix B: mean accuracy with standard deviations over runs plus significance testing), and the full evaluation protocol. The observation that 'only the abstract is available' indicates the complete manuscript was not supplied in the review package; we ask the editor to provide the full text. No revision to the abstract is planned, as expansion would violate length norms without improving clarity. revision: no

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper reports an empirical study: it constructs PowerCodeBench as an execution-validated benchmark, runs an L0-L3 probing procedure to measure API knowledge profiles, applies a boundary-aware intervention (demand estimation + documentation injection + routed correction), and measures accuracy gains on a frozen 2,000-task release. No equations, fitted parameters, or derivations are present; the accuracy improvements are direct experimental measurements on held-out tasks. No self-citations appear in the provided text, and the central premise (that first-pass errors are dominated by structured API-knowledge failures) is tested by the intervention mechanism itself rather than assumed via prior results. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Based solely on the abstract; full paper may contain additional parameters or assumptions not visible here.

axioms (1)
  • domain assumption First-pass failures in power-system code generation are dominated by structured API-knowledge boundary errors rather than reasoning limitations
    Explicitly stated as the motivating observation in the abstract.
invented entities (2)
  • PowerCodeBench no independent evidence
    purpose: Execution-validated benchmark generator pairing natural-language operator queries with pandapower code and numerical ground truth
    Newly introduced benchmark described in the abstract.
  • L0-L3 documentation-driven probing procedure no independent evidence
    purpose: Measures per-model API knowledge profiles
    New procedure introduced in the abstract.

pith-pipeline@v0.9.1-grok · 5832 in / 1353 out tokens · 35740 ms · 2026-06-28T21:23:30.110353+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 34 canonical work pages · 7 internal anchors

  1. [1]

    Donti and J

    Priya L. Donti and J. Zico Kolter. Machine learning for sustainable energy systems.Annual Review of Environment and Resources, 46:719–747, 2021. doi: 10.1146/annurev-environ-020220-061831

  2. [2]

    Lopez Garcia, Christophe Ballif, and Matthias Galus

    Fabian Heymann, Hugo Quest, Tania B. Lopez Garcia, Christophe Ballif, and Matthias Galus. Reviewing 40 years of artificial intelligence applied to power systems—a taxonomic perspective.Energy and AI, 15:100322,

  3. [3]

    doi: 10.1016/j.egyai.2023.100322

  4. [4]

    Large foundation models for power systems

    Chenghao Huang, Siyang Li, Ruohong Liu, Hao Wang, and Yize Chen. Large foundation models for power systems. In2024 IEEE Power & Energy Society General Meeting (PESGM), pages 1–5, 2024. doi: 10.1109/ PESGM51994.2024.10688670

  5. [5]

    Thatte, Na Li, and Le Xie

    Subir Majumder, Lin Dong, Fatemeh Doudi, Yuting Cai, Chao Tian, Dileep Kalathil, Kevin Ding, Anupam A. Thatte, Na Li, and Le Xie. Exploring the capabilities and limitations of large language models in the electric energy sector.Joule, 8(6):1544–1549, 2024. doi: 10.1016/j.joule.2024.05.009

  6. [6]

    Thurner, A

    Leon Thurner, Alexander Scheidler, Florian Schäfer, Jan-Hendrik Menke, Julian Dollichon, Friederike Meier, Steffen Meinecke, and Martin Braun. pandapower—an open-source python tool for convenient modeling, analysis, and optimization of electric power systems.IEEE Transactions on Power Systems, 33(6):6510–6521, 2018. doi: 10.1109/TPWRS.2018.2829021. 40

  7. [7]

    AI in power systems: A systematic review of key matters of concern.Energy Informatics, 8:76, 2025

    Felipe Henao, Robert Edgell, Ambar Sharma, and Jeffrey Olney. AI in power systems: A systematic review of key matters of concern.Energy Informatics, 8:76, 2025. doi: 10.1186/s42162-025-00529-1

  8. [8]

    Secure and trustworthy energy systems: A four-layer threat model and defense-in-depth framework.Energy, 344:140027, 2026

    Yuheng Cheng, Xiyuan Zhou, Huan Zhao, Gaoqi Liang, Fushuan Wen, and Junhua Zhao. Secure and trustworthy energy systems: A four-layer threat model and defense-in-depth framework.Energy, 344:140027, 2026. doi: 10.1016/j.energy.2026.140027

  9. [9]

    CodeUpdateArena: Bench- marking knowledge editing on API updates.ArXiv, abs/2407.06249, 2024

    Zeyu Leo Liu, Shrey Pandit, Xi Ye, Eunsol Choi, and Greg Durrett. CodeUpdateArena: Benchmarking knowledge editing on API updates.arXiv preprint arXiv:2407.06249, 2024

  10. [10]

    LibEvolutionEval: A benchmark and study for version-specific code generation

    Sachit Kuhar, Wasi Uddin Ahmad, Zijian Wang, Nihal Jain, Haifeng Qian, Baishakhi Ray, Murali Krishna Ramanathan, Xiaofei Ma, and Anoop Deoras. LibEvolutionEval: A benchmark and study for version-specific code generation. InProceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech...

  11. [11]

    Identifying and mitigating API misuse in large language models.arXiv preprint arXiv:2503.22821, 2025

    Terry Yue Zhuo, Junda He, Jiamou Sun, Zhenchang Xing, David Lo, John Grundy, and Xiaoning Du. Identifying and mitigating API misuse in large language models.arXiv preprint arXiv:2503.22821, 2025

  12. [12]

    Towards mitigating API hallucination in code generated by LLMs with hierarchical dependency aware.arXiv preprint arXiv:2505.05057, 2025

    Yujia Chen, Mingyu Chen, Cuiyun Gao, Zhihan Jiang, Zhongqi Li, and Yuchi Ma. Towards mitigating API hallucination in code generated by LLMs with hierarchical dependency aware.arXiv preprint arXiv:2505.05057, 2025

  13. [13]

    Gorilla: Large Language Model Connected with Massive APIs

    Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive APIs.arXiv preprint arXiv:2305.15334, 2023

  14. [14]

    Gridmind: Llms-powered agents for power system analysis and operations.arXiv preprint arXiv:2509.02494, 2025

    Hongwei Jin, Kibaek Kim, and Jonghwan Kwon. Gridmind: Llms-powered agents for power system analysis and operations.arXiv preprint arXiv:2509.02494, 2025

  15. [15]

    Poweragent: A road map toward agentic intelligence in power systems: Foundation model, model context protocol, and workflow.IEEE Power & Energy Magazine, 23(5):93–101, 2025

    Qian Zhang and Le Xie. Poweragent: A road map toward agentic intelligence in power systems: Foundation model, model context protocol, and workflow.IEEE Power & Energy Magazine, 23(5):93–101, 2025. doi: 10.1109/MPE.2025.3579718

  16. [16]

    A large language model for advanced power dispatch.Scientific Reports, 15:8925, 2025

    Yuheng Cheng, Huan Zhao, Xiyuan Zhou, Junhua Zhao, Yuji Cao, Chao Yang, and Xinlei Cai. A large language model for advanced power dispatch.Scientific Reports, 15:8925, 2025. doi: 10.1038/s41598-025-91940-x

  17. [17]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  18. [18]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  19. [19]

    Measuring coding challenge competence with APPS

    Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with APPS. In Advances in Neural Information Processing Systems, volume 34, pages 20389–20403, 2021

  20. [20]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, 2024

  21. [21]

    DS-1000: A natural and reliable benchmark for data science code generation

    Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. DS-1000: A natural and reliable benchmark for data science code generation. In Proceedings of the 40th International Conference on Machine Learning, pages 18319–18345, 2023. 41

  22. [22]

    Scicode: A research coding benchmark curated by scientists

    Minyang Tian, Luyu Gao, Shizhuo Dylan Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas, Pan Ji, Kittithat Krongchon, Yao Li, Shengyan Liu, et al. Scicode: A research coding benchmark curated by scientists. InAdvances in Neural Information Processing Systems, 2024

  23. [23]

    Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions

    Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions. InInternational Conference on Learning Representations, 2025

  24. [24]

    Elecbench: A power dispatch evaluation benchmark for large language models.arXiv preprint arXiv:2407.05365, 2024

    Xiyuan Zhou, Huan Zhao, Yuheng Cheng, Yuji Cao, Gaoqi Liang, Guolong Liu, Wenxuan Liu, Yan Xu, and Junhua Zhao. Elecbench: A power dispatch evaluation benchmark for large language models.arXiv preprint arXiv:2407.05365, 2024

  25. [25]

    Abhyankar

    Shuangshuang Jin and Shrirang G. Abhyankar. Chatgrid: Power grid visualization empowered by large language model. In2024 IEEE Workshop on Energy Data Visualization (EnergyVis), pages 12–17, 2024. doi: 10.1109/ EnergyVis63885.2024.00007

  26. [26]

    Bonadia, Fernanda C

    Rodrigo S. Bonadia, Fernanda C. L. Trindade, Walmir Freitas, and Bala Venkatesh. On the potential of ChatGPT to generate distribution systems for load flow studies using OpenDSS.IEEE Transactions on Power Systems, 38 (6):5965–5968, 2023. doi: 10.1109/TPWRS.2023.3315543

  27. [27]

    Enabling large language models to perform power system simulations with previously unseen tools: A case of Daline.arXiv preprint arXiv:2406.17215, 2024

    Mengshuo Jia, Zeyu Cui, and Gabriela Hug. Enabling large language models to perform power system simulations with previously unseen tools: A case of Daline.arXiv preprint arXiv:2406.17215, 2024

  28. [28]

    Enhancing LLMs for power system simulations: A feedback-driven multi-agent framework.arXiv preprint arXiv:2411.16707, 2024

    Mengshuo Jia, Zeyu Cui, and Gabriela Hug. Enhancing LLMs for power system simulations: A feedback-driven multi-agent framework.arXiv preprint arXiv:2411.16707, 2024

  29. [29]

    A systematic review of transformers and large language models in the energy sector: Towards agentic digital twins.Applied Energy, 401:126670, 2025

    Gabriel Antonesi, Tudor Cioara, Ionut Anghel, Vasilis Michalakopoulos, Elissaios Sarmas, and Liana Toderean. A systematic review of transformers and large language models in the energy sector: Towards agentic digital twins.Applied Energy, 401:126670, 2025. doi: 10.1016/j.apenergy.2025.126670

  30. [30]

    Large language models meet energy systems: Opportunities, challenges, and future perspectives.Applied Energy, 403:127076, 2026

    Chaobo Zhang, Jian Zhang, Jie Lu, and Yang Zhao. Large language models meet energy systems: Opportunities, challenges, and future perspectives.Applied Energy, 403:127076, 2026. doi: 10.1016/j.apenergy.2025.127076

  31. [31]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, volume 36, 2023

  32. [32]

    ReAct: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

  33. [33]

    API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

    Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. API-bank: A comprehensive benchmark for tool-augmented LLMs.arXiv preprint arXiv:2304.08244, 2023

  34. [34]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 16000+ real-world APIs.arXiv preprint arXiv:2307.16789, 2023

  35. [35]

    Toolace: Winning the points of LLM function calling

    Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, et al. Toolace: Winning the points of LLM function calling. InInternational Conference on Learning Representations, 2025

  36. [36]

    Retrieval-augmented generation for knowledge-intensive NLP tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems, volume 33, 2020

  37. [37]

    Repocoder: Repository-level code completion through iterative retrieval and generation

    Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. Repocoder: Repository-level code completion through iterative retrieval and generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023. 42

  38. [38]

    Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C. Park. Adaptive-RAG: Learning to adapt retrieval-augmented large language models through question complexity. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7036...

  39. [39]

    Teaching Large Language Models to Self-Debug

    Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023

  40. [40]

    Self-refine: Iterative refinement with self-feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. In Advances in Neural Information Processing Sy...

  41. [41]

    Anderson, David R

    Lorin W. Anderson, David R. Krathwohl, Peter W. Airasian, Kathleen A. Cruikshank, Richard E. Mayer, Paul R. Pintrich, James Raths, and Merlin C. Wittrock.A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Taxonomy of Educational Objectives. Longman, New York, 2001

  42. [42]

    Paris, Marjorie Y

    Scott G. Paris, Marjorie Y. Lipson, and Karen K. Wixson. Becoming a strategic reader.Contemporary Educational Psychology, 8(3):293–316, 1983. doi: 10.1016/0361-476X(83)90018-8

  43. [43]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  44. [44]

    Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models.Transactions on Machine Learning Research, 2022

  45. [45]

    The probabilistic relevance framework: Bm25 and beyond,

    Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: BM25 and beyond.Foundations and Trends in Information Retrieval, 3(4):333–389, 2009. doi: 10.1561/1500000019

  46. [46]

    Sentence-BERT: Sentence embeddings using siamese BERT-networks

    Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 3982–3992, 2019

  47. [47]

    Smola, Arthur Gretton, Karsten M

    Jiayuan Huang, Alexander J. Smola, Arthur Gretton, Karsten M. Borgwardt, and Bernhard Schölkopf. Correcting sample selection bias by unlabeled data. InAdvances in Neural Information Processing Systems, volume 19, pages 601–608, 2007

  48. [48]

    Zimmerman, Carlos E

    Ray D. Zimmerman, Carlos E. Murillo-Sánchez, and Robert J. Thomas. Matpower: Steady-state operations, planning, and analysis tools for power systems research and education.IEEE Transactions on Power Systems, 26 (1):12–19, 2011. doi: 10.1109/TPWRS.2010.2051168

  49. [49]

    Dugan and Thomas E

    Roger C. Dugan and Thomas E. McDermott. An open source platform for collaborating on smart grid research. In2011 IEEE Power and Energy Society General Meeting, pages 1–7, 2011. doi: 10.1109/PES.2011.6039829

  50. [50]

    Hybrid symbolic-numeric framework for power system modeling and analysis.IEEE Transactions on Power Systems, 36(2):1373–1384, 2021

    Hantao Cui, Fangxing Li, and Kevin Tomsovic. Hybrid symbolic-numeric framework for power system modeling and analysis.IEEE Transactions on Power Systems, 36(2):1373–1384, 2021. doi: 10.1109/TPWRS.2020.3017019. 43