Recognition: unknown
Test-Time Learning with an Evolving Library
Pith reviewed 2026-05-15 01:39 UTC · model grok-4.3
The pith
Large language models improve on complex reasoning by building and evolving a shared library of skills extracted from their own inference trajectories without any parameter updates or external supervision.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EvoLib maintains a shared library of knowledge abstractions, including modular skills and reflective insights, automatically extracted from the model's own inference trajectories. A principled weighting and consolidation mechanism jointly optimizes for immediate utility and long-term value, allowing simple, instance-specific abstractions to evolve into more general and reusable ones over time.
What carries the argument
The evolving library of modular skills and reflective insights extracted from inference trajectories, together with a weighting and consolidation mechanism that trades off short-term and long-term value.
If this is right
- Performance rises substantially on mathematical reasoning tasks compared with other test-time scaling methods.
- Code generation improves through reuse of previously extracted modular skills.
- Multi-turn agentic environments benefit from the gradual shift toward more general strategies.
- Models can keep improving across a sequence of problems without retraining or labeled data.
Where Pith is reading between the lines
- The same library idea could support continuous adaptation in deployed systems where external feedback is unavailable.
- Linking the library to external memory stores might extend its reach to longer planning horizons.
- Applying the approach to models of different sizes would test whether the consolidation process scales with capacity.
Load-bearing premise
That modular skills and reflective insights automatically extracted from the model's own inference trajectories can be weighted and consolidated into increasingly general and reusable abstractions that deliver long-term value without any external supervision or ground-truth signals.
What would settle it
Running the method on the reported mathematical reasoning, code generation, and agent benchmarks and finding no substantial gains over the strongest test-time scaling baselines, or observing that the stored abstractions stay narrow and fail to generalize across instances.
Figures
read the original abstract
We introduce EvoLib, a test-time learning framework that enables large language models to accumulate, reuse, and evolve knowledge across problem instances without parameter updates or external supervision. Instead of adapting model parameters, our approach maintains a shared library of knowledge abstractions, including modular skills and reflective insights, automatically extracted from the model's own inference trajectories. To support continual improvement, we introduce a principled weighting and consolidation mechanism that jointly optimizes for immediate utility and long-term value. This allows simple, instance-specific abstractions to evolve into more general and reusable ones over time. Across challenging benchmarks in mathematical reasoning, code generation, and multi-turn agentic environments, EvoLib improves substantially over the top test-time scaling and learning methods without ground-truth feedback.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces EvoLib, a test-time learning framework for large language models that maintains a shared evolving library of knowledge abstractions (modular skills and reflective insights) automatically extracted from the model's own inference trajectories. It proposes a weighting and consolidation mechanism to jointly optimize immediate utility and long-term value, allowing instance-specific abstractions to evolve into more general reusable ones, and claims substantial improvements over top test-time scaling and learning methods on mathematical reasoning, code generation, and multi-turn agentic benchmarks without ground-truth feedback or parameter updates.
Significance. If the central claims hold with rigorous empirical support, the work would be significant for enabling unsupervised continual adaptation in LLMs at inference time. It offers a modular, library-based alternative to parameter updates that could accumulate reusable knowledge across tasks, addressing a key limitation in current test-time scaling approaches.
major comments (3)
- [Abstract] Abstract: the central claim of 'substantial improvements' over existing methods is asserted without any quantitative results, baseline comparisons, ablation studies, or error analysis. This absence is load-bearing because the manuscript's value rests entirely on demonstrating net long-term gains from the self-extracted library.
- [Weighting and consolidation mechanism] Weighting and consolidation mechanism (described in abstract): the process extracts and weights abstractions solely from the model's inference trajectories with no external verifier or ground-truth signal. No concrete mechanism, equation, or example is supplied showing how systematic errors or locally coherent but globally incorrect steps are prevented from being promoted into the library, which directly risks undermining the long-term value claim.
- [Experimental Evaluation] Experimental section (implied by benchmark claims): no details are given on the specific benchmarks, number of instances, evolution of library size over time, or statistical significance of gains. Without these, it is impossible to evaluate whether the consolidation step delivers the claimed reusable abstractions or merely reinforces model biases.
minor comments (1)
- [Abstract] The abstract uses 'principled weighting' without defining the objective function or optimization procedure, which should be formalized early for clarity.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address each major comment below by referencing the full manuscript content and indicate where revisions will be made to strengthen clarity and support for the claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of 'substantial improvements' over existing methods is asserted without any quantitative results, baseline comparisons, ablation studies, or error analysis. This absence is load-bearing because the manuscript's value rests entirely on demonstrating net long-term gains from the self-extracted library.
Authors: We agree that the abstract would benefit from including key quantitative highlights to immediately convey the empirical support. The full manuscript (Section 4 and Tables 1-3) provides the requested comparisons, ablations, and error analysis across benchmarks, showing consistent gains. We will revise the abstract to incorporate representative quantitative results and a brief mention of the evaluation scope. revision: yes
-
Referee: [Weighting and consolidation mechanism] Weighting and consolidation mechanism (described in abstract): the process extracts and weights abstractions solely from the model's inference trajectories with no external verifier or ground-truth signal. No concrete mechanism, equation, or example is supplied showing how systematic errors or locally coherent but globally incorrect steps are prevented from being promoted into the library, which directly risks undermining the long-term value claim.
Authors: Section 3.2 of the manuscript details the weighting and consolidation mechanism, including the utility function that balances immediate task performance with reuse frequency and a generalization score computed across instances. Self-reflection steps during extraction are used to mitigate propagation of errors. We acknowledge that an explicit worked example of error filtering would improve accessibility and will add one in the revised version. revision: partial
-
Referee: [Experimental Evaluation] Experimental section (implied by benchmark claims): no details are given on the specific benchmarks, number of instances, evolution of library size over time, or statistical significance of gains. Without these, it is impossible to evaluate whether the consolidation step delivers the claimed reusable abstractions or merely reinforces model biases.
Authors: The experimental section (Section 4) specifies the benchmarks (e.g., GSM8K, MATH, HumanEval, WebShop), instance counts, library size trajectories (visualized in Figure 3), and statistical tests (paired t-tests with p-values). We will expand the text to include additional details on library evolution and an explicit discussion of potential bias reinforcement in the revision. revision: yes
Circularity Check
Self-contained framework with no load-bearing circularity in derivation
full rationale
The EvoLib framework extracts modular skills and reflective insights directly from the model's own inference trajectories and applies an internal weighting and consolidation mechanism to evolve them into reusable abstractions. No equations, predictions, or central claims in the abstract reduce by construction to fitted parameters, self-defined quantities, or prior self-citations; the improvements are presented as emerging from the iterative internal process itself rather than from any tautological renaming or imported uniqueness theorem. The derivation chain remains independent of external benchmarks or ground-truth signals by design, qualifying as a normal non-circular outcome with only minor potential for self-referential elements that do not carry the load of the main result.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Inference trajectories contain extractable modular skills and reflective insights that can be automatically identified and stored.
invented entities (1)
-
Evolving library of knowledge abstractions
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023
2023
-
[2]
Large language models are better reasoners with self-verification
Yixuan Weng, Minjun Zhu, Fei Xia, Bin Li, Shizhu He, Shengping Liu, Bin Sun, Kang Liu, and Jun Zhao. Large language models are better reasoners with self-verification. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2550–2575, Singapore, December 2023. Association for Compu...
2023
-
[3]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Yufan Zhuang, Chandan Singh, Liyuan Liu, Yelong Shen, Dinghuai Zhang, Jingbo Shang, Jianfeng Gao, and Weizhu Chen. Test-time recursive thinking: Self-improvement without external feedback.arXiv preprint arXiv:2602.03094, 2026
-
[5]
s1: Simple test-time scaling
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Lan...
2025
-
[6]
Siddarth Venkatraman, Vineet Jain, Sarthak Mittal, Vedant Shah, Johan Obando-Ceron, Yoshua Bengio, Brian R Bartoldson, Bhavya Kailkhura, Guillaume Lajoie, Glen Berseth, et al. Recursive self-aggregation unlocks deep thinking in large language models.arXiv preprint arXiv:2509.26626, 2025
-
[7]
Memory-assisted prompt editing to improve GPT-3 after deployment
Aman Madaan, Niket Tandon, Peter Clark, and Yiming Yang. Memory-assisted prompt editing to improve GPT-3 after deployment. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2833–2861, Abu Dhabi, United Arab Emirates, December 2022. Association for Compu...
2022
-
[8]
Thought-retriever: Don’t just retrieve raw data, retrieve thoughts, 2024
Tao Feng, Pengrui Han, Guanyu Lin, Ge Liu, and Jiaxuan You. Thought-retriever: Don’t just retrieve raw data, retrieve thoughts, 2024
2024
-
[9]
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
Tianxin Wei, Noveen Sachdeva, Benjamin Coleman, Zhankui He, Yuanchen Bei, Xuying Ning, Mengting Ai, Yunzhe Li, Jingrui He, Ed H Chi, et al. Evo-memory: Benchmarking llm agent test-time learning with self-evolving memory.arXiv preprint arXiv:2511.20857, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Test-time learning for large language models
Jinwu Hu, Zitian Zhang, Guohao Chen, Xutao Wen, Chao Shuai, Wei Luo, Bin Xiao, Yuanqing Li, and Mingkui Tan. Test-time learning for large language models. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Proceedings of the 42nd International Conference on Machine Learni...
2025
-
[11]
The surprising effectiveness of test-time training for few-shot learning
Ekin Akyürek, Mehul Damani, Adam Zweiger, Linlu Qiu, Han Guo, Jyothish Pari, Yoon Kim, and Jacob Andreas. The surprising effectiveness of test-time training for few-shot learning. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Proceedings of the 42nd International Conf...
2025
-
[12]
Learning to discover at test time.arXiv preprint arXiv:2601.16175, 2026
Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, et al. Learning to discover at test time.arXiv preprint arXiv:2601.16175, 2026. 10
-
[13]
Learning to (learn at test time): Rnns with expressive hidden states
Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states. arXiv preprint arXiv:2407.04620, 2024
-
[14]
Yuri Kuratov, Matvey Kairov, Aydar Bulatov, Ivan Rodkin, and Mikhail Burtsev. Gradmem: Learning to write context into memory with test-time gradient descent.arXiv preprint arXiv:2603.13875, 2026
-
[15]
How to grow a mind: Statistics, structure, and abstraction.science, 331(6022):1279–1285, 2011
Joshua B Tenenbaum, Charles Kemp, Thomas L Griffiths, and Noah D Goodman. How to grow a mind: Statistics, structure, and abstraction.science, 331(6022):1279–1285, 2011
2011
-
[16]
Cognitive skill acquisition.Annual review of psychology, 47(1):513–539, 1996
Kurt VanLehn. Cognitive skill acquisition.Annual review of psychology, 47(1):513–539, 1996
1996
-
[17]
On the Measure of Intelligence
François Chollet. On the measure of intelligence.arXiv preprint arXiv:1911.01547, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1911
-
[18]
Walt: Web agents that learn tools.arXiv preprint arXiv:2510.01524, 2025
Viraj Prabhu, Yutong Dai, Matthew Fernandez, Jing Gu, Krithika Ramakrishnan, Yanqi Luo, Silvio Savarese, Caiming Xiong, Junnan Li, Zeyuan Chen, et al. Walt: Web agents that learn tools.arXiv preprint arXiv:2510.01524, 2025
-
[19]
Zhengxi Lu, Zhiyuan Yao, Jinyang Wu, Chengcheng Han, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, and Yongliang Shen. Skill0: In-context agentic reinforcement learning for skill internalization.arXiv preprint arXiv:2604.02268, 2026
-
[20]
ReGAL: Refactoring programs to discover generalizable abstractions
Elias Stengel-Eskin, Archiki Prasad, and Mohit Bansal. ReGAL: Refactoring programs to discover generalizable abstractions. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Ma...
2024
-
[21]
Inducing programmatic skills for agentic tasks
Zora Zhiruo Wang, Apurva Gandhi, Graham Neubig, and Daniel Fried. Inducing programmatic skills for agentic tasks. InSecond Conference on Language Modeling, 2025
2025
-
[22]
Reuseit: Synthesizing reusable ai agent workflows for web automation
Yimeng Liu, Misha Sra, Jeevana Priya Inala, and Chenglong Wang. Reuseit: Synthesizing reusable ai agent workflows for web automation. InProceedings of the 31st International Conference on Intelligent User Interfaces, pages 885–908, 2026
2026
-
[23]
Libin Qiu, Zhirong Gao, Junfu Chen, Yuhang Ye, Weizhi Huang, Xiaobo Xue, Wenkai Qiu, and Shuo Tang. Autorefine: From trajectories to reusable expertise for continual llm agent refinement.arXiv preprint arXiv:2601.22758, 2026
-
[24]
Matthew Ho, Chen Si, Zhaoxiang Feng, Fangxu Yu, Yichi Yang, Zhijian Liu, Zhiting Hu, and Lianhui Qin. Arcmemo: Abstract reasoning composition with lifelong llm memory.arXiv preprint arXiv:2509.04439, 2025
-
[25]
Expel: Llm agents are experiential learners.Proceedings of the AAAI Conference on Artificial Intelligence, 38(17):19632– 19642, Mar
Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners.Proceedings of the AAAI Conference on Artificial Intelligence, 38(17):19632– 19642, Mar. 2024
2024
-
[26]
EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle
Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, et al. Evolver: Self-evolving llm agents through an experience-driven lifecycle.arXiv preprint arXiv:2510.16079, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Evolving programmatic skill networks.arXiv preprint arXiv:2601.03509, 2026
Haochen Shi, Xingdi Yuan, and Bang Liu. Evolving programmatic skill networks.arXiv preprint arXiv:2601.03509, 2026
-
[28]
Yuxiao Qu, Anikait Singh, Yoonho Lee, Amrith Setlur, Ruslan Salakhutdinov, Chelsea Finn, and Aviral Kumar. Rlad: Training llms to discover abstractions for solving reasoning problems.arXiv preprint arXiv:2510.02263, 2025
-
[29]
Hybrid-gym: Training coding agents to generalize across tasks.arXiv preprint arXiv:2602.16819, 2026
Yiqing Xie, Emmy Liu, Gaokai Zhang, Nachiket Kotalwar, Shubham Gandhi, Sathwik Acharya, Xingyao Wang, Carolyn Rose, Graham Neubig, and Daniel Fried. Hybrid-gym: Training coding agents to generalize across tasks.arXiv preprint arXiv:2602.16819, 2026
-
[30]
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
AlphaEvolve: A coding agent for scientific and algorithmic discovery
Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025. 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Thetaevolve: Test-time learning on open problems.arXiv preprint arXiv:2511.23473, 2025
Yiping Wang, Shao-Rong Su, Zhiyuan Zeng, Eva Xu, Liliang Ren, Xinyu Yang, Zeyi Huang, Xuehai He, Luyao Ma, Baolin Peng, et al. Thetaevolve: Test-time learning on open problems.arXiv preprint arXiv:2511.23473, 2025
-
[33]
SkillClaw: Let Skills Evolve Collectively with Agentic Evolver
Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangx- iang Chu. Skillclaw: Let skills evolve collectively with agentic evolver.arXiv preprint arXiv:2604.08377, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[34]
Evoskill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766, 2026
Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. Evoskill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766, 2026
-
[35]
Agent workflow memory
Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. In Forty-second International Conference on Machine Learning, 2025
2025
-
[36]
Dynamic cheatsheet: Test-time learning with adaptive memory
Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, and James Zou. Dynamic cheatsheet: Test-time learning with adaptive memory. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7080–7106, 2026
2026
-
[37]
Reasoningbank: Scaling agent self-evolving with reasoning memory
Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, and Tomas Pfister. Reasoningbank: Scaling agent self-evolving with reasoning memory. In The Fourteenth International Conference on Learning Represen...
2026
-
[38]
Judging llm-as-a- judge with mt-bench and chatbot arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. Judging llm-as-a- judge with mt-bench and chatbot arena. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Sy...
2023
-
[39]
Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions
Terry Yue Zhuo, Vu Minh Chien, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen GONG, James Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, Davi...
2025
-
[40]
Livecodebench: Holistic and contamination free evaluation of large language models for code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[41]
Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj Ammanabrolu. ScienceWorld: Is your agent smarter than a 5th grader? In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298, Abu Dhabi, United Arab Emirates, December 2022. Association...
2022
-
[42]
The 2014 international planning competition: Progress and trends.Ai Magazine, 36(3):90–98, 2015
Mauro Vallati, Lukas Chrpa, Marek Grze´s, Thomas Leo McCluskey, Mark Roberts, Scott Sanner, et al. The 2014 international planning competition: Progress and trends.Ai Magazine, 36(3):90–98, 2015
2014
-
[43]
Agentboard: An analytical evaluation board of multi-turn llm agents
Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, and Junxian He. Agentboard: An analytical evaluation board of multi-turn llm agents. In A. Glober- son, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 74325–...
2024
-
[44]
Matharena: Evaluating llms on uncontaminated math competitions.Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmark, 2025
Mislav Balunovi´c, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovi ´c, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions.Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmark, 2025
2025
-
[45]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 12 Table A.2: Summary of datasets used in evaluation. Domain Dataset # Instances Notes Math Reasoning HMMT Feb 2025 30 Competitive math problems HMMT Nov ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.