arxiv: 2604.20183 · v1 · submitted 2026-04-22 · 💻 cs.CL

Recognition: unknown

Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving

Xinyu Zhang , Yuchen Wan , Boxuan Zhang , Zesheng Yang , Lingling Zhang , Bifan Wei , Jun Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:35 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM agentsoptimization problemsmulti-paradigm ambiguitymemory augmentationknowledge inheritancetraining-free methodsclustering historical solutionsstructural ambiguity

0 comments

The pith

Dual clusters of historical solutions let LLMs resolve conflicting modeling paradigms in optimization problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often encounter optimization problems that admit multiple valid but incompatible modeling approaches, which leads to inconsistent or failed solution attempts. The paper introduces a method that gathers past solutions, places them into separate modeling and coding clusters, and condenses each cluster into three reusable forms of guidance: core approaches, verification checklists, and common pitfalls. This distilled knowledge then directs the model at inference time to select appropriate paths, spot mistakes, and shift strategies when needed. A reader would care because the process requires no model retraining and produces measurable gains on standard benchmarks while allowing smaller models to benefit from memory built by larger ones.

Core claim

The paper establishes that Dual-Cluster Memory Construction, by partitioning historical solutions into modeling and coding clusters and distilling each into Approach, Checklist, and Pitfall knowledge, supplies generalizable guidance. Paired with Memory-augmented Inference that employs this knowledge to navigate solution paths, detect errors, and adaptively switch reasoning, the resulting agent raises solution quality on optimization problems that contain structural ambiguity, as measured by 11-21 percent average gains across seven benchmarks and by the observed transfer of effective memory from larger models to smaller ones.

What carries the argument

Dual-Cluster Memory Construction, which separates historical solutions into modeling and coding clusters then distills each cluster into three structured knowledge types that support error detection and path switching during inference.

If this is right

LLMs gain the capacity to detect errors and switch between alternative reasoning paths by consulting distilled checklists and pitfalls.
Smaller models reach higher success rates when supplied with memory distilled from solutions produced by larger models.
The training-free memory construction delivers the reported gains immediately on any existing LLM without additional fine-tuning.
Average performance rises 11-21 percent on benchmarks that feature problems admitting multiple conflicting modeling paradigms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same clustering and distillation process could be applied to other domains where LLMs encounter decision ambiguity, such as multi-step planning or code refactoring.
Knowledge inheritance from large to small models offers a practical route for reducing inference compute while preserving performance on ambiguous tasks.
The distilled guidance might lose effectiveness if future problems shift substantially in style or domain from the historical set used to build the memory.

Load-bearing premise

Past solutions from earlier optimization problems contain patterns that can be clustered and distilled into guidance that generalizes to new problems without overfitting or overlooking fresh paradigms.

What would settle it

Measure performance on a new collection of optimization problems whose modeling paradigms do not appear in the historical memory; the central claim would be falsified if the agent shows no improvement over ordinary LLM prompting on that collection.

Figures

Figures reproduced from arXiv: 2604.20183 by Bifan Wei, Boxuan Zhang, Jun Liu, Lingling Zhang, Xinyu Zhang, Yuchen Wan, Zesheng Yang.

**Figure 2.** Figure 2: Overview of the Dual-Cluster Memory Agent (DCM-Agent). This agent operates in two distinct phases: [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Two examples in our Dual-Cluster Memory. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Ablation study on NLP4LP, OptiBench, and [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 4.** Figure 4: Statistical distribution of the number of two [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 6.** Figure 6: Comparison between the baseline and DCM-Agent on a discrete optimization task. DCM-Agent correctly [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Large Language Models (LLMs) often struggle with structural ambiguity in optimization problems, where a single problem admits multiple related but conflicting modeling paradigms, hindering effective solution generation. To address this, we propose Dual-Cluster Memory Agent (DCM-Agent) to enhance performance by leveraging historical solutions in a training-free manner. Central to this is Dual-Cluster Memory Construction. This agent assigns historical solutions to modeling and coding clusters, then distills each cluster's content into three structured types: Approach, Checklist, and Pitfall. This process derives generalizable guidance knowledge. Furthermore, this agent introduces Memory-augmented Inference to dynamically navigate solution paths, detect and repair errors, and adaptively switch reasoning paths with structured knowledge. The experiments across seven optimization benchmarks demonstrate that DCM-Agent achieves an average performance improvement of 11%- 21%. Notably, our analysis reveals a ``knowledge inheritance'' phenomenon: memory constructed by larger models can guide smaller models toward superior performance, highlighting the framework's scalability and efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DCM-Agent gives a training-free way to cluster and distill historical solutions into structured guidance for ambiguous optimization problems, with claimed 11-21% gains and a knowledge inheritance effect, but the support for generalizability over benchmark overlap remains thin.

read the letter

The main thing to know is that this paper describes a Dual-Cluster Memory Agent that pulls historical solutions into modeling and coding clusters, distills them into Approach/Checklist/Pitfall formats, and uses that memory at inference time to switch paths and fix errors on optimization tasks. It reports average gains of 11-21% across seven benchmarks plus the observation that memory from larger models can improve smaller ones without retraining.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Dual-Cluster Memory Agent (DCM-Agent), a training-free framework that addresses structural (multi-paradigm) ambiguity in optimization problems for LLMs. It constructs dual clusters of historical solutions (modeling vs. coding), distills each into three structured knowledge types (Approach, Checklist, Pitfall) via LLM, and augments inference with dynamic path navigation, error repair, and adaptive switching. Experiments on seven optimization benchmarks report 11-21% average gains, plus a 'knowledge inheritance' effect in which memory from larger models improves smaller models.

Significance. If the performance claims and inheritance observation are substantiated with proper controls, the work would offer a practical, scalable method for injecting structured historical guidance into LLM reasoning on ambiguous optimization tasks without fine-tuning. The dual-cluster distillation and memory-augmented inference could generalize beyond the reported benchmarks and support efficient deployment of smaller models.

major comments (3)

[Abstract, §4] Abstract and §4 (Experiments): the central performance claim of 11-21% average improvement is stated without any baseline systems, number of runs, variance, error bars, or statistical significance tests. This leaves the magnitude and reliability of the gains unverifiable and prevents assessment of whether the Dual-Cluster Memory Construction is the causal factor.
[§3.2] §3.2 (Dual-Cluster Memory Construction): the LLM-based distillation of finite historical solutions into Approach/Checklist/Pitfall forms risks embedding benchmark-specific modeling patterns rather than paradigm-agnostic rules. No ablation or invariance test is described that would confirm the distilled knowledge transfers to genuinely new multi-paradigm instances outside the historical distribution.
[§4, §5] §4 and §5 (knowledge inheritance analysis): the observation that larger-model memory guides smaller models is presented without specifying the model pairs, the exact transfer protocol, or controls that isolate inheritance from simple prompt length or example count effects. This weakens the scalability claim.

minor comments (2)

[Abstract] Abstract: '11%- 21%' contains an extraneous space; standardize formatting.
[§3] Notation for the three distilled knowledge types (Approach, Checklist, Pitfall) is introduced without a clear table or figure showing their exact template structure or how they are retrieved at inference time.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for strengthening the empirical rigor and clarity of our claims. We address each major comment point by point below and will incorporate revisions to improve the manuscript.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4 (Experiments): the central performance claim of 11-21% average improvement is stated without any baseline systems, number of runs, variance, error bars, or statistical significance tests. This leaves the magnitude and reliability of the gains unverifiable and prevents assessment of whether the Dual-Cluster Memory Construction is the causal factor.

Authors: We agree that the current presentation of results lacks sufficient detail to fully substantiate the claims. The experiments compared DCM-Agent against standard zero-shot and few-shot prompting baselines across the seven benchmarks using models such as GPT-4. We performed multiple runs with varied seeds and computed averages, but these specifics were not explicitly reported. In the revised version, we will expand §4 to describe the baselines in detail, specify the number of runs (increased to 5 for robustness), include standard deviations and error bars in all tables and figures, and report statistical significance via paired t-tests. These additions will allow readers to verify that the observed gains are attributable to the dual-cluster memory mechanism. revision: yes
Referee: [§3.2] §3.2 (Dual-Cluster Memory Construction): the LLM-based distillation of finite historical solutions into Approach/Checklist/Pitfall forms risks embedding benchmark-specific modeling patterns rather than paradigm-agnostic rules. No ablation or invariance test is described that would confirm the distilled knowledge transfers to genuinely new multi-paradigm instances outside the historical distribution.

Authors: This concern about potential overfitting to the historical distribution is well-taken. The distillation process was designed to extract higher-level patterns by prompting the LLM to generalize across the provided solutions, but we did not include explicit tests for transfer to unseen problems. We will revise §3.2 to better explain the generalization intent and add an ablation study in §4. This study will evaluate the distilled knowledge on held-out multi-paradigm instances excluded from memory construction and test invariance by varying the size and diversity of the historical solution set, thereby demonstrating transfer beyond the original distribution. revision: yes
Referee: [§4, §5] §4 and §5 (knowledge inheritance analysis): the observation that larger-model memory guides smaller models is presented without specifying the model pairs, the exact transfer protocol, or controls that isolate inheritance from simple prompt length or example count effects. This weakens the scalability claim.

Authors: We acknowledge that the inheritance analysis requires more precise specification and controls to support the scalability interpretation. The experiments transferred memory constructed by larger models (e.g., GPT-4 to GPT-3.5; Llama-2-70B to Llama-2-13B) by directly providing the distilled Approach/Checklist/Pitfall structures during inference on the smaller model. In the revised manuscript, we will explicitly list the model pairs and transfer protocol in §5. We will also add control experiments comparing against prompts of matched length containing random examples or non-distilled historical solutions, isolating the contribution of the structured, distilled knowledge from mere example count or length effects. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method relies on external historical data without self-referential reduction

full rationale

The paper presents an empirical agent framework that constructs memory from historical optimization solutions via clustering and LLM distillation, then applies it in inference. No equations, derivations, or fitted parameters are described. The central process (Dual-Cluster Memory Construction) operates on external prior data in a training-free manner, and performance gains are measured on separate benchmarks. No self-citations, uniqueness theorems, or ansatzes are invoked to justify load-bearing steps. The derivation chain does not reduce to its inputs by construction; claims rest on observable transfer from historical corpora rather than definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the unstated premise that clustering and distilling historical solutions produces transferable guidance; no free parameters, axioms, or invented entities are specified in the abstract.

pith-pipeline@v0.9.0 · 5491 in / 1035 out tokens · 22820 ms · 2026-05-10T00:35:26.824440+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

120 extracted references · 29 canonical work pages · 13 internal anchors

[1]

PhysReason: A comprehensive benchmark towards physics-based reasoning

Zhang, Xinyu and Dong, Yuxuan and Wu, Yanrui and Huang, Jiaxing and Jia, Chengyou and Fernando, Basura and Shou, Mike Zheng and Zhang, Lingling and Liu, Jun. P hys R eason: A Comprehensive Benchmark towards Physics-Based Reasoning. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10...

work page doi:10.18653/v1/2025.acl-long.811 2025
[2]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

CoFFT: Chain of Foresight-Focus Thought for Visual Language Models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[3]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Diagram-Driven Course Questions Generation , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[4]

IEEE Transactions on Circuits and Systems for Video Technology , volume=

RPMG-FSS: Robust prior mask guided few-shot semantic segmentation , author=. IEEE Transactions on Circuits and Systems for Video Technology , volume=. 2023 , publisher=

2023
[5]

Computer Vision and Image Understanding , pages=

Memory-enriched thought-by-thought framework for complex Diagram Question Answering , author=. Computer Vision and Image Understanding , pages=. 2025 , publisher=

2025
[6]

Proceedings of the 33rd ACM International Conference on Multimedia , pages=

Cognitive Predictive Coding Network: Rethinking the Generalization in Raven's Progressive Matrices , author=. Proceedings of the 33rd ACM International Conference on Multimedia , pages=
[7]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

TN-ZSTAD: Transferable network for zero-shot temporal activity detection , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2022 , publisher=

2022
[8]

NeurIPS 2022 competition track , pages=

Nl4opt competition: Formulating optimization problems based on their natural language descriptions , author=. NeurIPS 2022 competition track , pages=. 2023 , organization=

2022
[9]

LLM s for Mathematical Modeling: Towards Bridging the Gap between Natural and Mathematical Languages

Huang, Xuhan and Shen, Qingning and Hu, Yan and Gao, Anningzhe and Wang, Benyou. LLM s for Mathematical Modeling: Towards Bridging the Gap between Natural and Mathematical Languages. Findings of the Association for Computational Linguistics: NAACL 2025. 2025. doi:10.18653/v1/2025.findings-naacl.146

work page doi:10.18653/v1/2025.findings-naacl.146 2025
[10]

The Twelfth International Conference on Learning Representations , year=

Large Language Models Cannot Self-Correct Reasoning Yet , author=. The Twelfth International Conference on Learning Representations , year=
[11]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Improving Generalization of Neural Combinatorial Optimization for Vehicle Routing Problems via Test-Time Projection Learning , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[12]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Deepseek-v3. 2: Pushing the frontier of open large language models , author=. arXiv preprint arXiv:2512.02556 , year=

work page internal anchor Pith review arXiv
[13]

2025 , url =

GPT-5.1: A smarter, more conversational ChatGPT , author =. 2025 , url =

2025
[14]

The Thirteenth International Conference on Learning Representations , year=

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models , author=. The Thirteenth International Conference on Learning Representations , year=
[15]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Structure-Aware Cooperative Ensemble Evolutionary Optimization on Combinatorial Problems with Multimodal Large Language Models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[17]

ORM ind: A Cognitive-Inspired End-to-End Reasoning Framework for Operations Research

Wang, Zhiyuan and Chen, Bokui and Huang, Yinya and Cao, Qingxing and He, Ming and Fan, Jianping and Liang, Xiaodan. ORM ind: A Cognitive-Inspired End-to-End Reasoning Framework for Operations Research. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track). 2025. doi:10.18653/v1/2025.acl-industry.10

work page doi:10.18653/v1/2025.acl-industry.10 2025
[18]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Mm-agent: Llm as agents for real-world mathematical modeling problem , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[19]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Solver-Informed RL: Grounding Large Language Models for Authentic Optimization Modeling , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[20]

Operations Research , year=

Orlm: A customizable framework in training large models for automated optimization modeling , author=. Operations Research , year=
[21]

Evo-Step: Evolutionary Generation and Stepwise Validation for Optimizing LLMs in OR , author=
[22]

The Thirteenth International Conference on Learning Representations , year=

LLMOPT: Learning to Define and Solve General Optimization Problems from Scratch , author=. The Thirteenth International Conference on Learning Representations , year=
[23]

Forty-second International Conference on Machine Learning , year=

Autoformulation of Mathematical Optimization Models Using LLMs , author=. Forty-second International Conference on Machine Learning , year=
[24]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

OptiTree: Hierarchical Thoughts Generation with Tree Search for LLM Optimization Modeling , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[25]

Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining , pages=

A survey on rag meeting llms: Towards retrieval-augmented large language models , author=. Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining , pages=
[26]

Retrieval-Augmented Generation for AI-Generated Content: A Survey

Retrieval-augmented generation for ai-generated content: A survey , author=. arXiv preprint arXiv:2402.19473 , year=

work page internal anchor Pith review arXiv
[27]

The eleventh international conference on learning representations , year=

React: Synergizing reasoning and acting in language models , author=. The eleventh international conference on learning representations , year=
[28]

NeurIPS 2024 Workshop on Open-World Agents , year=

Rat: Retrieval augmented thoughts elicit context-aware reasoning and verification in long-horizon generation , author=. NeurIPS 2024 Workshop on Open-World Agents , year=

2024
[29]

Proceedings of the Workshop on Generative AI and Knowledge Graphs (GenAIK) , pages=

Graphrag: Leveraging graph-based efficiency to minimize hallucinations in llm-driven rag for finance data , author=. Proceedings of the Workshop on Generative AI and Knowledge Graphs (GenAIK) , pages=
[30]

LightRAG: Simple and Fast Retrieval-Augmented Generation

Lightrag: Simple and fast retrieval-augmented generation , author=. arXiv preprint arXiv:2410.05779 , year=

work page internal anchor Pith review arXiv
[31]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
[32]

ACM Transactions on Information Systems , volume=

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions , author=. ACM Transactions on Information Systems , volume=. 2025 , publisher=

2025
[33]

Advances in neural information processing systems , volume=

Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in neural information processing systems , volume=
[34]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

MURKA: Multi-Reward Reinforcement Learning with Knowledge Alignment for Optimization Tasks , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[35]

The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

AutoOpt: A Dataset and a Unified Framework for Automating Optimization Problem Solving , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
[36]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Logictree: Improving complex reasoning of LLMs via instantiated multi-step synthetic logical data , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[37]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

The twelfth international conference on learning representations , year=

Chain-of-experts: When llms meet complex operations research problems , author=. The twelfth international conference on learning representations , year=
[39]

Forty-first International Conference on Machine Learning , year=

OptiMUS: Scalable Optimization Modeling with (MI) LP Solvers and Large Language Models , author=. Forty-first International Conference on Machine Learning , year=
[40]

arXiv e-prints , pages=

Mamo: a mathematical modeling benchmark with solvers , author=. arXiv e-prints , pages=
[41]

Forty-second International Conference on Machine Learning , year=

OptMATH: A Scalable Bidirectional Data Synthesis Framework for Optimization Modeling , author=. Forty-second International Conference on Machine Learning , year=
[42]

The Thirteenth International Conference on Learning Representations , year=

OptiBench Meets ReSocratic: Measure and Improve LLMs for Optimization Modeling , author=. The Thirteenth International Conference on Learning Representations , year=
[43]

IFAC-PapersOnLine , volume=

MILP-based approach to mid-term production planning of batch manufacturing environment producing bulk products , author=. IFAC-PapersOnLine , volume=. 2018 , publisher=

2018
[44]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Large Language Models as End-to-end Combinatorial Optimization Solvers , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[45]

Transportation Science , volume=

Heuristic bounds and test problem generation for the time-dependent traveling salesman problem , author=. Transportation Science , volume=. 1995 , publisher=

1995
[46]

IEEE Transactions on Sustainable Energy , volume=

A MILP-based battery degradation model for economic scheduling of power system , author=. IEEE Transactions on Sustainable Energy , volume=. 2023 , publisher=

2023
[47]

2013 , publisher=

Mathematical modeling , author=. 2013 , publisher=

2013
[48]

arXiv preprint arXiv:2507.11737 , year=

Auto-formulating dynamic programming problems with large language models , author=. arXiv preprint arXiv:2507.11737 , year=

work page arXiv
[49]

arXiv e-prints , pages=

Large Language Models and Operations Research: A Structured Survey , author=. arXiv e-prints , pages=
[50]

arXiv preprint arXiv:2503.10009 , year=

Or-llm-agent: Automating modeling and solving of operations research optimization problem with reasoning large language model , author=. arXiv preprint arXiv:2503.10009 , year=

work page arXiv
[51]

OpenAI , title =. , url =
[52]

Google Deepmind , title =
[53]

Symbol- LLM : Towards Foundational Symbol-centric Interface For Large Language Models

Xu, Fangzhi and Wu, Zhiyong and Sun, Qiushi and Ren, Siyu and Yuan, Fei and Yuan, Shuai and Lin, Qika and Qiao, Yu and Liu, Jun. Symbol- LLM : Towards Foundational Symbol-centric Interface For Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.a...

work page doi:10.18653/v1/2024.acl-long.707 2024
[54]

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

Omni-math: A universal olympiad level mathematic benchmark for large language models , author=. arXiv preprint arXiv:2410.07985 , year=

work page internal anchor Pith review arXiv
[55]

arXiv preprint arXiv:2411.01281 , year=

Varco arena: A tournament approach to reference-free benchmarking large language models , author=. arXiv preprint arXiv:2411.01281 , year=

work page arXiv
[56]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement , author=. arXiv preprint arXiv:2409.12122 , year=

work page internal anchor Pith review arXiv
[57]

International Conference on Learning Representations , year=

Scaling LLM Test-time Compute Optimally Can Be More Effective Than Scaling Model Parameters , author=. International Conference on Learning Representations , year=
[58]

2024 , url =

OpenAI , title =. 2024 , url =

2024
[59]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. arXiv preprint arXiv:2403.07974 , year=

work page internal anchor Pith review arXiv
[60]

Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , year=

Measuring Mathematical Problem Solving With the MATH Dataset , author=. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , year=
[61]

The Twelfth International Conference on Learning Representations , year=

Let's Verify Step by Step , author=. The Twelfth International Conference on Learning Representations , year=
[62]

2025 , url =

OpenAI , title =. 2025 , url =

2025
[63]

2025 , url =

Google Deepmind , title =. 2025 , url =

2025
[64]

2024 , url =

ZhipuAI , title =. 2024 , url =

2024
[65]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[66]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[67]

Advances in Neural Information Processing Systems , volume=

Learn to explain: Multimodal reasoning via thought chains for science question answering , author=. Advances in Neural Information Processing Systems , volume=
[68]

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models , author=
[69]

Advances in Neural Information Processing Systems , volume=

Solving quantitative reasoning problems with language models , author=. Advances in Neural Information Processing Systems , volume=
[70]

Proceedings of the 3rd Workshop on Noisy User-generated Text , pages=

Crowdsourcing Multiple Choice Science Questions , author=. Proceedings of the 3rd Workshop on Noisy User-generated Text , pages=
[71]

QVQ: To See the World with Wisdom , url =

Qwen Team , month =. QVQ: To See the World with Wisdom , url =
[72]

2024 , eprint=

DeepSeek-V3 Technical Report , author=. 2024 , eprint=

2024
[73]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution , author=. arXiv preprint arXiv:2409.12191 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[74]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling , author=. arXiv preprint arXiv:2412.05271 , year=

work page internal anchor Pith review arXiv
[75]

QwQ: Reflect Deeply on the Boundaries of the Unknown , url =

Qwen Team , month =. QwQ: Reflect Deeply on the Boundaries of the Unknown , url =
[76]

American journal of physics , volume=

Toward a modeling theory of physics instruction , author=. American journal of physics , volume=. 1987 , publisher=

1987
[77]

Bowman , booktitle=

David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman , booktitle=. 2024 , url=

2024
[78]

2024 , eprint=

MinerU: An Open-Source Solution for Precise Document Content Extraction , author=. 2024 , eprint=

2024
[79]

OpenCompass: A Universal Evaluation Platform for Foundation Models , author=
[80]

Physbench: Benchmarking and enhancing vision-language models for physical world understanding

PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding , author=. arXiv preprint arXiv:2501.16411 , year=

work page arXiv

Showing first 80 references.