Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions

Feiran Huang; Hao Wu; Hongxia Yang; Junnan Dong; Linyi Li; Su Dong; Xiao Huang; Yilin Xiao; Yujing Zhang; Zhu Wang

arxiv: 2501.11790 · v5 · pith:YXECJJ7Jnew · submitted 2025-01-20 · 💻 cs.CL · cs.AI

Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions

Zijin Hong , Hao Wu , Su Dong , Junnan Dong , Yilin Xiao , Yujing Zhang , Zhu Wang , Feiran Huang

show 3 more authors

Linyi Li Hongxia Yang Xiao Huang

This is my paper

classification 💻 cs.CL cs.AI

keywords llmsreasoningmathematicalrandomrv-benchrvqsunseenvariable

0 comments

read the original abstract

Recent studies have raised significant concerns regarding the reliability of current mathematics benchmarks, highlighting issues such as simplistic design and potential data contamination. Consequently, developing a reliable benchmark that effectively evaluates large language models' (LLMs) genuine capabilities in mathematical reasoning remains a critical challenge. To address these concerns, we propose RV-Bench, a novel evaluation methodology for Benchmarking LLMs with Random Variables in mathematical reasoning. Specifically, we build question-generating functions to produce random variable questions (RVQs), whose background content mirrors original benchmark problems, but with randomized variable combinations, rendering them "unseen" to LLMs. Models must completely understand the inherent question pattern to correctly answer RVQs with diverse variable combinations. Thus, an LLM's genuine reasoning capability is reflected through its accuracy and robustness on RV-Bench. We conducted extensive experiments on over 30 representative LLMs across more than 1,000 RVQs. Our findings propose that LLMs exhibit a proficiency imbalance between encountered and ``unseen'' data distributions. Furthermore, RV-Bench reveals that proficiency generalization across similar mathematical reasoning tasks is limited, but we verified it can still be effectively elicited through test-time scaling.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Knapsack Optimization-based Schema Linking for LLM-based Text-to-SQL Generation
cs.CL 2025-02 unverdicted novelty 6.0

KaSLA applies knapsack optimization hierarchically to schema linking for LLM text-to-SQL, claiming better results than large models and improved SQL generation on Spider and BIRD.