ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics
Pith reviewed 2026-06-27 13:14 UTC · model grok-4.3
The pith
A benchmark of 100 Olympiad combinatorics problems shows that rigorous proof reasoning and constructive realization are distinct capabilities in frontier language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ComBench comprises 100 human-annotated problems divided into analysis-centric settings that test rigorous arguments and construction-centric settings that additionally require valid explicit constructions. The evaluation protocol pairs rubric-guided proof grading with deterministic construction verification. Frontier models reach at most 65.4 percent overall average and 75.3 percent Best@4, with Kimi-K2.6 trailing GPT-5.5 on analysis-centric proof grading yet surpassing it on construction-centric Best@4, while existence and construction problems prove consistently hardest.
What carries the argument
ComBench benchmark consisting of analysis-centric and construction-centric problem sets together with rubric-guided proof grading and deterministic construction verification.
If this is right
- Frontier models require separate diagnostic evaluation on proof reasoning versus construction tasks rather than aggregate scores alone.
- Existence and construction problems remain the hardest category across current models and will need targeted advances.
- The observed divergence between models on the two settings indicates that scaling or general training does not automatically align both skills.
- Benchmarks that isolate these capabilities can guide development of models that handle both rigorous justification and explicit object creation.
Where Pith is reading between the lines
- The distinction may extend to other discrete mathematics domains where proof and construction appear together.
- Training regimes could be designed to balance exposure to proof-heavy and construction-heavy examples.
- Hybrid problems that demand both skills simultaneously might serve as stronger stress tests for future models.
Load-bearing premise
The 100 selected problems represent typical Olympiad combinatorics difficulty and the combined grading methods accurately capture the intended reasoning skills without systematic bias or verification mistakes.
What would settle it
A controlled test in which the same model achieves top scores on both analysis-centric proof grading and construction-centric verification without the observed performance divergence would undermine the claim that the two capabilities are distinct.
Figures
read the original abstract
Combinatorics is central to Olympiad-level mathematical problem solving, requiring deep discrete reasoning, creative constructions, and rigorous structural insight. Recent evidence suggests that even today's strongest frontier models remain uneven on Olympiad combinatorics, revealing a gap in creative mathematical reasoning. We introduce ComBench, an Olympiad-level combinatorics benchmark for evaluating and diagnosing the combinatorial reasoning capabilities of large language models. ComBench contains 100 human-annotated competition-level problems organized around two complementary settings: analysis-centric problems, which primarily require rigorous mathematical arguments, and construction-centric problems, which require explicit constructions in addition to correctness justifications. The evaluation protocol combines rubric-guided proof grading with deterministic construction verification, exposing cases where proof quality and construction validity diverge. Experiments on frontier open- and closed-source models show that ComBench is far from saturated: the strongest model reaches 65.4% overall Avg. and 75.3% overall Best@4. We further find that Rigorous Proof Reasoning and Constructive Realization are distinct capabilities: Kimi-K2.6 trails GPT-5.5 on analysis-centric proof grading but surpasses it on construction-centric Best@4, while Existence and Construction problems remain consistently hardest across representative frontier models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ComBench, a benchmark of 100 human-annotated Olympiad-level combinatorics problems split into analysis-centric (rigorous proof) and construction-centric (explicit construction plus justification) categories. It describes an evaluation protocol combining rubric-guided proof grading with deterministic construction verification, reports that frontier models remain far from saturation (strongest reaches 65.4% overall Avg. and 75.3% overall Best@4), and concludes that Rigorous Proof Reasoning and Constructive Realization are distinct capabilities on the basis of a performance crossover (Kimi-K2.6 trails GPT-5.5 on analysis-centric grading but surpasses it on construction-centric Best@4).
Significance. If the problem categorization and grading protocol prove reliable, ComBench would supply a diagnostically useful resource for combinatorial reasoning in LLMs, with the reported distinction between proof and construction skills offering a concrete signal for model development. The work is a pure empirical evaluation on a newly collected dataset and contains no fitted parameters or self-referential derivations.
major comments (2)
- [Methods (problem annotation and categorization)] Methods (problem sourcing and annotation): The manuscript states that the 100 problems are human-annotated and organized into analysis-centric versus construction-centric settings but reports neither inter-annotator agreement figures nor an error-rate audit on a held-out subset. This is load-bearing for the central claim of distinct capabilities, because the observed Kimi-K2.6 vs. GPT-5.5 crossover could be an artifact of annotation variance without sensitivity analysis showing robustness to modest re-labeling.
- [Evaluation protocol] Evaluation protocol: Exact rubric criteria for proof grading and the precise deterministic checks for construction validity are not supplied, nor is any validation that the protocol measures the intended skills without systematic bias. This directly undermines the support for the performance numbers and the conclusion that divergence reflects distinct reasoning skills rather than grading artifacts.
minor comments (2)
- [Results] The abstract and results sections refer to 'Existence and Construction problems' as consistently hardest but do not include a per-subcategory breakdown table across models to substantiate the consistency claim.
- Consider adding a short table or appendix entry listing the distribution of the 100 problems across the two main categories and any subcategories.
Simulated Author's Rebuttal
We thank the referee for highlighting important aspects of our methodology and evaluation protocol. We provide point-by-point responses to the major comments and indicate the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: Methods (problem annotation and categorization): The manuscript states that the 100 problems are human-annotated and organized into analysis-centric versus construction-centric settings but reports neither inter-annotator agreement figures nor an error-rate audit on a held-out subset. This is load-bearing for the central claim of distinct capabilities, because the observed Kimi-K2.6 vs. GPT-5.5 crossover could be an artifact of annotation variance without sensitivity analysis showing robustness to modest re-labeling.
Authors: We agree that providing inter-annotator agreement and an error-rate audit would enhance confidence in the categorization. The annotation was performed by domain experts using explicit guidelines for distinguishing the two categories. In revision, we will expand the methods section with a detailed account of the annotation process, including examples of categorization decisions. Additionally, we will conduct and report a sensitivity analysis on a held-out subset to assess robustness of the performance crossover to potential re-labeling. We note that the observed distinction is supported by consistent patterns across multiple models, but we acknowledge the value of these additional checks. revision: partial
-
Referee: Evaluation protocol: Exact rubric criteria for proof grading and the precise deterministic checks for construction validity are not supplied, nor is any validation that the protocol measures the intended skills without systematic bias. This directly undermines the support for the performance numbers and the conclusion that divergence reflects distinct reasoning skills rather than grading artifacts.
Authors: We appreciate this point and will revise the manuscript to include the full rubric criteria in an appendix, along with the exact deterministic verification procedures for constructions. We will also add a section discussing the validation steps taken during protocol development, such as pilot grading sessions and alignment with expert judgments, to demonstrate that the protocol targets the intended skills. revision: yes
Circularity Check
Empirical benchmark evaluation with no derivation chain or self-referential reductions
full rationale
The paper collects a new dataset of 100 human-annotated Olympiad combinatorics problems, defines two categories (analysis-centric and construction-centric), applies rubric-guided grading plus deterministic verification, and reports empirical model scores. No equations, fitted parameters, predictions, uniqueness theorems, or ansatzes appear. No self-citations are invoked to justify core claims. The observed performance differences are direct measurements on the external dataset rather than reductions to the paper's own inputs or prior fitted quantities. This is a standard self-contained empirical benchmark paper.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Accessed: 2026-05-26
Google Keyword Blog. Accessed: 2026-05-26. Google DeepMind. Gemini 3.1 Pro. https://deepmind.google/models/gemini/pro/,
2026
-
[2]
Accessed: 2026-05-26. Moonshot AI. Kimi K2.6: Advancing Open-Source Coding. https://www.kimi.com/blog/ kimi-k2-6, 2026. Moonshot AI Tech Blog. Accessed: 2026-05-26. 9 OpenAI. Introducing GPT -5.5. 2026. URL https://openai.com/index/ introducing-gpt-5-5/. Accessed: 2026-05-26. Qwen Team. Qwen3.6-35B-A3B: Agentic Coding Power, Now Open to All. 2026. URL htt...
-
[3]
Please use LaTeX format to represent the variables and formulas used in the solution process and results
-
[4]
Please provide a complete and explicit solution process in the response
-
[5]
"" [Original problem statement]
In the end of the response: - If the problem has final answer(s), put them in \boxed{}. - If the problem requires multiple answers, list them in order, each in a separate \boxed{}. - If the problem is a proof or does not require a final numerical answer, provide a complete and rigorous proof, and do not use \boxed{}. Problem: [Original problem statement] ...
2019
-
[6]
the first cell in the sequence is a valley, meaning the number written is less than all its orthogonal neighbors
-
[7]
each subsequent cell in the sequence is orthogonally adjacent to the previous cell; and
-
[8]
standard zig- zag deletion pattern
the numbers written in the cells in the sequence are in increasing order. Find, as a function ofn, the smallest possible total number of uphill paths in a Nordic square. Relevant grading guidelines. • Partial:Establish the lower-bound idea by associating adjacent pairs with uphill paths, correctly count the 2n(n−1) adjacent pairs in the grid, and add at l...
2009
-
[9]
there is an oriented edgev→wwithh(w)−h(v) =−3, or
-
[10]
at most two
there is an oriented edgew→vwithh(v)−h(w) = 1. In either case, comparing with the corresponding valid edge difference forH shows that d(w)≥d(v) . Moreover, in both cases, h(w)< h(v) . Thus from v we can move to another vertex w∈D with h(w)< h(v) . Repeating this process gives an infinite sequence of vertices in the finite set D along which the integerh-va...
2020
-
[11]
Then for any other pair (x, y), the determinant ay−bx is a multiple of gcd(a, b) =d
Only primitive pairs can contribute.Suppose a pair (a, b) has d= gcd(a, b)>1 . Then for any other pair (x, y), the determinant ay−bx is a multiple of gcd(a, b) =d . Hence |ay−bx| ≥d >1 . Thus a non-primitive pair cannot form an edge with any other pair. To maximize N, we may restrict our attention to primitive pairs, those with gcd(a, b) = 1 . The only pr...
-
[12]
Connect two vertices by an edge if their determinant is ±1
Graph interpretation.Consider the set V of all primitive pairs (a, b) with a, b≥0 and (a, b)̸= (0,0) . Connect two vertices by an edge if their determinant is ±1. This graph is exactly the Stern-Brocot tree, or Farey graph. It is well known that the Stern-Brocot tree is a tree: it is connected and contains no cycles
-
[13]
A forest onnvertices has at mostn−1edges
Upper bound.Choosing any 100 vertices from a tree induces a subgraph which is a forest. A forest onnvertices has at mostn−1edges. ThereforeN≤100−1 = 99
-
[14]
,(1,98),(1,99)
Achieving the bound.Take the 100 pairs (1,0),(1,1),(1,2), . . . ,(1,98),(1,99) . All are distinct, nonnegative, and primitive. For any0≤k≤98, |1·(k+ 1)−1·k|= 1, so each consecutive pair contributes an edge. There are exactly 99 such edges, and no other pair gives determinant±1. HenceN= 99is attainable. 99 Scoring interpretation.The response contains a val...
2021
-
[15]
, m−1 , remove row 3k+ 1
Clear the middle rows.For each k= 0, . . . , m−1 , remove row 3k+ 1 . After these clears the only remaining stones are • in row3k: columns3ℓ+ 1for allℓ; • in row3k+ 2: columns3ℓand3ℓ+ 1for allℓ. 3.Second batch of placements.For everyk, ℓ∈ {0, . . . , m−1}, place stones at (3k+ 1,3ℓ). These moves are also pairwise disjoint and are legal because row 3k+ 1 i...
-
[16]
, m−1 , remove column 3ℓ and then column 3ℓ+ 1
Clear the full columns.For each ℓ= 0, . . . , m−1 , remove column 3ℓ and then column 3ℓ+ 1 . All stones have been removed, and many moves were performed. Hencen= 3mworks. Impossibility when 3∤n .Colour the columns by their index modulo 3; call the three colour classes C0, C1, C2. For a configuration letN i be the number of stones inC i. A placement at (r,...
2025
-
[17]
, n−1} , the cells (0, c),(1, c),
Upper Triangle (r < c) .For each column c∈ {1,2, . . . , n−1} , the cells (0, c),(1, c), . . . ,(c− 1, c)form a vertical rectangle of sizec×1. There aren−1such rectangles
-
[18]
, n−1} , the cells (r,0),(r,1),
Lower Triangle (r > c) .For each row r∈ {1,2, . . . , n−1} , the cells (r,0),(r,1), . . . ,(r, r−1) form a horizontal rectangle of size1×r. There aren−1such rectangles. These 2n−2 rectangles are disjoint and cover all cells except the diagonal holes. Thus, 2n−2 tiles are sufficient. Minimality (Lower Bound).Analysis of small cases n= 1,2,3,4 shows that th...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.