ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics

Bangjie Xu; Dongrui Liu; Haodi Lei; Haoran Zhang; Ning Ding; Qianjia Cheng; Runzhe Zhan; Shunkai Zhang; Xiaoye Qu; Xinmiao Han

arxiv: 2606.10479 · v1 · pith:F75A6PKKnew · submitted 2026-06-09 · 💻 cs.AI

ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics

Shunkai Zhang , Haoran Zhang , Yun Luo , Qianjia Cheng , Haodi Lei , Yizhuo Li , Runzhe Zhan , Zhilin Wang

show 10 more authors

Bangjie Xu Yucheng Su Xinmiao Han Xiaoye Qu Dongrui Liu Zhouchen Lin Yu Qiao Ning Ding Yafu Li Yu Cheng

This is my paper

Pith reviewed 2026-06-27 13:14 UTC · model grok-4.3

classification 💻 cs.AI

keywords ComBenchOlympiad combinatoricsLLM evaluationproof reasoningconstructive realizationmathematical reasoning benchmarkcombinatorics problemsfrontier model assessment

0 comments

The pith

A benchmark of 100 Olympiad combinatorics problems shows that rigorous proof reasoning and constructive realization are distinct capabilities in frontier language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ComBench to evaluate large language models on competition-level combinatorics tasks that demand either detailed mathematical proofs or explicit constructions plus justifications. It splits problems into analysis-centric and construction-centric categories and uses rubric-based grading plus deterministic verification to measure performance. Experiments reveal that overall scores remain well below saturation even for the strongest models, and that different models lead in different categories, with proof grading and construction success diverging. This separation matters because it identifies specific gaps in how models handle the dual demands of combinatorial reasoning.

Core claim

ComBench comprises 100 human-annotated problems divided into analysis-centric settings that test rigorous arguments and construction-centric settings that additionally require valid explicit constructions. The evaluation protocol pairs rubric-guided proof grading with deterministic construction verification. Frontier models reach at most 65.4 percent overall average and 75.3 percent Best@4, with Kimi-K2.6 trailing GPT-5.5 on analysis-centric proof grading yet surpassing it on construction-centric Best@4, while existence and construction problems prove consistently hardest.

What carries the argument

ComBench benchmark consisting of analysis-centric and construction-centric problem sets together with rubric-guided proof grading and deterministic construction verification.

If this is right

Frontier models require separate diagnostic evaluation on proof reasoning versus construction tasks rather than aggregate scores alone.
Existence and construction problems remain the hardest category across current models and will need targeted advances.
The observed divergence between models on the two settings indicates that scaling or general training does not automatically align both skills.
Benchmarks that isolate these capabilities can guide development of models that handle both rigorous justification and explicit object creation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The distinction may extend to other discrete mathematics domains where proof and construction appear together.
Training regimes could be designed to balance exposure to proof-heavy and construction-heavy examples.
Hybrid problems that demand both skills simultaneously might serve as stronger stress tests for future models.

Load-bearing premise

The 100 selected problems represent typical Olympiad combinatorics difficulty and the combined grading methods accurately capture the intended reasoning skills without systematic bias or verification mistakes.

What would settle it

A controlled test in which the same model achieves top scores on both analysis-centric proof grading and construction-centric verification without the observed performance divergence would undermine the claim that the two capabilities are distinct.

Figures

Figures reproduced from arXiv: 2606.10479 by Bangjie Xu, Dongrui Liu, Haodi Lei, Haoran Zhang, Ning Ding, Qianjia Cheng, Runzhe Zhan, Shunkai Zhang, Xiaoye Qu, Xinmiao Han, Yafu Li, Yizhuo Li, Yu Cheng, Yucheng Su, Yun Luo, Yu Qiao, Zhilin Wang, Zhouchen Lin.

**Figure 2.** Figure 2: Distribution of ComBench categories. We introduce ComBench, a benchmark of 100 Olympiadlevel combinatorics problems designed to evaluate mathematical reasoning beyond final-answer correctness. It is balanced between 50 analysis-centric problems and 50 construction-centric problems. The benchmark spans 15 competition sources from 2000 to 2025, including the International Mathematical Olympiad (IMO), Uni… view at source ↗

**Figure 3.** Figure 3: Construction-centric data builder pipeline for ComBench. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Verifier-gated scoring rule for construction-centric records. ComBench keeps proof grading and construction verification as separate signals so that Rigorous Proof Reasoning and Constructive Realization can be analyzed independently. For model ranking, however, construction-centric records also require a single score. Analysis-centric records use the proof score returned by the rubric-guided judge. For c… view at source ↗

**Figure 5.** Figure 5: Category-level performance of four repre [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of primary proof error reasons over all below-full-credit proof samples. Primary [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Combinatorics is central to Olympiad-level mathematical problem solving, requiring deep discrete reasoning, creative constructions, and rigorous structural insight. Recent evidence suggests that even today's strongest frontier models remain uneven on Olympiad combinatorics, revealing a gap in creative mathematical reasoning. We introduce ComBench, an Olympiad-level combinatorics benchmark for evaluating and diagnosing the combinatorial reasoning capabilities of large language models. ComBench contains 100 human-annotated competition-level problems organized around two complementary settings: analysis-centric problems, which primarily require rigorous mathematical arguments, and construction-centric problems, which require explicit constructions in addition to correctness justifications. The evaluation protocol combines rubric-guided proof grading with deterministic construction verification, exposing cases where proof quality and construction validity diverge. Experiments on frontier open- and closed-source models show that ComBench is far from saturated: the strongest model reaches 65.4% overall Avg. and 75.3% overall Best@4. We further find that Rigorous Proof Reasoning and Constructive Realization are distinct capabilities: Kimi-K2.6 trails GPT-5.5 on analysis-centric proof grading but surpasses it on construction-centric Best@4, while Existence and Construction problems remain consistently hardest across representative frontier models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ComBench adds a split between proof and construction in combinatorics reasoning but the distinction claim needs annotation validation to hold up.

read the letter

ComBench introduces a benchmark of 100 Olympiad combinatorics problems divided into analysis-centric and construction-centric tracks. The key observation is that models can perform differently on rigorous proof tasks versus those requiring explicit constructions, with Kimi-K2.6 trailing GPT-5.5 on proofs but leading on constructions.

The new part is the complementary tracks and the evidence that these are distinct capabilities rather than the same thing. The protocol uses rubric grading for proofs and deterministic verification for constructions, which helps expose the divergence. The top model scores are not close to ceiling, so the benchmark has room to grow.

This setup does a solid job of focusing on combinatorics, an area that mixes discrete reasoning and creative insight, and the results point to where models still struggle, especially on existence and construction problems.

The main concern is the missing details on how the problems were selected and labeled. No inter-annotator agreement is reported, and there's no mention of checks for consistency in the categorization or grading rubrics. That leaves the central claim about distinct capabilities on somewhat shaky ground, as annotation differences could explain the model crossover instead of real capability differences.

Readers who work on AI for mathematical reasoning or who build evaluation suites will get the most from this. It gives a concrete dataset and some diagnostic findings that could guide future model improvements.

The paper shows clear thinking in designing the two settings and running the experiments. It deserves peer review because a well-vetted benchmark in this space would be valuable, even if the current version needs more on the reliability side.

I would recommend sending it out for review and asking the authors to include agreement metrics and perhaps a small audit on the annotations.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ComBench, a benchmark of 100 human-annotated Olympiad-level combinatorics problems split into analysis-centric (rigorous proof) and construction-centric (explicit construction plus justification) categories. It describes an evaluation protocol combining rubric-guided proof grading with deterministic construction verification, reports that frontier models remain far from saturation (strongest reaches 65.4% overall Avg. and 75.3% overall Best@4), and concludes that Rigorous Proof Reasoning and Constructive Realization are distinct capabilities on the basis of a performance crossover (Kimi-K2.6 trails GPT-5.5 on analysis-centric grading but surpasses it on construction-centric Best@4).

Significance. If the problem categorization and grading protocol prove reliable, ComBench would supply a diagnostically useful resource for combinatorial reasoning in LLMs, with the reported distinction between proof and construction skills offering a concrete signal for model development. The work is a pure empirical evaluation on a newly collected dataset and contains no fitted parameters or self-referential derivations.

major comments (2)

[Methods (problem annotation and categorization)] Methods (problem sourcing and annotation): The manuscript states that the 100 problems are human-annotated and organized into analysis-centric versus construction-centric settings but reports neither inter-annotator agreement figures nor an error-rate audit on a held-out subset. This is load-bearing for the central claim of distinct capabilities, because the observed Kimi-K2.6 vs. GPT-5.5 crossover could be an artifact of annotation variance without sensitivity analysis showing robustness to modest re-labeling.
[Evaluation protocol] Evaluation protocol: Exact rubric criteria for proof grading and the precise deterministic checks for construction validity are not supplied, nor is any validation that the protocol measures the intended skills without systematic bias. This directly undermines the support for the performance numbers and the conclusion that divergence reflects distinct reasoning skills rather than grading artifacts.

minor comments (2)

[Results] The abstract and results sections refer to 'Existence and Construction problems' as consistently hardest but do not include a per-subcategory breakdown table across models to substantiate the consistency claim.
Consider adding a short table or appendix entry listing the distribution of the 100 problems across the two main categories and any subcategories.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting important aspects of our methodology and evaluation protocol. We provide point-by-point responses to the major comments and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: Methods (problem annotation and categorization): The manuscript states that the 100 problems are human-annotated and organized into analysis-centric versus construction-centric settings but reports neither inter-annotator agreement figures nor an error-rate audit on a held-out subset. This is load-bearing for the central claim of distinct capabilities, because the observed Kimi-K2.6 vs. GPT-5.5 crossover could be an artifact of annotation variance without sensitivity analysis showing robustness to modest re-labeling.

Authors: We agree that providing inter-annotator agreement and an error-rate audit would enhance confidence in the categorization. The annotation was performed by domain experts using explicit guidelines for distinguishing the two categories. In revision, we will expand the methods section with a detailed account of the annotation process, including examples of categorization decisions. Additionally, we will conduct and report a sensitivity analysis on a held-out subset to assess robustness of the performance crossover to potential re-labeling. We note that the observed distinction is supported by consistent patterns across multiple models, but we acknowledge the value of these additional checks. revision: partial
Referee: Evaluation protocol: Exact rubric criteria for proof grading and the precise deterministic checks for construction validity are not supplied, nor is any validation that the protocol measures the intended skills without systematic bias. This directly undermines the support for the performance numbers and the conclusion that divergence reflects distinct reasoning skills rather than grading artifacts.

Authors: We appreciate this point and will revise the manuscript to include the full rubric criteria in an appendix, along with the exact deterministic verification procedures for constructions. We will also add a section discussing the validation steps taken during protocol development, such as pilot grading sessions and alignment with expert judgments, to demonstrate that the protocol targets the intended skills. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation with no derivation chain or self-referential reductions

full rationale

The paper collects a new dataset of 100 human-annotated Olympiad combinatorics problems, defines two categories (analysis-centric and construction-centric), applies rubric-guided grading plus deterministic verification, and reports empirical model scores. No equations, fitted parameters, predictions, uniqueness theorems, or ansatzes appear. No self-citations are invoked to justify core claims. The observed performance differences are direct measurements on the external dataset rather than reductions to the paper's own inputs or prior fitted quantities. This is a standard self-contained empirical benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark paper; no mathematical derivations, fitted constants, or new theoretical entities are introduced.

pith-pipeline@v0.9.1-grok · 5816 in / 1064 out tokens · 18376 ms · 2026-06-27T13:14:27.815334+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 1 canonical work pages

[1]

Accessed: 2026-05-26

Google Keyword Blog. Accessed: 2026-05-26. Google DeepMind. Gemini 3.1 Pro. https://deepmind.google/models/gemini/pro/,

2026
[2]

Empty input

Accessed: 2026-05-26. Moonshot AI. Kimi K2.6: Advancing Open-Source Coding. https://www.kimi.com/blog/ kimi-k2-6, 2026. Moonshot AI Tech Blog. Accessed: 2026-05-26. 9 OpenAI. Introducing GPT -5.5. 2026. URL https://openai.com/index/ introducing-gpt-5-5/. Accessed: 2026-05-26. Qwen Team. Qwen3.6-35B-A3B: Agentic Coding Power, Now Open to All. 2026. URL htt...

work page arXiv 2026
[3]

Please use LaTeX format to represent the variables and formulas used in the solution process and results
[4]

Please provide a complete and explicit solution process in the response
[5]

"" [Original problem statement]

In the end of the response: - If the problem has final answer(s), put them in \boxed{}. - If the problem requires multiple answers, list them in order, each in a separate \boxed{}. - If the problem is a proof or does not require a final numerical answer, provide a complete and rigorous proof, and do not use \boxed{}. Problem: [Original problem statement] ...

2019
[6]

the first cell in the sequence is a valley, meaning the number written is less than all its orthogonal neighbors
[7]

each subsequent cell in the sequence is orthogonally adjacent to the previous cell; and
[8]

standard zig- zag deletion pattern

the numbers written in the cells in the sequence are in increasing order. Find, as a function ofn, the smallest possible total number of uphill paths in a Nordic square. Relevant grading guidelines. • Partial:Establish the lower-bound idea by associating adjacent pairs with uphill paths, correctly count the 2n(n−1) adjacent pairs in the grid, and add at l...

2009
[9]

there is an oriented edgev→wwithh(w)−h(v) =−3, or
[10]

at most two

there is an oriented edgew→vwithh(v)−h(w) = 1. In either case, comparing with the corresponding valid edge difference forH shows that d(w)≥d(v) . Moreover, in both cases, h(w)< h(v) . Thus from v we can move to another vertex w∈D with h(w)< h(v) . Repeating this process gives an infinite sequence of vertices in the finite set D along which the integerh-va...

2020
[11]

Then for any other pair (x, y), the determinant ay−bx is a multiple of gcd(a, b) =d

Only primitive pairs can contribute.Suppose a pair (a, b) has d= gcd(a, b)>1 . Then for any other pair (x, y), the determinant ay−bx is a multiple of gcd(a, b) =d . Hence |ay−bx| ≥d >1 . Thus a non-primitive pair cannot form an edge with any other pair. To maximize N, we may restrict our attention to primitive pairs, those with gcd(a, b) = 1 . The only pr...
[12]

Connect two vertices by an edge if their determinant is ±1

Graph interpretation.Consider the set V of all primitive pairs (a, b) with a, b≥0 and (a, b)̸= (0,0) . Connect two vertices by an edge if their determinant is ±1. This graph is exactly the Stern-Brocot tree, or Farey graph. It is well known that the Stern-Brocot tree is a tree: it is connected and contains no cycles
[13]

A forest onnvertices has at mostn−1edges

Upper bound.Choosing any 100 vertices from a tree induces a subgraph which is a forest. A forest onnvertices has at mostn−1edges. ThereforeN≤100−1 = 99
[14]

,(1,98),(1,99)

Achieving the bound.Take the 100 pairs (1,0),(1,1),(1,2), . . . ,(1,98),(1,99) . All are distinct, nonnegative, and primitive. For any0≤k≤98, |1·(k+ 1)−1·k|= 1, so each consecutive pair contributes an edge. There are exactly 99 such edges, and no other pair gives determinant±1. HenceN= 99is attainable. 99 Scoring interpretation.The response contains a val...

2021
[15]

, m−1 , remove row 3k+ 1

Clear the middle rows.For each k= 0, . . . , m−1 , remove row 3k+ 1 . After these clears the only remaining stones are • in row3k: columns3ℓ+ 1for allℓ; • in row3k+ 2: columns3ℓand3ℓ+ 1for allℓ. 3.Second batch of placements.For everyk, ℓ∈ {0, . . . , m−1}, place stones at (3k+ 1,3ℓ). These moves are also pairwise disjoint and are legal because row 3k+ 1 i...
[16]

, m−1 , remove column 3ℓ and then column 3ℓ+ 1

Clear the full columns.For each ℓ= 0, . . . , m−1 , remove column 3ℓ and then column 3ℓ+ 1 . All stones have been removed, and many moves were performed. Hencen= 3mworks. Impossibility when 3∤n .Colour the columns by their index modulo 3; call the three colour classes C0, C1, C2. For a configuration letN i be the number of stones inC i. A placement at (r,...

2025
[17]

, n−1} , the cells (0, c),(1, c),

Upper Triangle (r < c) .For each column c∈ {1,2, . . . , n−1} , the cells (0, c),(1, c), . . . ,(c− 1, c)form a vertical rectangle of sizec×1. There aren−1such rectangles
[18]

, n−1} , the cells (r,0),(r,1),

Lower Triangle (r > c) .For each row r∈ {1,2, . . . , n−1} , the cells (r,0),(r,1), . . . ,(r, r−1) form a horizontal rectangle of size1×r. There aren−1such rectangles. These 2n−2 rectangles are disjoint and cover all cells except the diagonal holes. Thus, 2n−2 tiles are sufficient. Minimality (Lower Bound).Analysis of small cases n= 1,2,3,4 shows that th...

2025

[1] [1]

Accessed: 2026-05-26

Google Keyword Blog. Accessed: 2026-05-26. Google DeepMind. Gemini 3.1 Pro. https://deepmind.google/models/gemini/pro/,

2026

[2] [2]

Empty input

Accessed: 2026-05-26. Moonshot AI. Kimi K2.6: Advancing Open-Source Coding. https://www.kimi.com/blog/ kimi-k2-6, 2026. Moonshot AI Tech Blog. Accessed: 2026-05-26. 9 OpenAI. Introducing GPT -5.5. 2026. URL https://openai.com/index/ introducing-gpt-5-5/. Accessed: 2026-05-26. Qwen Team. Qwen3.6-35B-A3B: Agentic Coding Power, Now Open to All. 2026. URL htt...

work page arXiv 2026

[3] [3]

Please use LaTeX format to represent the variables and formulas used in the solution process and results

[4] [4]

Please provide a complete and explicit solution process in the response

[5] [5]

"" [Original problem statement]

In the end of the response: - If the problem has final answer(s), put them in \boxed{}. - If the problem requires multiple answers, list them in order, each in a separate \boxed{}. - If the problem is a proof or does not require a final numerical answer, provide a complete and rigorous proof, and do not use \boxed{}. Problem: [Original problem statement] ...

2019

[6] [6]

the first cell in the sequence is a valley, meaning the number written is less than all its orthogonal neighbors

[7] [7]

each subsequent cell in the sequence is orthogonally adjacent to the previous cell; and

[8] [8]

standard zig- zag deletion pattern

the numbers written in the cells in the sequence are in increasing order. Find, as a function ofn, the smallest possible total number of uphill paths in a Nordic square. Relevant grading guidelines. • Partial:Establish the lower-bound idea by associating adjacent pairs with uphill paths, correctly count the 2n(n−1) adjacent pairs in the grid, and add at l...

2009

[9] [9]

there is an oriented edgev→wwithh(w)−h(v) =−3, or

[10] [10]

at most two

there is an oriented edgew→vwithh(v)−h(w) = 1. In either case, comparing with the corresponding valid edge difference forH shows that d(w)≥d(v) . Moreover, in both cases, h(w)< h(v) . Thus from v we can move to another vertex w∈D with h(w)< h(v) . Repeating this process gives an infinite sequence of vertices in the finite set D along which the integerh-va...

2020

[11] [11]

Then for any other pair (x, y), the determinant ay−bx is a multiple of gcd(a, b) =d

Only primitive pairs can contribute.Suppose a pair (a, b) has d= gcd(a, b)>1 . Then for any other pair (x, y), the determinant ay−bx is a multiple of gcd(a, b) =d . Hence |ay−bx| ≥d >1 . Thus a non-primitive pair cannot form an edge with any other pair. To maximize N, we may restrict our attention to primitive pairs, those with gcd(a, b) = 1 . The only pr...

[12] [12]

Connect two vertices by an edge if their determinant is ±1

Graph interpretation.Consider the set V of all primitive pairs (a, b) with a, b≥0 and (a, b)̸= (0,0) . Connect two vertices by an edge if their determinant is ±1. This graph is exactly the Stern-Brocot tree, or Farey graph. It is well known that the Stern-Brocot tree is a tree: it is connected and contains no cycles

[13] [13]

A forest onnvertices has at mostn−1edges

Upper bound.Choosing any 100 vertices from a tree induces a subgraph which is a forest. A forest onnvertices has at mostn−1edges. ThereforeN≤100−1 = 99

[14] [14]

,(1,98),(1,99)

Achieving the bound.Take the 100 pairs (1,0),(1,1),(1,2), . . . ,(1,98),(1,99) . All are distinct, nonnegative, and primitive. For any0≤k≤98, |1·(k+ 1)−1·k|= 1, so each consecutive pair contributes an edge. There are exactly 99 such edges, and no other pair gives determinant±1. HenceN= 99is attainable. 99 Scoring interpretation.The response contains a val...

2021

[15] [15]

, m−1 , remove row 3k+ 1

Clear the middle rows.For each k= 0, . . . , m−1 , remove row 3k+ 1 . After these clears the only remaining stones are • in row3k: columns3ℓ+ 1for allℓ; • in row3k+ 2: columns3ℓand3ℓ+ 1for allℓ. 3.Second batch of placements.For everyk, ℓ∈ {0, . . . , m−1}, place stones at (3k+ 1,3ℓ). These moves are also pairwise disjoint and are legal because row 3k+ 1 i...

[16] [16]

, m−1 , remove column 3ℓ and then column 3ℓ+ 1

Clear the full columns.For each ℓ= 0, . . . , m−1 , remove column 3ℓ and then column 3ℓ+ 1 . All stones have been removed, and many moves were performed. Hencen= 3mworks. Impossibility when 3∤n .Colour the columns by their index modulo 3; call the three colour classes C0, C1, C2. For a configuration letN i be the number of stones inC i. A placement at (r,...

2025

[17] [17]

, n−1} , the cells (0, c),(1, c),

Upper Triangle (r < c) .For each column c∈ {1,2, . . . , n−1} , the cells (0, c),(1, c), . . . ,(c− 1, c)form a vertical rectangle of sizec×1. There aren−1such rectangles

[18] [18]

, n−1} , the cells (r,0),(r,1),

Lower Triangle (r > c) .For each row r∈ {1,2, . . . , n−1} , the cells (r,0),(r,1), . . . ,(r, r−1) form a horizontal rectangle of size1×r. There aren−1such rectangles. These 2n−2 rectangles are disjoint and cover all cells except the diagonal holes. Thus, 2n−2 tiles are sufficient. Minimality (Lower Bound).Analysis of small cases n= 1,2,3,4 shows that th...

2025