GeometryZero: Advancing Geometry Solving via Group Contrastive Policy Optimization
Pith reviewed 2026-05-19 11:16 UTC · model grok-4.3
The pith
Group Contrastive Policy Optimization lets smaller models learn when auxiliary constructions help geometry problems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that naively applying GRPO produces unconditional rewards that encourage harmful auxiliary constructions, but Group Contrastive Masking can assign positive or negative rewards to those constructions according to their contextual utility for reaching the correct final answer, and pairing this with a length reward for longer chains enables smaller models to learn selective, helpful construction use, as demonstrated by consistent gains over RL baselines on Geometry3K and MathVista.
What carries the argument
Group Contrastive Masking, which labels each construction as positive or negative according to whether it improves the chance of a correct final answer within its reasoning group.
If this is right
- Smaller language models can be trained to couple auxiliary construction decisions with geometric reasoning without relying on much larger models.
- Harmful or indiscriminate constructions are suppressed, reducing the performance penalty they cause under standard RL.
- Longer reasoning chains receive explicit encouragement, supporting more detailed step-by-step solutions.
- The approach yields measurable accuracy improvements on established geometry benchmarks while keeping model size fixed.
Where Pith is reading between the lines
- The same masking logic might transfer to other construction-heavy domains such as algebraic proofs or physics diagram problems where intermediate steps can be verified by outcome.
- Combining the method with visual encoders could let models decide when to request or generate diagram annotations during reasoning.
- If the reward signal proves reliable, training data requirements could drop by removing the need for human-labeled construction examples.
Load-bearing premise
The final-answer reward signal alone can correctly label constructions as useful or harmful without creating systematic bias in the policy updates.
What would settle it
Retraining the models with the same setup but measuring no accuracy gain or a drop for GeometryZero versus the GRPO baseline on both Geometry3K and MathVista would show the masking does not deliver the claimed selective benefit.
read the original abstract
Recent progress in large language models (LLMs) has boosted mathematical reasoning, yet geometry remains challenging where auxiliary construction is often essential. Prior methods either underperform or depend on very large models (e.g., GPT-4o), making them costly. We argue that reinforcement learning with verifiable rewards (e.g., GRPO) can train smaller models to couple auxiliary construction with solid geometric reasoning. However, naively applying GRPO yields unconditional rewards, encouraging indiscriminate and sometimes harmful constructions. We propose Group Contrastive Policy Optimization (GCPO), an RL framework with two components: (1) Group Contrastive Masking, which assigns positive/negative construction rewards based on contextual utility, and (2) a Length Reward that encourages longer reasoning chains. On top of GCPO, we build GeometryZero, an affordable family of geometry reasoning models that selectively use auxiliary construction. Experiments on Geometry3K and MathVista show GeometryZero consistently outperforms RL baselines (e.g., GRPO, ToRL). The code has been available at https://github.com/ekonwang/GeometryZero.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GeometryZero, a family of smaller LLMs for geometry reasoning trained via Group Contrastive Policy Optimization (GCPO). GCPO augments standard RL (e.g., GRPO) with Group Contrastive Masking, which labels auxiliary constructions as positive or negative according to whether grouped trajectories produce a correct final answer, plus a Length Reward that favors longer reasoning chains. Experiments on Geometry3K and MathVista report consistent gains over RL baselines such as GRPO and ToRL.
Significance. If the empirical claims hold under rigorous controls, the work offers a practical route to improve auxiliary-construction handling in geometry solvers without scaling to GPT-4o-sized models. The open release of code is a clear reproducibility asset.
major comments (2)
- [§3.2] §3.2 (Group Contrastive Masking definition): The masking procedure assigns positive/negative labels to constructions solely by whether the completed trajectory yields a correct final answer. This implicitly assumes that intra-group performance differences are attributable to the construction step rather than downstream reasoning variance or base-policy stochasticity. No ablation isolating construction utility from these confounds is presented, which directly undermines the claim that GCPO selectively reinforces geometrically useful constructions.
- [Experiments] Experiments (results tables and §4): Outperformance is stated over GRPO and ToRL, yet the manuscript provides no statistical significance tests, exact baseline hyperparameter search protocols, or controls for random seeds. Without these, the central empirical claim that GCPO yields reliable gains remains only partially supported.
minor comments (2)
- [Abstract] Abstract: The phrase 'consistently outperforms' would benefit from a parenthetical note on the magnitude of gains and the model sizes employed.
- [Method] Notation: The contrastive group size and masking threshold appear as free parameters; their sensitivity should be reported in an appendix or ablation table.
Simulated Author's Rebuttal
We thank the referee for their insightful comments on our paper. We provide detailed responses to each major comment below and indicate the revisions we will make to address the concerns raised.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Group Contrastive Masking definition): The masking procedure assigns positive/negative labels to constructions solely by whether the completed trajectory yields a correct final answer. This implicitly assumes that intra-group performance differences are attributable to the construction step rather than downstream reasoning variance or base-policy stochasticity. No ablation isolating construction utility from these confounds is presented, which directly undermines the claim that GCPO selectively reinforces geometrically useful constructions.
Authors: The referee correctly identifies a potential limitation in our current presentation. The Group Contrastive Masking is designed to leverage the final verifiable reward to differentiate construction quality within groups of trajectories sharing the same problem context. This approach is motivated by the fact that the reward signal is only available at the end. Nevertheless, to strengthen the claim, we will revise §3.2 to include a more detailed discussion of this assumption and add an ablation study in the experiments section that compares GCPO to a non-contrastive variant with the same grouping mechanism. This will help demonstrate that the performance gains are attributable to the selective masking of constructions. revision: yes
-
Referee: [Experiments] Experiments (results tables and §4): Outperformance is stated over GRPO and ToRL, yet the manuscript provides no statistical significance tests, exact baseline hyperparameter search protocols, or controls for random seeds. Without these, the central empirical claim that GCPO yields reliable gains remains only partially supported.
Authors: We agree with the referee that additional experimental details and statistical analysis are necessary to fully support our empirical claims. In the revised manuscript, we will update the experiments section to include: (1) results from multiple random seeds with mean and standard deviation, (2) details on the hyperparameter tuning process for baselines including GRPO and ToRL, and (3) statistical significance tests such as t-tests to confirm the improvements are significant. These changes will be reflected in the updated tables and text. revision: yes
Circularity Check
No significant circularity in the derivation of GCPO or GeometryZero.
full rationale
The paper defines Group Contrastive Masking by grouping trajectories according to whether they produce a correct final answer under the terminal reward and then contrasting constructions within those groups; this definition is independent of the reported accuracy numbers on Geometry3K or MathVista. The Length Reward is likewise an additive term whose functional form does not presuppose the performance gains. No equation reduces the claimed superiority over GRPO or ToRL to a fitted parameter or to a self-citation whose validity depends on the present result. The central modeling choice is an empirical hypothesis about the utility of contrastive signals, not a definitional tautology. Self-citations to prior RL work, if present, supply external algorithmic scaffolding rather than load-bearing justification for the new masking rule.
Axiom & Free-Parameter Ledger
free parameters (1)
- contrastive group size and masking threshold
axioms (1)
- domain assumption Final-answer correctness provides a reliable and automatically verifiable reward signal for geometry problems
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Group Contrastive Masking function (Eq. 4) that masks Raux positive when E(Racc(Ow)) > E(Racc(Owo)) + ϵ, negative otherwise, zero when gap ≤ ϵ
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
Draw2Think: Harnessing Geometry Reasoning through Constraint Engine Interaction
Draw2Think recasts geometric reasoning as agentic interaction with a constraint engine, achieving 95.9% predicate-level construction fidelity and up to 16.4% accuracy gains on solid geometry tasks.
-
Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning
APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.
-
Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs
FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.