pith. sign in

arxiv: 2506.07160 · v3 · submitted 2025-06-08 · 💻 cs.CL

GeometryZero: Advancing Geometry Solving via Group Contrastive Policy Optimization

Pith reviewed 2026-05-19 11:16 UTC · model grok-4.3

classification 💻 cs.CL
keywords geometry reasoningreinforcement learningauxiliary constructionpolicy optimizationlarge language modelsmathematical problem solvingcontrastive masking
0
0 comments X

The pith

Group Contrastive Policy Optimization lets smaller models learn when auxiliary constructions help geometry problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Geometry problems frequently require adding extra points or lines that are not given in the diagram, yet standard reinforcement learning with answer-only rewards tends to reinforce every construction equally, including those that confuse the model. The paper introduces Group Contrastive Policy Optimization, which uses Group Contrastive Masking to assign positive rewards only to constructions that prove useful in context and adds a length reward to favor longer reasoning chains. Experiments show the resulting GeometryZero models outperform naive GRPO and other RL baselines on Geometry3K and MathVista. A sympathetic reader would care because the method promises to make reliable geometry reasoning available in smaller, cheaper models rather than requiring GPT-4o-scale systems. If the central mechanism holds, training pipelines could shift from indiscriminate exploration to selective construction use across similar reasoning domains.

Core claim

The paper claims that naively applying GRPO produces unconditional rewards that encourage harmful auxiliary constructions, but Group Contrastive Masking can assign positive or negative rewards to those constructions according to their contextual utility for reaching the correct final answer, and pairing this with a length reward for longer chains enables smaller models to learn selective, helpful construction use, as demonstrated by consistent gains over RL baselines on Geometry3K and MathVista.

What carries the argument

Group Contrastive Masking, which labels each construction as positive or negative according to whether it improves the chance of a correct final answer within its reasoning group.

If this is right

  • Smaller language models can be trained to couple auxiliary construction decisions with geometric reasoning without relying on much larger models.
  • Harmful or indiscriminate constructions are suppressed, reducing the performance penalty they cause under standard RL.
  • Longer reasoning chains receive explicit encouragement, supporting more detailed step-by-step solutions.
  • The approach yields measurable accuracy improvements on established geometry benchmarks while keeping model size fixed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same masking logic might transfer to other construction-heavy domains such as algebraic proofs or physics diagram problems where intermediate steps can be verified by outcome.
  • Combining the method with visual encoders could let models decide when to request or generate diagram annotations during reasoning.
  • If the reward signal proves reliable, training data requirements could drop by removing the need for human-labeled construction examples.

Load-bearing premise

The final-answer reward signal alone can correctly label constructions as useful or harmful without creating systematic bias in the policy updates.

What would settle it

Retraining the models with the same setup but measuring no accuracy gain or a drop for GeometryZero versus the GRPO baseline on both Geometry3K and MathVista would show the masking does not deliver the claimed selective benefit.

read the original abstract

Recent progress in large language models (LLMs) has boosted mathematical reasoning, yet geometry remains challenging where auxiliary construction is often essential. Prior methods either underperform or depend on very large models (e.g., GPT-4o), making them costly. We argue that reinforcement learning with verifiable rewards (e.g., GRPO) can train smaller models to couple auxiliary construction with solid geometric reasoning. However, naively applying GRPO yields unconditional rewards, encouraging indiscriminate and sometimes harmful constructions. We propose Group Contrastive Policy Optimization (GCPO), an RL framework with two components: (1) Group Contrastive Masking, which assigns positive/negative construction rewards based on contextual utility, and (2) a Length Reward that encourages longer reasoning chains. On top of GCPO, we build GeometryZero, an affordable family of geometry reasoning models that selectively use auxiliary construction. Experiments on Geometry3K and MathVista show GeometryZero consistently outperforms RL baselines (e.g., GRPO, ToRL). The code has been available at https://github.com/ekonwang/GeometryZero.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces GeometryZero, a family of smaller LLMs for geometry reasoning trained via Group Contrastive Policy Optimization (GCPO). GCPO augments standard RL (e.g., GRPO) with Group Contrastive Masking, which labels auxiliary constructions as positive or negative according to whether grouped trajectories produce a correct final answer, plus a Length Reward that favors longer reasoning chains. Experiments on Geometry3K and MathVista report consistent gains over RL baselines such as GRPO and ToRL.

Significance. If the empirical claims hold under rigorous controls, the work offers a practical route to improve auxiliary-construction handling in geometry solvers without scaling to GPT-4o-sized models. The open release of code is a clear reproducibility asset.

major comments (2)
  1. [§3.2] §3.2 (Group Contrastive Masking definition): The masking procedure assigns positive/negative labels to constructions solely by whether the completed trajectory yields a correct final answer. This implicitly assumes that intra-group performance differences are attributable to the construction step rather than downstream reasoning variance or base-policy stochasticity. No ablation isolating construction utility from these confounds is presented, which directly undermines the claim that GCPO selectively reinforces geometrically useful constructions.
  2. [Experiments] Experiments (results tables and §4): Outperformance is stated over GRPO and ToRL, yet the manuscript provides no statistical significance tests, exact baseline hyperparameter search protocols, or controls for random seeds. Without these, the central empirical claim that GCPO yields reliable gains remains only partially supported.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'consistently outperforms' would benefit from a parenthetical note on the magnitude of gains and the model sizes employed.
  2. [Method] Notation: The contrastive group size and masking threshold appear as free parameters; their sensitivity should be reported in an appendix or ablation table.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments on our paper. We provide detailed responses to each major comment below and indicate the revisions we will make to address the concerns raised.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Group Contrastive Masking definition): The masking procedure assigns positive/negative labels to constructions solely by whether the completed trajectory yields a correct final answer. This implicitly assumes that intra-group performance differences are attributable to the construction step rather than downstream reasoning variance or base-policy stochasticity. No ablation isolating construction utility from these confounds is presented, which directly undermines the claim that GCPO selectively reinforces geometrically useful constructions.

    Authors: The referee correctly identifies a potential limitation in our current presentation. The Group Contrastive Masking is designed to leverage the final verifiable reward to differentiate construction quality within groups of trajectories sharing the same problem context. This approach is motivated by the fact that the reward signal is only available at the end. Nevertheless, to strengthen the claim, we will revise §3.2 to include a more detailed discussion of this assumption and add an ablation study in the experiments section that compares GCPO to a non-contrastive variant with the same grouping mechanism. This will help demonstrate that the performance gains are attributable to the selective masking of constructions. revision: yes

  2. Referee: [Experiments] Experiments (results tables and §4): Outperformance is stated over GRPO and ToRL, yet the manuscript provides no statistical significance tests, exact baseline hyperparameter search protocols, or controls for random seeds. Without these, the central empirical claim that GCPO yields reliable gains remains only partially supported.

    Authors: We agree with the referee that additional experimental details and statistical analysis are necessary to fully support our empirical claims. In the revised manuscript, we will update the experiments section to include: (1) results from multiple random seeds with mean and standard deviation, (2) details on the hyperparameter tuning process for baselines including GRPO and ToRL, and (3) statistical significance tests such as t-tests to confirm the improvements are significant. These changes will be reflected in the updated tables and text. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation of GCPO or GeometryZero.

full rationale

The paper defines Group Contrastive Masking by grouping trajectories according to whether they produce a correct final answer under the terminal reward and then contrasting constructions within those groups; this definition is independent of the reported accuracy numbers on Geometry3K or MathVista. The Length Reward is likewise an additive term whose functional form does not presuppose the performance gains. No equation reduces the claimed superiority over GRPO or ToRL to a fitted parameter or to a self-citation whose validity depends on the present result. The central modeling choice is an empirical hypothesis about the utility of contrastive signals, not a definitional tautology. Self-citations to prior RL work, if present, supply external algorithmic scaffolding rather than load-bearing justification for the new masking rule.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard RL assumptions plus a small number of reward-design choices that must be tuned; no new physical entities are postulated.

free parameters (1)
  • contrastive group size and masking threshold
    Hyperparameters that control how positive and negative construction examples are selected within each group.
axioms (1)
  • domain assumption Final-answer correctness provides a reliable and automatically verifiable reward signal for geometry problems
    The entire reward structure depends on being able to judge solution correctness without human intervention.

pith-pipeline@v0.9.0 · 5738 in / 1154 out tokens · 55914 ms · 2026-05-19T11:16:12.683121+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Draw2Think: Harnessing Geometry Reasoning through Constraint Engine Interaction

    cs.CV 2026-05 unverdicted novelty 7.0

    Draw2Think recasts geometric reasoning as agentic interaction with a constraint engine, achieving 95.9% predicate-level construction fidelity and up to 16.4% accuracy gains on solid geometry tasks.

  2. Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning

    cs.CL 2026-04 unverdicted novelty 5.0

    APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.

  3. Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs

    cs.CL 2026-04 unverdicted novelty 5.0

    FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.