arxiv: 2604.24114 · v1 · submitted 2026-04-27 · 💻 cs.CL

Recognition: unknown

IRIS: Interleaved Reinforcement with Incremental Staged Curriculum for Cross-Lingual Mathematical Reasoning

Navya Gupta , Rishitej Reddy Vyalla , Avinash Anand , Chhavi Kirtani , Erik Cambria , Zhengchen Zhang , Zhengkui Wang , Timothy Liu

show 3 more authors

Aik Beng Ng Simon See Rajiv Ratn Shah

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:41 UTC · model grok-4.3

classification 💻 cs.CL

keywords cross-lingual reasoningmathematical reasoningcurriculum learningreinforcement learninglow-resource languagesmultilingual datasets

0 comments

The pith

A two-axis training framework improves mathematical reasoning in low-resource languages by pairing staged curriculum fine-tuning with reverse reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Language models often produce inconsistent step-by-step math reasoning when moving from English to languages like Hindi or Marathi. The paper presents IRIS as a method that runs supervised fine-tuning on progressively harder problems while using reverse curriculum reinforcement learning to lessen dependence on explicit guidance. A composite reward scores final correctness, step alignment, continuity, and numeric handling, and the whole system is optimized with Group Relative Policy Optimization. Experiments show gains on standard benchmarks and larger improvements in low-resource and bilingual test sets, backed by a new 29k-problem dataset with step annotations in three languages. A sympathetic reader would care because the approach directly targets the gap that currently limits reliable math reasoning outside high-resource languages.

Core claim

IRIS combines supervised fine-tuning along a vertical axis of increasing problem difficulty with reverse curriculum reinforcement learning along a horizontal axis that reduces reliance on step-by-step prompts, using a composite reward of correctness, step-wise alignment, continuity, and numeric incentives optimized via Group Relative Policy Optimization, and produces consistent performance lifts across multilingual math benchmarks with the largest gains in low-resource settings.

What carries the argument

The IRIS two-axis framework that interleaves incremental staged curriculum supervised fine-tuning with reverse curriculum reinforcement learning under a composite reward optimized by Group Relative Policy Optimization.

If this is right

Performance rises on standard math reasoning benchmarks in English, Hindi, and Marathi.
Substantial gains appear in low-resource and bilingual evaluation settings.
Modest gains occur in high-resource languages.
Models exhibit reduced need for explicit step-by-step guidance while maintaining answer accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Releasing the CL-Math dataset with step-level annotations could support further work on multilingual reasoning beyond the reported experiments.
The same interleaving pattern might transfer to other structured reasoning domains such as code generation or science question answering in low-resource languages.
Testing the method on additional language pairs would show whether the cross-lingual improvements generalize beyond the three languages studied here.

Load-bearing premise

The composite reward and reverse curriculum reinforcement axis will reduce dependence on explicit step-by-step guidance in low-resource languages without introducing new failure modes that the benchmarks miss.

What would settle it

A controlled comparison on the multilingual test sets in which IRIS shows no accuracy gain over standard fine-tuning for Hindi or Marathi problems or produces more inconsistent reasoning chains despite higher final-answer scores.

Figures

Figures reproduced from arXiv: 2604.24114 by Aik Beng Ng, Avinash Anand, Chhavi Kirtani, Erik Cambria, Navya Gupta, Rajiv Ratn Shah, Rishitej Reddy Vyalla, Simon See, Timothy Liu, Zhengchen Zhang, Zhengkui Wang.

**Figure 1.** Figure 1: Integrated IRIS: Interleaved Reinforcement with Incremental Staged Curriculum pipeline: It blends vertical supervised fine-tuning (SFT) with horizontal GRPO-based reinforcement learning. Starting with multilingual step-by-step data, the model is first warmed up on an easy-to-hard vertical curriculum, then refined through reverse-stage prompts and composite rewards to master deep, robust mathematical reason… view at source ↗

**Figure 2.** Figure 2: Reverse curriculum staging. Progressive removal of reasoning context guides the model from partial completion to full solution generation. 3.6 RLVR-Based Composite Reward Design In the horizontal axis, we define a composite reward function that promotes answer correctness, alignment with reference reasoning, solution continuity, and well-formed numeric outputs. For a given problem (q, y) at stage k, each… view at source ↗

**Figure 4.** Figure 4: validates this directly and unambiguously. In the Hindi-only setting (HI-EASY), overall reward declines from approximately 2.5 to 2.0 across just 250 steps, consistent with a model struggling to simultaneously maintain reasoning quality while adapting to an unfamiliar script distribution. Adding English (EN+HI-EASY) reverses this entirely: reward rises steadily from 1.0 to 2.3 across 900 steps, with a … view at source ↗

**Figure 5.** Figure 5: Training reward dynamics across curriculum configurations. Without SFT warmup, rewards plateau early (∼2.5). Monolingual Marathi on Easy+Medium stabilizes at ∼2.1, while adding English improves performance to ∼2.7 but saturates at medium difficulty. The full IRIS setup (Easy+Medium+Hard) achieves sustained gains up to ∼3.2, showing that SFT warmup, bilingual mixing, and full curriculum each incrementall… view at source ↗

**Figure 6.** Figure 6: Average Solution Length by Difficulty: Comparison of mean and median number of reasoning steps generated per difficulty level in the CL-Math dataset. You have the following question and its corresponding answer. Your task is to convert the answer only into 3 - 5 logical steps. # Question: {question} # Answer: {answer} Give the response in the following format: # Step wise format: [your response] 2. Medium… view at source ↗

**Figure 7.** Figure 7: We compare reward trends across three configurations: full curriculum with SFT (left), curriculum view at source ↗

**Figure 8.** Figure 8: This figure illustrates reward progression when training on Hindi-only data (left), English+Hindi (center), view at source ↗

**Figure 9.** Figure 9: Reward curves for Marathi-only (left), English-only (center), and combined English+Marathi curriculum view at source ↗

**Figure 10.** Figure 10: Question and Base Model Response Analysis view at source ↗

**Figure 11.** Figure 11: Comparison of Arithmetic Product Tasks on Vertical Curriculum: We compared responses by Base Model and SFT responses of all difficulty tiers view at source ↗

**Figure 12.** Figure 12: Comparison of Quadratic Factorisation Task on Vertical Curriculum: We compared responses by Base Model and SFT responses of easy and medium tiers view at source ↗

**Figure 13.** Figure 13: Comparison of Arithmetic Product Task on Horizontal Curriculum: We compared responses after applying GRPO on respective SFT checkpoints view at source ↗

**Figure 14.** Figure 14: Comparison of Quadratic Factorisation Task on Horizontal Curriculum: We compared responses after applying GRPO on respective SFT checkpoints view at source ↗

**Figure 15.** Figure 15: Comparison of Hindi Responses for Arithmetic Average Task on Horizontal Curriculum: We compared responses after applying GRPO on respective SFT checkpoints view at source ↗

**Figure 16.** Figure 16: Comparison of Marathi Responses for Algebra Task on Horizontal Curriculum: We compared responses after applying GRPO on respective SFT checkpoints view at source ↗

**Figure 17.** Figure 17: Original English Arithmetic Question and Corresponding Generated Stepwise Response view at source ↗

**Figure 18.** Figure 18: Marathi Translation of the English Arithmetic Question and the Corresponding Generated Stepwise view at source ↗

**Figure 19.** Figure 19: Hindi Translation of the English Arithmetic Question and the Corresponding Generated Stepwise view at source ↗

read the original abstract

Curriculum learning helps language models tackle complex reasoning by gradually increasing task difficulty. However, it often fails to generate consistent step-by-step reasoning, especially in multilingual and low-resource settings where cross-lingual transfer from English to Indian languages remains limited. We propose IRIS: Interleaved Reinforcement with Incremental Staged Curriculum, a two-axis framework that combines Supervised Fine-Tuning on progressively harder problems (vertical axis) with Reverse Curriculum Reinforcement Learning to reduce reliance on step-by-step guidance (horizontal axis). We design a composite reward combining correctness, step-wise alignment, continuity, and numeric incentives, optimized via Group Relative Policy Optimization (GRPO). We release CL-Math, a dataset of 29k problems with step-level annotations in English, Hindi, and Marathi. Across standard benchmarks and curated multilingual test sets, IRIS consistently improves performance, with strong results on math reasoning tasks and substantial gains in low-resource and bilingual settings, alongside modest improvements in high-resource languages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

IRIS combines staged SFT with reverse-curriculum RL and a four-part reward for multilingual math, plus a new 29k dataset, but the abstract gives no numbers or ablations so the gains are hard to judge.

read the letter

The core of this paper is a practical training recipe that runs supervised fine-tuning on progressively harder problems in one direction while applying reverse-curriculum reinforcement learning in the other to cut down on explicit step guidance. They add a composite reward with correctness, step-wise alignment, continuity, and numeric terms, optimize it with GRPO, and release CL-Math, a 29k-problem set with step annotations in English, Hindi, and Marathi. That dataset release is the clearest concrete contribution here.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces IRIS, a two-axis framework for cross-lingual mathematical reasoning that interleaves vertical supervised fine-tuning on incrementally harder problems with horizontal reverse-curriculum reinforcement learning. The RL component uses Group Relative Policy Optimization (GRPO) with a composite reward consisting of correctness, step-wise alignment, continuity, and numeric incentives. The authors release the CL-Math dataset of 29k step-annotated problems in English, Hindi, and Marathi and claim consistent accuracy gains on standard benchmarks and multilingual test sets, with particularly large improvements in low-resource and bilingual settings.

Significance. If the empirical claims hold after addressing the noted gaps, IRIS would offer a practical recipe for improving step-by-step reasoning transfer from high- to low-resource languages while releasing a useful annotated dataset. The explicit separation of vertical curriculum SFT and horizontal reverse-curriculum RL, together with the four-term reward, addresses a recognized weakness of standard curriculum learning in multilingual settings.

major comments (3)

[Abstract] Abstract: the headline claim of 'consistent improvements' and 'substantial gains in low-resource settings' is presented without any numerical results, confidence intervals, ablation tables, or statistical tests, so the central empirical assertion cannot be evaluated from the provided text.
[Method (reward and GRPO)] Reward design and experimental evaluation: the composite reward (correctness + step-wise alignment + continuity + numeric incentives) is load-bearing for the claim that reverse-curriculum RL produces genuine reasoning gains rather than fluent but logically broken chains; no per-language error taxonomy, human step-quality ratings, or component ablations are supplied to rule out reward hacking in Hindi/Marathi where base-model step priors are weaker.
[Experiments] Experimental section: aggregate accuracy on curated test sets is reported, but the paper does not provide the per-language breakdown of failure modes or comparison against a strong English-only baseline with the same reward, which is required to substantiate the cross-lingual transfer claim.

minor comments (2)

[Dataset] Clarify the exact split of the 29k CL-Math problems across the three languages and the protocol used to generate step-level annotations.
[Notation and terminology] Define GRPO and all reward-component weights at first mention; ensure consistent use of 'vertical' and 'horizontal' axis terminology throughout.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback. The comments correctly identify areas where the empirical claims and validations can be strengthened. We address each point below and commit to revisions that improve clarity without altering the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim of 'consistent improvements' and 'substantial gains in low-resource settings' is presented without any numerical results, confidence intervals, ablation tables, or statistical tests, so the central empirical assertion cannot be evaluated from the provided text.

Authors: We agree that the abstract would benefit from concrete numerical support. In the revised version we will insert the most salient accuracy deltas (e.g., overall and low-resource gains) together with a brief reference to the statistical tests and ablation tables already present in the main text. Space constraints preclude full tables or confidence intervals, but the headline numbers will allow readers to evaluate the central claim directly from the abstract. revision: yes
Referee: [Method (reward and GRPO)] Reward design and experimental evaluation: the composite reward (correctness + step-wise alignment + continuity + numeric incentives) is load-bearing for the claim that reverse-curriculum RL produces genuine reasoning gains rather than fluent but logically broken chains; no per-language error taxonomy, human step-quality ratings, or component ablations are supplied to rule out reward hacking in Hindi/Marathi where base-model step priors are weaker.

Authors: The manuscript already contains component-wise ablations of the composite reward (Section 4.3) that show performance degradation when any term is removed. We will expand the discussion to explicitly address the risk of reward hacking in lower-resource languages and will add a short qualitative error analysis on a sample of Hindi and Marathi outputs. However, we did not collect human step-quality ratings or a full per-language error taxonomy; these would require new annotation effort beyond the scope of the current revision. revision: partial
Referee: [Experiments] Experimental section: aggregate accuracy on curated test sets is reported, but the paper does not provide the per-language breakdown of failure modes or comparison against a strong English-only baseline with the same reward, which is required to substantiate the cross-lingual transfer claim.

Authors: Per-language accuracy tables and failure-mode breakdowns already appear in the appendix and in Figures 3–5. To directly support the cross-lingual transfer claim we will add, in the revised experiments section, a controlled comparison against an English-only model trained with identical GRPO and the same four-term reward. This addition will isolate the benefit of the multilingual interleaved curriculum. revision: yes

standing simulated objections not resolved

Human step-quality ratings and a comprehensive per-language error taxonomy were not performed in the original study and cannot be supplied without new data collection.

Circularity Check

0 steps flagged

No circularity: empirical method and dataset release with independent benchmark evaluation

full rationale

The paper describes an empirical two-axis training recipe (vertical staged SFT + horizontal reverse-curriculum GRPO) together with a composite reward and a new 29k-problem dataset. No equations, derivations, or self-citation chains are present that reduce the reported accuracy gains to quantities defined by the method's own fitted parameters or prior outputs. Claims rest on external benchmark numbers rather than on any self-referential construction, satisfying the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The abstract relies on standard assumptions from curriculum learning and RL literature plus the unstated premise that the chosen reward components will align with human-preferred reasoning traces in Hindi and Marathi.

free parameters (2)

curriculum stage boundaries
The number and difficulty thresholds of the incremental stages are not specified and must be chosen or tuned.
reward component weights
The relative weighting of correctness, step-wise alignment, continuity, and numeric incentives is not given and is presumably tuned on validation data.

axioms (2)

domain assumption Curriculum learning improves sample efficiency and final performance on complex reasoning tasks
Invoked by the vertical SFT axis and the overall design.
domain assumption Reverse curriculum reinforcement learning can reduce reliance on explicit step-by-step guidance
Core premise of the horizontal axis.

pith-pipeline@v0.9.0 · 5503 in / 1578 out tokens · 41770 ms · 2026-05-08T03:41:03.218848+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

119 extracted references · 5 canonical work pages · 2 internal anchors

[1]

arXiv preprint arXiv:2501.04519 , year=

rstar-math: Small llms can master math reason- ing with self-evolved deep thinking.arXiv preprint arXiv:2501.04519. Alexander Havrilla, Yuqing Du, Sharath Chandra Ra- parthy, Christoforos Nalmpantis, Jane Dwivedi-Yu, Eric Hambro, Sainbayar Sukhbaatar, and Roberta Raileanu. 2024. Teaching large language models to reason with reinforcement learning. InAI fo...

work page arXiv 2024
[2]

Let's Verify Step by Step

Let’s verify step by step.Preprint, arXiv:2305.20050. Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jian- guang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. 2023. Wiz- ardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583. Kelly Marchisio, Wei-Yin K...

work page internal anchor Pith review arXiv 2023
[3]

Association for Computational Linguistics

Are NLP models really able to solve simple math word problems? InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, Online. Association for Computational Linguistics. Pulkit Pattnaik, Rishabh Maheshwary, Kelechi Ogueji, Vikas Yadav, and Sathwik Te...

2021
[4]

InFindings of the Associ- ation for Computational Linguistics: EMNLP 2024, pages 12891–12907, Miami, Florida, USA

Enhancing alignment using curriculum learn- ing & ranked preferences. InFindings of the Associ- ation for Computational Linguistics: EMNLP 2024, pages 12891–12907, Miami, Florida, USA. Associa- tion for Computational Linguistics. Uri Shaham, Jonathan Herzig, Roee Aharoni, Idan Szpektor, Reut Tsarfaty, and Matan Eyal. 2024. Mul- tilingual instruction tunin...

work page arXiv 2024
[5]

InProceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3479–3490, Marseille, France

HAWP: a dataset for Hindi arithmetic word problem solving. InProceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3479–3490, Marseille, France. European Lan- guage Resources Association. Petru Soviany, Radu Tudor Ionescu, Paolo Rota, and Nicu Sebe. 2022. Curriculum learning: A survey.In- ternational Journal of Computer Vision,...

work page arXiv 2022
[6]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171. 10 Appendix This Technical Appendix is a supplement to “IRIS: Interleaved Reinforcement with Incremental Staged Curriculum for Cross-Lingual Mathemati- cal Reasoning”. The following sections include a detailed glossary of our Methodology, Algorithm, Pr...

work page internal anchor Pith review arXiv 2024
[7]

Could you explain this step in more detail?

I understand the solution, but I'm confused about the simplification process. Could you explain this step in more detail?
[8]

Cancel the 3 in the numerator with the 3 in the denominator: 2 ·2 ·2 ·7 7 = 2 ·2 ·2 = 8 So, the correct final answer should be:48 SFT Easy: Model Response (Incorrect )
[10]

Cancel out common terms: Identify and cancel out the common terms in the numerator and denominator across the fractions, starting from the second fraction
[11]

Simplify the expression: After canceling out the common terms, simplify the remaining expression to find the product
[12]

Calculate the final product: Multiply the remaining numerators and denominators to get the final product, which is 14 3
[13]

Figure 10: Question and Base Model Response Analysis SFT Medium: Model Response (Incorrect )

Verify the solution: Confirm that the solution is correct by checking the cancellation process and the final multiplication step. Figure 10: Question and Base Model Response Analysis SFT Medium: Model Response (Incorrect )
[14]

Identify the pattern of cancellation: Notice that the numerator of each fraction cancels out with the denominator of the next fraction, except for the first numerator and the last denominator
[15]

Cancel out the common factors: Cancel out the common factors between the numerators and denominators of consecutive fractions, which results in the sequence: 4 3 ·6 4 ·8 5 ·10 6 ·12 7 ·14 8 = 4 3 ·6 4 ·8 5 ·10 6 ·12 7 ·14 8
[16]

Simplify the expression: After canceling out the common factors, the expression simplifies to 4 3 ·6 4 ·8 5 ·10 6 ·12 7 ·14 8 = 4 3 ·6 4 ·8 5 ·10 6 ·12 7 ·14 8 = 4 3 ·6 4 ·8 5 ·10 6 ·12 7 ·14 8
[17]

Calculate the final product: After canceling out all the common factors, the final product is 14 3
[18]

Write the final answer: The product of the given fractions is 14 3

5. Write the final answer: The product of the given fractions is 14 3 . SFT Hard: Model Response (Correct )
[19]

Identify the given product: 4 3 ·6 4 ·8 5 ·10 6 ·12 7 ·14 8
[20]

Recognize that the product can be simplified by canceling out common factors in the numerators and denominators
[21]

Cancel out the common factors: 4 3 ·6 4 ·8 5 ·10 6 ·12 7 ·14 8 = 4 3 · 6 4 · 8 5 ·10 6 ·12 7 ·14 8
[22]

Simplify the expression further by canceling out more common factors: 4 3 · 6 4 · 8 5 ·10 6 ·12 7 ·14 8 = 1 3 ·1 5 ·10 1 ·12 7 ·14 1
[23]

Perform the multiplication of the remaining factors: 1 3 ·1 5 ·10 1 ·12 7 ·14 1 = 1 ·1 ·10 ·12 ·14 3 ·5 ·1 ·7 ·1
[24]

Simplify the numerator and denominator: 1 ·1 ·10 ·12 ·14 3 ·5 ·1 ·7 ·1 = 10 ·12 ·14 3 ·5 ·7
[25]

Calculate the final product: 10 ·12 ·14 3 ·5 ·7 = 1680 105 = 16
[26]

Verify that the calculation is correct and the final answer is indeed 16
[27]

Confirm that the product has been fully simplified and the result is accurate. Figure 11:Comparison of Arithmetic Product Tasks on Vertical Curriculum:We compared responses by Base Model and SFT responses of all difficulty tiers Vertical Curriculum — Quadratic Factorisation Task: Question The expression x2 + 18x − 63 can be written in the form (x − a)(x +...
[29]

The factors of − 63 are (1,− 63), (3,− 21), (7,− 9), (− 1,63), (− 3,21), and (− 7,9)

Find the factors of c: List the factors of c = − 63 and identify the pair of factors that add up to a = 18 . The factors of − 63 are (1,− 63), (3,− 21), (7,− 9), (− 1,63), (− 3,21), and (− 7,9). The pair that adds up to 18 is (21,− 3)
[30]

Determine a and b: From the pair (21,− 3), identify a = 21 and b = − 3
[31]

Calculate b: Since b is the negative factor, the value of b is − 3
[32]

SFT Medium: Model Response (Correct )

Verify the solution: Substitute a = 21 and b = − 3 back into the expression (x − a)(x + b) to verify that it matches the given expression x2 + 18x − 63. SFT Medium: Model Response (Correct )
[33]

Identify the coefficients: From the given expression x2 + 18x − 63, identify the coefficients a = 18 and c = − 63
[34]

The factors of − 63 are (1,− 63), (− 1,63), (3,− 21), (− 3,21), (7,− 9), and (− 7,9)

Find the factors of c that add up to a: List the factors of − 63 and find the pair that adds up to 18. The factors of − 63 are (1,− 63), (− 1,63), (3,− 21), (− 3,21), (7,− 9), and (− 7,9). The pair that adds up to 18 is (− 3,21)
[35]

Determine a and b: From the pair (− 3,21), identify a = 21 and b = 3
[36]

The expression becomes (x − 21)(x + 3), which expands to x2 + 3x − 21x − 63 = x2 − 18x − 63

Verify the expression: Substitute a and b back into the form (x − a)(x + b) to verify that it matches the given expression x2 +18x− 63. The expression becomes (x − 21)(x + 3), which expands to x2 + 3x − 21x − 63 = x2 − 18x − 63. However, since we are looking for the form (x − a)(x + b), we should have x2 + 18x − 63, which means a = 3 and b = 21
[37]

Identify the value of b: From the correct pair (3,21), identify b = 21. Figure 12:Comparison of Quadratic Factorisation Task on Vertical Curriculum:We compared responses by Base Model and SFT responses of easy and medium tiers Horizontal Curriculum — Arithmetic Product: Question Compute the product 4 3 ·6 4 ·8 5 ·10 6 ·12 7 ·14 8 Easy: Model Response (Incorrect )
[38]

Simplify the fractions by canceling out common factors: 4 3 ·6 4 ·8 5 ·10 6 ·12 7 ·14 8

Identify the pattern and simplify the fractions: Notice that the numerator of each fraction is 2 more than the denominator of the previous fraction. Simplify the fractions by canceling out common factors: 4 3 ·6 4 ·8 5 ·10 6 ·12 7 ·14 8
[39]

Cancel out common factors: Cancel out the common factors in the numerator and denominator of each fraction: 4 3 ·6 4 ·8 5 ·10 6 ·12 7 ·14 8 = 4 3 ·3 2 ·8 5 ·5 3 ·12 7 ·7 4
[40]

Continue canceling out common factors:
[41]

First pair: 4 3 ·3 2 = 4 ·3 3 ·2 = 4 2 = 2
[42]

Next pair: 8 5 ·5 3 = 8 ·5 5 ·3 = 8 3
[43]

Next pair: 12 7 ·7 4 = 12 ·7 7 ·4 = 12 4 = 3 Medium: Model Response (Correct )
[44]

Identify the pattern: Observe that the numerators and denominators of the fractions follow a pattern, where the numerator of each fraction is 2 more than the denominator of the previous fraction
[45]

This results in the simplified fractions: 4 3 ,6 4 ,8 5 ,10 6 ,12 7 ,14 8

Simplify the fractions: Cancel out the common factors in the numerators and denominators of the fractions. This results in the simplified fractions: 4 3 ,6 4 ,8 5 ,10 6 ,12 7 ,14 8
[46]

Multiply the simplified fractions: Multiply the numerators and denominators of the simplified fractions to get the final product: 4 ·6 ·8 ·10 ·12 ·14 3 ·4 ·5 ·6 ·7 ·8
[47]

Cancel out common factors: Cancel out the common factors in the numerator and denominator to simplify the product: 10 ·12 ·14 3 ·5 ·7
[48]

Figure 13:Comparison of Arithmetic Product Task on Horizontal Curriculum:We compared responses after applying GRPO on respective SFT checkpoints

Calculate the final product: Calculate the product of the remaining numerators and denominators to get the final answer: 1680 105 = 16. Figure 13:Comparison of Arithmetic Product Task on Horizontal Curriculum:We compared responses after applying GRPO on respective SFT checkpoints. Horizontal Curriculum — Quadratic Factorisation Task: Question The expressi...
[49]

Identify the coefficients: Identify the coefficients of the quadratic equation 3x2 + 14x + 8, which are a = 3, b = 14, and c = 8
[50]

Find the product of A and B: Since A and B are integers and AB = c = 8, list the possible pairs of integers (A, B) that satisfy this condition: (1,8), (2,4), (− 1,− 8), (− 2,− 4), (8,1), (4,2), (− 8,− 1), (− 4,− 2)
[51]

The pair that satisfies A + B = 14 is (2,4) or (4,2)

Find the sum of A and B: Calculate the sum A+B for each pair and compare it with the coefficient b = 14 to find the correct pair. The pair that satisfies A + B = 14 is (2,4) or (4,2)
[52]

Since the order of A and B does not matter for the product and sum, we can use either pair

Determine A and B: Choose the correct pair (A, B) based on the sum condition, which is (2,4) or (4,2). Since the order of A and B does not matter for the product and sum, we can use either pair
[53]

For (2,4), A − B = 2 − 4 = − 2

Calculate A - B: Calculate the difference A − B using the chosen pair (2,4) or (4,2). For (2,4), A − B = 2 − 4 = − 2. For (4,2), A − B = 4 − 2 = 2 . Since the question asks for the value of A − B, we can provide either result, but typically, we would choose the positive difference, which is 2. However, if we strictly follow the question's format, we shoul...
[54]

Here, a = 3, b = 14, and c = 8

Identify the coefficients: The given expression is 3x2 + 14 x + 8 . Here, a = 3, b = 14, and c = 8
[55]

Find the product of a and c: Calculate a timesc = 3 times8 = 24
[56]

These numbers are 12 and 2, since 12 times2 = 24 and 12 + 2 = 14

Find two numbers that multiply to ac and add to b: Identify two numbers that multiply to 24 and add to 14. These numbers are 12 and 2, since 12 times2 = 24 and 12 + 2 = 14
[57]

Hard: Model Response (Correct )

Determine A and B: From the identified numbers, A = 12 and B = 2.A − B: Compute A − B = 12 − 2 = 10 . Hard: Model Response (Correct )
[58]

To factor the expression 3x2 + 14x + 8 into the form (3x + A)(x + B), we need to find integers A and B such that the expression can be written as a product of two binomials
[59]

The product of the constants A and B must equal the constant term in the original expression, which is 8
[60]

The sum of A and 3B must equal the coefficient of x in the original expression, which is 14
[61]

We are looking for two integers whose product is 8 and whose sum, when one is multiplied by 3, equals 14
[62]

By testing possible pairs of integers that multiply to 8, we find that A = 2 and B = 4 satisfy both conditions: 2 cdot4 = 8 and 2 + 3 cdot4 = 14
[63]

With A = 2 and B = 4 , we can write the factored form of the expression as (3x + 2)(x + 4)
[64]

To find A − B, we simply subtract B from A: A − B = 2 − 4 = − 2
[65]

Therefore, the value of A − B is − 2
[66]

This confirms that our values for A and B are correct and that A− B equals − 2
[67]

Thus, we have successfully factored the expression and found the difference between A and B
[68]

The final answer is indeed − 2
[69]

We have now completed the process of finding A − B for the given expression
[70]

Figure 14:Comparison of Quadratic Factorisation Task on Horizontal Curriculum:We compared responses after applying GRPO on respective SFT checkpoints

The solution is complete, and we have found the required value. Figure 14:Comparison of Quadratic Factorisation Task on Horizontal Curriculum:We compared responses after applying GRPO on respective SFT checkpoints. Horizontal Curriculum — Arithmetic Average Task on Hindi: Question पाँच स ंख्याओं का औसत 10.6 ह ै।. इनम ें से चार स ंख्याएँ 10,4,5 और 20 ह ैं।...
[71]

चरणबद्ध प्रारूपः दी गई संख्याओं का योग कर ेंः 10 + 4 + 5 + 20 = 39।
[72]

दी गई स ंख्याओं क े औसत का उपयोग करक े चार स ंख्याओं क े योग कɃ गणना कर ेंः चार संख्याओं का योग = 4 * 10.6 = 42.4।
[73]

चार स ंख्याओं क े योग से पाँचवीं स ंख्या का िनधार्रण करेंः पाँचवीं स ंख्या = 42.4-39 = 3.4।
[74]

पाँचवीं स ंख्या का मान िनधार्ȼरत करेंः पाँचवीं स ंख्या 3.4 है।
[75]

अं˃तम उत्तर देंः पाँचवीं स ंख्या का मान 3.4 है। Medium: Model Response(Incorrect )
[76]

चरणबद्ध प्रारूपः चार संख्याओं क े योग कɃ गणना कɃʹजएः 10 + 4 + 5 + 20 = 39।
[77]

चार स ंख्याओं क े योग को चार स े गुणा करक े चार स ंख्याओं का योग ज्ञात कɃʹजएः 10.6 * 4 = 42.4।
[78]

चार संख्याओं क े योग को चार संख्याओं क े योग से घटाकर पाँचवीं संख्या का मूल्य िनधार्ȼरत करेंः 42.4-39 = 3.4।
[79]

पाँचवीं स ंख्या क े Ǻलए अं˃तम उत्तर प्राप्त करने क े Ǻलए अं˃तम अ ंक ज्ञात कɃʹजएः पाँचवीं संख्या 3.4 है।
[80]

सत्यािपत करें िक गणना वास्तव म ें चार स ंख्याओं क े औसत क े बराबर ह ै, जो िदए गए औसत क े बराबर ह ै।
[81]

यिद आवश्यक हो तो अं˃तम उत्तर को दो दशमलव स्थानों तक छोड़ दें। Hard: Model Response(Correct )
[82]

चरणबद्ध प्रारूपः चार दी गई संख्याओं का योग ज्ञात कɃʹजएः10 + 4 + 5 + 20 = 39 ।

Showing first 80 references.