Revisiting group rel- ative policy optimization: Insights into on-policy and off- policy training

Revisiting group relative policy optimization: Insights into on-policy, off-policy training · 2024 · arXiv 2505.22257

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Pramana: Fine-Tuning Large Language Models for Epistemic Reasoning through Navya-Nyaya

cs.AI · 2026-02-14 · conditional · novelty 7.0

Fine-tuning LLMs on Navya-Nyaya's six-phase reasoning structure yields 100% semantic correctness on held-out logical problems despite only 40% strict format adherence.

Beyond the Literal: Decomposing Pragmatic Intent in Multimodal Meme Understanding

cs.CL · 2026-06-02 · unverdicted · novelty 6.0

Intent Projection decomposes literal and pragmatic signals in multimodal memes via orthogonal projection and contrastive objectives in LVLMs, outperforming baselines on six benchmarks especially for high-divergence posts.

Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO

cs.LG · 2026-05-29 · unverdicted · novelty 6.0

Smaller models provide temporally correlated policy-level diversity that serves as structured exploration for training larger models in GRPO, yielding accuracy gains such as +8.8% on AIME 24 with reduced compute via the S2L-PO framework.

From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

A group-revision paradigm for GRPO-based RL fine-tuning of VLMs converts failure responses into improvement signals that refine rewards and advantages, yielding gains on referring segmentation, REC, and counting benchmarks.

Gradient Extrapolation-Based Policy Optimization

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

GXPO approximates longer local lookahead in GRPO training via gradient extrapolation from two optimizer steps using three backward passes total, improving pass@1 accuracy by 1.65-5.00 points over GRPO and delivering up to 4x step speedup.

PubSwap: Public-Data Off-Policy Coordination for Federated RLVR

cs.LG · 2026-04-14 · unverdicted · novelty 5.0

PubSwap uses a small public dataset for selective off-policy response swapping in federated RLVR to improve coordination and performance over standard baselines on math and medical reasoning tasks.

POPI: Personalizing LLMs via Optimized Natural Language Preference Inference

cs.CL · 2025-10-17 · unverdicted · novelty 5.0

POPI distills user preferences into reusable natural-language summaries via a shared inference model and conditions a generator on them, trained jointly with RL to improve personalization quality while cutting context length by up to 10x on benchmarks.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Pramana: Fine-Tuning Large Language Models for Epistemic Reasoning through Navya-Nyaya cs.AI · 2026-02-14 · conditional · none · ref 5
Fine-tuning LLMs on Navya-Nyaya's six-phase reasoning structure yields 100% semantic correctness on held-out logical problems despite only 40% strict format adherence.

Revisiting group rel- ative policy optimization: Insights into on-policy and off- policy training

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer