GEAR reshapes GRPO trajectory advantages using divergence signals from a ground-truth-conditioned teacher to create adaptive token- and segment-level credit regions.
arXiv preprint arXiv:2501.12851 , year=
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
Entropy polarity from a first-order entropy change approximation enables Polarity-Aware Policy Optimization (PAPO) that preserves complementary polarity branches and outperforms baselines on math and agentic RL fine-tuning tasks.
R2IF improves LLM function-calling accuracy by up to 34.62% on BFCL using a composite reward system with CER and SMV components optimized via GRPO, while increasing interpretability through positive CoT effectiveness.
LLMs show structural alignment bias by invoking semantically irrelevant tools when query attributes match tool parameters, revealed via SABEval dataset and mitigated by attention rebalancing.
citing papers explorer
-
GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation
GEAR reshapes GRPO trajectory advantages using divergence signals from a ground-truth-conditioned teacher to create adaptive token- and segment-level credit regions.
-
Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control
Entropy polarity from a first-order entropy change approximation enables Polarity-Aware Policy Optimization (PAPO) that preserves complementary polarity branches and outperforms baselines on math and agentic RL fine-tuning tasks.
-
R2IF: Aligning Reasoning with Decisions via Composite Rewards for Interpretable LLM Function Calling
R2IF improves LLM function-calling accuracy by up to 34.62% on BFCL using a composite reward system with CER and SMV components optimized via GRPO, while increasing interpretability through positive CoT effectiveness.
-
Do LLMs Know Tool Irrelevance? Demystifying Structural Alignment Bias in Tool Invocations
LLMs show structural alignment bias by invoking semantically irrelevant tools when query attributes match tool parameters, revealed via SABEval dataset and mitigated by attention rebalancing.