pith. sign in
Pith Number

pith:BTZPDJO6

pith:2026:BTZPDJO66QJ4W3LPEDJYN5XP7X
not attested not anchored not stored refs resolved

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Hongxu Yin, Jan Kautz, Kwang-Ting Cheng, Mingjie Liu, Min-Hung Chen, Pavlo Molchanov, Peter Belcak, Shih-Yang Liu, Shizhe Diao, Ximing Lu, Xin Dong, Yejin Choi, Yu-Chiang Frank Wang

Decoupling normalization of each reward in multi-reward RL prevents collapse of advantage values into identical signals.

arxiv:2601.05242 v1 · 2026-01-08 · cs.CL · cs.AI · cs.LG

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{BTZPDJO66QJ4W3LPEDJYN5XP7X}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure

C2weakest assumption

That separately normalizing each reward before aggregation will faithfully preserve relative differences across reward combinations without introducing new scaling artifacts or training instabilities.

C3one line summary

GDPO decouples per-reward normalization in multi-reward RL to avoid advantage collapse and improve convergence over GRPO on tool-calling, math, and coding tasks.

References

46 extracted · 46 resolved · 17 Pith anchors

[1] Learn to reason efficiently with adaptive length-based reward shaping 2025
[2] Kimi k1.5: Scaling Reinforcement Learning with LLMs 2025 · arXiv:2501.12599
[4] Rule based rewards for language model safety.Advances in Neural Information Processing Systems, 37:108877–108901, 2024 2024
[5] Grpo-care: Consistency- aware reinforcement learning for multimodal reasoning, 2025 2025
[6] DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models 2025 · arXiv:2512.02556

Cited by

26 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:53.386098Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

0cf2f1a5def413cb6d6f20d386f6effdedcc3264e3e7081e8404e0fc3fbf4847

Aliases

arxiv: 2601.05242 · arxiv_version: 2601.05242v1 · doi: 10.48550/arxiv.2601.05242 · pith_short_12: BTZPDJO66QJ4 · pith_short_16: BTZPDJO66QJ4W3LP · pith_short_8: BTZPDJO6
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/BTZPDJO66QJ4W3LPEDJYN5XP7X \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 0cf2f1a5def413cb6d6f20d386f6effdedcc3264e3e7081e8404e0fc3fbf4847
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "d646f2c556ec36788b65f958f77a2db7de5f740eceeddfc230153cfd0e2107c8",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.LG"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2026-01-08T18:59:24Z",
    "title_canon_sha256": "b6bf6385df9e528004dff7db06dadda8378dad300ba2b59eae30c311d9848d4d"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2601.05242",
    "kind": "arxiv",
    "version": 1
  }
}