CRPO: Character-centric Group Relative Policy Optimization for Role-aware Reasoning in Role-playing Agents

Benyou Wang; Kehai Chen; Liang Yue; Min Zhang; Yihong Tang

arxiv: 2605.25511 · v1 · pith:HLMWWSUVnew · submitted 2026-05-25 · 💻 cs.CL

CRPO: Character-centric Group Relative Policy Optimization for Role-aware Reasoning in Role-playing Agents

Yihong Tang , Kehai Chen , Liang Yue , Benyou Wang , Min Zhang This is my paper

Pith reviewed 2026-06-29 21:48 UTC · model grok-4.3

classification 💻 cs.CL

keywords role-playing agentsreinforcement learningGRPOcharacter fidelitystyle collapsepolicy optimizationpersona alignmentLLM reasoning

0 comments

The pith

CRPO realigns RL for role-playing agents by decoupling task logic from stylistic rewards to preserve character fidelity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that applying standard Group Relative Policy Optimization to role-playing agents causes loss of character fidelity because it favors context-specific utility over persona alignment. CRPO counters this by separating the optimization of task logic from stylistic rewards, adapting constraints dynamically to each character's complexity, and treating generic responses as negative baselines. These changes are presented as resolving gradient conflicts and stopping reversion to a common distribution. A reader would care if the result is agents that stay consistent in emotion and style while still gaining from RL-based reasoning improvements. The work positions this as a direct fix for a mismatch between problem-centric RL and persona-driven tasks.

Core claim

CRPO is a framework that realigns RL objectives with the role-playing task. It improves character distinctiveness through three mechanisms: decoupling task logic from stylistic rewards to resolve gradient conflicts, dynamically adapting optimization constraints based on character complexity, and utilizing generic responses as negative baselines to prevent the model from reverting to a common distribution. Extensive experiments demonstrate that CRPO outperforms existing methods in consistency, emotion and others.

What carries the argument

Character-centric Group Relative Policy Optimization (CRPO), which decouples task logic from stylistic rewards, adapts constraints dynamically, and uses generic responses as negative baselines.

If this is right

Role-playing agents maintain higher character distinctiveness and emotional consistency while retaining RL reasoning gains.
Gradient conflicts between utility and style objectives are resolved by the separation of reward signals.
Optimization constraints adjust automatically to character complexity, reducing the need for manual tuning.
Generic responses serve as effective negative baselines that block collapse to a common output distribution.
The approach yields measurable gains over prior methods on consistency and emotion metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decoupling pattern could extend to other multi-objective RL settings where utility and style or persona compete.
Agents trained this way might support longer interactive sessions without gradual loss of persona traits.
The negative baseline technique may generalize to preventing mode collapse in other persona-conditioned generation tasks.

Load-bearing premise

The decoupling of task logic from stylistic rewards, along with dynamic constraint adaptation, can be implemented scalably across characters without new instabilities or per-character hyperparameter search.

What would settle it

Running the paper's role-playing benchmarks with CRPO and finding no measurable gain in character consistency or emotion metrics relative to standard GRPO would falsify the central improvement claim.

Figures

Figures reproduced from arXiv: 2605.25511 by Benyou Wang, Kehai Chen, Liang Yue, Min Zhang, Yihong Tang.

**Figure 1.** Figure 1: Problem-centric optimization suffers from ambiguous rewards, rigid constraints, and style collapse, which impede consistent persona alignment. transformation from behaviorist imitation to cognitivist reasoning. Traditional Supervised Fine-Tuning (SFT), representative of the former, often focuses on imitating the shallow linguistic style of characters. Conversely, recent preference-optimization methods have… view at source ↗

**Figure 2.** Figure 2: The overall framework of CRPO. The method synergizes three core mechanisms to enhance role-playing: (1) Contrastive Anchor Sampling prevents style collapse by introducing generic negative samples; (2) Dual-Stream Advantage Estimation decouples task and style rewards to resolve optimization conflicts; and (3) Entropy-Aware Adaptive Exploitation dynamically regulates gradient updates based on character-speci… view at source ↗

**Figure 3.** Figure 3: Analysis of training dynamics. 4.3. Ablation Studies We investigate the contribution of each component in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Human evaluation results. 4.4. Human Evaluation To validate real-world utility, we conducted a blinded, pairwise human evaluation involving 4 experts across 100 interaction sessions. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Impact of yanchor quantity on the CharacterBench. Dynamics of Character Identification Entropy Figure 3c illustrates the evolutionary dynamics of training stability. CRPO exhibits a distinct high-exploration, sharp-convergence phase transition. The initially high Hid validates our cautious exploration strategy driven by instance-level weighting, preventing early mode collapse. The subsequent precipitous dr… view at source ↗

**Figure 6.** Figure 6: The t-SNE visualization of hidden states of the model under different role group. Visualization of Character Representations We visualize the latent representations of responses across different role groups using t-SNE ( [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: The role-playing prompt for training [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: All reward curves for Qwen3-8B. 0 50 100 0.00 0.05 0.10 0.15 BLEU-1 0 50 100 0.00 0.02 0.04 0.06 BLEU-2 0 50 100 0.00 0.01 0.02 0.03 BLEU-3 0 50 100 0.000 0.005 0.010 0.015 0.020 BLEU-4 0 50 100 0.0 0.1 0.2 0.3 0.4 Focus 0 50 100 0.00 0.05 0.10 0.15 Focus Attr. 0 50 100 0.00 0.25 0.50 0.75 1.00 Format 0 50 100 0.0 0.1 0.2 ROUGE-1 0 50 100 0.00 0.02 0.04 ROUGE-2 0 50 100 0.00 0.05 0.10 0.15 ROUGE-L 0 50 100… view at source ↗

**Figure 9.** Figure 9 [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

read the original abstract

Recent advancements in Reinforcement Learning (RL), particularly Group Relative Policy Optimization (GRPO), have significantly enhanced the reasoning capabilities of Large Language Models. However, applying these problem-centric optimization methods to role-playing agents often leads to a loss of character fidelity and style collapse, as they prioritize context-specific utility over persona alignment. To address this, we propose Character-Centric Group Relative Policy Optimization (CRPO), a framework designed to realign RL objectives with the role-playing task. CRPO improves character distinctiveness through three mechanisms: decoupling task logic from stylistic rewards to resolve gradient conflicts, dynamically adapting optimization constraints based on character complexity, and utilizing generic responses as negative baselines to prevent the model from reverting to a common distribution. Extensive experiments demonstrate that CRPO outperforms existing methods in consistency, emotion and others.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CRPO adds three targeted tweaks to GRPO for persona consistency in role-play agents, but the abstract supplies zero numbers or details to support the outperformance claim.

read the letter

The main takeaway is that this paper tries to fix a real problem—standard GRPO causing style collapse and lost character fidelity in role-playing agents—by introducing CRPO with three mechanisms: decoupling task logic from stylistic rewards, dynamic constraint adaptation by character complexity, and generic responses as negative baselines.

What is new is the specific combination of those three changes framed as a character-centric shift from problem-centric RL. The abstract does a clear job naming the gradient conflict and distribution reversion issues that arise when applying existing methods to personas.

The soft spot is the total lack of evidence. The abstract states that extensive experiments show gains in consistency and emotion, yet provides no metrics, baselines, statistical tests, ablations, or even pseudocode for the mechanisms. Without those, the central claim cannot be checked. The dynamic adaptation step also raises a practical question about whether it scales without per-character tuning that would limit usability.

This is aimed at people working on role-playing LLMs for chat or entertainment applications. A reader already deep in RLHF for conversational agents might pick up the ideas, but only if the full paper contains reproducible results.

I would not send this to peer review in its current form; the absence of any data makes the contribution impossible to evaluate.

Referee Report

1 major / 0 minor

Summary. The paper introduces Character-Centric Group Relative Policy Optimization (CRPO) to adapt Group Relative Policy Optimization (GRPO) for role-playing agents. It claims that CRPO addresses the loss of character fidelity and style collapse in standard RL methods by decoupling task logic from stylistic rewards, dynamically adapting optimization constraints based on character complexity, and using generic responses as negative baselines. The authors assert that extensive experiments show CRPO outperforms existing methods in terms of consistency, emotion, and other metrics.

Significance. If the results hold, this work would offer a targeted improvement to RL-based fine-tuning for role-playing LLMs, helping to balance reasoning capabilities with persona consistency, which is a key challenge in deploying such agents.

major comments (1)

[Abstract] Abstract: The abstract states that 'extensive experiments demonstrate that CRPO outperforms existing methods in consistency, emotion and others' but provides no quantitative results, baselines, metrics, statistical tests, or implementation details. This absence makes it impossible to assess or verify the central empirical claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for highlighting the need for greater specificity in the abstract. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract states that 'extensive experiments demonstrate that CRPO outperforms existing methods in consistency, emotion and others' but provides no quantitative results, baselines, metrics, statistical tests, or implementation details. This absence makes it impossible to assess or verify the central empirical claim.

Authors: We agree that the current abstract is too high-level and does not provide enough quantitative grounding for the central claim. In the revised version we will expand the final sentence of the abstract to report the primary evaluation metrics (character consistency, emotion alignment, and style distinctiveness), the key baselines (standard GRPO and SFT), and the magnitude of the observed improvements (e.g., relative gains and statistical significance where computed). This change will make the empirical contribution verifiable directly from the abstract while preserving its brevity. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and description present CRPO as a new framework with three high-level mechanisms (decoupling rewards, dynamic constraints, generic baselines) without any equations, fitted parameters, or derivation steps that reduce to self-defined quantities or prior self-citations. No load-bearing claims rely on the authors' own previous work in a way that creates definitional equivalence or forced predictions. The central claim remains independent of its inputs based on the available text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the approach implicitly relies on standard RL policy optimization assumptions without detailing any new fitted quantities or postulates.

pith-pipeline@v0.9.1-grok · 5671 in / 1033 out tokens · 46405 ms · 2026-06-29T21:48:00.668061+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 28 canonical work pages · 13 internal anchors

[1]

Training-free group relative policy optimization.arXiv preprint arXiv:2510.08191,

Cai, Y ., Cai, S., Shi, Y ., Xu, Z., Chen, L., Qin, Y ., Tan, X., Li, G., Li, Z., Lin, H., Mao, Y ., Li, K., and Sun, X. Training-free group relative policy optimization.arXiv preprint arXiv:2510.08191,

work page arXiv
[2]

SocialBench: Sociality evaluation of role-playing conversational agents

Chen, H., Chen, H., Yan, M., Xu, W., Xing, G., Shen, W., Quan, X., Li, C., Zhang, J., and Huang, F. SocialBench: Sociality evaluation of role-playing conversational agents. In Ku, L.-W., Martins, A., and Srikumar, V . (eds.),Findings of the Association for Computational Linguistics: ACL 2024, pp. 2108– 2126, Bangkok, Thailand, August

2024
[4]

Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning

Dou, Z., Zhao, Q., Wan, Z., Zhang, D., Wang, W., Raiyan, T., Chen, B., Pan, Q., Ouyang, Y ., Gao, Z., et al. Plan then action: High-level planning guidance reinforcement learning for llm reasoning.arXiv preprint arXiv:2510.01833,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

ORPP: Self-optimizing role-playing prompts to enhance language model capabilities

Duan, Y ., Tang, Y ., Chen, K., Nie, L., and Zhang, M. ORPP: Self-optimizing role-playing prompts to enhance language model capabilities. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V . (eds.), Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 28585– 28600, Suzhou, China, November

2025
[6]

ISBN 979-8-89176-332-6

Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main.1453. Feng, X., Dou, L., and Kong, L. Reasoning does not necessarily improve role-playing ability. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.),Findings of the Association for Computational Linguistics: ACL 2025, pp. 10301–10314, Vienna,...

work page doi:10.18653/v1/2025.emnlp-main.1453 2025
[7]

ISBN 979-8-89176-256-5

Association for Computational Linguistics. ISBN 979-8-89176-256-5. doi: 10.18653/ v1/2025.findings-acl.537. Gallego, V . Humanish-Roleplay-Llama-3.1-8B. https://huggingface.co/vicgalle/ Humanish-Roleplay-Llama-3.1-8B,

2025
[8]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

GLM, T., Zeng, A., Xu, B., Wang, B., Zhang, C., Yin, D., Zhang, D., Rojas, D., Feng, G., et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/ v1/2025.acl-long.731. Hu, J., Liu, J. K., Xu, H., and Shen, W. Reinforce++: Stabilizing critic-free policy optimization with global ad- vantage normalization.arXiv preprint arXiv:2501.03262,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Kool, W., van Hoof, H., and Welling, M

doi: 10.1145/3773279. Kool, W., van Hoof, H., and Welling, M. Buy 4 REINFORCE samples, get a baseline for free! InICLR 2019 Workshop drlStructPred,

work page doi:10.1145/3773279 2019
[11]

Sgpo: Self-generated preference optimization based on self-improver.arXiv preprint arXiv:2507.20181,

Lee, H., Jo, D., Yun, S., and Kim, S. Sgpo: Self-generated preference optimization based on self-improver.arXiv preprint arXiv:2507.20181,

work page arXiv
[12]

Understanding Generalization in Role-Playing Models via Information Theory

Li, Y ., Lang, H., Huang, F., Qian, T., and Li, Y . Understanding generalization in role-playing models via information theory.arXiv preprint arXiv:2512.17270,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Outcome-Grounded Advantage Reshaping for Fine-Grained Credit Assignment in Mathematical Reasoning

Li, Z., Kang, L., Xiao, F., Xing, L., Si, Q., Li, Z., Gong, W., Yang, D., Xiao, Y ., and Guo, H. Outcome- grounded advantage reshaping for fine-grained credit assignment in mathematical reasoning.arXiv preprint arXiv:2601.07408,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

MOA: Multi-Objective Alignment for Role-Playing Agents

Liao, C., Wang, K., Wu, Y ., Huang, F., and Li, Y . Moa: Multi-objective alignment for role-playing agents.arXiv preprint arXiv:2512.09756,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Liu, S.-Y ., Dong, X., Lu, X., Diao, S., Belcak, P., Liu, M., Chen, M.-H., Yin, H., Wang, Y .-C. F., Cheng, K.-T., et al. Gdpo: Group reward-decoupled normalization policy optimization for multi-reward rl optimization.arXiv preprint arXiv:2601.05242,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Understanding R1-Zero-Like Training: A Critical Perspective

Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., Lee, W. S., and Lin, M. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

RoleMRC: A fine-grained composite benchmark for role-playing and instruction- following

Lu, J., Li, J., Shen, G., Gui, L., An, S., He, Y ., Yin, D., and Sun, X. RoleMRC: A fine-grained composite benchmark for role-playing and instruction- following. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.),Findings of the Association for Computational Linguistics: ACL 2025, pp. 21008– 21030, Vienna, Austria, July

2025
[18]

ISBN 979-8-89176-256-5

Association for Computational Linguistics. ISBN 979-8-89176-256-5. doi: 10.18653/v1/2025.findings-acl.1082. Lu, K., Yu, B., Zhou, C., and Zhou, J. Large language models are superpositions of all characters: Attaining arbitrary role-play via self-alignment. In Ku, L.-W., Martins, A., and Srikumar, V . (eds.),Proceedings of the 62nd Annual Meeting of the As...

work page doi:10.18653/v1/2025.findings-acl.1082 2025
[19]

Maaten, L

18653/v1/2024.acl-long.423. Maaten, L. v. d. and Hinton, G. Visualizing data using t-sne.Journal of machine learning research, 9(Nov): 2579–2605,

2024
[20]

Ngrpo: Negative-enhanced group relative policy optimization

Nan, G., Chen, S., Huang, J., Lu, M., Wang, D., Xie, C., Xiong, W., Zeng, X., Zhou, Q., Li, Y ., et al. Ngrpo: Negative-enhanced group relative policy optimization. arXiv preprint arXiv:2509.18851,

work page arXiv
[21]

Deriving character logic from storyline as codified decision trees

Peng, L., Zhou, K., Yun, L., Hou, Y ., and Shang, J. Deriving character logic from storyline as codified decision trees. arXiv preprint arXiv:2601.10080,

work page arXiv
[22]

Qwen2.5 Technical Report

Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

RiskPO: Risk-based policy optimization with verifiable reward for LLM post-training

Ren, T., Jiang, J., Yang, H., Tian, W., and Peng, Y . RiskPO: Risk-based policy optimization with verifiable reward for LLM post-training. InNeurIPS 2025 Workshop MLxOR: Mathematical Foundations and Operational Integration of Machine Learning for Uncertainty-Aware Decision- Making,

2025
[24]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Gtpo: Stabilizing group relative policy optimization via gradient and entropy control.arXiv preprint arXiv:2508.03772,

Simoni, M., Fontana, A., Rossolini, G., Saracino, A., and Mori, P. Gtpo: Stabilizing group relative policy optimization via gradient and entropy control.arXiv preprint arXiv:2508.03772,

work page arXiv
[27]

The rise of darkness: Safety-utility trade- offs in role-playing dialogue agents

Tang, Y ., Chen, K., Bai, X., Niu, Z.-Y ., Wang, B., Liu, J., and Zhang, M. The rise of darkness: Safety-utility trade- offs in role-playing dialogue agents. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.),Findings of the Association for Computational Linguistics: ACL 2025, pp. 16313–16337, Vienna, Austria, July 2025a. Association for Com...

work page doi:10.18653/v1/2025.findings-acl.839 2025
[28]

SCOPE-RL: Stable and Quantitative Control of Policy Entropy in RL Post-Training

Wang, C., Li, Z., Bai, J., Zhang, Y ., Cui, S., Zhao, Z., and Wang, Y . Arbitrary entropy policy optimization breaks the exploration bottleneck of reinforcement learning.arXiv preprint arXiv:2510.08141, 2025a. Wang, H., Ma, C., Reid, I., and Yaqub, M. Kalman filter enhanced grpo for reinforcement learning-based language model reasoning.arXiv preprint arXi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.findings-acl.878 2024
[29]

Dagrpo: Rectifying gradient conflict in reasoning via distinctiveness-aware group relative policy optimization

Xie, X., Wang, X., Wang, W., Chen, S., and Lin, W. Dagrpo: Rectifying gradient conflict in reasoning via distinctiveness-aware group relative policy optimization. arXiv preprint arXiv:2512.06337,

work page arXiv
[30]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. Yang, A., Xiao, B., Wang, B., Zhang, B., Bian, C., Yin, C., Lv, C., et al. Baichuan 2: Open large-scale language models.arXiv preprint arXiv:2309.10305, 2025b. Yang, S., Lu, Z., Yang, Y ., Lv, B., Shen, Y ., and Liu, N. Hycora: Hyper-contr...

work page internal anchor Pith review Pith/arXiv arXiv
[31]

CPO: Addressing reward ambiguity in role- playing dialogue via comparative policy optimization

Ye, J., Wang, R., Wu, Y ., Ma, V ., Fang, F., Huang, F., and Li, Y . CPO: Addressing reward ambiguity in role- playing dialogue via comparative policy optimization. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V . (eds.),Findings of the Association for Computational Linguistics: EMNLP 2025, pp. 297– 323, Suzhou, China, November

2025
[32]

ISBN 979-8-89176-335-7

Association for Computational Linguistics. ISBN 979-8-89176-335-7. doi: 10.18653/v1/2025.findings-emnlp.18. Yu, H., Qi, Z., Zhao, Y ., Nottingham, K., Xuan, K., Majumder, B. P., Zhu, H., Liang, P. P., and You, J. Sotopia-rl: Reward design for social intelligence.arXiv preprint arXiv:2508.03905, 2025a. Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y...

work page doi:10.18653/v1/2025.findings-emnlp.18 2025
[33]

doi: 10.18653/v1/2024.emnlp-main.697

Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.697. Zhang, B., Huang, Y ., Cui, W., and Zhang, H. Thinking before speaking: A role-playing model with mindset. arXiv preprint arXiv:2409.13752,

work page doi:10.18653/v1/2024.emnlp-main.697 2024
[34]

ORPO: Mono- lithic preference optimization without reference model

Zhang, X., Wen, S., Wu, W., and Huang, L. Edge-grpo: Entropy-driven grpo with guided error correction for advantage diversity.arXiv preprint arXiv:2507.21848, 2025a. Zhang, X., Wu, S., Zhu, Y ., Tan, H., Yu, S., He, Z., and Jia, J. Scaf-grpo: Scaffolded group relative policy optimization for enhancing llm reasoning.arXiv preprint arXiv:2510.19807, 2025b. ...

work page doi:10.18653/v1/2024 2024
[35]

Sotopia: Interactive evaluation for social intelligence in language agents.arXiv preprint arXiv:2310.11667, 2024b

Zhou, X., Zhu, H., Mathur, L., Zhang, R., Yu, H., Qi, Z., Morency, L.-P., Bisk, Y ., Fried, D., Neubig, G., and Sap, M. Sotopia: Interactive evaluation for social intelligence in language agents.arXiv preprint arXiv:2310.11667, 2024b. Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., and Irving, G. Fine-tuning la...

work page arXiv 1909
[36]

brave”, “historical figure

Following Character-R1 (Tang et al., 2026), these focus dimensions correspond to the evaluation dimensions in CharacterBench and thus have natural annotations. {Character Profile} You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within <think> </think> tags. Duri...

2026
[37]

Xiaoming

A fine-grained credit assignment that reshapes token-level advantages based on outcome influence. D. Experimental Result Details D.1. Performance on Comprehensive Role-Play Dialogue Generation As shown in Table 3, CRPO consistently outperforms all baselines across both Llama-3.2-3B and Qwen3-8B backbones, establishing a new state-of-the-art on CharacterBe...

work page arXiv

[1] [1]

Training-free group relative policy optimization.arXiv preprint arXiv:2510.08191,

Cai, Y ., Cai, S., Shi, Y ., Xu, Z., Chen, L., Qin, Y ., Tan, X., Li, G., Li, Z., Lin, H., Mao, Y ., Li, K., and Sun, X. Training-free group relative policy optimization.arXiv preprint arXiv:2510.08191,

work page arXiv

[2] [2]

SocialBench: Sociality evaluation of role-playing conversational agents

Chen, H., Chen, H., Yan, M., Xu, W., Xing, G., Shen, W., Quan, X., Li, C., Zhang, J., and Huang, F. SocialBench: Sociality evaluation of role-playing conversational agents. In Ku, L.-W., Martins, A., and Srikumar, V . (eds.),Findings of the Association for Computational Linguistics: ACL 2024, pp. 2108– 2126, Bangkok, Thailand, August

2024

[3] [4]

Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning

Dou, Z., Zhao, Q., Wan, Z., Zhang, D., Wang, W., Raiyan, T., Chen, B., Pan, Q., Ouyang, Y ., Gao, Z., et al. Plan then action: High-level planning guidance reinforcement learning for llm reasoning.arXiv preprint arXiv:2510.01833,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [5]

ORPP: Self-optimizing role-playing prompts to enhance language model capabilities

Duan, Y ., Tang, Y ., Chen, K., Nie, L., and Zhang, M. ORPP: Self-optimizing role-playing prompts to enhance language model capabilities. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V . (eds.), Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 28585– 28600, Suzhou, China, November

2025

[5] [6]

ISBN 979-8-89176-332-6

Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main.1453. Feng, X., Dou, L., and Kong, L. Reasoning does not necessarily improve role-playing ability. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.),Findings of the Association for Computational Linguistics: ACL 2025, pp. 10301–10314, Vienna,...

work page doi:10.18653/v1/2025.emnlp-main.1453 2025

[6] [7]

ISBN 979-8-89176-256-5

Association for Computational Linguistics. ISBN 979-8-89176-256-5. doi: 10.18653/ v1/2025.findings-acl.537. Gallego, V . Humanish-Roleplay-Llama-3.1-8B. https://huggingface.co/vicgalle/ Humanish-Roleplay-Llama-3.1-8B,

2025

[7] [8]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

GLM, T., Zeng, A., Xu, B., Wang, B., Zhang, C., Yin, D., Zhang, D., Rojas, D., Feng, G., et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [9]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/ v1/2025.acl-long.731. Hu, J., Liu, J. K., Xu, H., and Shen, W. Reinforce++: Stabilizing critic-free policy optimization with global ad- vantage normalization.arXiv preprint arXiv:2501.03262,

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [10]

Kool, W., van Hoof, H., and Welling, M

doi: 10.1145/3773279. Kool, W., van Hoof, H., and Welling, M. Buy 4 REINFORCE samples, get a baseline for free! InICLR 2019 Workshop drlStructPred,

work page doi:10.1145/3773279 2019

[10] [11]

Sgpo: Self-generated preference optimization based on self-improver.arXiv preprint arXiv:2507.20181,

Lee, H., Jo, D., Yun, S., and Kim, S. Sgpo: Self-generated preference optimization based on self-improver.arXiv preprint arXiv:2507.20181,

work page arXiv

[11] [12]

Understanding Generalization in Role-Playing Models via Information Theory

Li, Y ., Lang, H., Huang, F., Qian, T., and Li, Y . Understanding generalization in role-playing models via information theory.arXiv preprint arXiv:2512.17270,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [13]

Outcome-Grounded Advantage Reshaping for Fine-Grained Credit Assignment in Mathematical Reasoning

Li, Z., Kang, L., Xiao, F., Xing, L., Si, Q., Li, Z., Gong, W., Yang, D., Xiao, Y ., and Guo, H. Outcome- grounded advantage reshaping for fine-grained credit assignment in mathematical reasoning.arXiv preprint arXiv:2601.07408,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [14]

MOA: Multi-Objective Alignment for Role-Playing Agents

Liao, C., Wang, K., Wu, Y ., Huang, F., and Li, Y . Moa: Multi-objective alignment for role-playing agents.arXiv preprint arXiv:2512.09756,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [15]

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Liu, S.-Y ., Dong, X., Lu, X., Diao, S., Belcak, P., Liu, M., Chen, M.-H., Yin, H., Wang, Y .-C. F., Cheng, K.-T., et al. Gdpo: Group reward-decoupled normalization policy optimization for multi-reward rl optimization.arXiv preprint arXiv:2601.05242,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [16]

Understanding R1-Zero-Like Training: A Critical Perspective

Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., Lee, W. S., and Lin, M. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [17]

RoleMRC: A fine-grained composite benchmark for role-playing and instruction- following

Lu, J., Li, J., Shen, G., Gui, L., An, S., He, Y ., Yin, D., and Sun, X. RoleMRC: A fine-grained composite benchmark for role-playing and instruction- following. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.),Findings of the Association for Computational Linguistics: ACL 2025, pp. 21008– 21030, Vienna, Austria, July

2025

[17] [18]

ISBN 979-8-89176-256-5

Association for Computational Linguistics. ISBN 979-8-89176-256-5. doi: 10.18653/v1/2025.findings-acl.1082. Lu, K., Yu, B., Zhou, C., and Zhou, J. Large language models are superpositions of all characters: Attaining arbitrary role-play via self-alignment. In Ku, L.-W., Martins, A., and Srikumar, V . (eds.),Proceedings of the 62nd Annual Meeting of the As...

work page doi:10.18653/v1/2025.findings-acl.1082 2025

[18] [19]

Maaten, L

18653/v1/2024.acl-long.423. Maaten, L. v. d. and Hinton, G. Visualizing data using t-sne.Journal of machine learning research, 9(Nov): 2579–2605,

2024

[19] [20]

Ngrpo: Negative-enhanced group relative policy optimization

Nan, G., Chen, S., Huang, J., Lu, M., Wang, D., Xie, C., Xiong, W., Zeng, X., Zhou, Q., Li, Y ., et al. Ngrpo: Negative-enhanced group relative policy optimization. arXiv preprint arXiv:2509.18851,

work page arXiv

[20] [21]

Deriving character logic from storyline as codified decision trees

Peng, L., Zhou, K., Yun, L., Hou, Y ., and Shang, J. Deriving character logic from storyline as codified decision trees. arXiv preprint arXiv:2601.10080,

work page arXiv

[21] [22]

Qwen2.5 Technical Report

Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [23]

RiskPO: Risk-based policy optimization with verifiable reward for LLM post-training

Ren, T., Jiang, J., Yang, H., Tian, W., and Peng, Y . RiskPO: Risk-based policy optimization with verifiable reward for LLM post-training. InNeurIPS 2025 Workshop MLxOR: Mathematical Foundations and Operational Integration of Machine Learning for Uncertainty-Aware Decision- Making,

2025

[23] [24]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [25]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [26]

Gtpo: Stabilizing group relative policy optimization via gradient and entropy control.arXiv preprint arXiv:2508.03772,

Simoni, M., Fontana, A., Rossolini, G., Saracino, A., and Mori, P. Gtpo: Stabilizing group relative policy optimization via gradient and entropy control.arXiv preprint arXiv:2508.03772,

work page arXiv

[26] [27]

The rise of darkness: Safety-utility trade- offs in role-playing dialogue agents

Tang, Y ., Chen, K., Bai, X., Niu, Z.-Y ., Wang, B., Liu, J., and Zhang, M. The rise of darkness: Safety-utility trade- offs in role-playing dialogue agents. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.),Findings of the Association for Computational Linguistics: ACL 2025, pp. 16313–16337, Vienna, Austria, July 2025a. Association for Com...

work page doi:10.18653/v1/2025.findings-acl.839 2025

[27] [28]

SCOPE-RL: Stable and Quantitative Control of Policy Entropy in RL Post-Training

Wang, C., Li, Z., Bai, J., Zhang, Y ., Cui, S., Zhao, Z., and Wang, Y . Arbitrary entropy policy optimization breaks the exploration bottleneck of reinforcement learning.arXiv preprint arXiv:2510.08141, 2025a. Wang, H., Ma, C., Reid, I., and Yaqub, M. Kalman filter enhanced grpo for reinforcement learning-based language model reasoning.arXiv preprint arXi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.findings-acl.878 2024

[28] [29]

Dagrpo: Rectifying gradient conflict in reasoning via distinctiveness-aware group relative policy optimization

Xie, X., Wang, X., Wang, W., Chen, S., and Lin, W. Dagrpo: Rectifying gradient conflict in reasoning via distinctiveness-aware group relative policy optimization. arXiv preprint arXiv:2512.06337,

work page arXiv

[29] [30]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. Yang, A., Xiao, B., Wang, B., Zhang, B., Bian, C., Yin, C., Lv, C., et al. Baichuan 2: Open large-scale language models.arXiv preprint arXiv:2309.10305, 2025b. Yang, S., Lu, Z., Yang, Y ., Lv, B., Shen, Y ., and Liu, N. Hycora: Hyper-contr...

work page internal anchor Pith review Pith/arXiv arXiv

[30] [31]

CPO: Addressing reward ambiguity in role- playing dialogue via comparative policy optimization

Ye, J., Wang, R., Wu, Y ., Ma, V ., Fang, F., Huang, F., and Li, Y . CPO: Addressing reward ambiguity in role- playing dialogue via comparative policy optimization. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V . (eds.),Findings of the Association for Computational Linguistics: EMNLP 2025, pp. 297– 323, Suzhou, China, November

2025

[31] [32]

ISBN 979-8-89176-335-7

Association for Computational Linguistics. ISBN 979-8-89176-335-7. doi: 10.18653/v1/2025.findings-emnlp.18. Yu, H., Qi, Z., Zhao, Y ., Nottingham, K., Xuan, K., Majumder, B. P., Zhu, H., Liang, P. P., and You, J. Sotopia-rl: Reward design for social intelligence.arXiv preprint arXiv:2508.03905, 2025a. Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y...

work page doi:10.18653/v1/2025.findings-emnlp.18 2025

[32] [33]

doi: 10.18653/v1/2024.emnlp-main.697

Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.697. Zhang, B., Huang, Y ., Cui, W., and Zhang, H. Thinking before speaking: A role-playing model with mindset. arXiv preprint arXiv:2409.13752,

work page doi:10.18653/v1/2024.emnlp-main.697 2024

[33] [34]

ORPO: Mono- lithic preference optimization without reference model

Zhang, X., Wen, S., Wu, W., and Huang, L. Edge-grpo: Entropy-driven grpo with guided error correction for advantage diversity.arXiv preprint arXiv:2507.21848, 2025a. Zhang, X., Wu, S., Zhu, Y ., Tan, H., Yu, S., He, Z., and Jia, J. Scaf-grpo: Scaffolded group relative policy optimization for enhancing llm reasoning.arXiv preprint arXiv:2510.19807, 2025b. ...

work page doi:10.18653/v1/2024 2024

[34] [35]

Sotopia: Interactive evaluation for social intelligence in language agents.arXiv preprint arXiv:2310.11667, 2024b

Zhou, X., Zhu, H., Mathur, L., Zhang, R., Yu, H., Qi, Z., Morency, L.-P., Bisk, Y ., Fried, D., Neubig, G., and Sap, M. Sotopia: Interactive evaluation for social intelligence in language agents.arXiv preprint arXiv:2310.11667, 2024b. Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., and Irving, G. Fine-tuning la...

work page arXiv 1909

[35] [36]

brave”, “historical figure

Following Character-R1 (Tang et al., 2026), these focus dimensions correspond to the evaluation dimensions in CharacterBench and thus have natural annotations. {Character Profile} You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within <think> </think> tags. Duri...

2026

[36] [37]

Xiaoming

A fine-grained credit assignment that reshapes token-level advantages based on outcome influence. D. Experimental Result Details D.1. Performance on Comprehensive Role-Play Dialogue Generation As shown in Table 3, CRPO consistently outperforms all baselines across both Llama-3.2-3B and Qwen3-8B backbones, establishing a new state-of-the-art on CharacterBe...

work page arXiv