T²-GRPO decouples turn-level environment rewards from trajectory rewards using independent centered-rank normalization and a hard veto for training caregiver agents in dementia care.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Can the Environment Speak for Itself? $T^{2}$-GRPO: A Turn-Trajectory Group Relative Policy Optimization for Caregiver Agents
T²-GRPO decouples turn-level environment rewards from trajectory rewards using independent centered-rank normalization and a hard veto for training caregiver agents in dementia care.