pith. machine review for the scientific record. sign in

arxiv: 2603.21016 · v2 · submitted 2026-03-22 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO

Authors on Pith no claims yet
classification 💻 cs.CL cs.AIcs.LG
keywords biaspa-grpopermutationsselectionacrossconsistentgithubgroup
0
0 comments X
read the original abstract

Large language models (LLMs) used for multiple-choice and pairwise evaluation tasks often exhibit selection bias due to non-semantic factors like option positions and label symbols. Existing inference-time debiasing is costly and may harm reasoning, while pointwise training ignores that the same question should yield consistent answers across permutations. To address this issue, we propose Permutation-Aware Group Relative Policy Optimization (PA-GRPO), which mitigates selection bias by enforcing permutation-consistent semantic reasoning. PA-GRPO constructs a permutation group for each instance by generating multiple candidate permutations, and optimizes the model using two complementary mechanisms: (1) cross-permutation advantage, which computes advantages relative to the mean reward over all permutations of the same instance, and (2) consistency-aware reward, which encourages the model to produce consistent decisions across different permutations. Experimental results demonstrate that PA-GRPO outperforms strong baselines across seven benchmarks, substantially reducing selection bias while maintaining high overall performance. The code is available on github (https://github.com/ECNU-Text-Computing/PA-GRPO).

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Option-Order Randomisation Reveals a Distributional Position Attractor in Prompted Sandbagging

    cs.CL 2026-04 unverdicted novelty 6.0

    Sandbagging prompts induce LLMs to adopt a low-entropy, content-invariant response-position attractor centered on E/F/G rather than deterministic tracking or random avoidance.