pith. machine review for the scientific record. sign in

arxiv: 2605.11974 · v1 · submitted 2026-05-12 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Towards Order Fairness: Mitigating LLMs Order Sensitivity through Dual Group Advantage Optimization

Guanyu Wang, Tong Mo, Weiping Li, Xinrong Chen, Xu Chu, Zhijie Tan, Ziyu Li

Pith reviewed 2026-05-13 07:29 UTC · model grok-4.3

classification 💻 cs.LG
keywords order biasLLM fairnessreinforcement learningorder sensitivityRAGconsistency ratedual group advantage
0
0 comments X

The pith

Dual Group Advantage Optimization uses reinforcement learning to reduce order bias in LLMs while raising accuracy on reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models change their outputs when the order of prompt elements shifts, which creates unreliable results in retrieval-augmented generation and similar settings. Earlier fixes either slow inference with search methods or train the model on reordered data at the cost of overall correctness. DGAO applies reinforcement learning that rewards responses correct within similar order groups and stable across different order groups. It balances an intra-group accuracy advantage signal with an inter-group stability advantage signal to update the policy. Experiments report gains in both fairness and task performance on RAG, mathematical reasoning, and classification benchmarks.

Core claim

The paper claims that Dual Group Advantage Optimization (DGAO) enables simultaneous gains in accuracy and order stability by computing intra-group relative accuracy advantage and inter-group relative stability advantage. This reinforcement learning procedure rewards the policy for order-stable correct outputs and penalizes order-sensitive or incorrect responses. The method introduces Consistency Rate and Overconfidence Rate metrics to expose pseudo-stability in prior approaches. Extensive tests show improved order fairness together with higher performance on RAG, mathematical reasoning, and classification tasks.

What carries the argument

Dual Group Advantage Optimization (DGAO), a reinforcement learning procedure that computes and balances an intra-group relative accuracy advantage with an inter-group relative stability advantage to guide policy updates toward correct and order-invariant outputs.

If this is right

  • DGAO produces higher accuracy and order stability on RAG, mathematical reasoning, and classification tasks.
  • Consistency Rate and Overconfidence Rate metrics expose cases where prior methods achieve apparent stability through consistent errors.
  • Reinforcement learning can be applied for the first time to directly mitigate LLMs' order sensitivity.
  • The approach supports more reliable in-context learning by reducing dependence on specific input arrangements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dual-advantage structure could be adapted to address other positional or formatting biases in sequence models.
  • Models trained this way may prove more reliable in production settings where users supply prompts in varying orders.
  • Balancing multiple advantage signals in this manner offers a template for multi-objective reinforcement learning in language model training.

Load-bearing premise

That dual advantage signals from accuracy within order groups and stability across order groups can be computed and balanced without creating new biases or over-penalizing reasoning that legitimately depends on order.

What would settle it

A controlled test on held-out tasks in which DGAO-trained models retain the same level of order sensitivity as the base model or show lower accuracy than supervised fine-tuning baselines would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.11974 by Guanyu Wang, Tong Mo, Weiping Li, Xinrong Chen, Xu Chu, Zhijie Tan, Ziyu Li.

Figure 1
Figure 1. Figure 1: Consistency Rates (obtained from 8 random [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: DGAO algorithm framework. A batch con￾tains M (e.g., M = 2 in the figure) queries, each query consists of text prompts and Q. Each text prompt is con￾structed into N order variants, and rewards are obtained through respective inference. Based on the rewards, intra-group relative advantage Aind and inter-group rel￾ative advantage Agroup are calculated respectively, and the aggregated final advantage Afinal … view at source ↗
Figure 3
Figure 3. Figure 3: Impact of input order on the performance of [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: DGAO’s Order Stability Generalization (using [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison examples of PAFT and DGAO test results. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Impact of input order on the performance of Qwen-2.5-3B across 8 random orders. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Impact of input order on the performance of Gemma-3-4B across 8 random orders. [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Example of SST2 dataset. The Great Boston Fire of 1872 Answer Given you a series of contexts, you should answer the question after reading them. doc 1: Reclamation projects significantly expanded Boston throughout the 19th century, creating areas like the South End and the Financial District. Notably, after The Great Boston Fire of 1872, building rubble was used as landfill. Boston annexed various towns be… view at source ↗
Figure 10
Figure 10. Figure 10: Example of GSM8K dataset [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example of SQuAD v2 dataset. The oven is off by 4% since 468 - 450 = 18 and 18 / 450 * 100 = 4%. If X + 0.04X = 520, then 1.04X = 520, so X = 520 / 1.04 = 500. \n#### 500 Answer Question: Adam has $100 and wants to spend it to open a rock stand. He can buy rocks for $5 each and sell them for $7 each. If he invests all his money in the rock stand but only sells 60% of his inventory, how much money does he l… view at source ↗
read the original abstract

Large Language Models (LLMs) suffer from order bias, where their performance is affected by the arrangement order of input elements. This unfairness limits the model's applications in scenarios such as in-context learning and Retrieval-Augmented Generation (RAG). Recent studies attempt to obtain optimal or suboptimal arrangements based on statistical results or using dataset-based search, but these methods increase inference overhead while leaving the model's inherent order bias unresolved. Other studies mitigate order sensitivity through supervised fine-tuning using augmented training sets with multiple order variants, but often at the cost of accuracy, trapping the model in consistent yet incorrect hallucinations. In this paper, we propose \textbf{D}ual \textbf{G}roup \textbf{A}dvantage \textbf{O}ptimization (\textbf{DGAO}), which aims to improve model accuracy and order stability simultaneously. DGAO calculates and balances intra-group relative accuracy advantage and inter-group relative stability advantage, rewarding the policy model for generating order-stable and correct outputs while penalizing order-sensitive or incorrect responses. This marks the first time reinforcement learning has been used to mitigate LLMs' order sensitivity. We also propose two new metrics, Consistency Rate and Overconfidence Rate, to reveal the pseudo-stability of previous methods and guide more comprehensive evaluation. Extensive experiments demonstrate that DGAO achieves superior order fairness while improving performance on RAG, mathematical reasoning, and classification tasks. Our code is available at: https://github.com/Hyalinesky/DGAO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLMs exhibit order bias affecting performance in tasks such as RAG and in-context learning. It proposes Dual Group Advantage Optimization (DGAO), a reinforcement learning method that computes intra-group relative accuracy advantages and inter-group relative stability advantages from order-variant rollouts to simultaneously improve accuracy and order invariance. The work introduces Consistency Rate and Overconfidence Rate metrics, reports gains over prior methods on RAG, mathematical reasoning, and classification tasks, and positions this as the first RL-based approach to order fairness.

Significance. If the central claim holds after addressing the advantage estimation details, the result would be significant as the first demonstration of RL mitigating LLM order sensitivity without the accuracy-stability trade-off observed in supervised fine-tuning baselines. The new metrics provide a more nuanced evaluation framework than prior consistency measures, and the dual-advantage formulation offers a generalizable template for fairness-aware RL in language models.

major comments (2)
  1. [§3.2] §3.2 (DGAO formulation): the inter-group stability advantage is defined over the same set of order permutations used to partition groups and generate rollouts. This risks circularity, as the stability signal becomes a function of the model's current (potentially biased) distribution over those permutations rather than measuring true invariance; the paper must show that the advantage estimator does not simply reward consistency with existing order preferences.
  2. [§4] §4 (Experiments): the reported gains in accuracy and stability rest on comparisons whose statistical significance, variance across random seeds, and ablation on the relative weighting of the two advantage terms are not detailed. Without these, it is unclear whether improvements are attributable to DGAO or to other factors such as the RL training regime itself.
minor comments (2)
  1. [§3] The abstract and method description would benefit from an explicit equation for the combined advantage (e.g., a weighted sum or product of intra- and inter-group terms) to make the balancing mechanism reproducible.
  2. [§4.1] Table captions and experimental setup should clarify the exact number of order permutations sampled per instance and how groups are formed (pre-specified vs. post-hoc clustering).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We address each major comment point by point below, providing clarifications on the DGAO formulation and committing to enhancements in the experimental section to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (DGAO formulation): the inter-group stability advantage is defined over the same set of order permutations used to partition groups and generate rollouts. This risks circularity, as the stability signal becomes a function of the model's current (potentially biased) distribution over those permutations rather than measuring true invariance; the paper must show that the advantage estimator does not simply reward consistency with existing order preferences.

    Authors: We appreciate the referee's concern regarding potential circularity in the inter-group stability advantage. In our formulation, order permutations are first partitioned into distinct groups according to structural criteria (such as permutation type or shuffling pattern), with independent rollouts generated for each group. The intra-group advantage is computed solely from relative accuracy within a fixed permutation group, while the inter-group advantage measures relative stability by comparing consistency rates across these partitioned groups. This comparative structure ensures the stability signal rewards improvements in invariance that exceed the model's initial distribution, rather than reinforcing existing biases. We will revise §3.2 to include an explicit step-by-step derivation of the advantage estimators and a short proof sketch showing that the inter-group term is orthogonal to intra-group preferences. revision: partial

  2. Referee: [§4] §4 (Experiments): the reported gains in accuracy and stability rest on comparisons whose statistical significance, variance across random seeds, and ablation on the relative weighting of the two advantage terms are not detailed. Without these, it is unclear whether improvements are attributable to DGAO or to other factors such as the RL training regime itself.

    Authors: We agree that additional experimental rigor is needed. In the revised manuscript we will report results aggregated over five independent random seeds with mean and standard deviation for all metrics, include paired t-test p-values comparing DGAO against each baseline, and add a dedicated ablation subsection varying the weighting hyperparameter λ between the intra-group accuracy advantage and inter-group stability advantage. These additions will isolate the contribution of the dual-advantage mechanism from the general RL training procedure. revision: yes

Circularity Check

0 steps flagged

No significant circularity in DGAO derivation chain

full rationale

The paper's central derivation defines intra-group accuracy advantage and inter-group stability advantage directly from empirical rollout statistics on order variants, then applies standard RL policy optimization to balance them. These quantities are computed from observed correctness and consistency rates rather than being fitted to a target or defined in terms of the optimization output itself. No equations reduce the claimed prediction to a self-referential fit, no uniqueness theorem is imported via self-citation, and no ansatz is smuggled through prior work. The approach remains self-contained against external task benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The method relies on standard RL policy optimization assumptions and the ability to partition responses into accuracy and stability groups; no new free parameters, axioms, or invented entities are introduced beyond conventional RL components.

pith-pipeline@v0.9.0 · 5582 in / 1145 out tokens · 33558 ms · 2026-05-13T07:29:16.020476+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 9 internal anchors

  1. [1]

    GPT-4o System Card

    Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

  2. [2]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

  3. [3]

    Qwen3 Technical Report

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  4. [4]

    Proceedings of the 47th international ACM SIGIR conference on research and development in information retrieval , pages=

    Data-efficient Fine-tuning for LLM-based Recommendation , author=. Proceedings of the 47th international ACM SIGIR conference on research and development in information retrieval , pages=

  5. [5]

    Proceedings of the AAAI conference on artificial intelligence , pages=

    Enhancing job recommendation through llm-based generative adversarial networks , author=. Proceedings of the AAAI conference on artificial intelligence , pages=

  6. [6]

    Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency , pages=

    Search Engines in the AI Era: A Qualitative Understanding to the False Promise of Factual and Verifiable Source-Cited Responses in LLM-based Search , author=. Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency , pages=

  7. [7]

    arXiv preprint arXiv:2506.17188 , year=

    Towards AI Search Paradigm , author=. arXiv preprint arXiv:2506.17188 , year=

  8. [8]

    Proceedings of the 17th ACM International Conference on Web Search and Data Mining , pages=

    Healai: A healthcare llm for effective medical documentation , author=. Proceedings of the 17th ACM International Conference on Web Search and Data Mining , pages=

  9. [9]

    arXiv preprint arXiv:2501.14431 , year=

    Domaino1s: Guiding llm reasoning for explainable answers in high-stakes domains , author=. arXiv preprint arXiv:2501.14431 , year=

  10. [10]

    Annual Meeting of the Association for Computational Linguistics , year=

    Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity , author=. Annual Meeting of the Association for Computational Linguistics , year=

  11. [11]

    Transactions of the Association for Computational Linguistics , year=

    Lost in the Middle: How Language Models Use Long Contexts , author=. Transactions of the Association for Computational Linguistics , year=

  12. [12]

    ArXiv , year=

    Eliminating Position Bias of Language Models: A Mechanistic Approach , author=. ArXiv , year=

  13. [13]

    ArXiv , year=

    Position-Aware Parameter Efficient Fine-Tuning Approach for Reducing Positional Bias in LLMs , author=. ArXiv , year=

  14. [14]

    arXiv preprint arXiv:2410.16983 , year=

    Order matters: Exploring order sensitivity in multimodal large language models , author=. arXiv preprint arXiv:2410.16983 , year=

  15. [15]

    ArXiv , year=

    Who is in the Spotlight: The Hidden Bias Undermining Multimodal Retrieval-Augmented Generation , author=. ArXiv , year=

  16. [16]

    arXiv preprint arXiv:2203.11364 , year=

    An information-theoretic approach to prompt engineering without ground truth labels , author=. arXiv preprint arXiv:2203.11364 , year=

  17. [17]

    Annual Meeting of the Association for Computational Linguistics , year=

    What Makes a Good Order of Examples in In-Context Learning , author=. Annual Meeting of the Association for Computational Linguistics , year=

  18. [18]

    ArXiv , year=

    SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine , author=. ArXiv , year=

  19. [19]

    ArXiv , year=

    Training Verifiers to Solve Math Word Problems , author=. ArXiv , year=

  20. [20]

    2024 , note =

    Llama 3.2 Model Cards and Prompt Formats , author =. 2024 , note =

  21. [21]

    ArXiv , year=

    Gemma 3 Technical Report , author=. ArXiv , year=

  22. [22]

    Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

    Can We Instruct LLMs to Compensate for Position Bias? , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

  23. [23]

    Proceedings of the AAAI Symposium Series , number=

    Enhancing Fairness in LLM Evaluations: Unveiling and Mitigating Biases in Standard-Answer-Based Evaluations , author=. Proceedings of the AAAI Symposium Series , number=

  24. [24]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

    Split and merge: Aligning position biases in LLM-based evaluators , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

  25. [25]

    arXiv preprint arXiv:2406.02536 , year=

    Mitigate position bias in large language models via scaling a single dimension , author=. arXiv preprint arXiv:2406.02536 , year=

  26. [26]

    Advances in Neural Information Processing Systems , volume=

    Unibias: Unveiling and mitigating llm bias through internal attention and ffn manipulation , author=. Advances in Neural Information Processing Systems , volume=

  27. [27]

    arXiv preprint arXiv:2406.03009 , year=

    Unveiling selection biases: Exploring order and token sensitivity in large language models , author=. arXiv preprint arXiv:2406.03009 , year=

  28. [28]

    2024 IEEE International Conference on Web Services (ICWS) , pages=

    Llm-mhr: A llm-augmented multimodal hashtag recommendation algorithm , author=. 2024 IEEE International Conference on Web Services (ICWS) , pages=. 2024 , organization=

  29. [29]

    arXiv preprint arXiv:2501.14427 , year=

    GraphSOS: Graph Sampling and Order Selection to Help LLMs Understand Graphs Better , author=. arXiv preprint arXiv:2501.14427 , year=

  30. [30]

    arXiv preprint arXiv:2505.15433 , year=

    Set-LLM: A Permutation-Invariant LLM , author=. arXiv preprint arXiv:2505.15433 , year=

  31. [31]

    ArXiv , year=

    ReFT: Reasoning with Reinforced Fine-Tuning , author=. ArXiv , year=

  32. [32]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

  33. [33]

    Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

    Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement , author=. arXiv preprint arXiv:2409.12122 , year=

  34. [34]

    Internlm-math: Open math large lan- guage models toward verifiable reasoning

    Internlm-math: Open math large language models toward verifiable reasoning , author=. arXiv preprint arXiv:2402.06332 , year=

  35. [35]

    Qwen2.5-Coder Technical Report

    Qwen2. 5-coder technical report , author=. arXiv preprint arXiv:2409.12186 , year=

  36. [36]

    Preference optimiza- tion for reasoning with pseudo feedback

    Preference optimization for reasoning with pseudo feedback , author=. arXiv preprint arXiv:2411.16345 , year=

  37. [37]

    ArXiv , year=

    IRCoCo: Immediate Rewards-Guided Deep Reinforcement Learning for Code Completion , author=. ArXiv , year=

  38. [38]

    arXiv preprint arXiv:2505.16400 , year=

    Acereason-nemotron: Advancing math and code reasoning through reinforcement learning , author=. arXiv preprint arXiv:2505.16400 , year=

  39. [39]

    Codedpo: Aligning code models with self generated and verified source code

    Codedpo: Aligning code models with self generated and verified source code , author=. arXiv preprint arXiv:2410.05605 , year=

  40. [40]

    Proximal Policy Optimization Algorithms

    Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

  41. [41]

    Advances in neural information processing systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

  42. [42]

    Advances in neural information processing systems , volume=

    Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

  43. [43]

    arXiv preprint arXiv:2402.10571 , year=

    Direct preference optimization with an offset , author=. arXiv preprint arXiv:2402.10571 , year=

  44. [44]

    arXiv preprint arXiv:2406.10858 , year=

    Step-level value preference optimization for mathematical reasoning , author=. arXiv preprint arXiv:2406.10858 , year=

  45. [45]

    Self-Rewarding Language Models

    Self-rewarding language models , author=. arXiv preprint arXiv:2401.10020 , volume=

  46. [46]

    arXiv preprint arXiv:2311.00168 , year=

    The alignment ceiling: Objective mismatch in reinforcement learning from human feedback , author=. arXiv preprint arXiv:2311.00168 , year=

  47. [47]

    arXiv preprint arXiv:2401.06080 , year=

    Secrets of rlhf in large language models part ii: Reward modeling , author=. arXiv preprint arXiv:2401.06080 , year=

  48. [48]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

  49. [49]

    arXiv preprint arXiv:2505.23558 , year=

    Qwen Look Again: Guiding Vision-Language Reasoning Models to Re-attention Visual Information , author=. arXiv preprint arXiv:2505.23558 , year=

  50. [50]

    arXiv preprint arXiv:2107.01431 , year=

    Neural-symbolic solver for math word problems with auxiliary tasks , author=. arXiv preprint arXiv:2107.01431 , year=

  51. [51]

    Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

    Recursive deep models for semantic compositionality over a sentiment treebank , author=. Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

  52. [52]

    Know What You Don't Know: Unanswerable Questions for SQuAD

    Know what you don't know: Unanswerable questions for SQuAD , author=. arXiv preprint arXiv:1806.03822 , year=

  53. [53]

    Eliminating Position Bias of Language Models: A Mechanistic Approach

    Eliminating position bias of language models: A mechanistic approach , author=. arXiv preprint arXiv:2407.01100 , year=

  54. [54]

    Qwen2.5: A Party of Foundation Models , url =

    Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =

  55. [55]

    arXiv preprint arXiv:2405.00200 , year=

    In-context learning with long-context models: An in-depth exploration , author=. arXiv preprint arXiv:2405.00200 , year=

  56. [56]

    Advances in Neural Information Processing Systems , volume=

    Many-shot in-context learning , author=. Advances in Neural Information Processing Systems , volume=

  57. [57]

    Advances in Neural Information Processing Systems , volume=

    Rankrag: Unifying context ranking with retrieval-augmented generation in llms , author=. Advances in Neural Information Processing Systems , volume=

  58. [58]

    Advances in Neural Information Processing Systems , volume=

    G-retriever: Retrieval-augmented generation for textual graph understanding and question answering , author=. Advances in Neural Information Processing Systems , volume=

  59. [59]

    arXiv preprint arXiv:2512.05111 , year=

    ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning , author=. arXiv preprint arXiv:2512.05111 , year=