Regime-Aware Peer Specialization for Robust RAG under Heterogeneous Knowledge Conflicts

Bo Wang; Chong Feng; Ge Shi; Heyan Huang; Jiahao Teng; Yanghao Zhou; Yaolin Li; Ziyi Yang

arxiv: 2606.30518 · v1 · pith:YELJOOR4new · submitted 2026-06-29 · 💻 cs.CL

Regime-Aware Peer Specialization for Robust RAG under Heterogeneous Knowledge Conflicts

Bo Wang , Heyan Huang , Yaolin Li , Yanghao Zhou , Jiahao Teng , Ziyi Yang , Ge Shi , Chong Feng This is my paper

Pith reviewed 2026-06-30 06:03 UTC · model grok-4.3

classification 💻 cs.CL

keywords RAGknowledge conflictspeer specializationregime-aware trainingretrieval-augmented generationconflict resolutionrobust generation

0 comments

The pith

RAPS-DA disentangles learning signals for RAG conflicts by training regime-specific peer specialists and routing samples accordingly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a method to make retrieval-augmented generation more robust when external context conflicts with a model's internal knowledge. It partitions these conflicts into three regimes—Grounding, Arbitration, and Resistance—and trains a separate peer specialist for each from the same base model. Samples are routed to the matching peer for specialized supervision using reverse KL divergence, while a token-level selector focuses on informative tokens. This approach avoids mixing incompatible signals that plague uniform training methods. The resulting student model performs better on conflict scenarios without needing regime information at deployment time.

Core claim

RAPS-DA divides knowledge conflicts into Grounding, Arbitration, and Resistance regimes, trains one same-scale peer specialist per regime, hard-routes samples to the regime-matched peer for on-policy reverse-KL supervision, and applies a dual-layer token selector based on inter-teacher disagreement, student-teacher divergence, and student entropy to filter and upweight tokens. This yields a student that surpasses prompting, decoding, fine-tuning, RL, and single-teacher baselines on five conflict scenarios and two out-of-distribution benchmarks, with gains from specialization rather than stronger teachers.

What carries the argument

The regime-aware peer specialization framework with hard-routing to three regime-matched peers and a dual-layer selector for token supervision.

If this is right

Specialization at fixed model scale produces better robustness than single-teacher training.
The deployed model needs no regime labels or access to peers.
Token-level filtering gradually focuses on high-conflict tokens as training progresses.
Performance gains hold on out-of-distribution benchmarks beyond the training conflict scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar regime partitioning could help in other areas where training signals vary in reliability, such as instruction tuning with noisy data.
The approach might scale to more than three regimes if finer conflict distinctions prove useful.
Since peers are discarded after training, the method could be combined with distillation techniques for even smaller final models.

Load-bearing premise

Conflicts can be accurately partitioned into the three regimes so that routing to matched peers creates disentangled beneficial signals instead of new inconsistencies.

What would settle it

Training a model with this regime-aware routing and finding it performs no better than a single-teacher baseline on the same conflict scenarios would falsify the benefit of the specialization.

Figures

Figures reproduced from arXiv: 2606.30518 by Bo Wang, Chong Feng, Ge Shi, Heyan Huang, Jiahao Teng, Yanghao Zhou, Yaolin Li, Ziyi Yang.

**Figure 2.** Figure 2: Overview of the RAPS-DA framework. Panel 1: A shared base model is fine-tuned into three regime-specialized peer teachers via SFT on the Grounding, Arbitration, and Resistance subsets. Panel 2: Each training sample is hard-routed by its regime label to the matched teacher; the student generates on-policy rollouts and receives routed reverse-KL supervision. Panel 3: Three token-level diagnostic signals—int… view at source ↗

**Figure 3.** Figure 3: Regime-level performance comparison (Qwen-7B). Bars show the regime-averaged EM [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Training dynamics under different annealing schedules. Each panel reports validation EM [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

read the original abstract

Retrieval-augmented generation (RAG) improves language models by grounding generation in external context. However, it can be fragile when the retrieved context conflicts with the model's parametric knowledge. Such conflicts span a reliability spectrum, ranging from reliable and partially reliable evidence to adversarial context. Existing remedies often handle such heterogeneous conflicts with regime-agnostic supervision, which can conflate incompatible learning signals across reliability regimes. To disentangle these signals, we propose RAPS-DA, a regime-aware peer specialization framework that addresses conflict at two complementary granularities. At the sample level, conflicts are divided into three regimes, including Grounding, Arbitration, and Resistance, with one same-scale peer specialist trained per regime from a shared base model. Each sample is then hard-routed to its regime-matched peer for on-policy reverse-KL supervision. At the token level, a dual-layer selector uses inter-teacher disagreement, student-teacher divergence, and student entropy to filter uninformative or unstable tokens, upweight confidently misaligned ones, and gradually focus supervision on high-conflict tokens as the student matures. Gains stem from specialization at a fixed model scale, not from a stronger teacher, and the peer specialists exist only during training, so the deployed student requires no regime labels or peer access. Experiments on five conflict scenarios and two out-of-distribution benchmarks show RAPS-DA surpasses all prompting, decoding, fine-tuning, RL, and single-teacher baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RAPS-DA splits RAG conflicts into three regimes and trains matched peer specialists with a disagreement-based token selector, but the abstract leaves the routing benefit unproven.

read the letter

The main takeaway is that this paper tries to fix conflicting signals in RAG fine-tuning by splitting examples into Grounding, Arbitration, and Resistance regimes, training one peer per regime from the same base, hard-routing each sample to its peer for reverse-KL, and adding a dual-layer token selector that looks at teacher disagreement, divergence, and entropy. That combination is the actual new piece; it does not collapse to the single-teacher or regime-agnostic baselines they cite.

The approach makes sense on its face. Treating all conflicts the same can mix incompatible gradients, so separating them at both sample and token level is a direct response to that problem. The fact that the peers are only used in training and the final model needs no regime labels is also practical.

The soft spot is exactly the one in the stress-test note. The abstract describes the regime split and hard-routing but gives no classifier accuracy, no overlap stats between regimes, and no ablation that keeps the token selector but swaps regime-matched routing for random or single-peer routing. Without those checks, the reported gains on the five scenarios could come mostly from the selector or from other training details rather than the specialization itself. The out-of-distribution results are mentioned but not broken down by regime either.

This is for people already working on robust RAG pipelines who want a concrete training recipe. A serious referee should see it, mainly to verify the missing ablations and to check whether the regime definitions are stable enough to be useful. I would not cite it yet, but the core idea is worth a full review.

Referee Report

3 major / 2 minor

Summary. The paper proposes RAPS-DA, a regime-aware peer specialization method for robust RAG under heterogeneous knowledge conflicts. Conflicts are partitioned into three regimes (Grounding, Arbitration, Resistance); one same-scale peer is trained per regime from a shared base; each sample is hard-routed to its matched peer for on-policy reverse-KL supervision. A dual-layer token selector filters tokens using inter-teacher disagreement, student-teacher divergence, and student entropy. The deployed model requires no regime labels or peers. Experiments on five conflict scenarios plus two OOD benchmarks claim superiority over prompting, decoding, fine-tuning, RL, and single-teacher baselines.

Significance. If the regime partitioning and hard-routing mechanism can be shown to produce net-positive disentangled signals rather than misrouting artifacts, the approach would offer a training-only specialization strategy that improves conflict robustness at fixed inference scale. The dual-layer selector and reverse-KL formulation are technically coherent extensions of existing distillation ideas, but their interaction with regime routing remains unverified.

major comments (3)

[§3.2] §3.2 (Regime Definition and Classifier): The three regimes are defined and a classifier is introduced, yet no accuracy, precision-recall, or inter-regime overlap statistics are reported for the regime predictor on the training or validation splits. Without these numbers it is impossible to confirm that hard-routing actually delivers regime-matched supervision rather than noise.
[§4.3] §4.3 and Table 3 (Ablation Studies): Performance tables show gains over baselines, but the manuscript contains no ablation that replaces regime-matched routing with random routing or a single shared peer while freezing the dual-layer token selector. Consequently the central claim that specialization (rather than the selector or other factors) drives the reported improvements cannot be isolated.
[§5.1] §5.1 (Token Selector Interaction): The dual-layer selector is motivated by disagreement and entropy signals, yet no analysis quantifies how often the selector and regime router disagree on high-conflict tokens or whether selector up-weighting amplifies misrouted samples. This interaction is load-bearing for the “disentangled signals” argument.

minor comments (2)

[§3.3] Notation for the reverse-KL objective and the dual-layer selector weights is introduced without an explicit equation reference in the main text; a numbered equation would improve traceability.
[Figure 2] Figure 2 (regime distribution) lacks error bars or per-scenario breakdowns; adding these would clarify stability across the five conflict scenarios.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of validating the regime routing and component interactions. We address each major comment below and will incorporate additional analyses and ablations in the revised manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Regime Definition and Classifier): The three regimes are defined and a classifier is introduced, yet no accuracy, precision-recall, or inter-regime overlap statistics are reported for the regime predictor on the training or validation splits. Without these numbers it is impossible to confirm that hard-routing actually delivers regime-matched supervision rather than noise.

Authors: We agree that the regime classifier performance metrics are necessary to substantiate the hard-routing mechanism. The current manuscript does not report these statistics. In the revision we will add accuracy, precision-recall, and F1 scores together with a confusion matrix for the regime predictor evaluated on both the training and validation splits. revision: yes
Referee: [§4.3] §4.3 and Table 3 (Ablation Studies): Performance tables show gains over baselines, but the manuscript contains no ablation that replaces regime-matched routing with random routing or a single shared peer while freezing the dual-layer token selector. Consequently the central claim that specialization (rather than the selector or other factors) drives the reported improvements cannot be isolated.

Authors: This point is well taken; an explicit random-routing ablation with the selector held fixed would strengthen isolation of the specialization effect. While single-teacher baselines are present, we did not include the requested random-routing control. We will run and report this ablation in the revised version, comparing regime-matched routing against random peer assignment under identical selector conditions. revision: yes
Referee: [§5.1] §5.1 (Token Selector Interaction): The dual-layer selector is motivated by disagreement and entropy signals, yet no analysis quantifies how often the selector and regime router disagree on high-conflict tokens or whether selector up-weighting amplifies misrouted samples. This interaction is load-bearing for the “disentangled signals” argument.

Authors: We acknowledge the value of quantifying the selector-router interaction. The manuscript currently lacks this analysis. In the revision we will add measurements of disagreement frequency between the token selector and regime router on high-conflict tokens, as well as an examination of whether selector up-weighting influences any misrouted samples. revision: yes

Circularity Check

0 steps flagged

No circularity: procedural method with no self-referential derivations

full rationale

The provided abstract and description define RAPS-DA as a new training procedure involving regime partitioning (Grounding/Arbitration/Resistance), per-regime peer training, hard-routing for reverse-KL, and a dual-layer token selector based on disagreement/divergence/entropy. No equations, fitted parameters renamed as predictions, or self-citations invoking uniqueness theorems appear. The central claims rest on experimental comparisons to baselines rather than any reduction of outputs to inputs by construction. This matches the reader's note that no derivations are present and the procedure introduces new components without visible circular reduction. Per rules, absent any quotable self-definitional or fitted-input steps, the finding is no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the approach relies on empirical choices for regime partitioning and token selection whose details are not supplied.

pith-pipeline@v0.9.1-grok · 5801 in / 1119 out tokens · 44789 ms · 2026-06-30T06:03:26.978807+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 20 canonical work pages · 11 internal anchors

[1]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InInt. Conf. Learn. Represent., 2024

2024
[2]

SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting

Anonymous. Scope: Correctness-based dual-path token weighting for on-policy distillation. arXiv preprint arXiv:2604.10688, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Srpo: Self-refined policy optimization via correctness-aware routing.arXiv preprint arXiv:2604.02288, 2026

Anonymous. Srpo: Self-refined policy optimization via correctness-aware routing.arXiv preprint arXiv:2604.02288, 2026

work page arXiv 2026
[4]

TIP: Token Importance in On-Policy Distillation

Anonymous. Tip: Token importance profiling for efficient on-policy distillation.arXiv preprint arXiv:2604.14084, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

Self-rag: Learning to retrieve, generate, and critique through self-reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avi Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection. InInt. Conf. Learn. Represent., 2024. 15

2024
[6]

Parameters vs

Baolong Bi, Shenghua Liu, Yiwei Wang, Yilong Xu, Junfeng Fang, Lingrui Mei, and Xueqi Cheng. Parameters vs. context: Fine-grained control of knowledge reliance in language mod- els, 2025

2025
[7]

Dola: Decoding by contrasting layers improves factuality in large language models

Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, and Pengcheng He. Dola: Decoding by contrasting layers improves factuality in large language models. InInt. Conf. Learn. Represent., 2024

2024
[8]

Retrieve only when it needs: Adaptive retrieval augmentation for hallucination mitigation in large language models

Hanxing Ding, Liang Pang, Zihao Wei, Huawei Shen, and Xueqi Cheng. Retrieve only when it needs: Adaptive retrieval augmentation for hallucination mitigation in large language models. arXiv preprint arXiv:2402.10612, 2024

work page arXiv 2024
[9]

Enhancing noise robustness of retrieval-augmented language models via RAAT

Yucheng Fang, Ruochen Wang, Kun Qian, Yansong Feng, Diyi Yang, and He He. Enhancing noise robustness of retrieval-augmented language models via RAAT. InProc. Annu. Meet. Assoc. Comput Linguist., 2024

2024
[10]

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

Yao Fu et al. Revisiting on-policy distillation: Three failure modes and top-k truncated reverse- kl.arXiv preprint arXiv:2603.25562, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Minillm: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InInt. Conf. Learn. Represent., 2024

2024
[12]

Deepseek-r1 incentivizes reasoning in LLMs through reinforcement learning.Nature, 645(8081):633–638, sep 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in LLMs through reinforcement learning.Nature, 645(8081):633–638, sep 2025

2025
[13]

REALM: Retrieval-augmented language model pre-training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. REALM: Retrieval-augmented language model pre-training. InProceedings of the 37th International Conference on Machine Learning (ICML), pages 3929–3938, 2020

2020
[14]

Retrieving, rethinking and revising: The chain-of-verification can improve retrieval augmented generation

Bolei He, Nuo Chen, Xinran He, Lingyong Yan, Zhenkai Wei, Jinchang Luo, and Zhen-Hua Ling. Retrieving, rethinking and revising: The chain-of-verification can improve retrieval augmented generation. InConf. Empir. Methods Nat. Lang. Process., pages 10371–10393, 2024

2024
[15]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[16]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Tug-of-war between knowledge: Exploring and resolving knowledge conflicts in retrieval-augmented language models

Zhuoran Jin, Pengfei Cao, Yubo Chen, Kang Liu, Xiaojian Jiang, Jiexin Xu, Li Qiuxia, and Jun Zhao. Tug-of-war between knowledge: Exploring and resolving knowledge conflicts in retrieval-augmented language models. InInt. Conf. Comput. Linguist., Lang. Resour. Eval., pages 16867–16878, 2024

2024
[18]

Reopold: Reward-based on-policy distillation with mixture-based reward clipping.arXiv preprint arXiv:2603.11137, 2026

Jongwoo Ko, Sungmin Park, and Joohyung Kim. Reopold: Reward-based on-policy distillation with mixture-based reward clipping.arXiv preprint arXiv:2603.11137, 2026

work page arXiv 2026
[19]

Retrieval-augmented generation for knowledge-intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAd- vances in Neural Information Processing Systems (NeurIPS), volume 33, pages 9459–9474, 2020

2020
[20]

Knowledge-aware fine-tuning for robust retrieval- augmented generation.arXiv preprint arXiv:2407.12854, 2024

Xiaoyu Li, Hao Zhang, and Zhiyuan Wang. Knowledge-aware fine-tuning for robust retrieval- augmented generation.arXiv preprint arXiv:2407.12854, 2024

work page arXiv 2024
[21]

Truthfulqa: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Compu- tational Linguistics (ACL), pages 3214–3252, 2022. 16

2022
[22]

Knowledgeable-r1: Reinforcement learn- ing for knowledge-conflict resolution in rag.arXiv preprint arXiv:2503.12345, 2025

Zhen Lin, Yifei Wang, Hao Chen, and Zhiyuan Liu. Knowledgeable-r1: Reinforcement learn- ing for knowledge-conflict resolution in rag.arXiv preprint arXiv:2503.12345, 2025

work page arXiv 2025
[23]

Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

Haoran Luo et al. Stable on-policy distillation: Mitigating length inflation in llm training. arXiv preprint arXiv:2604.08527, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

Privileged Information Distillation for Language Models

Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, and Massimo Caccia. Privileged information distillation for language models.arXiv preprint arXiv:2602.04942, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Qwen2.5 technical report, 2025

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

2025
[26]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Trusting your evidence: Hallucinate less with context-aware decoding

Weijia Shi, Xiaochuang Han, Mike Lewis, Yulia Tsvetkov, Luke Zettlemoyer, and Scott Wen- tau Yih. Trusting your evidence: Hallucinate less with context-aware decoding. InProc. Conf. North American Chapter Assoc. Comput. Linguist., pages 783–800, 2024

2024
[28]

Conflictbank: a benchmark for evaluating knowledge conflicts in large language models

Zhaochen Su, Jun Zhang, Xiaoye Qu, Tong Zhu, Yanshu Li, Jiashuo Sun, Juntao Li, Min Zhang, and Yu Cheng. Conflictbank: a benchmark for evaluating knowledge conflicts in large language models. InAdv. Neural Inform. Process. Syst., pages 103242–103268, 2024

2024
[29]

Branch- train-mix: Mixing expert llms into a mixture-of-experts llm.arXiv preprint arXiv:2403.07816, 2024

Sainbayar Sukhbaatar, Naman Goyal, Gabriel Synnaeve, and Guillaume Lample. Branch- train-mix: Mixing expert llms into a mixture-of-experts llm.arXiv preprint arXiv:2403.07816, 2024

work page arXiv 2024
[30]

Knowledge fusion of large language models

Fanqi Wan, Xinting Huang, Deng Cai, Xiaojun Quan, Wei Bi, and Shuming Shi. Knowledge fusion of large language models. InInt. Conf. Learn. Represent., 2024

2024
[31]

Astute rag: Over- coming imperfect retrieval augmentation and knowledge conflicts for large language models

Fei Wang, Xingchen Wan, Ruoxi Sun, Jiefeng Chen, and Sercan Ö Arık. Astute rag: Over- coming imperfect retrieval augmentation and knowledge conflicts for large language models. Proc. Annu. Meet. Assoc. Comput Linguist., 2025

2025
[32]

Retrieval-augmented gen- eration with conflicting evidence

Han Wang, Archiki Prasad, Elias Stengel-Eskin, and Mohit Bansal. Retrieval-augmented gen- eration with conflicting evidence. InConference on Language Modeling, 2025

2025
[33]

Resolving knowledge conflicts in large language models.Conference on Language Modeling, 2024

Yike Wang, Shangbin Feng, Heng Wang, Weijia Shi, Vidhisha Balachandran, Tianxing He, and Yulia Tsvetkov. Resolving knowledge conflicts in large language models.Conference on Language Modeling, 2024

2024
[34]

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

Chengyue Wu et al. Lightning on-policy distillation: Teacher consistency is all you need. arXiv preprint arXiv:2604.13010, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts

Jian Xie, Kai Zhang, Jiangjie Chen, Renze Lou, and Yu Su. Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts. InInt. Conf. Learn. Represent., 2024

2024
[36]

Info-rag: Information-filtered on-policy retrieval-augmented generation.arXiv preprint arXiv:2406.19009, 2024

Chenliang Xu, Jiaxin Guo, Yiwei Wang, and Shenghua Liu. Info-rag: Information-filtered on-policy retrieval-augmented generation.arXiv preprint arXiv:2406.19009, 2024

work page arXiv 2024
[37]

Knowledge conflicts for llms: A survey

Rongwu Xu, Zehan Qi, Zhijiang Guo, Cunxiang Wang, Hongru Wang, Yue Zhang, and Wei Xu. Knowledge conflicts for llms: A survey. InConf. Empir. Methods Nat. Lang. Process., pages 8541–8565, 2024

2024
[38]

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

Zichun Yang et al. G-opd: Generalized on-policy distillation as dense kl-constrained reinforce- ment learning.arXiv preprint arXiv:2602.12125, 2026. 17

work page internal anchor Pith review Pith/arXiv arXiv 2026
[39]

On-Policy Context Distillation for Language Models

Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models.arXiv preprint arXiv:2602.12275, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

Confidence-aware multi-teacher knowledge dis- tillation.arXiv preprint arXiv:2201.00007, 2022

Hailin Zhang, Defang Chen, and Can Wang. Confidence-aware multi-teacher knowledge dis- tillation.arXiv preprint arXiv:2201.00007, 2022

work page arXiv 2022
[41]

Dynamic contrastive decoding for knowledge conflict resolution in large language models.arXiv preprint arXiv:2405.13183, 2024

Xueying Zhang, Yanqiu Chen, and Yongkang Li. Dynamic contrastive decoding for knowledge conflict resolution in large language models.arXiv preprint arXiv:2405.13183, 2024. 18

work page arXiv 2024

[1] [1]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InInt. Conf. Learn. Represent., 2024

2024

[2] [2]

SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting

Anonymous. Scope: Correctness-based dual-path token weighting for on-policy distillation. arXiv preprint arXiv:2604.10688, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

Srpo: Self-refined policy optimization via correctness-aware routing.arXiv preprint arXiv:2604.02288, 2026

Anonymous. Srpo: Self-refined policy optimization via correctness-aware routing.arXiv preprint arXiv:2604.02288, 2026

work page arXiv 2026

[4] [4]

TIP: Token Importance in On-Policy Distillation

Anonymous. Tip: Token importance profiling for efficient on-policy distillation.arXiv preprint arXiv:2604.14084, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

Self-rag: Learning to retrieve, generate, and critique through self-reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avi Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection. InInt. Conf. Learn. Represent., 2024. 15

2024

[6] [6]

Parameters vs

Baolong Bi, Shenghua Liu, Yiwei Wang, Yilong Xu, Junfeng Fang, Lingrui Mei, and Xueqi Cheng. Parameters vs. context: Fine-grained control of knowledge reliance in language mod- els, 2025

2025

[7] [7]

Dola: Decoding by contrasting layers improves factuality in large language models

Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, and Pengcheng He. Dola: Decoding by contrasting layers improves factuality in large language models. InInt. Conf. Learn. Represent., 2024

2024

[8] [8]

Retrieve only when it needs: Adaptive retrieval augmentation for hallucination mitigation in large language models

Hanxing Ding, Liang Pang, Zihao Wei, Huawei Shen, and Xueqi Cheng. Retrieve only when it needs: Adaptive retrieval augmentation for hallucination mitigation in large language models. arXiv preprint arXiv:2402.10612, 2024

work page arXiv 2024

[9] [9]

Enhancing noise robustness of retrieval-augmented language models via RAAT

Yucheng Fang, Ruochen Wang, Kun Qian, Yansong Feng, Diyi Yang, and He He. Enhancing noise robustness of retrieval-augmented language models via RAAT. InProc. Annu. Meet. Assoc. Comput Linguist., 2024

2024

[10] [10]

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

Yao Fu et al. Revisiting on-policy distillation: Three failure modes and top-k truncated reverse- kl.arXiv preprint arXiv:2603.25562, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[11] [11]

Minillm: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InInt. Conf. Learn. Represent., 2024

2024

[12] [12]

Deepseek-r1 incentivizes reasoning in LLMs through reinforcement learning.Nature, 645(8081):633–638, sep 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in LLMs through reinforcement learning.Nature, 645(8081):633–638, sep 2025

2025

[13] [13]

REALM: Retrieval-augmented language model pre-training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. REALM: Retrieval-augmented language model pre-training. InProceedings of the 37th International Conference on Machine Learning (ICML), pages 3929–3938, 2020

2020

[14] [14]

Retrieving, rethinking and revising: The chain-of-verification can improve retrieval augmented generation

Bolei He, Nuo Chen, Xinran He, Lingyong Yan, Zhenkai Wei, Jinchang Luo, and Zhen-Hua Ling. Retrieving, rethinking and revising: The chain-of-verification can improve retrieval augmented generation. InConf. Empir. Methods Nat. Lang. Process., pages 10371–10393, 2024

2024

[15] [15]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[16] [16]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Tug-of-war between knowledge: Exploring and resolving knowledge conflicts in retrieval-augmented language models

Zhuoran Jin, Pengfei Cao, Yubo Chen, Kang Liu, Xiaojian Jiang, Jiexin Xu, Li Qiuxia, and Jun Zhao. Tug-of-war between knowledge: Exploring and resolving knowledge conflicts in retrieval-augmented language models. InInt. Conf. Comput. Linguist., Lang. Resour. Eval., pages 16867–16878, 2024

2024

[18] [18]

Reopold: Reward-based on-policy distillation with mixture-based reward clipping.arXiv preprint arXiv:2603.11137, 2026

Jongwoo Ko, Sungmin Park, and Joohyung Kim. Reopold: Reward-based on-policy distillation with mixture-based reward clipping.arXiv preprint arXiv:2603.11137, 2026

work page arXiv 2026

[19] [19]

Retrieval-augmented generation for knowledge-intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAd- vances in Neural Information Processing Systems (NeurIPS), volume 33, pages 9459–9474, 2020

2020

[20] [20]

Knowledge-aware fine-tuning for robust retrieval- augmented generation.arXiv preprint arXiv:2407.12854, 2024

Xiaoyu Li, Hao Zhang, and Zhiyuan Wang. Knowledge-aware fine-tuning for robust retrieval- augmented generation.arXiv preprint arXiv:2407.12854, 2024

work page arXiv 2024

[21] [21]

Truthfulqa: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Compu- tational Linguistics (ACL), pages 3214–3252, 2022. 16

2022

[22] [22]

Knowledgeable-r1: Reinforcement learn- ing for knowledge-conflict resolution in rag.arXiv preprint arXiv:2503.12345, 2025

Zhen Lin, Yifei Wang, Hao Chen, and Zhiyuan Liu. Knowledgeable-r1: Reinforcement learn- ing for knowledge-conflict resolution in rag.arXiv preprint arXiv:2503.12345, 2025

work page arXiv 2025

[23] [23]

Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

Haoran Luo et al. Stable on-policy distillation: Mitigating length inflation in llm training. arXiv preprint arXiv:2604.08527, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[24] [24]

Privileged Information Distillation for Language Models

Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, and Massimo Caccia. Privileged information distillation for language models.arXiv preprint arXiv:2602.04942, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

Qwen2.5 technical report, 2025

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

2025

[26] [26]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Trusting your evidence: Hallucinate less with context-aware decoding

Weijia Shi, Xiaochuang Han, Mike Lewis, Yulia Tsvetkov, Luke Zettlemoyer, and Scott Wen- tau Yih. Trusting your evidence: Hallucinate less with context-aware decoding. InProc. Conf. North American Chapter Assoc. Comput. Linguist., pages 783–800, 2024

2024

[28] [28]

Conflictbank: a benchmark for evaluating knowledge conflicts in large language models

Zhaochen Su, Jun Zhang, Xiaoye Qu, Tong Zhu, Yanshu Li, Jiashuo Sun, Juntao Li, Min Zhang, and Yu Cheng. Conflictbank: a benchmark for evaluating knowledge conflicts in large language models. InAdv. Neural Inform. Process. Syst., pages 103242–103268, 2024

2024

[29] [29]

Branch- train-mix: Mixing expert llms into a mixture-of-experts llm.arXiv preprint arXiv:2403.07816, 2024

Sainbayar Sukhbaatar, Naman Goyal, Gabriel Synnaeve, and Guillaume Lample. Branch- train-mix: Mixing expert llms into a mixture-of-experts llm.arXiv preprint arXiv:2403.07816, 2024

work page arXiv 2024

[30] [30]

Knowledge fusion of large language models

Fanqi Wan, Xinting Huang, Deng Cai, Xiaojun Quan, Wei Bi, and Shuming Shi. Knowledge fusion of large language models. InInt. Conf. Learn. Represent., 2024

2024

[31] [31]

Astute rag: Over- coming imperfect retrieval augmentation and knowledge conflicts for large language models

Fei Wang, Xingchen Wan, Ruoxi Sun, Jiefeng Chen, and Sercan Ö Arık. Astute rag: Over- coming imperfect retrieval augmentation and knowledge conflicts for large language models. Proc. Annu. Meet. Assoc. Comput Linguist., 2025

2025

[32] [32]

Retrieval-augmented gen- eration with conflicting evidence

Han Wang, Archiki Prasad, Elias Stengel-Eskin, and Mohit Bansal. Retrieval-augmented gen- eration with conflicting evidence. InConference on Language Modeling, 2025

2025

[33] [33]

Resolving knowledge conflicts in large language models.Conference on Language Modeling, 2024

Yike Wang, Shangbin Feng, Heng Wang, Weijia Shi, Vidhisha Balachandran, Tianxing He, and Yulia Tsvetkov. Resolving knowledge conflicts in large language models.Conference on Language Modeling, 2024

2024

[34] [34]

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

Chengyue Wu et al. Lightning on-policy distillation: Teacher consistency is all you need. arXiv preprint arXiv:2604.13010, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[35] [35]

Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts

Jian Xie, Kai Zhang, Jiangjie Chen, Renze Lou, and Yu Su. Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts. InInt. Conf. Learn. Represent., 2024

2024

[36] [36]

Info-rag: Information-filtered on-policy retrieval-augmented generation.arXiv preprint arXiv:2406.19009, 2024

Chenliang Xu, Jiaxin Guo, Yiwei Wang, and Shenghua Liu. Info-rag: Information-filtered on-policy retrieval-augmented generation.arXiv preprint arXiv:2406.19009, 2024

work page arXiv 2024

[37] [37]

Knowledge conflicts for llms: A survey

Rongwu Xu, Zehan Qi, Zhijiang Guo, Cunxiang Wang, Hongru Wang, Yue Zhang, and Wei Xu. Knowledge conflicts for llms: A survey. InConf. Empir. Methods Nat. Lang. Process., pages 8541–8565, 2024

2024

[38] [38]

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

Zichun Yang et al. G-opd: Generalized on-policy distillation as dense kl-constrained reinforce- ment learning.arXiv preprint arXiv:2602.12125, 2026. 17

work page internal anchor Pith review Pith/arXiv arXiv 2026

[39] [39]

On-Policy Context Distillation for Language Models

Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models.arXiv preprint arXiv:2602.12275, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[40] [40]

Confidence-aware multi-teacher knowledge dis- tillation.arXiv preprint arXiv:2201.00007, 2022

Hailin Zhang, Defang Chen, and Can Wang. Confidence-aware multi-teacher knowledge dis- tillation.arXiv preprint arXiv:2201.00007, 2022

work page arXiv 2022

[41] [41]

Dynamic contrastive decoding for knowledge conflict resolution in large language models.arXiv preprint arXiv:2405.13183, 2024

Xueying Zhang, Yanqiu Chen, and Yongkang Li. Dynamic contrastive decoding for knowledge conflict resolution in large language models.arXiv preprint arXiv:2405.13183, 2024. 18

work page arXiv 2024