pith. machine review for the scientific record. sign in

arxiv: 2604.02585 · v2 · submitted 2026-04-02 · 💻 cs.AI · cs.CL

Recognition: no theorem link

Mitigating LLM biases toward spurious social contexts using direct preference optimization

Dorottya Demszky, Hyunji Nam

Pith reviewed 2026-05-13 20:29 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords LLM debiasingdirect preference optimizationspurious contextsclassroom transcript evaluationpredictive accuracymodel robustnessself-supervised preference learning
0
0 comments X

The pith

Debiasing-DPO pairs neutral reasoning with biased reasoning to cut LLM sensitivity to spurious social contexts by 84 percent while raising predictive accuracy by 52 percent on average.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that large language models shift their predictions substantially when irrelevant social details such as teacher experience or demographics are added to classroom transcript evaluations. Standard prompting and ordinary direct preference optimization fail to correct this sensitivity. The authors introduce Debiasing-DPO, a self-supervised procedure that generates a neutral reasoning trace from the query alone and contrasts it against the model's own biased trace produced when the spurious context is present. Training on these pairs, combined with supervised fine-tuning on ground-truth scores, produces models that ignore the irrelevant context. The gains appear consistently across Llama and Qwen models of 3B and 7-8B scale on the NCTE classroom dataset.

Core claim

Debiasing-DPO reduces bias by 84% and improves predictive accuracy by 52% on average when applied to Llama 3B and 8B and Qwen 3B and 7B Instruct models evaluated on U.S. classroom transcripts paired with expert rubric scores.

What carries the argument

Debiasing-DPO, a direct preference optimization objective that treats neutral reasoning generated from the query alone as the preferred response and the model's own reasoning generated with added spurious context as the dispreferred response.

If this is right

  • Model predictions on teacher evaluation tasks become stable even when irrelevant details about experience, education, or identity are supplied.
  • The same training procedure raises alignment with expert rubric scores rather than trading robustness for accuracy.
  • The gains hold for both 3B and 7-8B scale models from two different families.
  • Standard prompt engineering and unmodified DPO remain insufficient to achieve comparable robustness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The pairing of self-generated neutral and biased reasoning traces may serve as a general template for preference optimization on other forms of context sensitivity.
  • If the method scales to larger models, it could reduce the need for costly human preference data when correcting spurious-context biases.
  • The result implies that explicit contrastive reasoning supervision can substitute for further model scaling in robustness tasks.

Load-bearing premise

Generating a neutral reasoning trace from the query alone supplies a trustworthy positive example that, when contrasted against the model's biased trace, trains the model to ignore spurious context without creating new distortions or accuracy losses.

What would settle it

Applying Debiasing-DPO to a new collection of spurious contexts or a different high-stakes prediction task and observing either no reduction in prediction shift or a decline in accuracy relative to the base model.

Figures

Figures reproduced from arXiv: 2604.02585 by Dorottya Demszky, Hyunji Nam.

Figure 1
Figure 1. Figure 1: Given a document input X and a query, the model outputs an evaluation of the quality of X but is undesirably swayed by spurious social context, such as the teacher’s level of certification. While a teacher’s certification may affect their instructional quality, given the same transcript, the model’s prediction should not change based on whether or not the teacher has a prestigious certification. rubric 1 —… view at source ↗
Figure 2
Figure 2. Figure 2: Both baseline implementations of DPO only focus on debiasing. In contrast, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Score distribution across the entire NCTE transcripts (including both training [PITH_FULL_IMAGE:figures/full_fig_p023_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Llama-8B-Instruct training curves. Left: Debiasing DPO (continues improving [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗
read the original abstract

LLMs are increasingly used for high-stakes decision-making, yet their sensitivity to spurious contextual information can introduce harmful biases. This is a critical concern when models are deployed for tasks like evaluating teachers' instructional quality, where biased assessment can affect teachers' professional development and career trajectories. We investigate model robustness to spurious social contexts using the largest publicly available dataset of U.S. classroom transcripts (NCTE) paired with expert rubric scores. Evaluating seven frontier and open-weight models across seven categories of spurious contexts -- including teacher experience, education level, demographic identity, and sycophancy-inducing framings -- we find that irrelevant contextual information can shift model predictions by up to 1.48 points on a 7-point scale, with larger models sometimes exhibiting greater sensitivity despite higher predictive accuracy. Mitigations using prompts and standard direct preference optimization (DPO) prove largely insufficient. We propose **Debiasing-DPO**,, a self-supervised training method that pairs neutral reasoning generated from the query alone, with the model's biased reasoning generated with both the query and additional spurious context. We further combine this objective with supervised fine-tuning on ground-truth labels to prevent losses in predictive accuracy. Applied to Llama 3B \& 8B and Qwen 3B \& 7B Instruct models, Debiasing-DPO reduces bias by 84\% and improves predictive accuracy by 52\% on average. Our findings from the educational case study highlight that robustness to spurious context is not a natural byproduct of model scaling and that our proposed method can yield substantial gains in both accuracy and robustness for prompt-based prediction tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLMs are sensitive to spurious social contexts (e.g., teacher demographics, experience) when scoring classroom transcripts from the NCTE dataset, with shifts up to 1.48 points on a 7-point rubric. Standard prompting and DPO fail to mitigate this. The authors introduce Debiasing-DPO, which constructs preference pairs by contrasting the model's query-only reasoning (treated as neutral) against its reasoning when spurious context is appended, then combines this DPO objective with SFT on ground-truth labels. On Llama-3B/8B and Qwen-3B/7B Instruct models, the method is reported to reduce bias by 84% while improving predictive accuracy by 52% on average.

Significance. If the empirical claims hold under proper controls, the work is significant for practical deployment of LLMs in high-stakes educational assessment, where biased scoring can affect careers. The self-supervised construction of preference pairs without extra human annotations is a practical strength, and the explicit combination with SFT to preserve accuracy addresses a common trade-off in debiasing methods. The finding that larger models can be more sensitive to spurious context also challenges scaling-based assumptions about robustness.

major comments (2)
  1. [§3] §3 (Debiasing-DPO definition): The central assumption that query-only generations are verifiably neutral is load-bearing for the 84% bias-reduction claim, yet the manuscript provides no quantitative validation (e.g., measuring the same bias metric on query-only outputs versus spurious-context outputs on a held-out set). If pretraining biases remain in the 'neutral' branch, the DPO objective optimizes toward flawed pairs.
  2. [§4.3] §4.3 (Results): The reported 84% bias reduction and 52% accuracy improvement are given only as averages across four models; no per-model tables, standard deviations, or statistical tests (e.g., paired t-tests or bootstrap CIs) are shown, and no ablation isolating the contribution of the neutral-vs-biased pairing versus the added SFT step is provided. This makes it impossible to assess whether the gains are robust or attributable to the proposed method.
minor comments (2)
  1. [Abstract] Abstract: Typo with double comma after 'Debiasing-DPO,'.
  2. [§4.1] Notation: The bias metric (shift on 7-point scale converted to percentage reduction) should be defined explicitly with an equation in §3 or §4.1 rather than left implicit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. The points raised highlight important aspects of our method's assumptions and the need for more granular reporting of results. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [§3] §3 (Debiasing-DPO definition): The central assumption that query-only generations are verifiably neutral is load-bearing for the 84% bias-reduction claim, yet the manuscript provides no quantitative validation (e.g., measuring the same bias metric on query-only outputs versus spurious-context outputs on a held-out set). If pretraining biases remain in the 'neutral' branch, the DPO objective optimizes toward flawed pairs.

    Authors: We agree that explicit validation of the query-only branch as a neutral reference is necessary to support the bias-reduction results. In the revised manuscript we will add a new subsection that applies the same bias metric to query-only outputs versus spurious-context outputs on a held-out portion of the NCTE data. This will quantify the degree to which query-only generations already exhibit lower sensitivity to the spurious contexts. We will also include a brief discussion of the possibility of residual pretraining biases and how the combined DPO+SFT objective mitigates their impact on downstream predictions. revision: yes

  2. Referee: [§4.3] §4.3 (Results): The reported 84% bias reduction and 52% accuracy improvement are given only as averages across four models; no per-model tables, standard deviations, or statistical tests (e.g., paired t-tests or bootstrap CIs) are shown, and no ablation isolating the contribution of the neutral-vs-biased pairing versus the added SFT step is provided. This makes it impossible to assess whether the gains are robust or attributable to the proposed method.

    Authors: We acknowledge that aggregate averages alone limit interpretability. The revised version will include a new table reporting per-model bias reduction and accuracy change for all four models (Llama-3B/8B, Qwen-3B/7B), together with standard deviations computed over three independent training runs. We will add paired t-tests (or bootstrap confidence intervals) comparing Debiasing-DPO against the standard DPO and prompt baselines. Finally, we will insert an ablation study that isolates the neutral-vs-biased preference pairing from the SFT component, reporting the incremental contribution of each term to both bias reduction and accuracy preservation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements independent of method definition

full rationale

The paper defines Debiasing-DPO explicitly as a self-supervised pairing of query-only generations (treated as neutral) against context-augmented generations, followed by DPO plus SFT on ground-truth labels. The headline results (84% bias reduction, 52% accuracy gain) are reported as post-training measurements on the NCTE dataset across models, not quantities that algebraically reduce to the training objective or to any fitted parameter by construction. No self-citations, uniqueness theorems, or ansatzes are invoked to justify the core pairing or to derive the performance numbers. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the NCTE dataset with expert rubric scores serves as a faithful proxy for real-world spurious context effects and that the self-supervised pairing captures the relevant bias mechanism without side effects.

axioms (1)
  • domain assumption The NCTE dataset paired with expert rubric scores is a valid benchmark for measuring sensitivity to spurious social contexts.
    The evaluation and claims depend on this dataset representing the target problem.

pith-pipeline@v0.9.0 · 5589 in / 1259 out tokens · 60378 ms · 2026-05-13T20:29:19.628286+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Medical Model Synthesis Architectures: A Case Study

    cs.AI 2026-05 unverdicted novelty 5.0

    MedMSA framework retrieves knowledge via language models then builds formal probabilistic models to produce uncertainty-weighted differential diagnoses from symptoms.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    Anthropic

    URLhttps://arxiv.org/abs/2407.13928. Anthropic. Anthropic system card: Claude haiku 4.5, October

  2. [2]

    Myra Cheng, Sunny Yu, Cinoo Lee, Pranav Khadpe, Lujain Ibrahim, and Dan Jurafsky

    URL https://arxiv.org/abs/2401.09566. Myra Cheng, Sunny Yu, Cinoo Lee, Pranav Khadpe, Lujain Ibrahim, and Dan Jurafsky. Elephant: Measuring and understanding social sycophancy in llms,

  3. [3]

    ELEPHANT: Measuring and understanding social sycophancy in LLMs

    URL https: //arxiv.org/abs/2505.13995. Andrea Cuadra, Maria Wang, Lynn Andrea Stein, Malte F Jung, Nicola Dell, Deborah Estrin, and James A Landay. The illusion of empathy? notes on displays of emotion in human-computer interaction. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pp. 1–18,

  4. [4]

    Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, Nicholas Schiefer, Ryan Soklaski, Alex Tamkin, Jared Kaplan, Buck Shlegeris, Samuel R

    URLhttps://arxiv.org/abs/2211.11772. Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, Nicholas Schiefer, Ryan Soklaski, Alex Tamkin, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Ethan Perez, and Evan Hubinger. Sycophancy to subterfuge: Investigating reward-tampering in large language models,

  5. [5]

    Bowman, Ethan Perez, and Evan Hubinger

    URL https://arxiv.org/abs/2406.10162. Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, and Karthik Narasimhan. Toxicity in chatgpt: Analyzing persona-assigned language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 1236–1270, Singapore, December

  6. [6]

    doi: 10.18653/v1/2023.findings-emnlp.88

    Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.88. URL https:// aclanthology.org/2023.findings-emnlp.88/. Aaron Fanous, Jacob Goldberg, Ank A. Agarwal, Joanna Lin, Anson Zhou, Roxana Daneshjou, and Sanmi Koyejo. Syceval: Evaluating llm sycophancy,

  7. [7]

    Richard G¨ollner, Rebecca Lazarides, and Philipp Stark

    URL https://arxiv.org/abs/2502.08177. Richard G¨ollner, Rebecca Lazarides, and Philipp Stark. Revealing teaching quality through lesson semantics: A gpt-assisted analysis of transcripts.British Journal of Educational Psychology, 95:S300–S315,

  8. [8]

    Learnlm: Improving gemini for learning.arXiv preprint arXiv:2412.16429,

    LearnLM Team Google. Learnlm: Improving gemini for learning.arXiv preprint arXiv:2412.16429,

  9. [9]

    The Llama 3 Herd of Models

    11 Preprint. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  10. [10]

    all that glitters

    URLhttps://arxiv.org/abs/2311.04892. Michael Hardy. “all that glitters”: Techniques for evaluations with unreliable model and human annotations. In Luis Chiruzzo, Alan Ritter, and Lu Wang (eds.),Findings of the Association for Computational Linguistics: NAACL 2025, pp. 2250–2278, Albuquerque, New Mexico, April

  11. [11]

    ISBN 979-8-89176- 195-7

    Association for Computational Linguistics. ISBN 979-8-89176- 195-7. doi: 10.18653/v1/2025.findings-naacl.120. URL https://aclanthology.org/2025. findings-naacl.120/. Heather C Hill, Merrie L Blunk, Charalambos Y Charalambous, Jennifer M Lewis, Geoffrey C Phelps, Laurie Sleep, and Deborah Loewenberg Ball. Mathematical knowledge for teaching and the mathema...

  12. [12]

    Openrlhf: An easy-to-use, scalable and high-performance rlhf framework.arXiv preprint arXiv:2405.11143, 2024

    Jian Hu, Xibin Wu, Zilin Zhu, Xianyu, Weixun Wang, Dehao Zhang, and Yu Cao. Open- rlhf: An easy-to-use, scalable and high-performance rlhf framework.arXiv preprint arXiv:2405.11143,

  13. [13]

    ISBN 979-8-89176-298-5

    The Asian Federa- tion of Natural Language Processing and The Association for Computational Linguistics. ISBN 979-8-89176-298-5. URLhttps://aclanthology.org/2025.ijcnlp-long.4/. Sal Khan. Harnessing gpt-4 so that all students benefit. a nonprofit approach for equal access.Khan Academy Blog,

  14. [14]

    Kexin Li, Pengjin Wang, and Gaowei Chen

    URLhttps://arxiv.org/abs/2508.01674. Kexin Li, Pengjin Wang, and Gaowei Chen. How can ai be integrated into teacher profes- sional development programs? a systematic review based on an adapted technology- based learning model.Teaching and Teacher Education, 168:105219, 2025a. Shuyue Stella Li, Avinandan Bose, Faeze Brahman, Simon Shaolei Du, Pang Wei Koh,...

  15. [15]

    The lessons of developing process reward models in mathematical reasoning

    Association for Computational Linguistics. doi: 10.18653/v1/ 2024.findings-acl.586. URLhttps://aclanthology.org/2024.findings-acl.586/. Yun Long, Haifeng Luo, and Yu Zhang. Evaluating large language models in analysing classroom dialogue.npj Science of Learning, 9(1), October

  16. [16]

    doi: 10.1038/s41539-024-00273-3

    ISSN 2056-7936. doi: 10.1038/s41539-024-00273-3. URLhttp://dx.doi.org/10.1038/s41539-024-00273-3. 12 Preprint. Pedro Henrique Luz de Araujo and Benjamin Roth. Helpful assistant or fruitful facilitator? investigating how personas affect language model behavior.PLOS One, 20(6):e0325664, June

  17. [17]

    doi: 10.1371/journal.pone.0325664

    ISSN 1932-6203. doi: 10.1371/journal.pone.0325664. URL http://dx.doi.org/ 10.1371/journal.pone.0325664. Lars Malmqvist. Sycophancy in large language models: Causes and mitigations,

  18. [18]

    Baptiste Moreau-Pernet, Yu Tian, Sandra Sawaya, Peter Foltz, Jie Cao, Brent Milne, and Thomas Christie

    URL https://arxiv.org/abs/2411.15287. Baptiste Moreau-Pernet, Yu Tian, Sandra Sawaya, Peter Foltz, Jie Cao, Brent Milne, and Thomas Christie. Classifying tutor discursive moves at scale in mathematics classrooms with large language models. InProceedings of the Eleventh ACM Conference on Learning @ Scale, L@S ’24, pp. 361–365, New York, NY, USA,

  19. [19]

    ISBN 9798400706332

    Association for Computing Machinery. ISBN 9798400706332. doi: 10.1145/3657604.3664664. URL https://doi.org/ 10.1145/3657604.3664664. Alexander Pan, Erik Jones, Meena Jagadeesan, and Jacob Steinhardt. Feedback loops with language models drive in-context reward hacking,

  20. [20]

    Feedback loops with language models drive in-context reward hacking, 2024

    URL https://arxiv.org/abs/ 2402.06627. Robert C Pianta, Karen M La Paro, and Bridget K Hamre.Classroom Assessment Scoring System™: Manual K-3.Paul H. Brookes Publishing Co.,

  21. [21]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model

    URLhttps://arxiv.org/abs/2305.18290. Leonard Salewski, Stephan Alaniz, Isabel Rio-Torto, Eric Schulz, and Zeynep Akata. In- context impersonation reveals large language models’ strengths and biases,

  22. [22]

    Francesco Salvi, Manoel Horta Ribeiro, Riccardo Gallotti, and Robert West

    URL https://arxiv.org/abs/2305.14930. Francesco Salvi, Manoel Horta Ribeiro, Riccardo Gallotti, and Robert West. On the con- versational persuasiveness of gpt-4.Nature Human Behaviour, 9(8):1645–1653, May

  23. [23]

    Nature Human Behaviour9(8), 1645–1653 (2025)

    ISSN 2397-3374. doi: 10.1038/s41562-025-02194-6. URL http://dx.doi.org/10.1038/ s41562-025-02194-6. Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Y...

  24. [24]

    URLhttps://arxiv.org/abs/2310.13548. Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker- Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Ale...

  25. [25]

    OpenAI GPT-5 System Card

    URLhttps://arxiv.org/abs/2601.03267. Zhongxiang Sun, Yi Zhan, Chenglei Shen, Weijie Yu, Xiao Zhang, Ming He, and Jun Xu. When personalization misleads: Understanding and mitigating hallucinations in person- alized llms,

  26. [26]

    Mei Tan, Christopher Mah, and Dorottya Demszky

    URLhttps://arxiv.org/abs/2601.11000. Mei Tan, Christopher Mah, and Dorottya Demszky. Reframing authority: A computational measure of power-affirming feedback on student writing. InProceedings of the Eleventh ACM Conference on Learning @ Scale, L@S ’24, pp. 417–421, New York, NY, USA,

  27. [27]

    ISBN 9798400706332

    Association for Computing Machinery. ISBN 9798400706332. doi: 10.1145/3657604. 3664680. URLhttps://doi.org/10.1145/3657604.3664680. Mei Tan, Lena Phalen, and Dorottya Demszky. Marked pedagogies: Examining linguistic biases in personalized automated writing feedback. InProceedings of the 16th International Conference on Learning Analytics & Knowledge (LAK ’26),

  28. [28]

    Anvesh Rao Vijjini, Somnath Basu Roy Chowdhury, and Snigdha Chaturvedi

    URL https: //arxiv.org/abs/2406.08680. Anvesh Rao Vijjini, Somnath Basu Roy Chowdhury, and Snigdha Chaturvedi. Exploring safety-utility trade-offs in personalized language models,

  29. [29]

    Angelina Wang, Erin Beeghly, Sanmi Koyejo, and Daniel E

    URL https://arxiv.org/ abs/2406.11107. Angelina Wang, Erin Beeghly, Sanmi Koyejo, and Daniel E. Ho. Personalization double binds: When user preferences meet group-based chatbot behaviors, 2025a. URL https: //angelina-wang.github.io/files/chatbot personalization.pdf. Deliang Wang, Dapeng Shan, Yaqian Zheng, Kai Guo, Gaowei Chen, and Yu Lu. Can chatgpt dete...

  30. [30]

    ISBN 978-1-7336736-4-8

    International Educational Data Mining Society. ISBN 978-1-7336736-4-8. doi: 10.5281/zenodo.8115772. Rose E. Wang and Dorottya Demszky. Is chatgpt a good teacher coach? measuring zero-shot performance for scoring and providing actionable insights on classroom instruction,

  31. [31]

    URLhttps://arxiv.org/abs/2306.03090. Rose E. Wang, Ana T. Ribeiro, Carly D. Robinson, Susanna Loeb, and Dora Demszky. Tutor copilot: A human-ai approach for scaling real-time expertise, 2025b. URL https: //arxiv.org/abs/2410.03017. Marcus Williams, Micah Carroll, Adhyyan Narang, Constantin Weisser, Brendan Murphy, and Anca Dragan. On targeted manipulation...

  32. [32]

    Paiheng Xu, Jing Liu, Nathan Jones, Julie Cohen, and Wei Ai

    URLhttps://arxiv.org/abs/2411.02306. Paiheng Xu, Jing Liu, Nathan Jones, Julie Cohen, and Wei Ai. The promises and pitfalls of using language models to measure instruction quality in education. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Pa...

  33. [33]

    Qwen2 Technical Report

    doi: 10.18653/v1/2024.naacl-long.246. URL http: //dx.doi.org/10.18653/v1/2024.naacl-long.246. 15 Preprint. An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingre...

  34. [34]

    don’t refuse

    URL https://arxiv.org/abs/ 2601.09141. 16 Preprint. A Additional related work Biases in LLMs.Despite numerous safety training and red-teaming efforts, LLMs still exhibit harmful biases. Specifically, we study how biases can be introduced through a user’s prompts in two ways: (1) persona prompting, and (2) the sharing of user contexts or attributes for in-...

  35. [35]

    For all DPO implementations, we used a beta value of 0.1

    5 Debiasing DPO uses a fixed learning rate of 1e-6. For all DPO implementations, we used a beta value of 0.1. For Debiasing-DPO, the capability loss is weighted byw SFT =0.1 relative tow DPO =1. We also observed that sometimes Debiasing DPO benefits for training for 2 episodes on the same dataset, so each training data point is used for update twice. All ...