arxiv: 2604.16845 · v1 · submitted 2026-04-18 · 💻 cs.CL

Recognition: unknown

DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training

Ziwen Pan , Zihan Liang , Jad Kabbara , Ali Emami

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:14 UTC · model grok-4.3

classification 💻 cs.CL

keywords difference awarenessharm driftLLM fine-tuningsafety alignmentdemographic differencesDART trainingrefusal reductionbenchmark evaluation

0 comments

The pith

DART training lets LLMs recognize when demographic differences matter in answers while cutting new harmful outputs by 72.6 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that fine-tuning LLMs to correctly classify whether a question needs group differences or equal treatment usually introduces more harmful explanations over time. DART counters this by first distilling reasoned labels from a stronger model, then auditing every new output against the original model's harm detections, and finally repairing the worst cases with severity-weighted updates. On eight benchmarks the method raises accuracy from 39 percent to 68.8 percent and cuts harm-drift incidents by nearly three-quarters. The same pipeline also improves open-ended real-world queries in medicine, law, policy, and education, raising appropriate difference-aware answers from 39.8 percent to 77.5 percent and dropping refusals from 34.3 percent to 3 percent.

Core claim

Fine-tuning for difference-awareness accuracy triggers harm drift: model explanations become more harmful even as classification decisions improve. DART prevents this by distilling label-conditioned reasoning from a teacher model, auditing generated outputs for new harm relative to the baseline, and repairing flagged cases through severity-weighted fine-tuning, thereby raising accuracy and lowering harm on both closed benchmarks and open-ended domain queries.

What carries the argument

DART (Distill-Audit-Repair Training) pipeline that distills reasoning, audits for harm drift relative to baseline, and applies severity-weighted repair.

If this is right

Accuracy on prompts that require equal treatment rises from 11.3 percent to 72.6 percent.
Harm-drift cases fall by 72.6 percent across the eight benchmarks.
Difference-appropriate responses on 280 real-world queries rise from 39.8 percent to 77.5 percent.
Unnecessary refusals on those queries drop from 34.3 percent to 3.0 percent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same distill-audit-repair loop could be applied to other safety-accuracy tensions such as factual correction versus over-refusal.
Scaling the audit step to larger models or additional domains would test whether the harm reduction generalizes beyond the tested medical, legal, policy, and education queries.
If the repair step proves stable, the method offers a concrete route to difference-aware models that still satisfy regulatory safety requirements.

Load-bearing premise

The harm audit must catch every new harmful statement created during fine-tuning, and the severity-weighted repair must not introduce fresh undetected harms or hurt performance on cases never audited.

What would settle it

After DART training, run the model on the same prompts and find either new harmful content the audit missed or a drop in accuracy on a held-out set of non-audited prompts.

Figures

Figures reproduced from arXiv: 2604.16845 by Ali Emami, Jad Kabbara, Zihan Liang, Ziwen Pan.

**Figure 1.** Figure 1: The harm drift problem. Left: The baseline model (M0) produces a safe but incorrect response. Center: After distillation, the model (Mint) answers correctly but introduces harmful content in its rationale. Right: After targeted repair, the final model (MDART) maintains accuracy while generating a safe rationale. et al., 2024; Zink et al., 2024; Gallegos et al., 2024), making models unreliable in contexts … view at source ↗

**Figure 2.** Figure 2: The DART pipeline. Stage I distills reasoning from a teacher; Stage II identifies harm drift cases where [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Stage ablation (Llama-3-8B). (a) Accuracy gains from Stage I. (b) Toxicity increases postdistillation (p < 10−12), drops below baseline after Stage III (p < 10−15). Error bars: 95% CI. Sev Mod Mild Total Severity Level 0 200 400 Number of Drift Cases Mint (Stage I) MDART (Stage III) 283 121 31 435 71 34 14 119 75% 72% 55% 73% [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Harm drift cases before/after Stage III repair [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: ROC curve for drift case detection using toxi [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗

read the original abstract

Large language models (LLMs) tuned for safety often avoid acknowledging demographic differences, even when such acknowledgment is factually correct (e.g., ancestry-based disease incidence) or contextually justified (e.g., religious hiring preferences). This identity-blindness yields incorrect responses, unnecessary refusals, or generic "equal-treatment" defaults. We study this via difference-awareness classification: given a question involving demographic groups, the task is not to answer directly, but to classify whether a correct answer requires recognizing group differences (yes) or whether groups should be treated identically (no). Crucially, fine-tuning for accuracy triggers harm drift: model-generated explanations become increasingly harmful as decision accuracy improves, whether by elaborating harmful content, introducing problematic assumptions, or failing to flag harms the baseline identified. To mitigate this, we introduce DART (Distill--Audit--Repair Training), which distills label-conditioned reasoning from a teacher, audits outputs for harm drift cases relative to baseline, and repairs problematic cases via severity-weighted fine-tuning. On eight benchmarks, DART improves Llama-3-8B-Instruct accuracy from 39.0% to 68.8%, with largest gains on equal-treatment prompts (11.3% -> 72.6%), while reducing harm drift cases by 72.6%. It also transfers to 280 open-ended real-world queries across medical, legal, policy, and educational domains, improving difference-appropriate responses from 39.8% to 77.5% while reducing refusals from 34.3% to 3.0%. Our results demonstrate that accuracy and safety need not conflict when explicit detection and repair mechanisms are in place.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DART gives a concrete distill-audit-repair loop that lifts accuracy on difference-awareness tasks while cutting reported harm cases, but the audit step itself lacks the validation needed to trust the gains.

read the letter

The paper introduces DART as a training loop that distills label-conditioned reasoning from a teacher model, audits the outputs for harm drift relative to the baseline, and repairs the flagged cases with severity-weighted fine-tuning. On Llama-3-8B-Instruct this moves accuracy from 39% to 69% across eight benchmarks, with the biggest lift on equal-treatment prompts, and reduces the counted harm-drift cases by 73%. The same pattern holds on 280 open-ended queries in medical, legal, policy, and education domains, where difference-appropriate answers rise from 40% to 78% and refusals drop sharply. The explicit framing of harm drift as a side-effect of accuracy fine-tuning is the clearest new element; prior safety work has not isolated this dynamic in the same way or tied it to a repair step that targets it directly. The empirical setup is straightforward and covers both controlled benchmarks and realistic queries, which is useful for seeing whether the method generalizes beyond the training distribution. The soft spot is the audit. The description does not give concrete harm criteria, inter-annotator numbers, or evidence that the audit was re-checked on repaired outputs against an external standard. If the audit misses subtle elaborations or new assumptions, the 73% reduction and the transfer gains become difficult to interpret. The central assumption that the audit reliably catches every new harmful output introduced during fine-tuning is not independently validated in the reported results. This work is for researchers who build or evaluate safety-tuned LLMs and need practical ways to handle cases where factual accuracy requires acknowledging group differences. A reader looking for an explicit pipeline rather than another alignment objective would find the method and the numbers worth examining. It deserves peer review because the problem it targets is real in deployed systems and the proposed loop is specific enough to test and improve.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces DART (Distill-Audit-Repair Training) to mitigate harm drift in LLMs during fine-tuning for difference-awareness classification tasks involving demographic groups. It claims that accuracy fine-tuning causes increasing harmful outputs (elaborations, assumptions, or missed flags), and that DART—by distilling label-conditioned reasoning, auditing relative to baseline, and applying severity-weighted repair—improves accuracy from 39.0% to 68.8% on eight benchmarks (largest on equal-treatment prompts), reduces harm drift cases by 72.6%, and transfers to 280 open-ended real-world queries with better difference-appropriate responses (39.8% to 77.5%) and fewer refusals (34.3% to 3.0%).

Significance. If the central results hold under rigorous validation, the work is significant for highlighting and addressing a concrete tension between factual accuracy and safety in LLM fine-tuning on demographic topics. The empirical scale (eight benchmarks plus 280 transfer queries across medical/legal/policy/educational domains) and the explicit audit-repair mechanism provide a practical template that could influence safety tuning pipelines. The paper ships clear numeric gains and a falsifiable operationalization of harm drift, which are strengths.

major comments (3)

[§3.2] §3.2 (Audit component): the description of harm drift detection provides no concrete criteria for flagging harmful content, no inter-annotator agreement statistics, and no indication of whether auditing is human, LLM-based, or hybrid. This is load-bearing for the 72.6% harm-drift reduction and the transfer gains, as false negatives would render both claims untrustworthy.
[§4.3] §4.3 and Table 3: no ablation isolates the severity-weighted repair step, and repaired outputs are not re-audited against an external standard. Without this, it remains possible that repair either creates new undetected harms or degrades performance on non-audited cases, directly undermining the safety claims.
[§5.1] §5.1 (benchmark results): the reported accuracy lift from 39.0% to 68.8% and the equal-treatment prompt gains lack controls for prompt distribution shifts between the baseline and DART training data. This weakens the causal attribution of gains to the DART pipeline rather than data artifacts.

minor comments (2)

[Abstract] Abstract: the term 'harm drift' is used without a one-sentence operational definition, which reduces accessibility for readers outside the immediate subfield.
[Figure 1] Figure 1: the DART pipeline diagram would benefit from explicit arrows or labels distinguishing the audit stage from the repair stage.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments highlight important areas for improving clarity, validation, and causal attribution in the manuscript. We address each major comment below and will incorporate revisions to strengthen the paper.

read point-by-point responses

Referee: [§3.2] §3.2 (Audit component): the description of harm drift detection provides no concrete criteria for flagging harmful content, no inter-annotator agreement statistics, and no indication of whether auditing is human, LLM-based, or hybrid. This is load-bearing for the 72.6% harm-drift reduction and the transfer gains, as false negatives would render both claims untrustworthy.

Authors: We agree that the audit component in §3.2 requires substantially more detail to support the harm-drift reduction claims. In the revised manuscript we will expand this section to specify: (1) concrete flagging criteria with examples of the three harm-drift categories (elaborations, problematic assumptions, and missed flags); (2) that auditing is a hybrid process (LLM-based initial screening followed by human review on flagged cases); and (3) inter-annotator agreement statistics (Fleiss’ kappa) along with the annotation guidelines as supplementary material. These additions will directly address concerns about false negatives and reproducibility. revision: yes
Referee: [§4.3] §4.3 and Table 3: no ablation isolates the severity-weighted repair step, and repaired outputs are not re-audited against an external standard. Without this, it remains possible that repair either creates new undetected harms or degrades performance on non-audited cases, directly undermining the safety claims.

Authors: We acknowledge the absence of an isolated ablation for the severity-weighted repair step and the lack of re-auditing. In the revision we will add a new ablation row to Table 3 comparing full DART against a variant without the repair component. We will also re-audit a random sample of 50 repaired outputs using both an external LLM judge and human annotators, reporting any new harms introduced and confirming that performance on non-audited cases remains stable. These changes will provide direct evidence supporting the safety claims. revision: yes
Referee: [§5.1] §5.1 (benchmark results): the reported accuracy lift from 39.0% to 68.8% and the equal-treatment prompt gains lack controls for prompt distribution shifts between the baseline and DART training data. This weakens the causal attribution of gains to the DART pipeline rather than data artifacts.

Authors: We appreciate this observation on potential distribution shifts. In the revised §5.1 we will explicitly document that the DART training prompts are drawn from the same underlying distributions as the eight evaluation benchmarks, with no additional curation that would introduce shifts. We will further add a control experiment in which the baseline model is fine-tuned on identical data using standard accuracy-only objectives (without distill-audit-repair), allowing direct isolation of the DART pipeline’s contribution to the accuracy and equal-treatment gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with independent benchmark validation

full rationale

The paper describes an empirical training procedure (distill-audit-repair) and reports measured accuracy and harm-drift reductions on eight benchmarks plus 280 open-ended queries. No derivation chain, equations, or self-referential definitions appear; results are presented as direct experimental outcomes rather than predictions forced by fitted parameters or prior self-citations. The central claims rest on observable performance deltas, which remain falsifiable against external test sets and do not reduce to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the existence of a reliable teacher model for distillation and on the ability to define and audit harm drift without circularity; no explicit free parameters or new physical entities are introduced.

axioms (1)

domain assumption Labeled data for difference-awareness classification is accurate and sufficient to train a teacher model.
Implicit in the distillation step described in the abstract.

invented entities (1)

harm drift no independent evidence
purpose: Describes the observed increase in harmful content as accuracy on difference-awareness improves.
New term introduced to motivate the audit-and-repair stages.

pith-pipeline@v0.9.0 · 5618 in / 1396 out tokens · 36792 ms · 2026-05-10T07:14:37.375197+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 8 canonical work pages · 2 internal anchors

[1]

IterAlign: Iterative constitutional alignment of large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1423–1433, Mexico City, Mexico. Association for Computational Linguistics. Zheng Chu, Jingchang Chen, Qianglo...

2024
[2]

InThe Twelfth International Con- ference on Learning Representations

Safe RLHF: Safe reinforcement learning from human feedback. InThe Twelfth International Con- ference on Learning Representations. Susmit Das, Rahul Madhavan, Punyajoy Saha, Samyak Jain, Kyuhong Shim, and Animesh Mukherjee. 2025. Tracealign – tracing the drift: Attributing align- ment failures to training-time belief sources in llms. Preprint, arXiv:2502.0...

work page arXiv 2025
[3]

The Llama 3 Herd of Models

Bias and fairness in large language models: A survey.Computational Linguistics, 50(3):1097– 1179. Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. 2020. RealToxi- cityPrompts: Evaluating neural toxic degeneration in language models. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369, Onlin...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[4]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. InFindings of the Association for Com- putational Linguistics: ACL 2023, pages 8003–8017, Toronto, Canada. Association for Computational Lin- guistics. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, ...

work page internal anchor Pith review arXiv 2023
[5]

InFindings of the Association for Com- putational Linguistics: ACL 2024, pages 8940–8965

Investigating subtler biases in llms: Ageism, beauty, institutional, and nationality bias in genera- tive models. InFindings of the Association for Com- putational Linguistics: ACL 2024, pages 8940–8965. J. Richard Landis and Gary G. Koch. 1977. The mea- surement of observer agreement for categorical data. Biometrics, 33(1):159–174. Lewis Z. Lewis and Cas...

work page arXiv 2024
[6]

Orca 2: Teaching small language models how to reason,

ParaDetox: Detoxification with parallel data. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6804–6818, Dublin, Ireland. Association for Computational Linguistics. Kaifeng Lyu, Haoyu Zhao, Xinran Gu, Dingli Yu, Anirudh Goyal, and Sanjeev Arora. 2024. Keep- ing llms aligned after fin...

work page arXiv 2024
[7]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others

Text-diffusion red-teaming of large language models: Unveiling harmful behaviors with proximity constraints.Preprint, arXiv:2501.01741. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others. 2022. Training language models to follow instructions with human...

work page arXiv 2022
[8]

InProceedings of the 37th International Conference on Neural In- formation Processing Systems, NIPS ’23, Red Hook, NY , USA

Direct preference optimization: your language model is secretly a reward model. InProceedings of the 37th International Conference on Neural In- formation Processing Systems, NIPS ’23, Red Hook, NY , USA. Curran Associates Inc. Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. 2024. XSTest: A test suite fo...

2024
[9]

arXiv preprint arXiv:2310.16944 , year=

HateCheck: Functional tests for hate speech detection models. InProceedings of the 59th An- nual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer- ence on Natural Language Processing (Volume 1: Long Papers), pages 41–58, Online. Association for Computational Linguistics. Sarthak Roy, Ashish Harshvardhan, Ani...

work page arXiv 2023
[10]

Pandya, Ashish Hooda, Xiaohan Fu, and Earlence Fernandes

Jailbroken: how does llm safety training fail? InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA. Curran Associates Inc. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning i...

work page arXiv 2022
[11]

Proceedings of the National Academy of Sciences, 121(34):e2402267121

Race adjustments in clinical algorithms can help correct for racial disparities in data quality. Proceedings of the National Academy of Sciences, 121(34):e2402267121. Appendix Overview A Data Splits and Protocol– Train/validation/test partitioning, leakage prevention measures, cross- validation BBenchmark Descriptions and Examples– De- tailed examples fro...

2025
[12]

Teacher rationales are gen- erated for these examples only

Stage I (Distillation): Uses only the 4,800 training examples. Teacher rationales are gen- erated for these examples only
[13]

No test set informa- tion is used for model selection prior to the Stage II audit and Stage III transductive repair step

Hyperparameter Selection: All hyperparam- eters (learning rate, LoRA rank, τdelta, etc.) are selected based on validation set perfor- mance (1,600 examples). No test set informa- tion is used for model selection prior to the Stage II audit and Stage III transductive repair step
[14]

This is the first time the model encounters these examples

Stage II (Audit): Harm drift case detection is performed on the held-out test set (1,624 examples). This is the first time the model encounters these examples
[15]

Stage III (Repair): Safe alternatives are gen- erated only for test set drift cases. In this stage, the teacher receives the gold label for each drift case to generate a correct safe ra- tionale, and the student is fine-tuned on the teacher’s safe rationale–conclusion sequence conditioned on the original prompt. This is a deliberate transductive post-hoc ...
[16]

is_regression

Final Evaluation: All reported metrics (Ta- ble 2) are computed on the test set. A.3 Generalization Verification Stage III repair operates on test set prompts. We verify that this design does not affect evaluation validity through three analyses: • Gold-label use in repair: Gold labels are used by the teacher during Stage III to gener- ate correct safe ra...

2022
[17]

Whether each model’s conclusion matches the gold label
[18]

Whether Model B misses real harm that Model A identified
[19]

Whether Model B normalizes bias or repro- duces harmful content
[20]

Whether Model B’s reasoning could be mis- used despite correct conclusions The final drift case pool is determined by toxicity-classifier screening followed by LLM- judge confirmation, yielding 435 drift cases (26.8% of test set). D.3 Severity Stratification Drift cases are stratified into four severity levels based on the harm score delta ∆H=H(r int)− H(...
[21]

Be brief (1–2 sentences maximum)
[22]

Simply state that group membership is irrele- vant without elaborating on group differences
[23]

For this task, provide a structured explanation (2–4 sentences) covering:

Avoid discussing, comparing, or contrasting any group characteristics Finish with exactly:Conclusion: NO F.2 Policy for YES/DIFF Cases For prompts where differential treatment is war- ranted, the policy permits longer explanations but requires structured justification: System Prompt (YES/DIFF Policy) You are an assistant performing difference-awareness cl...
[24]

Why the group difference is relevant to this specific context
[25]

The legal, ethical, or factual basis for differen- tiation (if applicable)
[26]

Write a concise analysis; avoid un- necessary details

Minimal necessary detail—avoid generaliza- tions beyond the specific case Do NOT repeat or elaborate on any harmful content from the prompt. Finish with exactly:Conclusion: YES F.3 Policy for Harmful Premises When the input prompt contains explicitly harm- ful content (e.g., slurs, stereotypes), an additional constraint applies: Additional Constraint (Har...
[27]

Generating a rationale ˆrfrom the current model using teacher forcing
[28]

Computing the toxicity scoreH(ˆr)∈[0,1]
[29]

Table 20 shows results across λvalues

Using Ltox =H(ˆr)as a differentiable penalty via straight-through estimation Hyperparameter Selection.We tune λ∈ {0.01,0.05,0.1,0.2,0.5} on the validation set, se- lecting based on the harmonic mean of accuracy and inverse toxicity. Table 20 shows results across λvalues. Table 20: Hyperparameter sweep for toxicity regular- ization weight λ (Llama-3-8B, va...
[30]

The 33.9pp gap demonstrates that learningwhyto classify (via rationales) is crit- ical for learningwhen notto differentiate

Rationale supervision is essential for EQUAL accuracy: Label-only SFT improves EQUAL from 11.3% to 38.7%; DART reaches 72.6%. The 33.9pp gap demonstrates that learningwhyto classify (via rationales) is crit- ical for learningwhen notto differentiate
[31]

The competing gradients force suboptimal compromises

Joint optimization cannot match staged training: Toxicity-regularized SFT achieves neither the accuracy of Stage I distillation nor the safety of Stage III repair. The competing gradients force suboptimal compromises
[32]

Repair targets the right subset: Stage III modifies only the 26.8% of cases flagged as drift cases, preserving Stage I’s accuracy gains on the remaining 73.2% while achieving safety improvements
[33]

K.4 Teacher vs

Inference-time policy provides orthogonal benefits: The policy reduces toxicity across all model variants, confirming that output- level constraints complement training-level interventions. K.4 Teacher vs. Student Harm Amplification Table 22 shows that harm drift is not simply in- herited from the teacher: teacher-generated ratio- nales exhibit the same h...
[34]

Drift case presence(binary): Does Mint’s rationale exhibit more harmful content than M0’s rationale for the same prompt?
[35]

Model A” and “Model B

Severity level(if drift case present): mild / moderate / severe / extreme, following the definitions in §2.4. Blinding Procedure.Annotators were shown the prompt and both rationales with model identity hid- den (labeled as “Model A” and “Model B” with randomized assignment). Order of model ratio- nales was randomized across samples to prevent ordering eff...

1977
[36]

Classifier limitations.Hate speech classifiers may penalize appropriate contextual engage- ment, requiring benchmark-specific interpre- tation
[37]

Abstention as confounder.High abstention artificially lowers harm scores by removing outputs from evaluation
[38]

Task-evaluator alignment.Toxicity clas- sifiers better capture improvements in our setting than hate classifiers, as difference- awareness inherently involves group refer- ences. P Sub-demographic Analysis on External Safety Benchmarks To examine whether DART’s safety improvements are consistent across demographic groups, we con- duct a fine-grained sub-d...
[39]

The improvements are largest for groups facing elevated baseline harm (Muslims, immigrants, trans people), suggesting DART preferentially reduces harm where it matters most

Consistent safety improvements across groups.On HateCheck, MDART achieves significantly lower toxicity for all seven tar- get identity groups. The improvements are largest for groups facing elevated baseline harm (Muslims, immigrants, trans people), suggesting DART preferentially reduces harm where it matters most
[40]

This confirms that DART’s safety gains do not come at the cost of introducing new biases or harm patterns

No introduction of new disparities.On BOLD, which tests general demographic con- tent, we observe no significant differences be- tween models across any demographic dimen- sion. This confirms that DART’s safety gains do not come at the cost of introducing new biases or harm patterns
[41]

Is the majority religion in In- donesia the same as in Malaysia?

Reduced abstention without safety trade- off. M0 exhibits high abstention rates on challenging prompts (up to 38.1% on Hate- Check), while MDART achieves 0% abstention with lower toxicity. This demonstrates that the perceived trade-off between safety and helpfulness can be overcome through targeted training. These findings support DART as a fairness- pres...

2022
[42]

Distilling Step- by-Step,

demonstrates that LLMs can automatically discover prompts eliciting harmful behavior, en- abling scalable safety evaluation. Automated red- teaming (Nöther et al., 2025) discovers harmful behaviors through adversarial prompts, providing alternative auditing strategies to our drift-based de- tection. This adversarial perspective influences our audit stage,...

2025
[43]

Who is not the boss?

elicits intermediate reasoning steps that sub- stantially improve LLM performance on complex tasks. Chu et al. (2024) provide a comprehensive survey of CoT advances, categorizing approaches by prompting strategy, training integration, and ap- plication domain. Our rationale format inherits from CoT princi- ples: requiring models to articulate reasoning be...

2024