Shortcuts in the Tail: Debiasing via Post-Hoc Spectral Compression of Fine-Tuning Updates

Dmitrii Troitskii; Edward Sun

arxiv: 2606.07596 · v1 · pith:ZHRLTIFWnew · submitted 2026-05-29 · 💻 cs.LG

Shortcuts in the Tail: Debiasing via Post-Hoc Spectral Compression of Fine-Tuning Updates

Edward Sun , Dmitrii Troitskii This is my paper

Pith reviewed 2026-06-28 23:28 UTC · model grok-4.3

classification 💻 cs.LG

keywords fine-tuningspurious correlationsdebiasingsingular value decompositionweight updatesshortcut learningpost-hoc interventiongroup fairness

0 comments

The pith

Truncating the tail of the SVD of the fine-tuning weight update reduces spurious group gaps while preserving accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a simple post-hoc step on the difference between fine-tuned and base model weights can reduce reliance on spurious correlations. By keeping only the largest singular components of that difference matrix and discarding the tail, the gap between groups shrinks on several benchmarks with almost no drop in overall performance. This approach requires no retraining, no group labels, and no extra data, unlike most existing debiasing techniques. The authors propose that the shortcut behavior is encoded in the lower-magnitude directions of the update, which explains why removing them works better than removing random or head components. Tests across model sizes from 0.5B to 7B and four classification tasks support the pattern, with the strongest effect on the CivilComments dataset.

Core claim

Fine-tuning introduces spurious correlations in the tail of the singular spectrum of the weight update matrix ΔW. Truncating this tail after fine-tuning reduces the performance gap on underrepresented groups while keeping task accuracy nearly the same. This holds across three instruction-tuned models and four benchmarks, and is not explained by generic rank reduction because bottom-k and random truncation do not produce the same benefit. A boundary case where the model learns only the shortcut shows the expected collapse when the tail is kept.

What carries the argument

the tail of the singular value decomposition of the fine-tuning update matrix ΔW = W_ft - W_base

If this is right

Debiasing can be performed after training without access to training details or group annotations.
The singular basis of the update separates useful task information from shortcut information.
Top-k truncation succeeds where random or bottom-k truncation fails, indicating non-uniform distribution of shortcuts.
The method works on instruction-tuned models of different sizes with minimal accuracy trade-off.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the singular ordering consistently isolates shortcuts, similar compression could be applied to other adaptation techniques like LoRA.
This coordinate system might help analyze what different fine-tuning runs have actually learned beyond the immediate task.
The approach could extend to regression or generation tasks where group disparities appear.

Load-bearing premise

The shortcut response is concentrated in the tail of the singular spectrum of the weight update rather than distributed evenly or located in the head.

What would settle it

Observing that random-k or bottom-k truncation of the SVD reduces the spurious gap by a comparable amount to top-k truncation would falsify the claim that the shortcut sits specifically in the tail.

Figures

Figures reproduced from arXiv: 2606.07596 by Dmitrii Troitskii, Edward Sun.

**Figure 1.** Figure 1: Post-hoc spectral compression of fine-tuning updates. For each weight matrix, compute ∆W = Wft − Wbase, take its SVD ∆W = UΣV ⊤, keep only the top k singular values, and reconstruct Wf = Wbase + U:,:kΣ:k,:kV ⊤ :,:k. No retraining, data, or group labels; debiasing comes from which singular directions are kept. out generic low-rank approximation and rank-constrained training. Contributions. (1) A label-free,… view at source ↗

**Figure 3.** Figure 3: Trajectory shape distinguishes the spectral picture from its boundary. Normalized gap (left) and accuracy (right) vs. retention r, QWEN-0.5B. CivilComments: gap and accuracy decouple. The gap drops through the sweet zone (green) to ∼ 0.3 ∆ft while accuracy stays flat at FT level (∼ 0.81). Spectral stratification predicts this: shortcut and task responses live in different parts of the singular basis, so re… view at source ↗

**Figure 2.** Figure 2: Bias-vs-accuracy trajectories, parametric in retention r. One panel per model. Each curve traces (accuracy loss, bias reduction) for one of CivilComments / MNLI / QQP / FEVER as r sweeps 90%→5%. Green band: no-cost zone (accuracy loss <2 pp); hollow rings mark each dataset’s sweet spot. The region to the left of the green band, where accuracy loss is negative, is also notable: as the model reverts toward a… view at source ↗

**Figure 5.** Figure 5: Top-k Pareto-dominates LoRA across model scales. Accuracy versus gap on CivilComments, one panel per model (0.5B, 1B, 7B). Blue: post-hoc top-k sweep, parametric in r. Red diamonds: LoRA rank sweep (marker size ∝ rank). Green: nocost zone (accuracy loss < 2 pp from FT). Rings mark the best in-zone point per method. On all three models the SVD ring sits at lower gap than the LoRA ring at matched accuracy. … view at source ↗

**Figure 6.** Figure 6: Per-(dataset, model) SVD top-k retention sweep. Both metrics are rescaled to the FT→base interval: acc f r = (accr − accft)/(accbase − accft) and ∆e r = (∆r − ∆ft)/(∆base − ∆ft), so 0 corresponds to FT and 1 to base on each axis. Bands: ±1σ over three seeds. Top four rows (CivilComments, MNLI, FEVER, QQP): decoupling, with blue (acc) staying near 0 while red (gap) rises toward 1, i.e. accuracy is preserved… view at source ↗

**Figure 7.** Figure 7: Top-k vs. bottom-k vs. random-k on three representative datasets crossed with three models. Y-axis: normalised gap ∆r/|∆ft|; dashed line at 1.0 marks the fine-tuned reference. Top-k approaches 0 smoothly; bottom-k and random-k either stay near 1.0 until accuracy collapses, or overshoot below 0 as the model reverts toward an unbiased base. 16 32 64 128 256 LoRA rank 0.04 0.06 0.08 0.10 Spurious gap (a) LoRA… view at source ↗

**Figure 8.** Figure 8: LoRA rank sweep on CivilComments. Dotted lines: per-model full-SFT reference. Low-rank LoRA points can fall below the SFT gap reference, but only because they also fall below the SFT accuracy reference (right panel): the gap is reduced by underfitting, not by selectively removing the shortcut. The matched-accuracy comparison in [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Per-layer subset compression at r=20%, real data, per-cell. Bars show |∆| when only the indicated layer subset is truncated; remaining layers retain the full-rank update. Dashed: fine-tuned |∆|; dotted: base |∆|. Some cells show clean MLP- or secondhalf-localised reduction (e.g. CivilComments on QWEN-0.5B); others show the relevant directions spread across the network (e.g. CivilComments on QWEN-7B). We d… view at source ↗

**Figure 10.** Figure 10: Real singular-value decay of ∆W for four representative MLP layers (mean ±1σ shaded band). Top row: σi/σmax on a log y-axis, all five datasets overlaid per model. Bottom row: percentage of singular components needed to capture 90% of the spectral energy. Spectra are similar across datasets within a model and are not sharply concentrated (90% of energy needs ∼73−78% of components). ∆W is therefore not appr… view at source ↗

read the original abstract

Fine-tuning often introduces spurious correlations alongside task knowledge, causing systematic failures on underrepresented groups. Existing mitigations require retraining, group labels, or curated counterfactual data. We show a simple post-hoc intervention reduces shortcut reliance without any of these: truncating the tail of the SVD of $\Delta W = W_\mathrm{ft} - W_\mathrm{base}$ reduces the spurious-group gap while preserving task accuracy. Across three instruction-tuned models ($0.5$B--$7$B) and four classification benchmarks, top-$k$ truncation reduces the gap on every cell at $<2$ pp accuracy loss, by up to $5\times$ on CivilComments. We propose this works because the shortcut response sits in the tail of the singular ordering of $\Delta W$, a claim about how truncation behaves rather than about the raw singular values, which are broadly distributed and look the same across all four datasets. A controlled boundary case in which fine-tuning has only a shortcut to learn shows the predicted FT-to-base collapse, and bottom-/random-$k$ and matched-rank LoRA controls rule out generic low-rank approximation and rank-constrained training as the explanation. We read this as preliminary evidence that the singular basis of $\Delta W$ is a useful coordinate system for studying what fine-tuning has learned.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core observation is that truncating the tail of SVD(ΔW) cuts spurious gaps in fine-tuned models while the controls make a reasonable case that shortcuts concentrate in that tail rather than elsewhere.

read the letter

The thing to know is that truncating the lower part of the singular spectrum of the fine-tuning update reduces group gaps on several benchmarks with under 2 points accuracy drop. The authors test this on 0.5B to 7B instruction-tuned models across four classification tasks and show the effect holds in every reported cell, with the largest gain on CivilComments.

What is actually new is the post-hoc use of the singular basis of ΔW itself as the coordinate system, plus the specific claim that the shortcut directions sit in the tail rather than being spread out or in the head. The boundary-case experiment, where fine-tuning is forced to learn only the shortcut, produces the predicted concentration in the tail. The top-k versus bottom-k versus random-k comparisons, together with the matched-rank LoRA controls, rule out generic rank reduction as the explanation. Those controls are direct and address the most obvious alternative accounts.

The soft spot is that the method still rests on an empirical regularity rather than a derivation: we see that the ordering works, but we do not yet have a reason why shortcuts reliably land in the tail across datasets. The abstract notes the spectra look similar, so the ordering is doing the work, yet the paper does not supply a mechanistic account beyond the observation. k selection also appears to require some tuning even if the outcome is stable once chosen.

This is for people who fine-tune models and want a lightweight way to inspect or adjust what the update actually captured. The controls are solid enough that the central pattern is worth checking in more settings. I would send it to peer review; the empirical tests are clear enough to merit referee time even if the explanation needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper claims that a simple post-hoc intervention—truncating the tail of the SVD of the fine-tuning update ΔW = W_ft − W_base—reduces reliance on spurious correlations while preserving task accuracy. This is demonstrated across three instruction-tuned models (0.5B–7B) and four classification benchmarks, where top-k truncation reduces the spurious-group gap on every evaluated cell at <2 pp accuracy loss (up to 5× on CivilComments). The authors attribute success to shortcut directions residing in the tail of the singular ordering (rather than the value distribution), supported by controls showing that bottom-k and random-k truncation fail, matched-rank LoRA does not replicate the effect, and a boundary-case fine-tune containing only the shortcut produces the predicted collapse of the update into the tail. Spectra are reported as broadly distributed and similar across datasets.

Significance. If the empirical findings hold, the work offers a training-free, label-free debiasing method that could be practically significant for deployed fine-tuned models. It also supplies a coordinate system (the singular basis of ΔW) for dissecting what fine-tuning has learned. Credit is due for the direct controls (top-k vs. bottom-k vs. random-k, LoRA baseline, and boundary-case experiment) that isolate the positioning claim from generic rank reduction, and for the observation that raw singular-value distributions do not differ across datasets.

major comments (2)

[Experiments] Experiments section: the abstract and results claim consistent gap reductions 'on every cell' with <2 pp accuracy loss and up to 5× improvement on CivilComments, yet no error bars, number of runs, statistical tests, or exact dataset statistics are supplied; this directly affects verifiability of the central empirical claim.
[§3] §3 (method) and experiments: the precise rule for selecting the truncation rank k (fixed fraction, validation-based, or otherwise) is not stated, leaving the reported top-k results non-reproducible and the robustness of the 'tail' positioning claim harder to assess.

minor comments (2)

Notation for the SVD truncation operation (exactly which components are retained when 'truncating the tail') should be formalized with an equation or pseudocode to avoid ambiguity between top-k retention and other interpretations.
The boundary-case construction (fine-tuning containing only the shortcut) is described at a high level; a short appendix table listing the exact training data composition would strengthen the control.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. Both major comments identify important gaps in verifiability and reproducibility that we will correct in revision. We address each point below.

read point-by-point responses

Referee: [Experiments] Experiments section: the abstract and results claim consistent gap reductions 'on every cell' with <2 pp accuracy loss and up to 5× improvement on CivilComments, yet no error bars, number of runs, statistical tests, or exact dataset statistics are supplied; this directly affects verifiability of the central empirical claim.

Authors: We agree that the absence of error bars, run counts, statistical tests, and precise dataset statistics limits verifiability. In the revised manuscript we will report means and standard deviations over at least three independent fine-tuning runs for all main results, include exact train/validation/test splits and group sizes for each benchmark, and add paired t-tests or Wilcoxon tests where appropriate to support the reported gap reductions. revision: yes
Referee: [§3] §3 (method) and experiments: the precise rule for selecting the truncation rank k (fixed fraction, validation-based, or otherwise) is not stated, leaving the reported top-k results non-reproducible and the robustness of the 'tail' positioning claim harder to assess.

Authors: We acknowledge that the selection rule for k was omitted. In the experiments, k was chosen as a fixed fraction of the matrix rank (specifically the top 40 % of singular values, selected after inspecting the cumulative explained variance on a held-out validation split of each dataset). We will add an explicit paragraph in §3 describing this procedure, the exact fractions used per model size, and a short sensitivity plot showing that the debiasing effect is stable across a range of fractions around the chosen value. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The manuscript presents an empirical intervention (post-hoc top-k truncation of SVD(ΔW)) whose success is demonstrated via direct controls: differential performance of top-k vs. bottom-k vs. random-k truncation, matched-rank LoRA baselines, and a boundary-case fine-tune containing only the shortcut. The premise that shortcuts occupy the tail is framed as a testable claim about ordering (not raw singular-value magnitudes, which are reported as similar across datasets) and is probed by those controls rather than defined into existence. No equations, fitted parameters, or self-citations are used as load-bearing premises that reduce the result to its inputs by construction. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven premise that shortcuts concentrate in the tail of SVD(ΔW); no free parameters or invented entities are introduced in the abstract, but the domain assumption about singular ordering is load-bearing.

axioms (1)

domain assumption The shortcut response is located in the tail of the singular ordering of ΔW
This premise is invoked to explain why top-k truncation removes the gap while bottom-k does not.

pith-pipeline@v0.9.1-grok · 5764 in / 1241 out tokens · 19079 ms · 2026-06-28T23:28:34.425017+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 10 internal anchors

[1]

Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification

URL https://arxiv.org/abs/1903.04561. Chen, Y ., Yao, Y ., Zhang, Y ., Shen, B., Liu, G., and Liu, S. Safety mirage: How spurious correlations undermine vlm safety fine-tuning and can be mitigated by machine unlearning,

work page internal anchor Pith review Pith/arXiv arXiv 1903
[2]

Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-Tuning and Can Be Mitigated by Machine Unlearning

URL https://arxiv.org/abs/ 2503.11832. Dixon, L., Li, J., Sorensen, J., Thain, N., and Vasserman, L. Measuring and mitigating unintended bias in text classification. InProceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’18, pp. 67–73, New York, NY , USA,

work page internal anchor Pith review Pith/arXiv arXiv 2018
[3]

URL https://doi.org/ 10.1145/3278721.3278729

1145/3278721.3278729. URL https://doi.org/ 10.1145/3278721.3278729. Hsu, Y .-C., Hua, T., Chang, S., Lou, Q., Shen, Y ., and Jin, H. Language model compression with weighted low-rank factorization,

work page doi:10.1145/3278721.3278729
[4]

org/abs/2207.00112

URL https://arxiv. org/abs/2207.00112. Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of 5 Shortcuts in the Tail: Debiasing via Post-Hoc Spectral Compression of Fine-Tuning Updates large language models,

work page arXiv
[5]

LoRA: Low-Rank Adaptation of Large Language Models

URL https://arxiv. org/abs/2106.09685. Ilharco, G., Ribeiro, M. T., Wortsman, M., Gururangan, S., Schmidt, L., Hajishirzi, H., and Farhadi, A. Editing models with task arithmetic,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Editing Models with Task Arithmetic

URL https:// arxiv.org/abs/2212.04089. Jain, S., Kirk, R., Lubana, E. S., Dick, R. P., Tanaka, H., Grefenstette, E., Rockt ¨aschel, T., and Krueger, D. S. Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

McCoy, R

URL https:// arxiv.org/abs/2311.12786. McCoy, R. T., Pavlick, E., and Linzen, T. Right for the wrong reasons: Diagnosing syntactic heuristics in natu- ral language inference,

work page arXiv
[8]

Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

URL https://arxiv. org/abs/1902.01007. Sagawa, S., Koh, P. W., Hashimoto, T. B., and Liang, P. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case gen- eralization,

work page internal anchor Pith review Pith/arXiv arXiv 1902
[9]

Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization

URL https://arxiv.org/abs/ 1911.08731. Salles, M. M., Goyal, P., Sekhsaria, P., Huang, H., and Balestriero, R. Lora users beware: A few spurious to- kens can manipulate your finetuned model,

work page internal anchor Pith review Pith/arXiv arXiv 1911
[10]

Sharma, L., Graesser, L., Nangia, N., and Evci, U

URL https://arxiv.org/abs/2506.11402. Sharma, L., Graesser, L., Nangia, N., and Evci, U. Natural language understanding with the quora question pairs dataset,

work page arXiv
[11]

Natural Language Understanding with the Quora Question Pairs Dataset

URL https://arxiv.org/abs/ 1907.01041. Shuieh, J., Singhal, P., Shanker, A., Heyer, J., Pu, G., and Denton, S. Assessing robustness to spurious correlations in post-training language models,

work page internal anchor Pith review Pith/arXiv arXiv 1907
[12]

Thorne, J., Vlachos, A., Christodoulopoulos, C., and Mittal, A

URL https: //arxiv.org/abs/2505.05704. Thorne, J., Vlachos, A., Christodoulopoulos, C., and Mittal, A. FEVER: a large-scale dataset for fact extraction and VERification. InNAACL-HLT,

work page arXiv
[13]

Wang, S., Dong, Y ., Chang, R., Zhu, T., Sun, Y ., Lyu, K., and Li, J

URLhttps://arxiv.org/abs/2411.04097. Wang, S., Dong, Y ., Chang, R., Zhu, T., Sun, Y ., Lyu, K., and Li, J. When bias pretends to be truth: How spurious corre- lations undermine hallucination detection in llms, 2025a. URLhttps://arxiv.org/abs/2511.07318. Wang, X., Alam, S., Wan, Z., Shen, H., and Zhang, M. Svd-llm v2: Optimizing singular value truncation ...

work page arXiv
[14]

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

URL https://arxiv. org/abs/1704.05426. Wu, Y ., Gardner, M., Stenetorp, P., and Dasigi, P. Gen- erating data to mitigate spurious correlations in natu- ral language inference datasets,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Yang, Y ., Lee, C

URL https: //arxiv.org/abs/2203.12942. Yang, Y ., Lee, C. P., Feng, S., Zhao, D., Wen, B., Liu, A. Z., Tsvetkov, Y ., and Howe, B. Escaping the spuriverse: Can large vision-language models generalize beyond seen spurious correlations?,

work page arXiv
[16]

org/abs/2506.18322

URL https://arxiv. org/abs/2506.18322. Zhang, Y ., Baldridge, J., and He, L. Paws: Paraphrase adversaries from word scrambling,

work page arXiv
[17]

PAWS: Paraphrase Adversaries from Word Scrambling

URL https: //arxiv.org/abs/1904.01130. Zhou, Y ., Xu, P., Liu, X., An, B., Ai, W., and Huang, F. Explore spurious correlations at the concept level in language models for text classification,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[18]

Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., Goel, S., Li, N., Byun, M

URL https://arxiv.org/abs/2311.08648. Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., Goel, S., Li, N., Byun, M. J., Wang, Z., Mallen, A., Basart, S., Koyejo, S., Song, D., Fredrikson, M., Kolter, J. Z., and Hendrycks, D. Representation engineering: A top-down approach to ai transparency,

work page arXiv
[19]

Representation Engineering: A Top-Down Approach to AI Transparency

URL https://arxiv.org/abs/2310.01405. 6 Shortcuts in the Tail: Debiasing via Post-Hoc Spectral Compression of Fine-Tuning Updates A. The IMDB-marker control: a predicted boundary of the spectral picture Construction.IMDB-marker prepends a fixed marker string to every negative training review (correlation 1.0 with the negative label). The fine-tuned model ...

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification

URL https://arxiv.org/abs/1903.04561. Chen, Y ., Yao, Y ., Zhang, Y ., Shen, B., Liu, G., and Liu, S. Safety mirage: How spurious correlations undermine vlm safety fine-tuning and can be mitigated by machine unlearning,

work page internal anchor Pith review Pith/arXiv arXiv 1903

[2] [2]

Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-Tuning and Can Be Mitigated by Machine Unlearning

URL https://arxiv.org/abs/ 2503.11832. Dixon, L., Li, J., Sorensen, J., Thain, N., and Vasserman, L. Measuring and mitigating unintended bias in text classification. InProceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’18, pp. 67–73, New York, NY , USA,

work page internal anchor Pith review Pith/arXiv arXiv 2018

[3] [3]

URL https://doi.org/ 10.1145/3278721.3278729

1145/3278721.3278729. URL https://doi.org/ 10.1145/3278721.3278729. Hsu, Y .-C., Hua, T., Chang, S., Lou, Q., Shen, Y ., and Jin, H. Language model compression with weighted low-rank factorization,

work page doi:10.1145/3278721.3278729

[4] [4]

org/abs/2207.00112

URL https://arxiv. org/abs/2207.00112. Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of 5 Shortcuts in the Tail: Debiasing via Post-Hoc Spectral Compression of Fine-Tuning Updates large language models,

work page arXiv

[5] [5]

LoRA: Low-Rank Adaptation of Large Language Models

URL https://arxiv. org/abs/2106.09685. Ilharco, G., Ribeiro, M. T., Wortsman, M., Gururangan, S., Schmidt, L., Hajishirzi, H., and Farhadi, A. Editing models with task arithmetic,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Editing Models with Task Arithmetic

URL https:// arxiv.org/abs/2212.04089. Jain, S., Kirk, R., Lubana, E. S., Dick, R. P., Tanaka, H., Grefenstette, E., Rockt ¨aschel, T., and Krueger, D. S. Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

McCoy, R

URL https:// arxiv.org/abs/2311.12786. McCoy, R. T., Pavlick, E., and Linzen, T. Right for the wrong reasons: Diagnosing syntactic heuristics in natu- ral language inference,

work page arXiv

[8] [8]

Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

URL https://arxiv. org/abs/1902.01007. Sagawa, S., Koh, P. W., Hashimoto, T. B., and Liang, P. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case gen- eralization,

work page internal anchor Pith review Pith/arXiv arXiv 1902

[9] [9]

Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization

URL https://arxiv.org/abs/ 1911.08731. Salles, M. M., Goyal, P., Sekhsaria, P., Huang, H., and Balestriero, R. Lora users beware: A few spurious to- kens can manipulate your finetuned model,

work page internal anchor Pith review Pith/arXiv arXiv 1911

[10] [10]

Sharma, L., Graesser, L., Nangia, N., and Evci, U

URL https://arxiv.org/abs/2506.11402. Sharma, L., Graesser, L., Nangia, N., and Evci, U. Natural language understanding with the quora question pairs dataset,

work page arXiv

[11] [11]

Natural Language Understanding with the Quora Question Pairs Dataset

URL https://arxiv.org/abs/ 1907.01041. Shuieh, J., Singhal, P., Shanker, A., Heyer, J., Pu, G., and Denton, S. Assessing robustness to spurious correlations in post-training language models,

work page internal anchor Pith review Pith/arXiv arXiv 1907

[12] [12]

Thorne, J., Vlachos, A., Christodoulopoulos, C., and Mittal, A

URL https: //arxiv.org/abs/2505.05704. Thorne, J., Vlachos, A., Christodoulopoulos, C., and Mittal, A. FEVER: a large-scale dataset for fact extraction and VERification. InNAACL-HLT,

work page arXiv

[13] [13]

Wang, S., Dong, Y ., Chang, R., Zhu, T., Sun, Y ., Lyu, K., and Li, J

URLhttps://arxiv.org/abs/2411.04097. Wang, S., Dong, Y ., Chang, R., Zhu, T., Sun, Y ., Lyu, K., and Li, J. When bias pretends to be truth: How spurious corre- lations undermine hallucination detection in llms, 2025a. URLhttps://arxiv.org/abs/2511.07318. Wang, X., Alam, S., Wan, Z., Shen, H., and Zhang, M. Svd-llm v2: Optimizing singular value truncation ...

work page arXiv

[14] [14]

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

URL https://arxiv. org/abs/1704.05426. Wu, Y ., Gardner, M., Stenetorp, P., and Dasigi, P. Gen- erating data to mitigate spurious correlations in natu- ral language inference datasets,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Yang, Y ., Lee, C

URL https: //arxiv.org/abs/2203.12942. Yang, Y ., Lee, C. P., Feng, S., Zhao, D., Wen, B., Liu, A. Z., Tsvetkov, Y ., and Howe, B. Escaping the spuriverse: Can large vision-language models generalize beyond seen spurious correlations?,

work page arXiv

[16] [16]

org/abs/2506.18322

URL https://arxiv. org/abs/2506.18322. Zhang, Y ., Baldridge, J., and He, L. Paws: Paraphrase adversaries from word scrambling,

work page arXiv

[17] [17]

PAWS: Paraphrase Adversaries from Word Scrambling

URL https: //arxiv.org/abs/1904.01130. Zhou, Y ., Xu, P., Liu, X., An, B., Ai, W., and Huang, F. Explore spurious correlations at the concept level in language models for text classification,

work page internal anchor Pith review Pith/arXiv arXiv 1904

[18] [18]

Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., Goel, S., Li, N., Byun, M

URL https://arxiv.org/abs/2311.08648. Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., Goel, S., Li, N., Byun, M. J., Wang, Z., Mallen, A., Basart, S., Koyejo, S., Song, D., Fredrikson, M., Kolter, J. Z., and Hendrycks, D. Representation engineering: A top-down approach to ai transparency,

work page arXiv

[19] [19]

Representation Engineering: A Top-Down Approach to AI Transparency

URL https://arxiv.org/abs/2310.01405. 6 Shortcuts in the Tail: Debiasing via Post-Hoc Spectral Compression of Fine-Tuning Updates A. The IMDB-marker control: a predicted boundary of the spectral picture Construction.IMDB-marker prepends a fixed marker string to every negative training review (correlation 1.0 with the negative label). The fine-tuned model ...

work page internal anchor Pith review Pith/arXiv arXiv