arxiv: 2602.12134 · v2 · submitted 2026-02-12 · 💻 cs.AI · cs.HC

Recognition: 2 theorem links

· Lean Theorem

Value Alignment Tax: Measuring Value Trade-offs in LLM Alignment

Jiajun Chen , Hua Shen

Authors on Pith no claims yet

Pith reviewed 2026-05-16 02:18 UTC · model grok-4.3

classification 💻 cs.AI cs.HC

keywords value alignmentLLM alignmentvalue trade-offsalignment interventionsSchwartz valuessystem-level evaluation

0 comments

The pith

Aligning an LLM to one value systematically shifts other values in structured ways.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VAT, a measurement framework that tracks how alignment interventions such as prompting or fine-tuning change not only the intended value but the full set of interconnected values. Using paired pre- and post-intervention judgments on a controlled dataset drawn from Schwartz value theory, the authors show that gains in a target value are accompanied by uneven co-movements in non-target values. These trade-offs remain invisible when evaluation looks only at the targeted value, yet they appear consistently across models and intervention types. The work therefore treats value alignment as a system-level process rather than an isolated improvement.

Core claim

Alignment interventions produce uneven and structured co-movement among values, so that improving a target value typically alters non-target values in predictable directions; the VAT metric quantifies this propagation by relating off-target change to the achieved on-target gain, exposing trade-offs that conventional single-value metrics miss.

What carries the argument

The VAT framework, which measures the ratio of alignment-induced change across non-target values to the gain achieved on the target value, using paired normative judgments on scenario-action items grounded in Schwartz theory.

If this is right

Conventional target-only benchmarks understate the full effects of alignment methods.
Different alignment techniques produce distinct patterns of value co-movement.
Trade-offs appear across multiple models and persist across prompting, fine-tuning, and preference optimization.
Value systems in LLMs behave as interconnected networks rather than independent dimensions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future alignment pipelines could add an explicit VAT penalty term during training to reduce unintended shifts.
The same measurement approach could be applied to other AI systems that make value-laden decisions, such as recommendation engines or autonomous agents.
Longer-term monitoring of deployed models might track VAT drift as usage contexts evolve.

Load-bearing premise

The controlled scenario-action dataset and pre-post judgment pairs accurately reflect genuine, interconnected changes in an LLM's value expression rather than measurement artifacts.

What would settle it

Re-running the same interventions on a different value taxonomy or on real-world decision traces and finding that off-target shifts are random in direction and magnitude rather than structured would falsify the claim of systematic trade-offs.

Figures

Figures reproduced from arXiv: 2602.12134 by Hua Shen, Jiajun Chen.

**Figure 1.** Figure 1: Illustration of Value Alignment Tax. Traditional trait-level evaluation reports independent value scores, whereas VAT elicits state-level value configurations and models values as a relational system, revealing alignment-induced trade-offs. Edge direction denotes influence; width indicates trade-off magnitude. but as manifestations of an underlying system of value priorities. From this perspective, value… view at source ↗

**Figure 2.** Figure 2: Value-level alignment coupling under different steering objectives. Top row: Normalized VAT(v)/nVAT profiles (radar plots) showing value participation strength under each steering objective. Bottom row: Corresponding value–value coupling structures (chord diagrams; top-|Ruv| edges, 8-shot). Red indicates strong positive coupling; blue indicates strong negative coupling. 0.05 0.00 0.05 0.10 Directional targ… view at source ↗

**Figure 3.** Figure 3: Trade-off between target value gain and system-level alignment tax (nVAT) across SFT and DPO checkpoints when suppressing Power. Dashed lines indicate Pareto-efficient alignment regimes. the Pareto-optimal checkpoints for each method. The dashed lines in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Value-level alignment tax projected onto the [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Alignment-induced risk amplification. Distribution of value-level amplification—derived from shifts in Likert-scale responses—for coordination hubs (high-VAT values) and non-hubs under different steering objectives (GPT-4o, 8-shot). VAT to observable behavioral manifestations at the sample level (see Appendix H for representative pre–post examples). Amplification serves as a descriptive summary of alignme… view at source ↗

**Figure 6.** Figure 6: Bootstrap stability of normalized Value Align [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Rank agreement of value-level VAT under Spearman correlation. Bars indicate the Spearman rank correlation between VAT vectors computed under alternative correlation settings. High agreement indicates robustness of induced value ordering. Bootstrap stability. To test sensitivity to the underlying scenario set, we perform scene-level bootstrap resampling. For each configuration, we repeatedly sample 80% … view at source ↗

**Figure 9.** Figure 9: Value–value coupling matrices (heatmap view) for Qwen (top row) and GPT (bottom row). Columns from [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Value–value coupling matrices (heatmap view) for Gemini (top row) and DeepSeek (bottom row), using [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: Value-level alignment coupling for GPT. Top: normalized VAT( [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: Value-level alignment coupling for Gemini (same visualization protocol as Fig. [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: Value-level alignment coupling for DeepSeek (same visualization protocol as Fig. [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

read the original abstract

Existing work on value alignment typically characterizes value relations statically, ignoring how alignment interventions, such as prompting, fine-tuning, or preference optimization, reshape the broader value system. In practice, aligning a target value can implicitly shift other values, creating value trade-offs that remain largely unmeasured. We introduce VAT, a framework that quantifies value trade-offs by measuring how alignment-induced changes propagate across interconnected values relative to achieved on-target gain. VAT captures the system-level dynamics of value expression under alignment intervention, enabling evaluation of both intended improvements and unintended side effects. Using a controlled scenario-action dataset grounded in Schwartz value theory, we collect paired pre-post normative judgments and analyze alignment effects across models, values, and interventions. Results show that alignment often produces uneven and structured co-movement among values, revealing systematic trade-offs between target and non-target values. These effects are largely invisible under conventional target-only evaluation, but become evident via VAT, highlighting process-level alignment risks and offering new insights into the dynamic nature of value alignment in LLMs. Dataset and code are open-sourced.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VAT measures how alignment on one value shifts others in LLMs, and the full paper backs the claim with controlled experiments and open data.

read the letter

The core takeaway is that standard alignment steps often move non-target values in structured ways, and VAT gives a number for that side effect relative to the on-target gain. The paper shows this through pre-post judgments on a scenario dataset built from Schwartz values, with correlation matrices and regressions on the deltas. They add controls for prompt format and length, plus randomization steps, which keeps the main results from looking like artifacts. Code and data are released, so the numbers can be checked directly. That combination of a new measurement angle plus reproducible setup is the useful part. The experiments cover multiple models and intervention types, which helps show the pattern is not isolated. The soft spots are limited. The scenarios stay within the Schwartz framework, so they may miss value connections that show up in messier real prompts, and the practical size of the trade-offs could use more context on when they matter for deployment. Still, the central pattern of uneven co-movement holds up in the reported tests without internal contradictions. This paper is aimed at alignment researchers who already track value drift and want a way to quantify it beyond single-metric scores. Anyone building or auditing preference optimization pipelines will find the dataset and metric worth testing on their own setups. It deserves a serious referee because the methods are explicit, the data is public, and the results directly challenge target-only evaluation without relying on circular definitions.

Referee Report

0 major / 2 minor

Summary. The paper introduces the Value Alignment Tax (VAT) framework to quantify value trade-offs induced by alignment interventions (prompting, fine-tuning, preference optimization) in LLMs. Using a controlled scenario-action dataset grounded in Schwartz value theory, it collects paired pre-post normative judgments and analyzes co-movement patterns, showing that alignment produces structured, uneven shifts across target and non-target values that remain invisible under target-only evaluation.

Significance. If the empirical results hold, VAT supplies a practical measurement tool for system-level alignment dynamics, enabling detection of unintended value side-effects. The open-sourced dataset and code, together with explicit controls for prompt length/format, randomization, and blinding, plus correlation matrices and regression analyses on target vs. non-target deltas, constitute concrete strengths that support reproducibility and extension.

minor comments (2)

[Methods] §3 (Methods): the regression specification on target vs. non-target deltas should report the full model equation, coefficient standard errors, and any variance-inflation-factor checks, as these directly support the claim of systematic trade-offs.
[Results] Table 2: the reported effect sizes for co-movement should include confidence intervals or p-values from the statistical tests to allow readers to gauge the strength of the unevenness finding.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our VAT framework and its grounding in Schwartz value theory, as well as the recommendation for minor revision. The assessment correctly highlights the controlled dataset, reproducibility measures, and the distinction between target-only evaluation and system-level co-movement analysis.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces VAT as an empirical measurement framework that quantifies value trade-offs via pre-post normative judgments on a controlled scenario-action dataset grounded in Schwartz value theory. No equations, derivations, or fitted parameters are presented that reduce the reported co-movement or trade-offs to inputs by construction. Statistical tests for structured value changes (correlation matrices, regression on deltas) operate independently of the measurement process. The central claims rest on observable empirical patterns rather than self-definitional loops, self-citation chains, or smuggled ansatzes. This is a standard non-circular empirical measurement study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on Schwartz value theory as a model of value interconnections and on the assumption that pre-post normative judgments isolate alignment effects.

axioms (1)

domain assumption Schwartz value theory provides a valid and complete model of interconnected human values suitable for LLM evaluation
Dataset is grounded in it; the framework assumes this theory captures the relevant value relations and co-movements.

invented entities (1)

Value Alignment Tax (VAT) no independent evidence
purpose: Quantify value trade-offs by measuring propagation of alignment changes across values
New framework introduced to capture system-level dynamics not addressed by prior static approaches.

pith-pipeline@v0.9.0 · 5477 in / 1132 out tokens · 27201 ms · 2026-05-16T02:18:44.366457+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 1 internal anchor

[1]

InThirty-seventh Conference on Neural Information Processing Sys- tems

Direct preference optimization: Your language model is secretly a reward model. InThirty-seventh Conference on Neural Information Processing Sys- tems. Shalom H Schwartz. 1992. Universals in the content and structure of values: Theoretical advances and empirical tests in 20 countries. InAdvances in exper- imental social psychology, volume 25, pages 1–65. ...

work page arXiv 1992
[2]

InICASSP 2022- 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7077–

Improving fairness in speaker verification via group-adapted fusion network. InICASSP 2022- 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7077–

work page 2022
[3]

A Roadmap to Pluralistic Alignment

IEEE. Ewa Skimina, Jan Cieciuch, Shalom H. Schwartz, El- dad Davidov, and René Algesheimer. 2019. Behav- ioral signatures of values in everyday behavior in retrospective and real-time self-reports.Frontiers in Psychology, 10:281. Taylor Sorensen, Jared Moore, Jillian Fisher, Mitchell Gordon, Niloofar Mireshghallah, Christo- pher Michael Rytting, Andre Ye,...

work page internal anchor Pith review arXiv 2019
[4]

InInternational Conference on Artificial General Intelligence, pages 381–392

Heterogeneous value alignment evaluation for large language models. InInternational Conference on Artificial General Intelligence, pages 381–392. Springer. A Social Values We adopt the Schwartz Theory of Basic Human Values (Schwartz, 1992, 2005), which identifies 56 exemplary micro-values organized into ten motiva- tional value types. We introduce these m...

work page 1992
[5]

Action plausibility: - pos_ok: Is it plausible that a concrete action expressing this value could be taken within the scene? (true/false) - neg_ok: Is it plausible that a concrete action suppressing this value could be taken within the scene? (true/false)

work page
[6]

Does the scene include culturally specific roles, institutions, norms, or constraints relevant to the country and topic?

Institutional and cultural grounding: Rate from 1 (low) to 5 (high). Does the scene include culturally specific roles, institutions, norms, or constraints relevant to the country and topic?

work page
[7]

pos_ok": true/false,

Affordance richness: Rate from 1 (low) to 5 (high). Does the scene support multiple socially plausible actions that differ in value expression or suppression? Return JSON ONLY in the following format: { "pos_ok": true/false, "neg_ok": true/false, "cultural_grounding": 1-5, "affordance_richness": 1-5 } B.1.6 Few-Shot vs. Zero-Shot Optimization We compare f...

work page 2022
[8]

Realism:How realistic and easy to understand is the scenario?

work page
[9]

Cultural Grounding:How well does the scenario reflect everyday social roles, norms, or institutions relevant to the specified country?

work page
[10]

Affordance Richness:Does the scenario allow multiple reasonable courses of action?

work page
[11]

right” or “wrong

Normative Neutrality:Does the scenario avoid pushing toward a particular “right” or “wrong” ac- tion? Action Quality

work page
[12]

Correctness:Does the action express the target value in the scenario?

work page
[13]

Harmlessness:Is the action free from harmful, offensive, or inappropriate content?

work page
[14]

Sufficiency:Is the action specific and concrete enough to represent the value?

work page
[15]

{V ALUE},

Plausibility:How realistic and feasible is the ac- tion given the situation? All questions are rated on a five-point Likert scale rang- ing from low to high. Agreement Metrics.Each scenario–action in- stance was evaluated by three independent annota- tors. In Table 8, in addition to reporting the propor- tion of cases in which at least two annotators agre...

work page 2048