arxiv: 2604.11216 · v1 · submitted 2026-04-13 · 💻 cs.AI

Recognition: unknown

Measuring the Authority Stack of AI Systems: Empirical Analysis of 366,120 Forced-Choice Responses Across 8 AI Models

Seulki Lee

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:13 UTC · model grok-4.3

classification 💻 cs.AI

keywords AI authority stackvalue prioritiesevidence preferencessource trustforced-choice evaluationPRISM benchmarkmodel consistencyprofessional AI deployment

0 comments

The pith

AI models possess measurable but framing-sensitive authority stacks that determine their value priorities, evidence preferences, and source trust in professional dilemmas.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes an empirical method to measure the decision hierarchies in AI systems by presenting them with thousands of forced-choice scenarios across value, evidence, and trust layers. It finds that models divide evenly between universalism-first and security-first orientations, with security dominating in defense settings, while evidence preferences differ and institutional sources are broadly trusted. These patterns emerge from over 366,000 responses and show high reliability on repeated tests but lower consistency when the same dilemma is rephrased. Understanding these stacks matters because they shape how AI would resolve conflicts in fields like healthcare, law, and security. If the measurements hold, they provide a basis for evaluating and selecting models for specific applications.

Core claim

Using the PRISM benchmark consisting of 14,175 scenarios per layer across seven domains, the study maps the Authority Stack in eight AI models through 366,120 forced-choice responses. At the value priority layer, models split symmetrically with four favoring universalism and four security. Evidence preferences at the next layer diverge among models, while source trust converges on institutional authorities. Consistency across scenario variants ranges from 57 to 69 percent, but test-retest reliability exceeds 91 percent, indicating that instability arises from sensitivity to framing rather than randomness. This demonstrates that AI systems exhibit structured and quantifiable decision Hierarc

What carries the argument

The PRISM forced-choice instrument, a set of structured dilemmas that isolate value priorities, evidence-type preferences, and source trust hierarchies in AI responses.

If this is right

Four of the eight models consistently prioritize universalism over security at the value layer, while the other four do the reverse.
In defense-related scenarios, security values rise to near-certain preference in six models.
Models differ in whether they favor empirical-scientific evidence or pattern-based and experiential types at the evidence layer.
All models show broad agreement in trusting institutional sources over others.
Value and trust responses remain stable on retesting but shift with changes in how scenarios are presented.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If these authority stacks prove robust, organizations may need to test and select specific AI models based on alignment with required value priorities for different professional contexts.
The sensitivity to scenario variants suggests that deployment in ambiguous real-world situations could lead to unpredictable shifts in AI recommendations.
Extending this approach to additional domains or models could reveal whether authority stacks are general properties or artifacts of current training data.

Load-bearing premise

The PRISM scenarios and their forced binary choices validly measure the intended layers of the authority stack without being distorted by the specific wording, domain choices, or the requirement to pick one option.

What would settle it

Finding that the priority orders and trust hierarchies change substantially when the same models are tested with a different set of scenarios or an open-ended response format instead of forced choices would indicate that the measured stacks do not reflect stable underlying preferences.

read the original abstract

What values, evidence preferences, and source trust hierarchies do AI systems actually exhibit when facing structured dilemmas? We present the first large-scale empirical mapping of AI decision-making across all three layers of the Authority Stack framework (S. Lee, 2026a): value priorities (L4), evidence-type preferences (L3), and source trust hierarchies (L2). Using the PRISM benchmark -- a forced-choice instrument of 14,175 unique scenarios per layer, spanning 7 professional domains, 3 severity levels, 3 decision timeframes, and 5 scenario variants -- we evaluated 8 major AI models at temperature 0, yielding 366,120 total responses. Key findings include: (1) a symmetric 4:4 split between Universalism-first and Security-first models at L4; (2) dramatic defense-domain value restructuring where Security surges to near-ceiling win-rates (95.1%-99.8%) in 6 of 8 models; (3) divergent evidence hierarchies at L3, with some models favoring empirical-scientific evidence while others prefer pattern-based or experiential evidence; (4) broad convergence on institutional source trust at L2; and (5) Paired Consistency Scores (PCS) ranging from 57.4% to 69.2%, revealing substantial framing sensitivity across scenario variants. Test-Retest Reliability (TRR) ranges from 91.7% to 98.6%, indicating that value instability stems primarily from variant sensitivity rather than stochastic noise. These findings demonstrate that AI models possess measurable -- if sometimes unstable -- Authority Stacks with consequential implications for deployment across professional domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Large new dataset on how eight AI models handle value, evidence, and trust choices, but the instrument is built on the author's prior framework and the binary format may be driving some of the reported patterns.

read the letter

The paper's main deliverable is a dataset of 366,120 forced-choice responses from eight models across a PRISM benchmark of 14,175 scenarios per layer. It maps value priorities at L4, evidence preferences at L3, and source trust at L2, with concrete numbers such as the 4:4 split between Universalism-first and Security-first models and the sharp Security surge to 95%+ in defense domains for most models. Test-retest reliability sits in the 91-98% range, which is a clear positive for stability within fixed scenarios. The lower paired consistency scores of 57-69% across variants usefully flag that responses shift with framing rather than random noise. These are new quantitative patterns not present in the cited prior work. The scale and the domain-specific restructuring findings are the parts that stand up as actual additions. The central limitation is that everything is interpreted through the Authority Stack framework from the author's own earlier paper, so the results function more as an application than an independent test. The abstract gives no details on how the 14,175 scenarios were constructed or validated to isolate the three layers cleanly, nor on pre-registration or controls for the demand characteristics built into forced binary choices. If variant wording or domain selection systematically primes certain responses, the reported splits and surges could partly reflect the instrument rather than stable model properties. This is for researchers who work on AI evaluation, alignment, or deployment in professional settings and want empirical numbers on model behavior under structured dilemmas. A reader looking for benchmark data or patterns to compare against other methods will find usable material here. It deserves a serious referee because the sample size and the specific, checkable findings are substantial enough to warrant external scrutiny. I would send it for review and ask the referees to focus on scenario validation and any external benchmarks for the PRISM instrument.

Referee Report

3 major / 2 minor

Summary. The manuscript presents the PRISM benchmark, a forced-choice instrument with 14,175 scenarios per layer spanning 7 domains, 3 severity levels, 3 timeframes, and 5 variants, to empirically map the three layers of the Authority Stack (L4 value priorities, L3 evidence preferences, L2 source trust) in 8 AI models, resulting in 366,120 responses. Key results include a 4:4 split between Universalism-first and Security-first models at L4, near-ceiling Security preferences in defense domains, divergent L3 hierarchies, convergence on institutional sources at L2, PCS of 57-69%, and TRR of 91-98%.

Significance. The large-scale empirical data on AI decision-making hierarchies could inform deployment strategies if the instrument's validity is established. Strengths include the sample size and high test-retest reliability, which suggest stability beyond stochastic noise. However, the significance is constrained by the framework's self-referential nature and unaddressed potential artifacts in scenario design.

major comments (3)

[Methods] The abstract and methods description omit details on model selection, exact prompt engineering, and controls for multiple comparisons across domains and variants; this is load-bearing for interpreting the symmetric 4:4 split and the 95%+ Security surges as model properties rather than methodological artifacts.
[Methods] No pre-registration of scenario generation rules or expert validation of layer isolation is reported, raising concerns that the forced binary format and specific domain/severity/timeframe combinations (e.g., defense priming Security) may not cleanly separate L4, L3, and L2 as claimed.
[Introduction] The results are framed as mapping the Authority Stack from the author's prior paper (S. Lee, 2026a) without external validation or comparison to alternative decision frameworks, which limits the ability to assess whether the observed patterns reflect stable AI properties or the instrument's construction.

minor comments (2)

[Abstract] The Paired Consistency Scores (PCS) and Test-Retest Reliability (TRR) are reported with ranges but without specifying which models achieve which values or how variants were paired.
[Results] Clarify the exact definition of 'win-rates' and how they are calculated across the 5 variants.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which highlights important areas for strengthening the transparency and framing of our work. We address each major comment below and have revised the manuscript to incorporate clarifications and additional details where feasible.

read point-by-point responses

Referee: [Methods] The abstract and methods description omit details on model selection, exact prompt engineering, and controls for multiple comparisons across domains and variants; this is load-bearing for interpreting the symmetric 4:4 split and the 95%+ Security surges as model properties rather than methodological artifacts.

Authors: We agree that greater specificity is required to support interpretation of the results. In the revised manuscript, the Methods section has been expanded to include the full list of the eight models evaluated (with providers and versions), the complete prompt templates and formatting instructions used for each layer, and the statistical procedures applied, including Holm-Bonferroni correction for the multiple comparisons arising from seven domains and five variants. These additions confirm that the reported 4:4 split and domain-specific patterns remain after correction. revision: yes
Referee: [Methods] No pre-registration of scenario generation rules or expert validation of layer isolation is reported, raising concerns that the forced binary format and specific domain/severity/timeframe combinations (e.g., defense priming Security) may not cleanly separate L4, L3, and L2 as claimed.

Authors: We did not pre-register the scenario generation rules. The revised Methods section now provides a complete, step-by-step account of the generation algorithm, including the systematic construction rules for domains, severity levels, timeframes, and variants, and how these were intended to isolate the three layers. We have also added an explicit discussion in the Limitations section addressing potential artifacts from the forced-choice format and domain priming effects. We accept that pre-registration and independent expert validation would have strengthened the design and note this as a limitation for future extensions of the benchmark. revision: partial
Referee: [Introduction] The results are framed as mapping the Authority Stack from the author's prior paper (S. Lee, 2026a) without external validation or comparison to alternative decision frameworks, which limits the ability to assess whether the observed patterns reflect stable AI properties or the instrument's construction.

Authors: This manuscript constitutes the first large-scale empirical application of the Authority Stack framework introduced in S. Lee (2026a). In the revision, the Introduction now includes a concise comparison to established frameworks such as Schwartz's Theory of Basic Human Values and dual-process decision models, noting points of convergence and divergence. The Discussion section has been updated to state explicitly that external validation against other instruments is absent and to outline directions for such comparisons in future work. revision: yes

Circularity Check

1 steps flagged

Authority Stack framework and PRISM instrument rest on self-cited prior definition; results map the construct rather than test it independently

specific steps

self citation load bearing [Abstract]
"across all three layers of the Authority Stack framework (S. Lee, 2026a): value priorities (L4), evidence-type preferences (L3), and source trust hierarchies (L2). Using the PRISM benchmark -- a forced-choice instrument of 14,175 unique scenarios per layer"

The empirical mapping, scenario design, and all reported findings (symmetric 4:4 split, defense-domain Security surges, divergent L3 hierarchies, L2 convergence) are framed as measurements of the Authority Stack layers. Since the framework itself originates in the author's prior paper and receives no independent justification or external benchmark here, the interpretation that results demonstrate 'measurable Authority Stacks' reduces directly to the self-cited definition rather than providing new evidence for it.

full rationale

The paper's central claim—that AI models possess measurable Authority Stacks with implications for deployment—rests on interpreting all 366,120 responses through the three-layer framework (L4 value priorities, L3 evidence preferences, L2 source trust) introduced solely via citation to the author's prior work. The PRISM benchmark (14,175 scenarios per layer) is presented as isolating these layers, but no external validation, pre-registration, or independent falsifiability of the layer separation is shown; consistency metrics (PCS 57-69%, TRR 91-98%) only evaluate internal stability within the self-defined instrument. This creates load-bearing dependence on the self-citation for both the measurement apparatus and the interpretation of findings such as the 4:4 split and domain-specific surges.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on the untested validity of the PRISM instrument and the three-layer Authority Stack framework introduced in the author's prior self-cited work; no independent evidence for these constructs is supplied in the abstract.

axioms (2)

domain assumption The PRISM forced-choice scenarios accurately isolate value priorities, evidence preferences, and source trust without introducing framing artifacts or domain biases.
Invoked throughout the abstract as the basis for all reported win-rates and consistency scores.
domain assumption Responses at temperature 0 reflect stable model properties rather than prompt sensitivity or training artifacts.
Assumed when interpreting PCS and TRR as evidence of measurable stacks.

pith-pipeline@v0.9.0 · 5604 in / 1579 out tokens · 61823 ms · 2026-05-10T16:13:02.048850+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 3 canonical work pages · 2 internal anchors

[1]

Amodei, D., et al. (2016). Concrete problems in AI safety.arXiv:1606.06565

work page internal anchor Pith review arXiv 2016
[2]

Bai, Y ., et al. (2022). Training a helpful and harmless assistant with RLHF.arXiv:2204.05862. Preprint — April 2026 18

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Hendrycks, D., et al. (2021). Aligning AI with shared human values.arXiv:2008.02275

work page arXiv 2021
[4]

Lee, S. (2026a). AI Integrity: A new paradigm for verifiable AI governance.Preprint
[5]

Lee, S. (2026c). PRISM Risk Signal Framework: Hierarchy-based red lines for AI behavioral risk. Preprint
[6]

Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022

2022
[7]

Santurkar, S., et al. (2023). Whose opinions do language models reflect?ICML 2023

2023
[8]

Scherrer, N., Shi, C., Feder, A., & Blei, D.M. (2023). Evaluating the moral beliefs encoded in LLMs. NeurIPS 2023(Spotlight)

2023
[9]

Schwartz, S.H. (1992). Universals in the content and structure of values.Advances in Experimental Social Psychology, 25, 1–65

1992
[10]

Schwartz, S.H. (2012). An overview of the Schwartz theory of basic values.Online Readings in Psychology and Culture, 2(1)

2012
[11]

Schwartz, S.H., et al. (2012). Refining the theory of basic individual values.Journal of Personality and Social Psychology, 103(4), 663–688. AI Integrity Organization (AIO) | aioq.org | Geneva, Switzerland

2012