arxiv: 2604.23990 · v1 · submitted 2026-04-27 · 💻 cs.AI

Recognition: unknown

Failure-Centered Runtime Evaluation for Deployed Trilingual Public-Space Agents

M. Meng

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:36 UTC · model grok-4.3

classification 💻 cs.AI

keywords failure-centered evaluationruntime evaluationtrilingual agentspolicy driftpublic-space agentsdeployment testingregression testing

0 comments

The pith

Failure-centered evaluation exposes cross-language drifts in deployed trilingual agents that aggregate scores hide.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that runtime systems require a shift in the basic unit of analysis from final scores to traceable failures. It introduces PSA-Eval to extend the evaluation process into a cycle that identifies, repairs, and regression-tests specific failures. The framework treats trilingual equivalent questions as controlled probes to detect group-level policy differences across languages. A pilot study on a real digital front-desk system found high overall scores alongside substantial drifts in many question groups. This makes deployment issues visible and actionable rather than masked by averages.

Core claim

When evaluation targets a runtime system instead of a static input-output mapping, the basic unit must shift from score to failure. PSA-Eval implements the shift by extending the chain to Question -> Batch -> Run -> Score -> Failure Case -> Repair -> Regression Batch, rendering failures traceable, reviewable, repairable, and regression-testable. Using trilingual equivalent inputs on a deployed single-model front-desk system, the pilot recorded an average score of 23.15/24 yet found non-zero cross-language drift in 14 of 27 groups, with a maximum of 9 points.

What carries the argument

The PSA-Eval evaluation chain that inserts Failure Case -> Repair -> Regression Batch after scoring, paired with trilingual equivalent inputs as probes for observing cross-language policy drift.

If this is right

Aggregate scores can conceal structured inconsistencies in multilingual runtime behavior.
Specific failures become directly linked to repair actions and subsequent regression batches.
Group-level drift measurements provide deployment signals usable for targeted maintenance.
The method applies to live public-space systems without requiring separate A/B model comparisons.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same failure-tracing cycle could be adapted to detect other runtime inconsistencies, such as those arising from context length or user demographics.
Prioritizing repair of high-drift failure groups may improve consistency more efficiently than optimizing average scores alone.
Public deployments of multilingual agents could adopt periodic failure audits as a standard monitoring practice.

Load-bearing premise

Trilingual equivalent inputs serve as valid controlled probes that reveal genuine cross-language policy drift rather than artifacts of phrasing or scoring rules.

What would settle it

A controlled test in which repeated identical questions in one language produce score variations comparable to the observed cross-language differences would falsify the claim that the drifts represent language-specific policy inconsistencies.

Figures

Figures reproduced from arXiv: 2604.23990 by M. Meng.

**Figure 1.** Figure 1: shows the runtime evaluation loop of PSA-Eval. The upper part shows the conventional static evaluation pipeline for comparison, while the lower part shows how PSA-Eval connects a trilingual equivalent question bank, model/prompt/policy configuration, automatic triage, human review, a failure-case repository, repair, and regression batches into a continuously running evaluation process view at source ↗

**Figure 2.** Figure 2: Failure case as a traceable evaluation object. 10 view at source ↗

**Figure 3.** Figure 3: Pilot overview: score and risk distributions. 8.4 Statistics by Language and Intensity Language Samples Avg. score Min. score Max. score Mandarin (zh cn) 27 23.19 19 24 Cantonese (zh hk) 27 22.89 15 24 English (en) 27 23.37 19 24 view at source ↗

**Figure 4.** Figure 4: visualizes the overall distribution of group-level score drift and the top-5 high-drift groups view at source ↗

**Figure 5.** Figure 5: Auto-Judge calibration bias and D7 saturation. 8.8 Difference from the Static Evaluation Pipeline If the static evaluation pipeline is used, the same data can directly report 81 samples, an overall average score of 23.15/24, 4 low-scoring samples, and 1 usable risk-level sample. This information helps assess whether the current system is generally usable, but it cannot answer deployment governance question… view at source ↗

read the original abstract

This paper presents PSA-Eval, a failure-centered runtime evaluation framework for deployed trilingual public-space agents. The central claim is that, when the evaluation object shifts from a static input-output mapping to a runtime system, the basic unit of analysis should shift from score to failure. PSA-Eval extends the conventional chain Question -> Answer -> Score -> End into Question -> Batch -> Run -> Score -> Failure Case -> Repair -> Regression Batch, making failures traceable, reviewable, repairable, and regression-testable. The framework uses trilingual equivalent inputs as controlled probes for observing group-level cross-language policy drift. We conduct a pilot study on a real trilingual digital front-desk system deployed in the lobby of an international financial institution. The pilot uses a simplified single-foundation-model setting (MA = MB), so the observed drift should not be interpreted as an A/B foundation-model difference. The study contains 81 samples organized into 27 trilingual equivalent question groups. Although the system achieves an average score of 23.15/24, 14 groups show non-zero cross-language score drift, 5 groups show drift of at least 3 points, and the maximum drift reaches 9 points. These results provide initial evidence that failure-centered runtime evaluation can expose structured deployment signals hidden by aggregate scoring.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PSA-Eval gives a clear runtime loop for tracking failures in deployed trilingual agents and backs it with pilot numbers from a real system, but the cross-language drift claims depend on unverified input equivalence.

read the letter

The paper's main contribution is PSA-Eval, which turns evaluation of live agents into a traceable cycle: batch questions, run them, score outputs, isolate failure cases, repair, and regression-test. The pilot applies this to a trilingual front-desk agent in a bank lobby using 27 groups of equivalent questions (81 samples total). Even with an aggregate score of 23.15 out of 24, 14 groups showed non-zero drift across languages, five reached at least 3 points, and one hit 9. The authors correctly flag that this is a single-model setup, so the numbers are not an A/B comparison of foundations. That framing and the concrete deployment data are the parts worth noticing. It moves the discussion from static benchmarks toward ongoing repair loops in public settings, which matches how these systems actually get used. The pilot numbers are reported plainly and show that aggregate scores can hide structured differences. The stress-test concern holds up from the abstract: there is no description of how semantic equivalence across the three languages was checked, no back-translation or independent review mentioned, and no detail on whether the scoring rubric stayed consistent when applied to different languages. Without that, the observed drifts could stem from phrasing variations or scoring artifacts rather than genuine policy differences in the agent. The framework itself is defined separately from the data, which avoids circularity, but the pilot interpretation rests on an assumption that is not yet supported in the text. This work is for teams that already run agents in real environments and want a practical way to monitor language-specific issues after deployment. It is not a finished methodological paper, but the combination of a named framework and real pilot data is enough to justify sending it to referees so the equivalence and scoring steps can be examined directly.

Referee Report

2 major / 2 minor

Summary. The paper proposes PSA-Eval, a failure-centered runtime evaluation framework for deployed trilingual public-space agents. It argues that evaluation of runtime systems should treat failure (rather than aggregate score) as the basic unit of analysis, extending the conventional Question-Answer-Score chain into a traceable loop of Batch-Run-Score-Failure Case-Repair-Regression Batch. Using trilingual equivalent inputs as controlled probes, the authors report a pilot on a real deployed digital front-desk system comprising 81 samples in 27 groups; despite an aggregate score of 23.15/24, 14 groups exhibit non-zero cross-language drift, 5 groups show drift of at least 3 points, and the maximum drift is 9 points. The study is conducted under a single-foundation-model simplification (MA = MB).

Significance. If the pilot measurements are robust, the work supplies concrete evidence that failure-centered evaluation can surface structured cross-language policy signals that aggregate scoring conceals. The use of a live deployed system and the reporting of specific drift counts (14 non-zero, 5 >=3 points) constitute a practical strength; the framework's emphasis on traceable, repairable failures also offers a clear operational path for maintainers of multilingual agents.

major comments (2)

[pilot study] Pilot study description (81 samples, 27 trilingual groups): the manuscript provides no explicit validation that the trilingual equivalent inputs are semantically equivalent (e.g., back-translation checks, independent linguist review, or inter-rater agreement on equivalence). Without this, the reported drifts cannot be confidently attributed to runtime policy differences rather than phrasing or rubric artifacts, which directly undermines the central claim that the framework exposes 'structured deployment signals hidden by aggregate scoring.'
[abstract / pilot results] Abstract and pilot results: scoring rules and the procedure for assigning the 0-24 scores are not described. It is therefore impossible to assess whether the observed cross-language differences (max drift 9) reflect genuine policy drift or inconsistent application of the rubric across languages, a load-bearing issue for interpreting the 14 non-zero drift groups.

minor comments (2)

[abstract] The acronym PSA-Eval is introduced in the title and abstract without an immediate expansion; a parenthetical definition on first use would improve readability.
[pilot study] The single-foundation-model simplification (MA = MB) is noted but its implications for generalizing the drift findings to multi-model deployments could be stated more explicitly in the discussion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our pilot study. These points identify areas where additional detail will improve the manuscript's transparency. We address each major comment below and will revise the paper accordingly.

read point-by-point responses

Referee: [pilot study] Pilot study description (81 samples, 27 trilingual groups): the manuscript provides no explicit validation that the trilingual equivalent inputs are semantically equivalent (e.g., back-translation checks, independent linguist review, or inter-rater agreement on equivalence). Without this, the reported drifts cannot be confidently attributed to runtime policy differences rather than phrasing or rubric artifacts, which directly undermines the central claim that the framework exposes 'structured deployment signals hidden by aggregate scoring.'

Authors: We agree that the absence of explicit semantic-equivalence validation is a limitation in the current manuscript and weakens confidence in attributing drifts solely to policy differences. The trilingual inputs were prepared to be equivalent, but the manuscript does not document validation steps such as back-translation or inter-rater review. In the revision we will add a subsection describing the input-construction process and any equivalence checks that were performed, along with a discussion of remaining limitations. This will allow readers to assess the robustness of the reported cross-language drifts. revision: yes
Referee: [abstract / pilot results] Abstract and pilot results: scoring rules and the procedure for assigning the 0-24 scores are not described. It is therefore impossible to assess whether the observed cross-language differences (max drift 9) reflect genuine policy drift or inconsistent application of the rubric across languages, a load-bearing issue for interpreting the 14 non-zero drift groups.

Authors: We concur that the scoring rules and assignment procedure must be described for the drift results to be interpretable. The manuscript currently omits these details. In the revised version we will expand the pilot-study section to present the full scoring rubric (including the criteria that sum to the 0-24 range), the exact procedure used to apply the rubric, and how consistency was maintained across languages. This addition will directly address concerns about whether the observed drifts (including the maximum of 9 points) arise from policy differences or rubric application. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework definition and pilot measurements are independent.

full rationale

The paper introduces PSA-Eval as an independent framework that extends the evaluation chain to include failure cases and regression testing. The pilot results (27 trilingual groups, 81 samples, reported drift statistics) are direct empirical measurements collected from an external deployed system rather than quantities fitted, predicted, or derived from the framework's own parameters or definitions. No self-citations, ansatzes, uniqueness theorems, or renamings of known results appear as load-bearing steps in the derivation. The trilingual probes are presented as an assumption for observing drift, but the observed numbers do not reduce to the framework inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the domain assumption that matched trilingual questions isolate language effects and that failures can be made reviewable and regression-testable without additional formal definitions of 'failure' or 'repair'.

axioms (2)

domain assumption Trilingual equivalent inputs control for semantic content while varying only language, allowing observation of policy drift.
Invoked to justify the probe design in the pilot study.
domain assumption Failures identified in runtime can be repaired and re-tested in regression batches to confirm consistency.
Core to the extended evaluation chain presented.

invented entities (1)

PSA-Eval framework no independent evidence
purpose: To shift evaluation unit from score to traceable failure case with repair and regression steps.
Newly introduced evaluation pipeline; no independent evidence provided beyond the pilot description.

pith-pipeline@v0.9.0 · 5525 in / 1278 out tokens · 33922 ms · 2026-05-08T03:36:55.707909+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 2 canonical work pages · 2 internal anchors

[1]

A Survey on LLM-as-a-Judge

Jiawei Gu, Xuhui Jiang, Zhichao Shi, et al. A survey on llm-as-a-judge.arXiv preprint arXiv:2411.15594, 2024

work page internal anchor Pith review arXiv 2024
[2]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021

2021
[3]

BenchMAX: A comprehensive multilingual evaluation suite for large language models

Xu Huang, Wenhao Zhu, Hanxu Hu, Conghui He, Lei Li, Shujian Huang, and Fei Yuan. BenchMAX: A comprehensive multilingual evaluation suite for large language models. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 16751–16774. Association for Computational Linguistics, 2025

2025
[4]

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

Haitao Li, Qian Dong, Junjie Chen, et al. Llms-as-judges: A comprehensive survey on llm-based evaluation methods.arXiv preprint arXiv:2412.05579, 2024. 21

work page internal anchor Pith review arXiv 2024
[5]

Holistic evaluation of language models.Transactions on Machine Learning Research, 2023

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models.Transactions on Machine Learning Research, 2023

2023
[6]

TruthfulQA: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022

2022
[7]

AgentBench: Evaluating LLMs as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. AgentBench: Evaluating LLMs as agents. InThe Twelfth International Conference on Learning Representations, 2024

2024
[8]

AgentBoard: An analytical evaluation board of multi-turn LLM agents

Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, and Junxian He. AgentBoard: An analytical evaluation board of multi-turn LLM agents. InAdvances in Neural Information Processing Systems 37 (Datasets and Benchmarks Track), 2024

2024
[9]

Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R. Bowman. BBQ: A hand-built bias benchmark for question answering. InFindings of the Association for Computational Linguistics: ACL 2022, 2022

2022
[10]

Red teaming language models with language models

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022

2022
[11]

Brown, Adam Santoro, Aditya Gupta, Adri` a Garriga-Alonso, et al

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adri` a Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.Transactions on Machine Learning Research, 2023

2023
[12]

LiveBench: A challenging, contamination-limited LLM benchmark

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, et al. LiveBench: A challenging, contamination-limited LLM benchmark. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[13]

Xing, et al

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, et al. Judging LLM-as-a-judge with MT-bench and chatbot arena. InAdvances in Neural Information Processing Systems 36 (Datasets and Benchmarks Track), 2023. 22 A Terminology English term Meaning in this paper Short explanati...

2023