LLM judges exhibit high stability under neutral re-evaluation but substantial reversibility under targeted post-decision challenges, quantified via a new Evaluation Robustness Score (ERS).
A multi-task evaluation of LLMs’ processing of academic text input
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
Audits of 43 LLMs show that varying persona prompts (language, location, role-and-task) and context affects technical quality and social representativeness of scholar recommendations, with location impacting diversity and factuality.
The Guardian Parser Pack pipeline extracts structured intelligence from heterogeneous missing-person documents using schema-guided LLM assistance, achieving F1 of 0.866 on 75 cases versus 0.258 for a deterministic baseline.
citing papers explorer
-
Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges
LLM judges exhibit high stability under neutral re-evaluation but substantial reversibility under targeted post-decision challenges, quantified via a new Evaluation Robustness Score (ERS).
-
Whose Name Comes Up? III: Persona Prompting Effects in LLM-Based Scholar Recommendation
Audits of 43 LLMs show that varying persona prompts (language, location, role-and-task) and context affects technical quality and social representativeness of scholar recommendations, with location impacting diversity and factuality.
-
LLM-based Schema-Guided Extraction and Validation of Missing-Person Intelligence from Heterogeneous Data Sources
The Guardian Parser Pack pipeline extracts structured intelligence from heterogeneous missing-person documents using schema-guided LLM assistance, achieving F1 of 0.866 on 75 cases versus 0.258 for a deterministic baseline.