LLMs can provide cost-effective annotation of credibility in Danish asylum texts but produce inconsistent errors that vary by model and prompt, requiring checks beyond single-model accuracy.
arXiv preprint arXiv:2407.02039 , year=
2 Pith papers cite this work. Polarity classification is still indexing.
abstract
Researchers are increasingly using language models (LMs) for text annotation. These approaches rely only on a prompt telling the model to return a given output according to a set of instructions. The reproducibility of LM outputs may nonetheless be vulnerable to small changes in the prompt design. This calls into question the replicability of classification routines. To tackle this problem, researchers have typically tested a variety of semantically similar prompts to determine what we call ``prompt stability." These approaches remain ad-hoc and task specific. In this article, we propose a general framework for diagnosing prompt stability by adapting traditional approaches to intra- and inter-coder reliability scoring. We call the resulting metric the Prompt Stability Score (PSS) and provide a Python package \texttt{promptstability} for its estimation. Using six different datasets and twelve outcomes, we classify $\sim$3.1m rows of data and $\sim$300m input tokens to: a) diagnose when prompt stability is low; and b) demonstrate the functionality of the package. We conclude by providing best practice recommendations for applied researchers.
fields
cs.CL 2years
2026 2representative citing papers
Grain calibration decomposes theoretical constructs into clause-level components, tests each with extractive evidence, and combines results through explicit theory-derived rules to validate LLM coding beyond agreement with human annotators.
citing papers explorer
-
LLMs as annotators of credibility assessment in Danish asylum decisions: evaluating classification performance and errors beyond aggregated metrics
LLMs can provide cost-effective annotation of credibility in Danish asylum texts but produce inconsistent errors that vary by model and prompt, requiring checks beyond single-model accuracy.
-
Correct codes for the wrong reasons? validating LLMs as measurement instruments for theoretical constructs
Grain calibration decomposes theoretical constructs into clause-level components, tests each with extractive evidence, and combines results through explicit theory-derived rules to validate LLM coding beyond agreement with human annotators.