arxiv: 2604.22286 · v1 · submitted 2026-04-24 · 📊 stat.AP

Recognition: unknown

From specific-source feature-based to common-source score-based likelihood-ratio systems: ranking the stars

Peter Vergeer

Pith reviewed 2026-05-08 09:18 UTC · model grok-4.3

classification 📊 stat.AP

keywords likelihood ratioforensic statisticssource attributiontrace evidenceperformance evaluationscore-based systemsfeature-based systemsproper scoring rules

0 comments

The pith

Likelihood-ratio systems for trace-reference comparisons rank from specific-source feature-based as highest performing to common-source score-based as most practical, with a performance-feasibility trade-off and one standout exception.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates the most common classes of source-level likelihood-ratio systems for updating prior odds in a trace-reference comparison problem. It measures expected performance with strictly proper scoring rules and separately examines the practical demands of building and applying each class. The resulting ranking places specific-source feature-based systems at the top for performance but at the bottom for feasibility, while common-source score-based systems reverse that order. All classes still improve upon using prior odds alone. Common-source feature-based systems form the notable exception by delivering strong performance with relatively modest experimental requirements.

Core claim

Applying strictly proper scoring rules to simulated trace-reference comparisons produces a ranking of LR system classes from specific-source feature-based to common-source anchored or non-anchored score-based. Performance and practical feasibility trade off directly, so the highest-performing class is the hardest to realise while the lowest-performing class is the easiest. Common-source feature-based systems are the single positive exception, combining good performance with lower experimental demands. Every class improves the updating of prior odds relative to using prior odds without any LR system.

What carries the argument

Classes of source-level likelihood-ratio (LR) systems, split by specific-source versus common-source and by feature-based versus score-based (anchored or non-anchored), compared via strictly proper scoring rules for performance and by experimental demands for feasibility.

Load-bearing premise

That performance gaps measured by strictly proper scoring rules on simulated or idealized trace-reference comparisons will accurately reflect real-world differences across LR system classes without further assumptions about data distributions or case conditions.

What would settle it

A study of actual forensic casework outcomes showing that common-source score-based systems update prior odds more accurately than specific-source feature-based systems would falsify the claimed performance ranking.

read the original abstract

This paper studies expected performance and practical feasibility of the most commonly used classes of source-level likelihood-ratio (LR) systems when applied to a trace-reference comparison problem. The paper compares performance of these classes of LR systems (used to update prior odds) to each other and to the use of prior odds only, using strictly proper scoring rules as performance measures. It also explores practical feasibility of the classes of LR systems. The present analysis allows for a ranking of these classes of LR systems: from specific-source feature-based to common-source anchored or non-anchored score-based. A trade-off between performance and practical feasibility is observed, meaning that the best performing class of LR systems is the hardest to realise in practice, while the least performing class is the easiest to realise in practice. The other classes of LR systems are in between the two extremes. The one positive exception is a common-source feature-based LR system, with good performance and relatively low experimental demands. The paper also argues against the claim that some classes of LR systems should not be used, by showing that all systems have merit (when updating prior odds) over just using the prior odds (i.e. not using the LR system).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper ranks the main classes of forensic LR systems by expected performance and notes a practical trade-off, with common-source feature-based systems as the clear exception that combines decent accuracy with lower setup costs.

read the letter

The main thing to know is that specific-source feature-based LR systems rank highest on performance but demand the most experimental work, while common-source score-based systems (anchored or not) are easiest to build yet weaker, and common-source feature-based ones sit in a useful middle spot with good scores and modest demands. All classes improve on prior odds alone when measured by strictly proper scoring rules.

Referee Report

2 major / 1 minor

Summary. The manuscript compares expected performance (via strictly proper scoring rules) and practical feasibility of source-level LR systems for trace-reference comparisons across four classes: specific-source feature-based, common-source feature-based, common-source anchored score-based, and common-source non-anchored score-based. It derives a ranking in which specific-source feature-based systems perform best, common-source score-based systems are intermediate, and common-source feature-based systems are a positive exception with strong performance and lower experimental demands. All classes are shown to outperform the use of prior odds alone, and a performance-feasibility trade-off is reported.

Significance. If the simulation-based ranking generalizes, the work offers practical guidance for forensic practitioners choosing LR systems by quantifying trade-offs between performance and implementation cost. The use of external strictly proper scoring rules provides an objective, non-circular performance metric, and the demonstration that every class improves on prior odds alone is a useful rebuttal to blanket prohibitions on certain LR approaches. The absence of concrete simulation details, quantitative tables, or real-data validation in the abstract, however, limits immediate applicability and requires verification that the ordering is not an artifact of the data-generating assumptions.

major comments (2)

Abstract: the headline ranking (specific-source feature-based > common-source anchored/non-anchored score-based, with common-source feature-based as exception) and the performance-feasibility trade-off are asserted without any reported expected scores, simulation parameters, number of Monte Carlo replicates, or tables of results. Because the ordering is the central claim, the manuscript must supply these quantitative comparisons and the explicit distributional assumptions used to generate trace-reference pairs.
Results section (inferred from abstract): the performance ordering rests on expected values of strictly proper scoring rules applied to simulated comparisons. No sensitivity analysis is described with respect to feature-distribution assumptions, source-variability parameters, or calibration of the score-based systems; if these assumptions systematically favor feature-based systems, the claimed ranking would not hold under realistic forensic data.

minor comments (1)

Abstract: the phrase 'the present analysis allows for a ranking' is vague; a single sentence stating the explicit order (with the noted exception) would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify how to strengthen the presentation of our simulation-based ranking of LR systems. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [—] Abstract: the headline ranking (specific-source feature-based > common-source anchored/non-anchored score-based, with common-source feature-based as exception) and the performance-feasibility trade-off are asserted without any reported expected scores, simulation parameters, number of Monte Carlo replicates, or tables of results. Because the ordering is the central claim, the manuscript must supply these quantitative comparisons and the explicit distributional assumptions used to generate trace-reference pairs.

Authors: We agree that the abstract should be more self-contained. In the revised version we will insert the key quantitative results (expected scores under the strictly proper scoring rules for each class), the number of Monte Carlo replicates, and the main distributional assumptions (multivariate normal features with specified within- and between-source covariance structures). This will allow readers to assess the reported ranking without first consulting the full text. revision: yes
Referee: [—] Results section (inferred from abstract): the performance ordering rests on expected values of strictly proper scoring rules applied to simulated comparisons. No sensitivity analysis is described with respect to feature-distribution assumptions, source-variability parameters, or calibration of the score-based systems; if these assumptions systematically favor feature-based systems, the claimed ranking would not hold under realistic forensic data.

Authors: The concern is valid. Although our simulations employ standard forensic assumptions (multivariate normal feature distributions with controlled source variability), we did not include a systematic sensitivity study. We will add a dedicated subsection that varies the degree of between-source variability, the dimensionality of the feature space, and the calibration procedure for the score-based systems. This will demonstrate the stability of the performance ordering and clarify the conditions under which the ranking holds. revision: yes

Circularity Check

0 steps flagged

No circularity: ranking derived from external proper scoring rules applied to LR outputs

full rationale

The paper evaluates classes of LR systems by computing expected values of strictly proper scoring rules on their outputs for trace-reference comparisons (simulated or idealized). This performance measure is independent of the internal definitions or parameterizations of the LR systems; proper scoring rules reward calibration and discrimination without reducing the ranking to a fit or self-definition. Feasibility assessment is qualitative and secondary. No load-bearing step quotes a self-citation as a uniqueness theorem or renames a fitted quantity as a prediction. The claim that all classes outperform prior odds follows directly from the properties of proper scoring rules whenever LR ≠ 1, without circularity. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on standard definitions of source-level LR systems and the appropriateness of strictly proper scoring rules for evaluating them; no new entities or fitted parameters are introduced in the abstract.

axioms (1)

domain assumption Strictly proper scoring rules provide a valid measure of expected performance for LR systems in trace-reference problems
Invoked to compare classes and establish the ranking.

pith-pipeline@v0.9.0 · 5503 in / 1255 out tokens · 38705 ms · 2026-05-08T09:18:59.237242+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

Different likelihood ratio approaches to evaluate the strength of evidence of MDMA tablet comparisons

https://doi.org/10.1016/j.forsciint.2007.11.008 Bolck, A., Weyermann, C., Dujourdy, L., Esseiva, P., van den Berg, J., 2009. Different likelihood ratio approaches to evaluate the strength of evidence of MDMA tablet comparisons. Forensic Science International 191, 42–51. https://doi.org/10.1016/j.forsciint.2009.06.006 Brümmer, N., 2010. Measuring, refining...

work page doi:10.1016/j.forsciint.2007.11.008 2007
[2]

https://doi.org/10.1016/j.scijus.2017.06.005 Morrison, G.S., Enzinger, E., Hughes, V., Jessen, M., Meuwly, D., Neumann, C., Planting, S., Thompson, W.C., van der Vloed, D., Ypma, R.J.F., Zhang, C., Anonymous, A., Anonymous, B.,

work page doi:10.1016/j.scijus.2017.06.005 2017
[3]

Science & Justice 61, 299–309

Consensus on validation of forensic voice comparison. Science & Justice 61, 299–309. https://doi.org/10.1016/j.scijus.2021.02.002 Morrison, G.S., Stoel, R.D., 2014. Forensic strength of evidence statements should preferably be likelihood ratios calculated using relevant data, quantitative measurements, and statistical models – a response to Lennard (2013)...

work page doi:10.1016/j.scijus.2021.02.002 2021
[4]

The evaluation of evidence for microspectrophotometry data using functional data analysis

Computation of likelihood ratios in fingerprint identification for configurations of 68 three minutiæ. Journal of Forensic Sciences 51, 1255–1266. https://doi.org/10.1111/j.1556- 4029.2006.00266.x Neumann, C., Champod, C., Yoo, M., Genessay, T., Langenburg, G., 2015. Quantifying the weight of fingerprint evidence through the spatial relationship, directio...

work page doi:10.1111/j.1556- 2006