Recognition: unknown
From specific-source feature-based to common-source score-based likelihood-ratio systems: ranking the stars
Pith reviewed 2026-05-08 09:18 UTC · model grok-4.3
The pith
Likelihood-ratio systems for trace-reference comparisons rank from specific-source feature-based as highest performing to common-source score-based as most practical, with a performance-feasibility trade-off and one standout exception.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Applying strictly proper scoring rules to simulated trace-reference comparisons produces a ranking of LR system classes from specific-source feature-based to common-source anchored or non-anchored score-based. Performance and practical feasibility trade off directly, so the highest-performing class is the hardest to realise while the lowest-performing class is the easiest. Common-source feature-based systems are the single positive exception, combining good performance with lower experimental demands. Every class improves the updating of prior odds relative to using prior odds without any LR system.
What carries the argument
Classes of source-level likelihood-ratio (LR) systems, split by specific-source versus common-source and by feature-based versus score-based (anchored or non-anchored), compared via strictly proper scoring rules for performance and by experimental demands for feasibility.
Load-bearing premise
That performance gaps measured by strictly proper scoring rules on simulated or idealized trace-reference comparisons will accurately reflect real-world differences across LR system classes without further assumptions about data distributions or case conditions.
What would settle it
A study of actual forensic casework outcomes showing that common-source score-based systems update prior odds more accurately than specific-source feature-based systems would falsify the claimed performance ranking.
read the original abstract
This paper studies expected performance and practical feasibility of the most commonly used classes of source-level likelihood-ratio (LR) systems when applied to a trace-reference comparison problem. The paper compares performance of these classes of LR systems (used to update prior odds) to each other and to the use of prior odds only, using strictly proper scoring rules as performance measures. It also explores practical feasibility of the classes of LR systems. The present analysis allows for a ranking of these classes of LR systems: from specific-source feature-based to common-source anchored or non-anchored score-based. A trade-off between performance and practical feasibility is observed, meaning that the best performing class of LR systems is the hardest to realise in practice, while the least performing class is the easiest to realise in practice. The other classes of LR systems are in between the two extremes. The one positive exception is a common-source feature-based LR system, with good performance and relatively low experimental demands. The paper also argues against the claim that some classes of LR systems should not be used, by showing that all systems have merit (when updating prior odds) over just using the prior odds (i.e. not using the LR system).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript compares expected performance (via strictly proper scoring rules) and practical feasibility of source-level LR systems for trace-reference comparisons across four classes: specific-source feature-based, common-source feature-based, common-source anchored score-based, and common-source non-anchored score-based. It derives a ranking in which specific-source feature-based systems perform best, common-source score-based systems are intermediate, and common-source feature-based systems are a positive exception with strong performance and lower experimental demands. All classes are shown to outperform the use of prior odds alone, and a performance-feasibility trade-off is reported.
Significance. If the simulation-based ranking generalizes, the work offers practical guidance for forensic practitioners choosing LR systems by quantifying trade-offs between performance and implementation cost. The use of external strictly proper scoring rules provides an objective, non-circular performance metric, and the demonstration that every class improves on prior odds alone is a useful rebuttal to blanket prohibitions on certain LR approaches. The absence of concrete simulation details, quantitative tables, or real-data validation in the abstract, however, limits immediate applicability and requires verification that the ordering is not an artifact of the data-generating assumptions.
major comments (2)
- Abstract: the headline ranking (specific-source feature-based > common-source anchored/non-anchored score-based, with common-source feature-based as exception) and the performance-feasibility trade-off are asserted without any reported expected scores, simulation parameters, number of Monte Carlo replicates, or tables of results. Because the ordering is the central claim, the manuscript must supply these quantitative comparisons and the explicit distributional assumptions used to generate trace-reference pairs.
- Results section (inferred from abstract): the performance ordering rests on expected values of strictly proper scoring rules applied to simulated comparisons. No sensitivity analysis is described with respect to feature-distribution assumptions, source-variability parameters, or calibration of the score-based systems; if these assumptions systematically favor feature-based systems, the claimed ranking would not hold under realistic forensic data.
minor comments (1)
- Abstract: the phrase 'the present analysis allows for a ranking' is vague; a single sentence stating the explicit order (with the noted exception) would improve readability.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify how to strengthen the presentation of our simulation-based ranking of LR systems. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [—] Abstract: the headline ranking (specific-source feature-based > common-source anchored/non-anchored score-based, with common-source feature-based as exception) and the performance-feasibility trade-off are asserted without any reported expected scores, simulation parameters, number of Monte Carlo replicates, or tables of results. Because the ordering is the central claim, the manuscript must supply these quantitative comparisons and the explicit distributional assumptions used to generate trace-reference pairs.
Authors: We agree that the abstract should be more self-contained. In the revised version we will insert the key quantitative results (expected scores under the strictly proper scoring rules for each class), the number of Monte Carlo replicates, and the main distributional assumptions (multivariate normal features with specified within- and between-source covariance structures). This will allow readers to assess the reported ranking without first consulting the full text. revision: yes
-
Referee: [—] Results section (inferred from abstract): the performance ordering rests on expected values of strictly proper scoring rules applied to simulated comparisons. No sensitivity analysis is described with respect to feature-distribution assumptions, source-variability parameters, or calibration of the score-based systems; if these assumptions systematically favor feature-based systems, the claimed ranking would not hold under realistic forensic data.
Authors: The concern is valid. Although our simulations employ standard forensic assumptions (multivariate normal feature distributions with controlled source variability), we did not include a systematic sensitivity study. We will add a dedicated subsection that varies the degree of between-source variability, the dimensionality of the feature space, and the calibration procedure for the score-based systems. This will demonstrate the stability of the performance ordering and clarify the conditions under which the ranking holds. revision: yes
Circularity Check
No circularity: ranking derived from external proper scoring rules applied to LR outputs
full rationale
The paper evaluates classes of LR systems by computing expected values of strictly proper scoring rules on their outputs for trace-reference comparisons (simulated or idealized). This performance measure is independent of the internal definitions or parameterizations of the LR systems; proper scoring rules reward calibration and discrimination without reducing the ranking to a fit or self-definition. Feasibility assessment is qualitative and secondary. No load-bearing step quotes a self-citation as a uniqueness theorem or renames a fitted quantity as a prediction. The claim that all classes outperform prior odds follows directly from the properties of proper scoring rules whenever LR ≠ 1, without circularity. The derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Strictly proper scoring rules provide a valid measure of expected performance for LR systems in trace-reference problems
Reference graph
Works this paper leans on
-
[1]
https://doi.org/10.1016/j.forsciint.2007.11.008 Bolck, A., Weyermann, C., Dujourdy, L., Esseiva, P., van den Berg, J., 2009. Different likelihood ratio approaches to evaluate the strength of evidence of MDMA tablet comparisons. Forensic Science International 191, 42–51. https://doi.org/10.1016/j.forsciint.2009.06.006 Brümmer, N., 2010. Measuring, refining...
-
[2]
https://doi.org/10.1016/j.scijus.2017.06.005 Morrison, G.S., Enzinger, E., Hughes, V., Jessen, M., Meuwly, D., Neumann, C., Planting, S., Thompson, W.C., van der Vloed, D., Ypma, R.J.F., Zhang, C., Anonymous, A., Anonymous, B.,
-
[3]
Consensus on validation of forensic voice comparison. Science & Justice 61, 299–309. https://doi.org/10.1016/j.scijus.2021.02.002 Morrison, G.S., Stoel, R.D., 2014. Forensic strength of evidence statements should preferably be likelihood ratios calculated using relevant data, quantitative measurements, and statistical models – a response to Lennard (2013)...
-
[4]
The evaluation of evidence for microspectrophotometry data using functional data analysis
Computation of likelihood ratios in fingerprint identification for configurations of 68 three minutiæ. Journal of Forensic Sciences 51, 1255–1266. https://doi.org/10.1111/j.1556- 4029.2006.00266.x Neumann, C., Champod, C., Yoo, M., Genessay, T., Langenburg, G., 2015. Quantifying the weight of fingerprint evidence through the spatial relationship, directio...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.