Learning study similarity to investigate heterogeneity in meta-analysis using LLMs and triplet loss

Faculty of Medicine; Freiburg im Breisgau; Germany); Harald Binder (1); Kanella Panagiotopoulou (1); Medical Center- University of Freiburg; Statistics; Theodoros Evrenoglou (1) ((1) Institute of Medical Biometry

arxiv: 2605.29603 · v2 · pith:QKLWQJGAnew · submitted 2026-05-28 · 📊 stat.ME

Learning study similarity to investigate heterogeneity in meta-analysis using LLMs and triplet loss

Kanella Panagiotopoulou (1) , Harald Binder (1) , Theodoros Evrenoglou (1) ((1) Institute of Medical Biometry , Statistics , Faculty of Medicine , Medical Center- University of Freiburg , Freiburg im Breisgau , Germany) This is my paper

Pith reviewed 2026-06-29 05:57 UTC · model grok-4.3

classification 📊 stat.ME

keywords meta-analysisheterogeneityLLMtriplet lossembedding modelobservational studiesstudy similaritypreterm birth

0 comments

The pith

LLMs generate study triplets that an embedding model trained with triplet loss uses to cluster similar observational studies and lower apparent heterogeneity before meta-analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Meta-analyses of observational studies frequently encounter high between-study heterogeneity that makes pooled estimates hard to interpret. The paper describes a method that feeds study characteristics into an LLM to produce triplets of one anchor study plus a similar and a dissimilar counterpart. These triplets train an embedding model via triplet loss so that clinically comparable studies lie close together in a learned vector space. Clustering within that space then permits separate meta-analyses on the resulting groups. In the preterm-birth cognitive-outcomes example, one of the three clusters retained markedly lower heterogeneity, produced a more extreme pooled effect, and showed a narrower prediction interval than the analysis of all 58 studies.

Core claim

By processing study-level clinical and methodological features with an LLM to form triplets and then training an embedding model with triplet loss, the framework maps studies into a similarity space whose clusters exhibit lower within-group heterogeneity than the full collection, enabling more precise within-cluster inference.

What carries the argument

Embedding model trained with triplet loss on LLM-generated study triplets (anchor, similar, dissimilar) that learns a similarity space for subsequent clustering.

If this is right

Within the identified clusters, between-study heterogeneity is lower than in the overall meta-analysis.
One homogeneous cluster yields a more extreme pooled effect estimate than the full-set analysis.
Prediction intervals become narrower inside the homogeneous cluster relative to the overall analysis.
Study characteristics are incorporated prior to model fitting rather than explored only after fitting a single model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same triplet-generation and embedding steps could be applied to other evidence-synthesis tasks that require grouping studies by similarity before pooling.
Clusters produced this way might serve as a data-driven alternative to pre-specified subgroup analyses that are vulnerable to selective reporting.
Simulated datasets in which known effect modifiers are planted would allow direct checking of whether the learned clusters recover the planted structure.

Load-bearing premise

The LLM-generated triplets, once embedded, produce clusters whose lower within-cluster heterogeneity arises from genuine similarity captured by the model rather than chance or post-hoc selection of groupings.

What would settle it

Randomly assigning the 58 studies to three groups of comparable sizes and showing that the learned clusters do not have statistically lower within-group heterogeneity than the random partitions.

read the original abstract

Meta-analyses of observational studies often show substantial between-study heterogeneity, limiting the interpretability of pooled estimates. Meta-regression can be used to explore heterogeneity, but it is often underpowered to handle multiple effect modifiers. We propose a novel framework that integrates large language models (LLMs) with deep metric learning to infer study-level similarity prior to meta-analysis. Study-level clinical and methodological characteristics were processed by an LLM to generate study triplets (anchor, similar, dissimilar). These triplets were constructed by treating each study as an anchor and comparing it with pairs of other studies to identify, in each instance, the study most similar to the anchor. Then, the triplets were used into an embedding model trained with triplet loss; a deep learning approach that learns an embedding space where clinically and methodologically similar studies are clustered together. We apply our framework to a meta-analysis dataset of 58 observational studies comparing cognitive outcomes between preterm- and term-born children. Subsequently, we fit meta-analysis models within the identified study clusters and compare the results with those of the overall analysis. Results suggested three clusters two of which retained considerable between-study heterogeneity. The remaining cluster comprised the most homogeneous group of studies and exhibited a more extreme pooled effect estimate together with a narrower prediction interval compared with the overall analysis. This work presents a novel approach for exploring heterogeneity in meta-analysis by incorporating study characteristics prior to model fitting. By transforming study information into a similarity space, the framework identifies coherent subgroups and supports more precise inference in heterogeneous real-world evidence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main move is an LLM-triplet pipeline to embed and cluster studies before meta-analysis, but the 58-study example shows only partial heterogeneity reduction with no null-model checks.

read the letter

The paper's headline claim is that feeding study characteristics into an LLM to build triplets, then training an embedding with triplet loss, can produce clusters that support tighter meta-analysis. In the preterm cognitive outcomes example it yields three clusters; one is more homogeneous with a narrower prediction interval and a stronger pooled effect.

What is new is the specific sequence: LLM-generated triplets (anchor plus most-similar and dissimilar studies) followed by triplet-loss training and then standard random-effects meta-analysis inside each cluster. The authors apply it to a real set of 58 observational studies and report the within-cluster results side-by-side with the overall analysis.

The method is described clearly enough that a reader can see how the triplets are constructed and how the embedding is trained. That is useful for anyone who wants to experiment with the same idea.

The soft spots are straightforward. The abstract gives no comparison of the observed within-cluster heterogeneity against what would be expected from random partitions of the same 58 studies. Two of the three clusters still show substantial heterogeneity, and the authors highlight the remaining one without a pre-specified rule for which cluster to emphasize. With small N, both chance and post-hoc selection are plausible explanations for the apparent improvement. No sensitivity checks on the LLM prompting or comparison against ordinary meta-regression appear in the description.

This is for meta-analysts who already work with observational data and are curious about embedding methods for subgroup discovery. A reading group could discuss the pipeline, but the current evidence does not yet show that the clustering step reliably improves inference beyond what simpler approaches achieve.

I would send it for peer review only if the authors add a null-model comparison, code, and at least one sensitivity analysis; without those it is too preliminary for a serious referee.

Referee Report

3 major / 2 minor

Summary. The paper proposes a framework that uses LLMs to generate study triplets (anchor, similar, dissimilar) from clinical and methodological characteristics, trains an embedding model via triplet loss to produce a similarity space, performs clustering on the resulting embeddings, and then fits separate meta-analyses within the identified clusters. Applied to 58 observational studies on cognitive outcomes in preterm- versus term-born children, the approach yields three clusters; two retain substantial heterogeneity while the third is described as more homogeneous, with a more extreme pooled effect and narrower prediction interval than the overall analysis. The central claim is that transforming study information into a learned similarity space identifies coherent subgroups and thereby supports more precise inference in heterogeneous meta-analyses.

Significance. If the reported reduction in within-cluster heterogeneity can be shown to exceed what is expected under random partitioning or post-hoc selection, the method would represent a useful extension of unsupervised subgroup discovery in meta-analysis. It leverages textual study descriptions via LLMs and metric learning in a way that could complement or extend traditional meta-regression when multiple modifiers are present. The absence of any machine-checked proofs, reproducible code release, or pre-registered falsifiable predictions limits the immediate strength of the contribution, but the core idea of pre-clustering via learned embeddings is novel within the statistical methodology literature.

major comments (3)

[Results] Results section describing the three clusters: the manuscript states that one cluster is 'the most homogeneous group' with narrower prediction interval and more extreme effect, yet provides no comparison of the observed within-cluster τ² or I² values against the null distribution obtained by randomly partitioning the 58 studies into groups of comparable sizes; without such a check the reported reduction could arise from sampling variation or from selecting the minimum-heterogeneity partition after inspection.
[Methods] Methods section on triplet construction and clustering: no pre-specified rule or decision criterion is given for choosing which of the three clusters to highlight as the 'coherent subgroup' supporting more precise inference; the selection of the homogeneous cluster appears to have been made post-hoc on the basis of the heterogeneity statistics themselves.
[Discussion] No section or supplementary material compares the proposed LLM-triplet-loss pipeline to standard meta-regression (or to simple random-effects meta-analysis with study-level covariates) on the same 58-study dataset; such a benchmark is required to assess whether the embedding-based clustering yields meaningfully lower residual heterogeneity than conventional approaches.

minor comments (2)

[Abstract] Abstract: the phrase 'used into an embedding model' is grammatically incorrect and should read 'used in an embedding model'.
[Abstract] Abstract and Methods: the exact procedure for generating the 'most similar' and 'dissimilar' studies for each anchor is described only at a high level; a concrete example or pseudocode would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive comments. We respond to each major comment below, indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Results] Results section describing the three clusters: the manuscript states that one cluster is 'the most homogeneous group' with narrower prediction interval and more extreme effect, yet provides no comparison of the observed within-cluster τ² or I² values against the null distribution obtained by randomly partitioning the 58 studies into groups of comparable sizes; without such a check the reported reduction could arise from sampling variation or from selecting the minimum-heterogeneity partition after inspection.

Authors: We agree that demonstrating the reduction exceeds what would be expected under random partitioning would strengthen the results. In the revised manuscript, we will add a supplementary analysis comparing the within-cluster heterogeneity to that obtained from 1000 random partitions of the 58 studies into three groups of sizes matching the observed clusters. revision: yes
Referee: [Methods] Methods section on triplet construction and clustering: no pre-specified rule or decision criterion is given for choosing which of the three clusters to highlight as the 'coherent subgroup' supporting more precise inference; the selection of the homogeneous cluster appears to have been made post-hoc on the basis of the heterogeneity statistics themselves.

Authors: The framework is designed as an exploratory tool for identifying potential subgroups, and we present the heterogeneity statistics for all three clusters in the results. The highlighted cluster is the one with the lowest heterogeneity, which aligns with the goal of finding more homogeneous groups. We will revise the methods and discussion to explicitly state that cluster selection is based on post-clustering heterogeneity measures and that the approach is intended for hypothesis generation rather than confirmatory analysis. revision: partial
Referee: [Discussion] No section or supplementary material compares the proposed LLM-triplet-loss pipeline to standard meta-regression (or to simple random-effects meta-analysis with study-level covariates) on the same 58-study dataset; such a benchmark is required to assess whether the embedding-based clustering yields meaningfully lower residual heterogeneity than conventional approaches.

Authors: We recognize the importance of this benchmark. In the revised manuscript, we will include a comparison in the supplementary materials where we apply meta-regression using available study-level covariates (e.g., study design, sample size, year of publication) and report the residual τ², allowing direct comparison to the within-cluster heterogeneity from our method. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is exploratory and self-contained

full rationale

The paper describes generating triplets via LLM, training an embedding model with triplet loss, performing clustering, and then fitting separate meta-analyses within clusters. No equations, fitted parameters, or self-citations are shown that would make the reported within-cluster homogeneity or narrower prediction interval a direct algebraic or statistical consequence of the same data used to define the clusters. The derivation chain remains independent of its outputs, with the observed reduction in heterogeneity treated as an empirical finding rather than a definitional necessity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.1-grok · 5846 in / 1155 out tokens · 18661 ms · 2026-06-29T05:57:30.861635+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 9 canonical work pages · 1 internal anchor

[1]

Quantifying heterogeneity in a meta-analysis

Higgins JPT, Thompson SG. Quantifying heterogeneity in a meta-analysis. Stat Med. 2002;21(11):1539-1558. doi:10.1002/sim.1186

work page doi:10.1002/sim.1186 2002
[2]

A re-evaluation of random-effects meta-analysis

Higgins JPT, Thompson SG, Spiegelhalter DJ. A re-evaluation of random-effects meta-analysis. J R Stat Soc Ser A Stat Soc. 2009;172(1):137-159. doi:10.1111/j.1467-985X.2008.00552.x

work page doi:10.1111/j.1467-985x.2008.00552.x 2009
[3]

Why sources of heterogeneity in meta-analysis should be investigated

Thompson SG. Why sources of heterogeneity in meta-analysis should be investigated. BMJ. 1994;309(6965):1351-1355. doi:10.1136/bmj.309.6965.1351

work page doi:10.1136/bmj.309.6965.1351 1994
[4]

Cochrane handbook for systematic reviews of interventions

Higgins JPT, Thomas J, Chandler J, et al. Cochrane handbook for systematic reviews of interventions. In: 2nd ed. 6.4. Wiley; 2019

2019
[5]

Meta-analysis models relaxing the random-effects normality assumption: methodological systematic review and simulation study

Panagiotopoulou K, Evrenoglou T, Schmid CH, Metelli S, Chaimani A. Meta-analysis models relaxing the random-effects normality assumption: methodological systematic review and simulation study. BMC Med Res Methodol. 2025;25(1):231. doi:10.1186/s12874-025-02658-3

work page doi:10.1186/s12874-025-02658-3 2025
[6]

Limitations of Meta-analyses of Studies With High Heterogeneity

Imrey PB. Limitations of Meta-analyses of Studies With High Heterogeneity. JAMA Netw Open. 2020;3(1):e1919325. doi:10.1001/jamanetworkopen.2019.19325

work page doi:10.1001/jamanetworkopen.2019.19325 2020
[7]

Dealing with substantial heterogeneity in Cochrane reviews

Schroll JB, Moustgaard R, Gøtzsche PC. Dealing with substantial heterogeneity in Cochrane reviews. Cross-sectional study. BMC Med Res Methodol. 2011;11:22. doi:10.1186/1471-2288-11-22

work page doi:10.1186/1471-2288-11-22 2011
[8]

Interpretation of random effects meta-analyses

Riley RD, Higgins JPT, Deeks JJ. Interpretation of random effects meta-analyses. BMJ. 2011;342:d549. doi:10.1136/bmj.d549

work page doi:10.1136/bmj.d549 2011
[9]

Embeddings and Representation Learning for Structured Data

Shrier I, Boivin JF, Steele RJ, et al. Should meta-analyses of interventions include observational studies in addition to randomized controlled trials? A critical examination of underlying principles. Am J Epidemiol. 2007;166(10):1203-1209. doi:10.1093/aje/kwm189 10.O’Connor AM, Sargeant JM. Meta-analyses including data from observational studies. Spec Is...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1093/aje/kwm189 2007
[10]

Risk for Behavioral Problems Independent of Cognitive Functioning in Children Born at Low Gestational Ages

doi:10.1001/jamapediatrics.2025.2221 27.Domellöf E, Johansson AM, Farooqi A, Domellöf M, Rönnqvist L. Risk for Behavioral Problems Independent of Cognitive Functioning in Children Born at Low Gestational Ages. Front Pediatr. 2020;8:311. doi:10.3389/fped.2020.00311 28.Martínez-Cruz CF, Poblano A, Fernández-Carrocera LA, Jiménez-Quiróz R, Tuyú-Torres N. Ass...

work page doi:10.1001/jamapediatrics.2025.2221 2025

[1] [1]

Quantifying heterogeneity in a meta-analysis

Higgins JPT, Thompson SG. Quantifying heterogeneity in a meta-analysis. Stat Med. 2002;21(11):1539-1558. doi:10.1002/sim.1186

work page doi:10.1002/sim.1186 2002

[2] [2]

A re-evaluation of random-effects meta-analysis

Higgins JPT, Thompson SG, Spiegelhalter DJ. A re-evaluation of random-effects meta-analysis. J R Stat Soc Ser A Stat Soc. 2009;172(1):137-159. doi:10.1111/j.1467-985X.2008.00552.x

work page doi:10.1111/j.1467-985x.2008.00552.x 2009

[3] [3]

Why sources of heterogeneity in meta-analysis should be investigated

Thompson SG. Why sources of heterogeneity in meta-analysis should be investigated. BMJ. 1994;309(6965):1351-1355. doi:10.1136/bmj.309.6965.1351

work page doi:10.1136/bmj.309.6965.1351 1994

[4] [4]

Cochrane handbook for systematic reviews of interventions

Higgins JPT, Thomas J, Chandler J, et al. Cochrane handbook for systematic reviews of interventions. In: 2nd ed. 6.4. Wiley; 2019

2019

[5] [5]

Meta-analysis models relaxing the random-effects normality assumption: methodological systematic review and simulation study

Panagiotopoulou K, Evrenoglou T, Schmid CH, Metelli S, Chaimani A. Meta-analysis models relaxing the random-effects normality assumption: methodological systematic review and simulation study. BMC Med Res Methodol. 2025;25(1):231. doi:10.1186/s12874-025-02658-3

work page doi:10.1186/s12874-025-02658-3 2025

[6] [6]

Limitations of Meta-analyses of Studies With High Heterogeneity

Imrey PB. Limitations of Meta-analyses of Studies With High Heterogeneity. JAMA Netw Open. 2020;3(1):e1919325. doi:10.1001/jamanetworkopen.2019.19325

work page doi:10.1001/jamanetworkopen.2019.19325 2020

[7] [7]

Dealing with substantial heterogeneity in Cochrane reviews

Schroll JB, Moustgaard R, Gøtzsche PC. Dealing with substantial heterogeneity in Cochrane reviews. Cross-sectional study. BMC Med Res Methodol. 2011;11:22. doi:10.1186/1471-2288-11-22

work page doi:10.1186/1471-2288-11-22 2011

[8] [8]

Interpretation of random effects meta-analyses

Riley RD, Higgins JPT, Deeks JJ. Interpretation of random effects meta-analyses. BMJ. 2011;342:d549. doi:10.1136/bmj.d549

work page doi:10.1136/bmj.d549 2011

[9] [9]

Embeddings and Representation Learning for Structured Data

Shrier I, Boivin JF, Steele RJ, et al. Should meta-analyses of interventions include observational studies in addition to randomized controlled trials? A critical examination of underlying principles. Am J Epidemiol. 2007;166(10):1203-1209. doi:10.1093/aje/kwm189 10.O’Connor AM, Sargeant JM. Meta-analyses including data from observational studies. Spec Is...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1093/aje/kwm189 2007

[10] [10]

Risk for Behavioral Problems Independent of Cognitive Functioning in Children Born at Low Gestational Ages

doi:10.1001/jamapediatrics.2025.2221 27.Domellöf E, Johansson AM, Farooqi A, Domellöf M, Rönnqvist L. Risk for Behavioral Problems Independent of Cognitive Functioning in Children Born at Low Gestational Ages. Front Pediatr. 2020;8:311. doi:10.3389/fped.2020.00311 28.Martínez-Cruz CF, Poblano A, Fernández-Carrocera LA, Jiménez-Quiróz R, Tuyú-Torres N. Ass...

work page doi:10.1001/jamapediatrics.2025.2221 2025