arxiv: 2604.11104 · v1 · submitted 2026-04-13 · 💻 cs.AI · cs.IR· cs.LG· cs.NE

Recognition: unknown

Frugal Knowledge Graph Construction with Local LLMs: A Zero-Shot Pipeline, Self-Consistency and Wisdom of Artificial Crowds

Pierre Jourlin (LIA)

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:27 UTC · model grok-4.3

classification 💻 cs.AI cs.IRcs.LGcs.NE

keywords zero-shot learningknowledge graph constructionlocal LLMsself-consistencymulti-hop reasoningrelation extractionRAGAS evaluationfrugal inference

0 comments

The pith

A zero-shot pipeline with local LLMs extracts document relations at 0.70 F1 and reaches 0.55 EM on multi-hop questions via model routing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether local large language models can build and use knowledge graphs from text without any training data or fine-tuning. It runs a full pipeline on consumer hardware that extracts relations from documents, converts text to queries, and performs multi-hop reasoning. On 500 DocRED relations the system scores 0.70 F1, close to a supervised baseline of 0.80. Self-consistency sampling and a routing cascade between models lift multi-hop exact match from 0.46 to 0.55. The work also shows that strong agreement among repeated samples often marks collective hallucination instead of reliability.

Core claim

A reproducible zero-shot pipeline executed entirely with local inference achieves an F1 of 0.70 plus or minus 0.041 on 500 document-level relations, compared with 0.80 for supervised DREEAM. Text-to-query reaches 0.80 accuracy on 200 samples. Multi-hop reasoning on 500 HotpotQA questions yields 0.46 EM baseline, improved by self-consistency and a Phi-4 to GPT-OSS cascade to 0.55 EM while rerouting 45 percent of questions. High consensus among samples signals collective hallucination rather than correctness.

What carries the argument

The multi-model zero-shot pipeline that combines relation extraction, text-to-query conversion, and multi-hop reasoning with self-consistency sampling at temperature 0.7 and a confidence-routing cascade across architectures.

Load-bearing premise

The tested local models and sampling settings continue to work at similar accuracy on new documents and tasks without hidden data leakage or prompt tuning specific to the benchmarks.

What would settle it

Applying the identical pipeline to a fresh benchmark of 500 unseen documents with different relation types and measuring whether F1 falls below 0.55 or multi-hop EM falls below 0.40.

read the original abstract

This paper presents an empirical study of a multi-model zero-shot pipeline for knowledge graph construction and exploitation, executed entirely through local inference on consumer-grade hardware. We propose a reproducible evaluation framework integrating two external benchmarks (DocRED, HotpotQA), WebQuestionsSP-style synthetic data, and the RAGAS evaluation framework in an automated pipeline. On 500 document-level relations, our system achieves an F1 of 0.70 $\pm$ 0.041 in zero-shot, compared to 0.80 for supervised DREEAM. Text-to-query achieves an accuracy of 0.80 $\pm$ 0.06 on 200 samples. Multi-hop reasoning achieves an Exact Match (EM) of 0.46$\pm$0.04 on 500 HotpotQA questions, with a RAGAS faithfulness of 0.96 $\pm$ 0.04 on 50 samples. Beyond the pipeline, we study diversity mechanisms for difficult multi-hop reasoning. On 181 questions unsolvable at zero temperature, self-consistency (k=5, T =0.7) recovers up to 23% EM with a single Mixture-of-Experts (MoE) model, but the cross-model oracle (3 architectures x 5 samples) reaches 46.4%. We highlight an agreement paradox: strong consensus among samples signals collective hallucination rather than a reliable answer, echoing the work of Moussa{\"i}d et al. on the wisdom of crowds. Extending to the full pipeline (500 questions), self-consistency (k=3) raises EM from 0.46 to 0.48 $\pm$ 0.04. A confidence-routing cascade mechanism (Phi-4 $\rightarrow$ GPT-OSS, k=5) achieves an EM of 0.55 $\pm$ 0.04, the best result obtained, with 45.4% of questions rerouted. Finally, we show that V3 prompt engineering applied to other models does not reproduce the gains observed with Gemma-4, confirming the specific prompt/model interaction. The entire system runs in $\sim$5 h on a single RTX 3090, without any training, for an estimated carbon footprint of 0.09 kg CO2 eq.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows a workable local-LLM zero-shot pipeline for KG construction and multi-hop QA that hits 0.70 F1 and 0.55 EM via cascades, but the headline comparison to supervised DREEAM is not clearly controlled for the exact 500-sample subset.

read the letter

The core takeaway is that you can build and run a full zero-shot multi-model pipeline for document-level relation extraction and reasoning entirely on consumer hardware, with an automated evaluation setup that ties DocRED, HotpotQA, synthetic data, and RAGAS together. It reports 0.70 F1 on 500 relations, 0.80 accuracy on text-to-query, and lifts multi-hop EM from 0.46 to 0.55 with a confidence-routing cascade that reroutes 45% of questions. The self-consistency experiments on 181 hard questions and the cross-model oracle reaching 46.4% are the clearest new empirical angles, along with the observation that high sample agreement can signal collective hallucination rather than reliability. The whole run finishes in about five hours on an RTX 3090 with a tiny carbon footprint, and they include standard deviations on the metrics. That combination of local execution, reproducibility framework, and concrete numbers on diversity mechanisms is the part worth noting. The DREEAM comparison is the main soft spot. The abstract gives 0.70 versus 0.80 without confirming the supervised model was re-evaluated on precisely those 500 instances, so any claim that the gap is small could be inflated if the slice is easier or post-selected. The prompt gains also look tied to specific model-prompt pairs and do not transfer cleanly, which limits how far the findings generalize. Sample sizes stay modest across the board. This is aimed at researchers who need low-resource, API-free ways to extract structured data or run multi-hop reasoning and want real numbers rather than theory. It has enough empirical substance and clear methods to justify sending it to peer review so the subset controls and reproducibility details can be checked directly.

Referee Report

2 major / 2 minor

Summary. The paper presents an empirical study of a zero-shot, multi-model pipeline for knowledge graph construction and multi-hop reasoning using only local LLMs on consumer hardware. It integrates DocRED, HotpotQA, synthetic data, and RAGAS in an automated framework, reporting F1 of 0.70 ± 0.041 on 500 DocRED document-level relations (vs. 0.80 for supervised DREEAM), 0.80 accuracy on text-to-query, and EM improvements from 0.46 to 0.55 via self-consistency (k=3/5) and a Phi-4 → GPT-OSS confidence-routing cascade on HotpotQA, with discussion of an agreement paradox and prompt/model specificity. The system runs in ~5h on an RTX 3090 with 0.09 kg CO2 eq footprint.

Significance. If the baseline comparisons and generalization claims hold, the work provides concrete evidence that frugal local-LLM pipelines can approach supervised performance on document-level relation extraction and multi-hop QA without any training or fine-tuning. The emphasis on reproducibility, hardware accessibility, carbon accounting, and mechanisms like self-consistency and cross-model routing adds practical value for resource-constrained settings.

major comments (2)

[Abstract / DocRED evaluation] Abstract and results section on DocRED: the central claim that the zero-shot pipeline reaches F1 0.70 ± 0.041 'compared to 0.80 for supervised DREEAM' on the same 500 document-level relations is load-bearing for competitiveness; the manuscript must explicitly state whether DREEAM (or an equivalent supervised model) was re-evaluated on precisely those 500 instances or whether the 0.80 figure is taken from the published full DocRED test set, as post-hoc subset selection could inflate the apparent gap.
[Multi-hop reasoning and cascade experiments] HotpotQA multi-hop results and cascade mechanism: the reported EM lift from 0.46 to 0.55 ± 0.04 via the Phi-4 → GPT-OSS (k=5) routing with 45.4% rerouting is a key empirical contribution, but the paper should provide the exact confidence threshold or routing criterion used and confirm that the 500-question set was fixed before any post-hoc analysis of unsolvable cases at T=0.

minor comments (2)

[Method / Pipeline description] The manuscript should release the exact prompt templates, sampling parameters (k, T), and model versions (including any quantization) used for each stage to enable full reproduction, as these are described only at a high level.
[Self-consistency and wisdom-of-crowds discussion] Figure or table presenting the agreement-paradox analysis would benefit from explicit statistical tests (e.g., correlation between consensus strength and error rate) rather than qualitative description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will make the requested clarifications in the revised manuscript.

read point-by-point responses

Referee: [Abstract / DocRED evaluation] Abstract and results section on DocRED: the central claim that the zero-shot pipeline reaches F1 0.70 ± 0.041 'compared to 0.80 for supervised DREEAM' on the same 500 document-level relations is load-bearing for competitiveness; the manuscript must explicitly state whether DREEAM (or an equivalent supervised model) was re-evaluated on precisely those 500 instances or whether the 0.80 figure is taken from the published full DocRED test set, as post-hoc subset selection could inflate the apparent gap.

Authors: We agree this distinction is necessary for transparent interpretation. The 0.80 DREEAM figure is taken directly from the original DREEAM publication on the full DocRED test set; we did not re-run the supervised model on our 500-instance subset, as our focus was exclusively on zero-shot local-LLM methods and re-implementing supervised baselines was outside the scope. We will revise the abstract and results section to state this explicitly and add a brief note acknowledging that the comparison is not on an identical subset. This provides the required context without altering the reported zero-shot F1. revision: yes
Referee: [Multi-hop reasoning and cascade experiments] HotpotQA multi-hop results and cascade mechanism: the reported EM lift from 0.46 to 0.55 ± 0.04 via the Phi-4 → GPT-OSS (k=5) routing with 45.4% rerouting is a key empirical contribution, but the paper should provide the exact confidence threshold or routing criterion used and confirm that the 500-question set was fixed before any post-hoc analysis of unsolvable cases at T=0.

Authors: We confirm that the 500-question HotpotQA set was selected and fixed prior to any experiments, including the identification of the 181 questions unsolvable at T=0; all self-consistency, oracle, and cascade results are reported on this pre-defined set. We will add the exact routing criterion and confidence threshold description to the methods and results sections in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements on external benchmarks

full rationale

The paper reports direct experimental results (F1, EM, accuracy, RAGAS scores) from running a zero-shot LLM pipeline on public datasets (DocRED, HotpotQA, synthetic data) with fixed sampling parameters. No derivations, equations, fitted parameters renamed as predictions, or self-citations are used to justify central claims. All numbers are obtained by executing the described pipeline on the stated sample sizes; the comparison to DREEAM is a literature reference rather than a self-referential reduction. The work is self-contained against external benchmarks with no load-bearing self-definition or ansatz smuggling.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The work is empirical and relies on standard assumptions about LLM zero-shot behavior plus experimental hyperparameters; no new physical or mathematical entities are introduced.

free parameters (2)

self-consistency sample count k = 3-5
Number of generations (k=5 or k=3) used for majority voting or routing decisions
sampling temperature T = 0.7
Temperature (T=0.7) chosen to generate response diversity

axioms (2)

domain assumption Local LLMs can perform zero-shot relation extraction and multi-hop reasoning when prompted appropriately
Core premise enabling the entire pipeline and all reported metrics
domain assumption The selected benchmarks and synthetic data are representative of real-world KG construction tasks
Invoked when generalizing the F1 and EM scores to the proposed system

pith-pipeline@v0.9.0 · 5742 in / 1634 out tokens · 98004 ms · 2026-05-10T15:27:02.096421+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. Ba ng, A. Madotto, and P. Fung. Survey of hallucination in natural language generation. ACM Computing Surveys , 55(12):1– 38, 2023

2023
[2]

Strubell, A

E. Strubell, A. Ganesh, and A. McCallum. Energy and polic y considerations for deep learning in NLP. In Proc. 57th Annual Meeting of the ACL , pages 3645–3650, 2019

2019
[3]

Y. Yao, D. Ye, P. Li, X. Han, Y. Lin, Z. Liu, Z. Liu, L. Huang, J. Zhou, and M. Sun. DocRED: A large-scale document-level relation extraction dataset. In Proc. 57th Annual Meeting of the ACL , pages 764–777, 2019

2019
[4]

W.-t. Yih, M. Richardson, C. Meek, M.-W. Chang, and J. Suh . The value of semantic parse labeling for knowledge base question answering. In Proc. 54th Annual Meeting of the ACL , pages 201–206, 2016

2016
[5]

Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, R. Salakhutd inov, and C. D. Manning. Hot- potQA: A dataset for diverse, explainable multi-hop questi on answering. In Proc. EMNLP 2018, pages 2369–2380, 2018

2018
[6]

Lewis, E

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N . Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, and D. Kiela. Retrieva l-augmented generation for knowledge-intensive NLP tasks. In Advances in NeurIPS , volume 33, pages 9459–9474, 2020. 23

2020
[7]

S. Es, J. James, L. Espinosa-Anke, and S. Schockaert. RAG AS: Automated evaluation of retrieval augmented generation. In Proc. 18th EACL: System Demonstrations , pages 150–163, 2024

2024
[8]

W. Zhou, K. Huang, T. Ma, and J. Huang. Document-level rel ation extraction with adaptive thresholding and localized context pooling. In Proc. AAAI 2021 , volume 35, pages 14612– 14620, 2021

2021
[9]

Y. Ma, A. Wang, and N. Okazaki. DREEAM: Guiding attention with evidence for improving document-level relation extraction. In Proc. 17th EACL , pages 1963–1975, 2023

1963
[10]

Wadhwa, S

S. Wadhwa, S. Amir, and B. C. Wallace. Revisiting relati on extraction in the era of large language models. In Proc. 61st Annual Meeting of the ACL , pages 15566–15589, 2023

2023
[11]

B. Li, G. Fang, Y. Yang, Q. Wang, W. Ye, W. Zhao, and S. Zhan g. Evaluating ChatGPT’s information extraction capabilities. arXiv preprint arXiv:2304.11633 , 2023

work page arXiv 2023
[12]

Ozyurt, S

Y. Ozyurt, S. Feuerriegel, and C. Zhang. Document-leve l in-context few-shot relation ex- traction via pre-trained language models. arXiv preprint arXiv:2310.11085 , 2023

work page arXiv 2023
[13]

Huguet Cabot and R

P.-L. Huguet Cabot and R. Navigli. REBEL: Relation extr action by end-to-end language generation. In Findings of EMNLP 2021 , pages 2370–2381, 2021

2021
[14]

S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, and X. Wu. Unifyi ng large language models and knowledge graphs: A roadmap. IEEE Trans. Knowledge and Data Engineering , 36(7):3580– 3599, 2024

2024
[15]

Y. Gu, X. Deng, and Y. Su. Don’t generate, discriminate: A proposal for grounding language models to real-world environments. In Proc. 61st Annual Meeting of the ACL , pages 4928–4949, 2023

2023
[16]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xi a, E. Chi, Q. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large lan guage models. In Advances in NeurIPS, volume 35, pages 24824–24837, 2022

2022
[17]

Quantifying the Carbon Emissions of Machine Learning

A. Lacoste, A. Luccioni, V. Schmidt, and T. Dandres. Qua ntifying the carbon emissions of machine learning. arXiv preprint arXiv:1910.09700 , 2019

work page internal anchor Pith review arXiv 1910
[18]

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A . Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in la nguage models. In Proc. ICLR, 2023

2023
[19]

Moussaïd, J

M. Moussaïd, J. E. Kämmer, P. P. Analytis, and H. Neth. So cial inﬂuence and the collective dynamics of opinion formation. PLoS ONE , 8(11):e78433, 2013

2013
[20]

Probabilistic outputs for support vector m achines and comparisons to regular- ized likelihood methods

John Platt. Probabilistic outputs for support vector m achines and comparisons to regular- ized likelihood methods. In Advances in Large Margin Classiﬁers , pages 61–74. MIT Press, 1999. 24

1999