pith. machine review for the scientific record. sign in

arxiv: 2604.11104 · v1 · submitted 2026-04-13 · 💻 cs.AI · cs.IR· cs.LG· cs.NE

Recognition: unknown

Frugal Knowledge Graph Construction with Local LLMs: A Zero-Shot Pipeline, Self-Consistency and Wisdom of Artificial Crowds

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:27 UTC · model grok-4.3

classification 💻 cs.AI cs.IRcs.LGcs.NE
keywords zero-shot learningknowledge graph constructionlocal LLMsself-consistencymulti-hop reasoningrelation extractionRAGAS evaluationfrugal inference
0
0 comments X

The pith

A zero-shot pipeline with local LLMs extracts document relations at 0.70 F1 and reaches 0.55 EM on multi-hop questions via model routing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether local large language models can build and use knowledge graphs from text without any training data or fine-tuning. It runs a full pipeline on consumer hardware that extracts relations from documents, converts text to queries, and performs multi-hop reasoning. On 500 DocRED relations the system scores 0.70 F1, close to a supervised baseline of 0.80. Self-consistency sampling and a routing cascade between models lift multi-hop exact match from 0.46 to 0.55. The work also shows that strong agreement among repeated samples often marks collective hallucination instead of reliability.

Core claim

A reproducible zero-shot pipeline executed entirely with local inference achieves an F1 of 0.70 plus or minus 0.041 on 500 document-level relations, compared with 0.80 for supervised DREEAM. Text-to-query reaches 0.80 accuracy on 200 samples. Multi-hop reasoning on 500 HotpotQA questions yields 0.46 EM baseline, improved by self-consistency and a Phi-4 to GPT-OSS cascade to 0.55 EM while rerouting 45 percent of questions. High consensus among samples signals collective hallucination rather than correctness.

What carries the argument

The multi-model zero-shot pipeline that combines relation extraction, text-to-query conversion, and multi-hop reasoning with self-consistency sampling at temperature 0.7 and a confidence-routing cascade across architectures.

Load-bearing premise

The tested local models and sampling settings continue to work at similar accuracy on new documents and tasks without hidden data leakage or prompt tuning specific to the benchmarks.

What would settle it

Applying the identical pipeline to a fresh benchmark of 500 unseen documents with different relation types and measuring whether F1 falls below 0.55 or multi-hop EM falls below 0.40.

read the original abstract

This paper presents an empirical study of a multi-model zero-shot pipeline for knowledge graph construction and exploitation, executed entirely through local inference on consumer-grade hardware. We propose a reproducible evaluation framework integrating two external benchmarks (DocRED, HotpotQA), WebQuestionsSP-style synthetic data, and the RAGAS evaluation framework in an automated pipeline. On 500 document-level relations, our system achieves an F1 of 0.70 $\pm$ 0.041 in zero-shot, compared to 0.80 for supervised DREEAM. Text-to-query achieves an accuracy of 0.80 $\pm$ 0.06 on 200 samples. Multi-hop reasoning achieves an Exact Match (EM) of 0.46$\pm$0.04 on 500 HotpotQA questions, with a RAGAS faithfulness of 0.96 $\pm$ 0.04 on 50 samples. Beyond the pipeline, we study diversity mechanisms for difficult multi-hop reasoning. On 181 questions unsolvable at zero temperature, self-consistency (k=5, T =0.7) recovers up to 23% EM with a single Mixture-of-Experts (MoE) model, but the cross-model oracle (3 architectures x 5 samples) reaches 46.4%. We highlight an agreement paradox: strong consensus among samples signals collective hallucination rather than a reliable answer, echoing the work of Moussa{\"i}d et al. on the wisdom of crowds. Extending to the full pipeline (500 questions), self-consistency (k=3) raises EM from 0.46 to 0.48 $\pm$ 0.04. A confidence-routing cascade mechanism (Phi-4 $\rightarrow$ GPT-OSS, k=5) achieves an EM of 0.55 $\pm$ 0.04, the best result obtained, with 45.4% of questions rerouted. Finally, we show that V3 prompt engineering applied to other models does not reproduce the gains observed with Gemma-4, confirming the specific prompt/model interaction. The entire system runs in $\sim$5 h on a single RTX 3090, without any training, for an estimated carbon footprint of 0.09 kg CO2 eq.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents an empirical study of a zero-shot, multi-model pipeline for knowledge graph construction and multi-hop reasoning using only local LLMs on consumer hardware. It integrates DocRED, HotpotQA, synthetic data, and RAGAS in an automated framework, reporting F1 of 0.70 ± 0.041 on 500 DocRED document-level relations (vs. 0.80 for supervised DREEAM), 0.80 accuracy on text-to-query, and EM improvements from 0.46 to 0.55 via self-consistency (k=3/5) and a Phi-4 → GPT-OSS confidence-routing cascade on HotpotQA, with discussion of an agreement paradox and prompt/model specificity. The system runs in ~5h on an RTX 3090 with 0.09 kg CO2 eq footprint.

Significance. If the baseline comparisons and generalization claims hold, the work provides concrete evidence that frugal local-LLM pipelines can approach supervised performance on document-level relation extraction and multi-hop QA without any training or fine-tuning. The emphasis on reproducibility, hardware accessibility, carbon accounting, and mechanisms like self-consistency and cross-model routing adds practical value for resource-constrained settings.

major comments (2)
  1. [Abstract / DocRED evaluation] Abstract and results section on DocRED: the central claim that the zero-shot pipeline reaches F1 0.70 ± 0.041 'compared to 0.80 for supervised DREEAM' on the same 500 document-level relations is load-bearing for competitiveness; the manuscript must explicitly state whether DREEAM (or an equivalent supervised model) was re-evaluated on precisely those 500 instances or whether the 0.80 figure is taken from the published full DocRED test set, as post-hoc subset selection could inflate the apparent gap.
  2. [Multi-hop reasoning and cascade experiments] HotpotQA multi-hop results and cascade mechanism: the reported EM lift from 0.46 to 0.55 ± 0.04 via the Phi-4 → GPT-OSS (k=5) routing with 45.4% rerouting is a key empirical contribution, but the paper should provide the exact confidence threshold or routing criterion used and confirm that the 500-question set was fixed before any post-hoc analysis of unsolvable cases at T=0.
minor comments (2)
  1. [Method / Pipeline description] The manuscript should release the exact prompt templates, sampling parameters (k, T), and model versions (including any quantization) used for each stage to enable full reproduction, as these are described only at a high level.
  2. [Self-consistency and wisdom-of-crowds discussion] Figure or table presenting the agreement-paradox analysis would benefit from explicit statistical tests (e.g., correlation between consensus strength and error rate) rather than qualitative description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will make the requested clarifications in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract / DocRED evaluation] Abstract and results section on DocRED: the central claim that the zero-shot pipeline reaches F1 0.70 ± 0.041 'compared to 0.80 for supervised DREEAM' on the same 500 document-level relations is load-bearing for competitiveness; the manuscript must explicitly state whether DREEAM (or an equivalent supervised model) was re-evaluated on precisely those 500 instances or whether the 0.80 figure is taken from the published full DocRED test set, as post-hoc subset selection could inflate the apparent gap.

    Authors: We agree this distinction is necessary for transparent interpretation. The 0.80 DREEAM figure is taken directly from the original DREEAM publication on the full DocRED test set; we did not re-run the supervised model on our 500-instance subset, as our focus was exclusively on zero-shot local-LLM methods and re-implementing supervised baselines was outside the scope. We will revise the abstract and results section to state this explicitly and add a brief note acknowledging that the comparison is not on an identical subset. This provides the required context without altering the reported zero-shot F1. revision: yes

  2. Referee: [Multi-hop reasoning and cascade experiments] HotpotQA multi-hop results and cascade mechanism: the reported EM lift from 0.46 to 0.55 ± 0.04 via the Phi-4 → GPT-OSS (k=5) routing with 45.4% rerouting is a key empirical contribution, but the paper should provide the exact confidence threshold or routing criterion used and confirm that the 500-question set was fixed before any post-hoc analysis of unsolvable cases at T=0.

    Authors: We confirm that the 500-question HotpotQA set was selected and fixed prior to any experiments, including the identification of the 181 questions unsolvable at T=0; all self-consistency, oracle, and cascade results are reported on this pre-defined set. We will add the exact routing criterion and confidence threshold description to the methods and results sections in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements on external benchmarks

full rationale

The paper reports direct experimental results (F1, EM, accuracy, RAGAS scores) from running a zero-shot LLM pipeline on public datasets (DocRED, HotpotQA, synthetic data) with fixed sampling parameters. No derivations, equations, fitted parameters renamed as predictions, or self-citations are used to justify central claims. All numbers are obtained by executing the described pipeline on the stated sample sizes; the comparison to DREEAM is a literature reference rather than a self-referential reduction. The work is self-contained against external benchmarks with no load-bearing self-definition or ansatz smuggling.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The work is empirical and relies on standard assumptions about LLM zero-shot behavior plus experimental hyperparameters; no new physical or mathematical entities are introduced.

free parameters (2)
  • self-consistency sample count k = 3-5
    Number of generations (k=5 or k=3) used for majority voting or routing decisions
  • sampling temperature T = 0.7
    Temperature (T=0.7) chosen to generate response diversity
axioms (2)
  • domain assumption Local LLMs can perform zero-shot relation extraction and multi-hop reasoning when prompted appropriately
    Core premise enabling the entire pipeline and all reported metrics
  • domain assumption The selected benchmarks and synthetic data are representative of real-world KG construction tasks
    Invoked when generalizing the F1 and EM scores to the proposed system

pith-pipeline@v0.9.0 · 5742 in / 1634 out tokens · 98004 ms · 2026-05-10T15:27:02.096421+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. Ba ng, A. Madotto, and P. Fung. Survey of hallucination in natural language generation. ACM Computing Surveys , 55(12):1– 38, 2023

  2. [2]

    Strubell, A

    E. Strubell, A. Ganesh, and A. McCallum. Energy and polic y considerations for deep learning in NLP. In Proc. 57th Annual Meeting of the ACL , pages 3645–3650, 2019

  3. [3]

    Y. Yao, D. Ye, P. Li, X. Han, Y. Lin, Z. Liu, Z. Liu, L. Huang, J. Zhou, and M. Sun. DocRED: A large-scale document-level relation extraction dataset. In Proc. 57th Annual Meeting of the ACL , pages 764–777, 2019

  4. [4]

    W.-t. Yih, M. Richardson, C. Meek, M.-W. Chang, and J. Suh . The value of semantic parse labeling for knowledge base question answering. In Proc. 54th Annual Meeting of the ACL , pages 201–206, 2016

  5. [5]

    Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, R. Salakhutd inov, and C. D. Manning. Hot- potQA: A dataset for diverse, explainable multi-hop questi on answering. In Proc. EMNLP 2018, pages 2369–2380, 2018

  6. [6]

    Lewis, E

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N . Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, and D. Kiela. Retrieva l-augmented generation for knowledge-intensive NLP tasks. In Advances in NeurIPS , volume 33, pages 9459–9474, 2020. 23

  7. [7]

    S. Es, J. James, L. Espinosa-Anke, and S. Schockaert. RAG AS: Automated evaluation of retrieval augmented generation. In Proc. 18th EACL: System Demonstrations , pages 150–163, 2024

  8. [8]

    W. Zhou, K. Huang, T. Ma, and J. Huang. Document-level rel ation extraction with adaptive thresholding and localized context pooling. In Proc. AAAI 2021 , volume 35, pages 14612– 14620, 2021

  9. [9]

    Y. Ma, A. Wang, and N. Okazaki. DREEAM: Guiding attention with evidence for improving document-level relation extraction. In Proc. 17th EACL , pages 1963–1975, 2023

  10. [10]

    Wadhwa, S

    S. Wadhwa, S. Amir, and B. C. Wallace. Revisiting relati on extraction in the era of large language models. In Proc. 61st Annual Meeting of the ACL , pages 15566–15589, 2023

  11. [11]

    B. Li, G. Fang, Y. Yang, Q. Wang, W. Ye, W. Zhao, and S. Zhan g. Evaluating ChatGPT’s information extraction capabilities. arXiv preprint arXiv:2304.11633 , 2023

  12. [12]

    Ozyurt, S

    Y. Ozyurt, S. Feuerriegel, and C. Zhang. Document-leve l in-context few-shot relation ex- traction via pre-trained language models. arXiv preprint arXiv:2310.11085 , 2023

  13. [13]

    Huguet Cabot and R

    P.-L. Huguet Cabot and R. Navigli. REBEL: Relation extr action by end-to-end language generation. In Findings of EMNLP 2021 , pages 2370–2381, 2021

  14. [14]

    S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, and X. Wu. Unifyi ng large language models and knowledge graphs: A roadmap. IEEE Trans. Knowledge and Data Engineering , 36(7):3580– 3599, 2024

  15. [15]

    Y. Gu, X. Deng, and Y. Su. Don’t generate, discriminate: A proposal for grounding language models to real-world environments. In Proc. 61st Annual Meeting of the ACL , pages 4928–4949, 2023

  16. [16]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xi a, E. Chi, Q. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large lan guage models. In Advances in NeurIPS, volume 35, pages 24824–24837, 2022

  17. [17]

    Quantifying the Carbon Emissions of Machine Learning

    A. Lacoste, A. Luccioni, V. Schmidt, and T. Dandres. Qua ntifying the carbon emissions of machine learning. arXiv preprint arXiv:1910.09700 , 2019

  18. [18]

    X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A . Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in la nguage models. In Proc. ICLR, 2023

  19. [19]

    Moussaïd, J

    M. Moussaïd, J. E. Kämmer, P. P. Analytis, and H. Neth. So cial influence and the collective dynamics of opinion formation. PLoS ONE , 8(11):e78433, 2013

  20. [20]

    Probabilistic outputs for support vector m achines and comparisons to regular- ized likelihood methods

    John Platt. Probabilistic outputs for support vector m achines and comparisons to regular- ized likelihood methods. In Advances in Large Margin Classifiers , pages 61–74. MIT Press, 1999. 24