pith. machine review for the scientific record. sign in

arxiv: 2604.26048 · v1 · submitted 2026-04-28 · 💻 cs.CL

Recognition: unknown

BioGraphletQA: Knowledge-Anchored Generation of Complex QA Datasets

B\'arbara Maria Ribeiro de Abreu Martins, Richard A. A. Jonker, S\'ergio Matos

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:22 UTC · model grok-4.3

classification 💻 cs.CL
keywords complex QAknowledge graphsdataset creationlarge language modelsprompt engineeringbiomedical QAsynthetic datasets
0
0 comments X

The pith

Using small knowledge-graph subgraphs to anchor prompts lets large language models create large sets of complex QA data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish a scalable framework for generating complex question-answering datasets by basing prompts on small subgraphs extracted from knowledge graphs. This controls the complexity of the questions and ensures they are factually grounded in the graph structure. A sympathetic reader would care if this works because it could make it much easier to build large training sets for systems that need to answer sophisticated questions, especially where expert knowledge is required. The framework is tested by producing a biomedical dataset of over 119,000 pairs and showing that adding this data to training improves accuracy on standard benchmarks.

Core claim

The central discovery is that a generation process anchored in small subgraphs from a knowledge graph can be used to direct large language models toward producing complex QA pairs that retain factual accuracy and appropriate difficulty. When applied, this yields a large collection of such pairs, whose quality is supported by domain expert assessment of a sample and by the observation that models trained with the addition of these pairs achieve higher accuracy on biomedical QA tasks.

What carries the argument

The graphlet-anchored generation process, defined as using small subgraphs of up to five nodes from a knowledge graph in structured prompts to control complexity and ensure factual grounding of generated questions.

If this is right

  • Augmenting training data with the generated QA pairs increases accuracy on biomedical QA tasks in low-resource settings from 49.2% to 68.5% and in full-resource settings from 41.4% to 44.8%.
  • The process scales to produce 119,856 QA pairs while preserving the desired properties.
  • Evaluation by a domain expert on a sample of 106 pairs confirms high scientific validity and complexity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar subgraph-anchoring techniques could be applied in other knowledge-intensive domains to generate tailored QA datasets.
  • This method might help mitigate the challenges of data scarcity for training models on complex reasoning by providing a structured way to synthesize examples.
  • Extending the framework to include more types of graph structures or different large language models could further improve the diversity and utility of the generated data.

Load-bearing premise

That the use of small knowledge-graph subgraphs in prompts will continue to produce questions with preserved factual grounding and complexity when generating a full dataset of over one hundred thousand pairs, and that a review of just over one hundred samples is sufficient to represent the quality of the whole.

What would settle it

Finding that a larger sample of the generated QA pairs contains many factual errors or lacks the claimed complexity, or that augmenting models with this data fails to improve or even decreases performance on independent QA benchmarks.

Figures

Figures reproduced from arXiv: 2604.26048 by B\'arbara Maria Ribeiro de Abreu Martins, Richard A. A. Jonker, S\'ergio Matos.

Figure 1
Figure 1. Figure 1: An overview of our framework, composed of five phases: 1) Initial cleaning of the OREGANO KG, including KG pre-processing and graphlet extraction; 2) Generation of the initial KGQA dataset, starting with a modular prompt building stage; 3) Automatic filtering stage using LLM-as-a-judge; 4) Supporting document retrieval from PubMed with BM25 using an LLM to classify the document relevance and extract snippe… view at source ↗
Figure 2
Figure 2. Figure 2: Node degree distribution of the OREGANO KG and the reduced version. Graphlets The final preprocessing step involves extracting graphlets, which are small, connected, non-isomorphic subgraphs that capture local structural patterns within the KG. Rather than performing a simple random walk, the KG is partitioned into these graphlets to serve as the basis for QA generation. The process considers all 29 unique… view at source ↗
Figure 3
Figure 3. Figure 3: Compressed version of the selected prompt, showcasing extracts from some of the different modules. The full prompt and the modules can be inspected in our GitHub Repository. and 59-997 characters for answers, leading to the exclusion of an additional 4,658 pairs. 3.3 Post-Generation Filtering To improve dataset quality, LLM-based filtering was applied after the initial generation phase. Although the prompt… view at source ↗
Figure 4
Figure 4. Figure 4: The 29 graphlet shapes with 3-5 nodes. Each shape is shown with the number of graphlets initially present (sampled to 10,000) and the acceptance ratio (QA pairs accepted / QA pairs generated). 3.4 Supporting Document Retrieval While BioGraphletQA is grounded in structured knowledge, many biomedical QA scenarios also rely on unstructured text. To broaden our framework’s appli￾cability and enrich the final r… view at source ↗
Figure 5
Figure 5. Figure 5: shows a clear distributional shift in scores between the QA pairs our LLM filter accepted versus those it rejected. The accepted pairs consistently received high scores from the human expert, while rejected pairs scored poorly. This disparity provides strong empirical evidence that Stage 3 of our frame￾work is effective at identifying and removing low-quality data, a critical step for ensuring the reliabil… view at source ↗
Figure 6
Figure 6. Figure 6: Boxplot of Likert based human evaluation scores across 5 categories for the accepted QA pairs. The results, presented in view at source ↗
Figure 7
Figure 7. Figure 7: Effect of synthetic data augmentation on PubMedQA (left) and MedQA (right). Plots show mean test accuracy over 5 different seeds vs. real training data size. Aug￾menting with 10,000 synthetic samples consistently provides the largest performance gains over the baseline. Shaded areas represent standard deviation. The findings show performance gains in both low-resource and full-resource settings. The most s… view at source ↗
read the original abstract

This paper presents a principled and scalable framework for systematically generating complex Question Answering (QA) data. In the core of this framework is a graphlet-anchored generation process, where small subgraphs from a Knowledge Graph (KG) are used in a structured prompt to control the complexity and ensure the factual grounding of questions generated by Large Language Models. The first instantiation of this framework is BioGraphletQA, a new biomedical KGQA dataset of 119,856 QA pairs. Each entry is grounded in a graphlet of up to five nodes from the OREGANO KG, with most of the pairs being enriched with relevant document snippets from PubMed. We start by demonstrating the framework's value and the dataset's quality through evaluation by a domain expert on 106 QA pairs, confirming the high scientific validity and complexity of the generated data. Secondly, we establish its practical utility by showing that augmenting downstream benchmarks with our data improves accuracy on PubMedQA from 49.2% to 68.5% in a low-resource setting, and on MedQA from a 41.4% baseline to 44.8% in a full-resource setting. Our framework provides a robust and generalizable solution for creating critical resources to advance complex QA tasks, including MCQA and KGQA. All resources supporting this work, including the dataset (https://zenodo.org/records/17381119) and framework code (https://github.com/ieeta-pt/BioGraphletQA), are publicly available to facilitate use, reproducibility and extension.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a graphlet-anchored framework for generating complex QA datasets from knowledge graphs. Its primary instantiation, BioGraphletQA, consists of 119,856 biomedical QA pairs derived from subgraphs of up to five nodes in the OREGANO KG, with most pairs augmented by PubMed snippets. Quality is supported by domain-expert review of 106 pairs confirming scientific validity and complexity; utility is shown via downstream augmentation experiments that raise PubMedQA accuracy from 49.2% to 68.5% (low-resource) and MedQA from 41.4% to 44.8% (full-resource). Code and data are released publicly.

Significance. If the generation process preserves factual grounding and complexity at scale, the work supplies a reproducible, extensible resource for complex biomedical QA that could improve training of models on multi-hop and factually anchored reasoning. The public Zenodo dataset and GitHub framework code are explicit strengths that enable direct reuse and extension beyond biomedicine.

major comments (2)
  1. [Evaluation / expert review] Expert validation is reported only on 106 pairs (abstract and evaluation section). No selection protocol, stratification by graphlet size, or power analysis is described, so it is unclear whether the observed high validity generalizes to the full 119,856 pairs, especially for larger graphlets where incomplete grounding or LLM inconsistencies could be more prevalent.
  2. [Method / graphlet-anchored generation] The generation pipeline (method section) supplies no concrete prompt templates, graphlet sampling procedure, or hallucination-mitigation steps. Without these, the central claim that graphlet anchoring reliably controls complexity and factual grounding cannot be independently verified or reproduced from the manuscript alone.
minor comments (2)
  1. [Abstract] The abstract states that 'most' pairs are enriched with PubMed snippets but does not report the exact fraction or the enrichment procedure.
  2. [Experiments] Downstream experiment details (e.g., how BioGraphletQA pairs are mixed with original training data, exact model checkpoints, and hyper-parameter settings) are referenced only at high level; a table or appendix listing these would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and commit to revisions that enhance the clarity, reproducibility, and transparency of the work without altering its core contributions.

read point-by-point responses
  1. Referee: [Evaluation / expert review] Expert validation is reported only on 106 pairs (abstract and evaluation section). No selection protocol, stratification by graphlet size, or power analysis is described, so it is unclear whether the observed high validity generalizes to the full 119,856 pairs, especially for larger graphlets where incomplete grounding or LLM inconsistencies could be more prevalent.

    Authors: We agree that the manuscript would benefit from greater detail on the expert validation process. The 106 pairs were chosen via stratified random sampling across graphlet sizes (1–5 nodes) to ensure representation of varying complexities; we will explicitly document this protocol, the expert assessment criteria, and the exact validity results in the revised evaluation section. A formal statistical power analysis was not performed, as expert review is resource-intensive, but we will add a limitations paragraph acknowledging that the sample size limits strong claims of generalization to the full dataset, particularly for 5-node graphlets. The downstream benchmark gains provide complementary evidence of utility at scale, and we will emphasize this as supporting (rather than definitive) validation. revision: yes

  2. Referee: [Method / graphlet-anchored generation] The generation pipeline (method section) supplies no concrete prompt templates, graphlet sampling procedure, or hallucination-mitigation steps. Without these, the central claim that graphlet anchoring reliably controls complexity and factual grounding cannot be independently verified or reproduced from the manuscript alone.

    Authors: We acknowledge that the method section lacks the concrete implementation details needed for standalone reproduction. The full graphlet sampling logic, prompt templates, and hallucination-mitigation steps (KG verification plus PubMed snippet grounding) are present in the released code at https://github.com/ieeta-pt/BioGraphletQA. To make the manuscript self-contained, we will insert the key prompt templates, a precise description of the sampling procedure, and the mitigation steps into a new appendix in the revised version. This will enable verification directly from the paper while retaining the open-source repository for executable code. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset construction without derivation chain

full rationale

The paper presents a constructive framework for generating QA pairs by anchoring LLM prompts to small subgraphs (graphlets) extracted from the OREGANO KG, producing the BioGraphletQA dataset of 119,856 pairs. Quality is checked by domain-expert review of 106 samples and utility is measured by empirical augmentation experiments on PubMedQA (49.2% to 68.5%) and MedQA (41.4% to 44.8%). No equations, fitted parameters, self-referential definitions, or first-principles derivations appear; the generation process is a procedural pipeline whose outputs are tested against external benchmarks and human judgment rather than reduced to the inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing justification for the central claims. The work is therefore self-contained as an empirical resource-creation effort.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework relies on standard LLM prompting assumptions and the existence of the OREGANO KG; no free parameters, ad-hoc axioms, or invented entities are described in the abstract.

axioms (1)
  • domain assumption LLM outputs remain factually consistent with the provided graphlet when prompted appropriately
    Implicit in the graphlet-anchored generation process described in the abstract.

pith-pipeline@v0.9.0 · 5595 in / 1315 out tokens · 50237 ms · 2026-05-07T16:22:59.663141+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 16 canonical work pages · 1 internal anchor

  1. [1]

    arXiv preprint arXiv:2303.13351 (2023)

    Banerjee, D., Awale, S., Usbeck, R., Biemann, C.: Dblp-quad: A question answering dataset over the dblp scholarly knowledge graph. arXiv preprint arXiv:2303.13351 (2023)

  2. [2]

    In: Matsumoto, Y., Prasad, R

    Bao, J., Duan, N., Yan, Z., Zhou, M., Zhao, T.: Constraint-based question an- swering with knowledge graph. In: Matsumoto, Y., Prasad, R. (eds.) Proceedings of COLING 2016, the 26th International Conference on Computational Linguis- tics: Technical Papers. pp. 2503–2514. The COLING 2016 Organizing Committee, Osaka, Japan (Dec 2016), https://aclanthology.o...

  3. [3]

    In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data

    Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a collabo- ratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data. pp. 1247–1250 (2008)

  4. [4]

    Scientific data10(1), 871 (2023)

    Boudin, M., Diallo, G., Drancé, M., Mougin, F.: The oregano knowledge graph for computational drug repurposing. Scientific data10(1), 871 (2023)

  5. [5]

    arXiv preprint arXiv:2410.22182 (2024)

    Braga, M., Kasela, P., Raganato, A., Pasi, G.: Synthetic data generation with large language models for personalized community question answering. arXiv preprint arXiv:2410.22182 (2024)

  6. [6]

    Nature Scientific Data (2023)

    Chandak, P., Huang, K., Zitnik, M.: Building a knowledge graph to en- able precision medicine. Nature Scientific Data (2023). https://doi.org/10.1038/ s41597-023-01960-3

  7. [7]

    In: The Semantic Web– ISWC2019:18thInternationalSemanticWebConference,Auckland,NewZealand, October 26–30, 2019, Proceedings, Part II 18

    Dubey, M., Banerjee, D., Abdelkawi, A., Lehmann, J.: Lc-quad 2.0: A large dataset for complex question answering over wikidata and dbpedia. In: The Semantic Web– ISWC2019:18thInternationalSemanticWebConference,Auckland,NewZealand, October 26–30, 2019, Proceedings, Part II 18. pp. 69–78. Springer (2019)

  8. [8]

    In: Proceedings of the 13th IJCNLP

    Dutt, R., Khosla, S., Kumar, V.B., Gangadharaiah, R.: Grailqa++: A challenging zero-shot benchmark for knowledge base question answering. In: Proceedings of the 13th IJCNLP. pp. 897–909 (2023)

  9. [9]

    In: Proceedings of the Web Conference 2021

    Gu, Y., Kase, S., Vanni, M., Sadler, B., Liang, P., Yan, X., Su, Y.: Beyond iid: three levels of generalization for question answering on knowledge bases. In: Proceedings of the Web Conference 2021. pp. 3477–3488 (2021)

  10. [10]

    Haas, R.: A survey of biomedical knowledge graphs and of resources for their construction (February 6 2024), https://robert-haas.github.io/ awesome-biomedical-knowledge-graphs/

  11. [11]

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,

    Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., Liu, T.: A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst.43(2) (Jan 2025). https://doi.org/10.1145/3703155

  12. [12]

    Survey of hallucination in natural language generation,

    Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM Comput. Surv.55(12) (Mar 2023). https://doi.org/10.1145/3571730

  13. [13]

    Jiang, L., Usbeck, R.: Knowledge graph question answering datasets and their generalizability: Are they enough for future research? In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Informa- tion Retrieval. pp. 3209–3218 (2022) 14 Jonker et al

  14. [14]

    Jin, D., Pan, E., Oufattole, N., Weng, W.H., Fang, H., Szolovits, P.: What disease does this patient have? a large-scale open domain question answering dataset from medical exams (2020), https://arxiv.org/abs/2009.13081

  15. [15]

    Pub- MedQA: A dataset for biomedical research question answering

    Jin,Q.,Dhingra,B.,Liu,Z.,Cohen,W.,Lu,X.:PubMedQA:Adatasetforbiomed- ical research question answering. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Pro- cessing (EMNLP-IJCNLP). pp. 2567–2577. Association f...

  16. [16]

    arXiv preprint arXiv:2404.09163 (2024)

    Namboori, A., Mangale, S., Rosenbaum, A., Soltan, S.: Gemquad: Generating mul- tilingual question answering datasets from large language models using few shot learning. arXiv preprint arXiv:2404.09163 (2024)

  17. [17]

    figshare (2014)

    Peixoto, T.P.: The graph-tool python library. figshare (2014). https://doi.org/10. 6084/m9.figshare.1164194, http://figshare.com/articles/graph_tool/1164194

  18. [18]

    In: Dernon- court, F., Preoţiuc-Pietro, D., Shimorina, A

    Pradeep, R., Lee, D., Mousavi, A., Pound, J., Sang, Y., Lin, J., Ilyas, I., Potdar, S., Arefiyan, M., Li, Y.: ConvKGYarn: Spinning configurable and scalable conver- sational knowledge graph QA datasets with large language models. In: Dernon- court, F., Preoţiuc-Pietro, D., Shimorina, A. (eds.) Proceedings of the 2024 Con- ference on Empirical Methods in N...

  19. [19]

    Nucleic Acids Research52(D1), D938–D949 (11 2023)

    Putman, T.E., Schaper, K., Matentzoglu, N., Rubinetti, V., Alquaddoomi, F., Cox, C., Caufield, J.H., Elsarboukh, G., Gehrke, S., Hegde, H., Reese, J., Braun, I., Bruskiewich, R., Cappelletti, L., Carbon, S., Caron, A., Chan, L., Chute, C., Cortes, K., De Souza, V., Fontana, T., Harris, N., Hartley, E., Hurwitz, E., Jacob- sen, J.B., Krishnamurthy, M., Lar...

  20. [20]

    ACM computing surveys (csur)54(2), 1–36 (2021)

    Ribeiro, P., Paredes, P., Silva, M.E., Aparicio, D., Silva, F.: A survey on subgraph counting: concepts, algorithms, and applications to network motifs and graphlets. ACM computing surveys (csur)54(2), 1–36 (2021)

  21. [21]

    Foundations and Trends®in Information Retrieval3(4), 333–389 (2009)

    Robertson, S., Zaragoza, H., et al.: The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends®in Information Retrieval3(4), 333–389 (2009)

  22. [22]

    Nature biotechnology40(5), 692–702 (2022)

    Santos, A., Colaço, A.R., Nielsen, A.B., Niu, L., Strauss, M., Geyer, P.E., Coscia, F., Albrechtsen, N.J.W., Mundt, F., Jensen, L.J., et al.: A knowledge graph to interpret clinical proteomics data. Nature biotechnology40(5), 692–702 (2022)

  23. [23]

    Nucleic Acids Research49(D1), D10–D17 (10 2020)

    Sayers, E.W., Beck, J., Bolton, E.E., Bourexis, D., Brister, J.R., Canese, K., Comeau, D.C., Funk, K., Kim, S., Klimke, W., Marchler-Bauer, A., Landrum, M., Lathrop, S., Lu, Z., Madden, T.L., O’Leary, N., Phan, L., Rangwala, S.H., Schneider, V.A., Skripchenko, Y., Wang, J., Ye, J., Trawick, B.W., Pruitt, K.D., Sherry, S.T.: Database resources of the natio...

  24. [24]

    Tang, R., Han, X., Jiang, X., Hu, X.: Does synthetic data generation of llms help clinical text mining? arXiv preprint arXiv:2303.04360 (2023)

  25. [25]

    In: Advances in Neural Information Processing Systems

    Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems. vol. 30 (2017)

  26. [26]

    Com- munications of the ACM57(10), 78–85 (2014)

    Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Com- munications of the ACM57(10), 78–85 (2014)

  27. [27]

    IEEE/ACM Transactions on Computational Biology and Bioinformatics3(4), 347–359 (2006)

    Wernicke, S.: Efficient detection of network motifs. IEEE/ACM Transactions on Computational Biology and Bioinformatics3(4), 347–359 (2006). https://doi.org/ 10.1109/TCBB.2006.51

  28. [28]

    arXiv preprint arXiv:2407.02233 (2024)

    Wu, I., Jayanthi, S., Viswanathan, V., Rosenberg, S., Pakazad, S., Wu, T., Neubig, G.: Synthetic multimodal question generation. arXiv preprint arXiv:2407.02233 (2024)

  29. [29]

    Xu, C., Guan, S., Greene, D., Kechadi, M.T.: Benchmark data contamination of large language models: A survey (2024), https://arxiv.org/abs/2406.04244

  30. [30]

    In: ECAI 2024, pp

    Yan, X., Westphal, P., Seliger, J., Usbeck, R.: Bridging the gap: Generating a comprehensive biomedical knowledge graph question answering dataset. In: ECAI 2024, pp. 1198–1205. IOS Press (2024)

  31. [31]

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu, Q...

  32. [32]

    In: Association for Computational Linguistics (ACL) (2022)

    Yasunaga, M., Leskovec, J., Liang, P.: Linkbert: Pretraining language models with document links. In: Association for Computational Linguistics (ACL) (2022)