pith. machine review for the scientific record. sign in

arxiv: 2605.01077 · v1 · submitted 2026-05-01 · 💻 cs.CL

Recognition: unknown

Teaching LLMs Brazilian Healthcare: Injecting Knowledge from Official Clinical Guidelines

Authors on Pith no claims yet

Pith reviewed 2026-05-09 18:45 UTC · model grok-4.3

classification 💻 cs.CL
keywords Brazilian clinical guidelinesLLM domain adaptationsynthetic data generationcontinual pre-trainingreinforcement learningmedical question answeringPortuguese language models
0
0 comments X

The pith

A 14B model trained on synthetic data from Brazilian clinical guidelines outperforms much larger general LLMs on domain-specific medical recall.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Brazil's public health system uses detailed official guidelines for diagnosis, treatment, and monitoring that serve over 200 million people, yet general LLMs lack reliable command of this localized knowledge. The paper demonstrates that a 14B-parameter base model can be adapted by first generating tens of millions of tokens of synthetic training material directly from 178 official documents in multiple formats. This material is produced by four different generator models to increase variety, then used in a two-stage process of continual pre-training followed by reinforcement learning. The adapted model reaches high accuracy on two new benchmarks that test true/false assertions and open-ended clinical questions drawn from the guidelines. The gains hold even against commercial systems many times larger, showing that targeted injection of protocol-specific data can close the gap for smaller models.

Core claim

The authors establish that converting official Brazilian clinical guidelines into diverse synthetic datasets via multiple generator LLMs, followed by continual pre-training and Group Relative Policy Optimization, produces a 14B model that scores 83.9 percent on a balanced true/false benchmark and 85.4 percent on open-ended questions, exceeding the performance of GPT-5.2, Claude Sonnet 4.6, Gemini 3.1 Pro, and web-grounded retrieval systems while using far fewer parameters.

What carries the argument

A data-generation pipeline that turns 178 official guidelines into roughly 70 million tokens of rephrases, wiki-style articles, and question-answer pairs using four generator LLMs, then applies continual pre-training and Group Relative Policy Optimization to embed the content into the base model.

If this is right

  • Smaller models can achieve reliable command of guideline-specific details such as diagnostic criteria, dosages, and monitoring steps for Brazilian protocols.
  • Using multiple generators and the reinforcement learning stage each contribute measurably to the final accuracy.
  • The released datasets and benchmarks provide a reproducible way to measure how well any model captures Brazilian clinical protocols.
  • Deployment of such adapted models becomes practical in settings where compute resources are limited.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same guideline-to-synthetic-data approach could be repeated for clinical protocols from other national health systems.
  • Models trained this way might reduce certain types of medical hallucinations when answering questions within the covered guideline topics.
  • The new benchmarks could be extended over time to include cases from guidelines released after the current training cutoff.
  • Integration into clinical decision-support tools would still require separate safety and usability testing beyond the reported benchmarks.

Load-bearing premise

The synthetic data created from the guidelines accurately reflects their content without introducing factual errors, omissions, or biases that would distort what the model learns.

What would settle it

The fine-tuned model giving wrong answers on clinical scenarios taken directly from the guidelines where one of the larger comparison models answers correctly.

Figures

Figures reproduced from arXiv: 2605.01077 by Filipe Rocha Lopes, Hugo Abonizio, Roberto Lotufo, Rodrigo Nogueira.

Figure 1
Figure 1. Figure 1: Overview of our pipeline. The 178 clinical guidelines are split into train and test sets. All guidelines feed rephrase and wiki-style generation (dashed gray arrows), while QA pairs are generated only from train guidelines to prevent leakage. Four generator LLMs produce the synthetic corpus (∼70M tokens), which is used for continual pre￾training of Qwen2.5-14B-Instruct. The HealthBench-BR train split provi… view at source ↗
Figure 2
Figure 2. Figure 2: Examples from the two proposed benchmarks. Left: HealthBench-BR pairs each true assertion with a variant in which one critical detail is modified to produce a plausible distractor. Right: PCDT-QA pairs an open-ended question with a reference answer, scored against the model’s response by an LLM judge. 5 Evaluation Benchmarks We introduce two complementary benchmarks derived from the clinical guide￾lines, i… view at source ↗
read the original abstract

Brazil's Unified Health System (SUS) relies on official clinical guidelines that define diagnostic criteria, treatments, dosages, and monitoring procedures for over 200 million citizens. Yet current LLMs perform poorly on this guideline-specific knowledge, and no benchmark evaluates clinical recall grounded in Brazilian Portuguese protocols. We address this gap by adapting Qwen2.5-14B-Instruct to the Brazilian clinical domain. From 178 official guidelines (~5.4M tokens), we generate ~70M tokens of synthetic data in three formats -- rephrases, wiki-style articles, and question-answer pairs -- using four generator LLMs. We then apply continual pre-training followed by Group Relative Policy Optimization (GRPO). We introduce HealthBench-BR, with 1,780 balanced true/false clinical assertions, and PCDT-QA, with 890 open-ended clinical questions scored by an LLM judge. Our best model achieves 83.9% on HealthBench-BR and 85.4% on PCDT-QA, outperforming GPT-5.2, Claude Sonnet 4.6, Gemini 3.1 Pro, and Google AI Overview's web-grounded RAG despite having only 14B parameters. Ablations show that generator diversity and reinforcement learning are critical to these gains. We release all datasets, benchmarks, and model weights to support reproducible clinical NLP research for Brazilian Portuguese. Code, data, and model weights are available at https://github.com/hugoabonizio/clinical-protocols-br

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper adapts Qwen2.5-14B-Instruct to Brazilian clinical guidelines by generating ~70M tokens of synthetic rephrases, wiki articles, and QA pairs from 178 official SUS documents using four generator LLMs, followed by continual pre-training and GRPO. It introduces HealthBench-BR (1,780 balanced true/false assertions) and PCDT-QA (890 open-ended questions scored by LLM judge), reporting 83.9% and 85.4% accuracy respectively. The model outperforms GPT-5.2, Claude Sonnet 4.6, Gemini 3.1 Pro, and web-grounded RAG despite its size; ablations highlight the value of generator diversity and RL. All datasets, benchmarks, and weights are released.

Significance. If the central results hold after verification, the work is significant for showing that targeted synthetic-data adaptation plus GRPO can enable a 14B model to surpass much larger general-purpose LLMs on domain-specific clinical recall in Brazilian Portuguese. The public release of the guideline-derived datasets, two new benchmarks, and fine-tuned weights provides concrete resources for reproducible clinical NLP research in a low-resource language setting and supports further investigation of efficient knowledge injection methods.

major comments (2)
  1. [Section 3.2] Section 3.2 (Synthetic Data Generation): No expert review, error-rate quantification, or fidelity check against the source PDFs is reported for the ~70M tokens of generated rephrases/wiki/QA. If generator LLMs introduce factual distortions (incorrect dosages, omitted contraindications, or stylistic biases), these will be reinforced by continual pre-training and GRPO, directly undermining the claim that performance gains reflect accurate guideline knowledge rather than artifacts.
  2. [Section 4] Section 4 (Benchmark Construction): HealthBench-BR and PCDT-QA lack reported details on item sourcing, overlap controls with the synthetic training data, inter-annotator agreement for the true/false labels, or human validation of the LLM judge used for PCDT-QA. Without these, the headline scores (83.9% and 85.4%) and outperformance claims cannot be independently verified and may be inflated by shared synthetic patterns.
minor comments (2)
  1. [Abstract and Section 5] The abstract states that ablations demonstrate the importance of generator diversity and reinforcement learning, but the main text should include the exact accuracy drops, confidence intervals, and statistical tests for each ablation condition.
  2. [Section 3.3] Notation for the GRPO objective and the three synthetic data formats could be clarified with a small table or pseudocode to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us improve the transparency and rigor of the manuscript. We address each major point below and have revised the paper accordingly where possible.

read point-by-point responses
  1. Referee: [Section 3.2] Section 3.2 (Synthetic Data Generation): No expert review, error-rate quantification, or fidelity check against the source PDFs is reported for the ~70M tokens of generated rephrases/wiki/QA. If generator LLMs introduce factual distortions (incorrect dosages, omitted contraindications, or stylistic biases), these will be reinforced by continual pre-training and GRPO, directly undermining the claim that performance gains reflect accurate guideline knowledge rather than artifacts.

    Authors: We acknowledge that the absence of expert review or explicit error-rate quantification for the synthetic data is a limitation. Performing such validation at the scale of 70M tokens would require significant clinical resources beyond the scope of this study. In response, we have added a dedicated paragraph in Section 3.2 describing the multi-generator approach and its intended bias-mitigation benefits, supported by our existing ablations. We have also inserted a Limitations section that explicitly notes the lack of expert fidelity checks and recommends that practitioners always consult the original SUS PDFs for safety-critical decisions. All generated data is released to facilitate external audits. revision: partial

  2. Referee: [Section 4] Section 4 (Benchmark Construction): HealthBench-BR and PCDT-QA lack reported details on item sourcing, overlap controls with the synthetic training data, inter-annotator agreement for the true/false labels, or human validation of the LLM judge used for PCDT-QA. Without these, the headline scores (83.9% and 85.4%) and outperformance claims cannot be independently verified and may be inflated by shared synthetic patterns.

    Authors: We agree that additional methodological details are required for independent verification. We have expanded Section 4 to document item sourcing (assertions and questions were manually derived by the authors from the 178 official guidelines), overlap controls (disjoint guideline sections plus embedding-based deduplication were applied), and the construction process. We have added a note clarifying that formal inter-annotator agreement was not computed because labels are objective extractions from the source documents, and we have included a human validation study of the LLM judge on a held-out subset of PCDT-QA. These additions allow readers to assess potential data leakage and judge reliability. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper's central result is an empirical performance measurement obtained by ingesting external official Brazilian clinical guidelines, generating synthetic data via four generator LLMs, performing continual pre-training plus GRPO, and evaluating on two newly introduced benchmarks (HealthBench-BR true/false assertions and PCDT-QA open-ended questions). No equations, self-definitions, or fitted parameters reduce the reported accuracies (83.9% and 85.4%) to the inputs by construction; the benchmarks are presented as independent test sets, and no load-bearing self-citation or uniqueness theorem is invoked to justify the gains. The process remains externally grounded in the source guidelines and measurable outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central performance claim depends on the unverified assumption that LLM-generated synthetic data is a faithful proxy for the original guidelines and that the new benchmarks measure genuine guideline adherence rather than artifacts of the generation process.

free parameters (1)
  • Continual pre-training and GRPO hyperparameters
    Learning rate, number of epochs, and GRPO-specific parameters are chosen or tuned to produce the reported scores but are not enumerated in the abstract.
axioms (2)
  • domain assumption Synthetic data generated by four external LLMs accurately reflects the content and intent of the 178 official Brazilian clinical guidelines
    The method uses ~70M tokens of generated text as the primary training signal; any systematic distortion in that data would propagate to the final model.
  • domain assumption The LLM judge used to score PCDT-QA open-ended answers produces reliable, unbiased evaluations aligned with clinical expert judgment
    85.4% accuracy on PCDT-QA is reported using an LLM judge whose own accuracy and consistency are not independently validated in the abstract.

pith-pipeline@v0.9.0 · 5579 in / 1619 out tokens · 43489 ms · 2026-05-09T18:45:11.097601+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 22 canonical work pages · 5 internal anchors

  1. [1]

    In: Naldi, M.C., Bianchi, R.A.C

    Almeida, T.S., Laitz, T., Bonás, G.K., Nogueira, R.: Bluex: A benchmark based on brazilian leading universities entrance exams. In: Intelligent Systems: 12th Brazilian Conference, BRACIS 2023, Belo Horizonte, Brazil, September 25–29, 2023, Proceedings, Part I. p. 337–347. Springer-Verlag, Berlin, Heidelberg (2023). https://doi.org/10.1007/978-3-031-45368-...

  2. [2]

    In: ICML (2025)

    Bethune, L., Grangier, D., Busbridge, D., Gualdoni, E., Cuturi, M., Ablin, P.: Scaling laws for forgetting during finetuning with pretraining data injection. In: ICML (2025)

  3. [3]

    Transactions on Machine Learning Research (2024) 14 Abonizio et al

    Biderman, D., et al.: LoRA learns less and forgets less. Transactions on Machine Learning Research (2024) 14 Abonizio et al

  4. [4]

    12.401, de 28 de abril de 2011

    BRASIL: Lei n. 12.401, de 28 de abril de 2011. Altera a Lei n. 8.080/1990, para dispor sobre a assistência terapêutica e a incorporação de tecnologia em saúde no âmbito do SUS (2011), diário Oficial da União, seção 1, Brasília, DF, 29 abr. 2011

  5. [5]

    BMJ Health & Care Informatics32(1), e101195 (2025)

    Bruneti Severino, J.V., Basei de Paula, P.A., Nespolo Berger, M., Silveira Loures, F., Amadori Todeschini, S., Roeder, E.A., Han Veiga, M., Guedes, M., Lenci Mar- ques,G.:Benchmarkingopen-sourcelargelanguagemodelsonPortugueseRevalida multiple-choice questions. BMJ Health & Care Informatics32(1), e101195 (2025). https://doi.org/10.1136/bmjhci-2024-101195

  6. [6]

    arXiv e-prints arXiv:2511.11878 (Nov 2025).https://doi.org/10.48550/arXiv.2511.11878

    Bufon Färber, F., Alves Brito, I., Soares Dollis, J., Schindler Freire Brasil Ribeiro, P., Teixeira Sousa, R., Rodrigues Galvão Filho, A.: MedPT: A Massive Medi- cal Question Answering Dataset for Brazilian-Portuguese Speakers. arXiv e-prints arXiv:2511.11878 (Nov 2025).https://doi.org/10.48550/arXiv.2511.11878

  7. [7]

    Chen, Z., Hernández-Cano, A., Romanou, A., Bonnet, A., Matoba, K., et al.: Meditron-70b: Scaling medical pretraining for large language models (2023)

  8. [8]

    In: ICLR (2024)

    Cheng, D., Huang, S., Wei, F.: Adapting large language models to domains via reading comprehension. In: ICLR (2024)

  9. [9]

    arXiv preprint arXiv:2506.21578 (2025), https://arxiv.org/abs/2506.21578

    D’addario, A.M.V.: Healthqa-br: A system-wide benchmark reveals critical knowl- edge gaps in large language models. arXiv preprint arXiv:2506.21578 (2025), https://arxiv.org/abs/2506.21578

  10. [10]

    Submitted to IEEE

    Garcia,G.L.,Manesco,J.R.R.,Paiola,P.H.,CrespanRibeiro,P.H.,Garcia,A.L.A., Papa, J.P.: A step forward for medical LLMs in Brazilian Portuguese: Establishing a benchmark and a strong baseline (2025), preprint available on ResearchGate. Submitted to IEEE

  11. [11]

    Textbooks Are All You Need

    Gunasekar, S., Zhang, Y., Aneja, J., Mendes, C.C.T., Del Giorno, A., et al.: Text- books are all you need. arXiv preprint arXiv:2306.11644 (2023)

  12. [12]

    In: ICLR (2022)

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: ICLR (2022)

  13. [13]

    Scaling laws for forgetting when fine-tuning large language models, 2024

    Kalajdzievski, D.: Scaling laws for forgetting when fine-tuning large language mod- els. arXiv preprint arXiv:2401.05605 (2024)

  14. [14]

    In: Findings of ACL (2024)

    Labrak, Y., Bazoge, A., Morin, E., Gourraud, P.A., Rouvier, M., Dufour, R.: BioMistral: A collection of open-source pretrained large language models for med- ical domains. In: Findings of ACL (2024)

  15. [15]

    Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge

    Li, Y., Li, Z., Zhang, K., Dan, R., Jiang, S., Zhang, Y.: ChatDoctor: A medical chat model fine-tuned on LLaMA using medical domain knowledge. arXiv preprint arXiv:2303.14070 (2023)

  16. [16]

    In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Maini, P., Seto, S., Bai, R., Grangier, D., Zhang, Y., Jaitly, N.: Rephrasing the web: A recipe for compute and data-efficient language modeling. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 14044–14072. Association for Computational Linguistics (2024)

  17. [17]

    Capabilities of GPT-4 on Medical Challenge Problems

    Nori, H., King, N., McKinney, S.M., Carignan, D., Horvitz, E.: Capabilities of GPT-4 on medical challenge problems. arXiv preprint arXiv:2303.13375 (2023)

  18. [18]

    Paiola, P.H., Garcia, G.L., Manesco, J.R.R., Roder, M., Rodrigues, D., Papa, J.P.: Adapting llms for the medical domain in portuguese: A study on fine-tuning and model evaluation (2024),https://arxiv.org/abs/2410.00163

  19. [19]

    In: NeurIPS Datasets and Benchmarks Track (2024)

    Penedo, G., Kydlíček, H., et al.: The FineWeb datasets: Decanting the web for the finest text data at scale. In: NeurIPS Datasets and Benchmarks Track (2024)

  20. [20]

    Capabilities of gemini models in medicine.arXiv preprint arXiv:2404.18416.2024

    Saab, K., Tu, T., Weng, W.H., Tanno, R., Stutz, D., et al.: Capabilities of Gemini models in medicine. arXiv preprint arXiv:2404.18416 (2024)

  21. [21]

    In: Rumshisky, A., Roberts, K., Bethard, S., Nau- mann, T

    Schneider, E.T.R., de Souza, J.V.A., Knafou, J., Oliveira, L.E.S.e., Co- para, J., Gumiel, Y.B., Oliveira, L.F.A.d., Paraiso, E.C., Teodoro, D., Barra, Teaching LLMs Brazilian Healthcare 15 C.M.C.M.: BioBERTpt - a Portuguese neural language model for clinical named entity recognition. In: Rumshisky, A., Roberts, K., Bethard, S., Nau- mann, T. (eds.) Proce...

  22. [22]

    LoRA without regret

    Schulman, J., Lab, T.M.: Lora without regret. Thinking Machines Lab: Connectionism (2025).https://doi.org/10.64434/tml.20250929, https://thinkingmachines.ai/blog/lora/

  23. [23]

    MedGemma Technical Report

    Sellergren, A., Kazemzadeh, S., Jaroensri, T., Kiraly, A., Traverse, M., Kohlberger, T., Xu, S., Jamil, F., Hughes, C., Lau, C., et al.: Medgemma technical report. arXiv preprint arXiv:2507.05201 (2025)

  24. [24]

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, Y., Wu, Y., Guo, D.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models (2024),https://arxiv.org/abs/2402.03300

  25. [25]

    Lora vs full fine-tuning: An illusion of equivalence

    Shuttleworth, R., Andreas, J., Torralba, A., Sharma, P.: Lora vs full fine-tuning: An illusion of equivalence. arXiv preprint arXiv:2410.21228 (2024)

  26. [26]

    Nature620, 172–180 (2023)

    Singhal, K., Azizi, S., Tu, T., Mahdavi, S.S., Wei, J., Chung, H.W., et al.: Large language models encode clinical knowledge. Nature620, 172–180 (2023)

  27. [27]

    Pfohl, Heather Cole-Lewis, Darlene Neal, Qazi Mamunur Rashid, Mike Schaekermann, Amy Wang, Dev Dash, Jonathan H

    Singhal, K., Tu, T., Gottweis, J., Sayres, R., Wulczyn, E., Hou, L., et al.: To- ward expert-level medical question answering with large language models. Nature Medicine31(3),943–950(2025).https://doi.org/10.1038/s41591-024-03423-7

  28. [28]

    In: Paes, A., Verri, F.A.N

    de Souza Pinto, J.G., Rodrigues de Freitas, A., Martins, A.C.G., Sawazaki, C.M.R., Vidal,C.,SilvaeOliveira,L.E.:Developingresource-efficientclinicalllmsforbrazil- ian portuguese. In: Paes, A., Verri, F.A.N. (eds.) Intelligent Systems. pp. 46–60. Springer Nature Switzerland, Cham (2025)

  29. [29]

    Su,D.,Kong,K.,Lin,Y.,Jennings,J.,Norick,B.,Kliegl,M.,Patwary,M.,Shoeybi, M., Catanzaro, B.: Nemotron-cc: Transforming common crawl into a refined long- horizon pretraining dataset (2024)

  30. [30]

    Journal of the American Med- ical Informatics Association31(9), 1833–1843 (09 2024).https://doi.org/10

    Wu, C., Lin, W., Zhang, X., Zhang, Y., Xie, W., Wang, Y.: Pmc-llama: toward building open-source language models for medicine. Journal of the American Med- ical Informatics Association31(9), 1833–1843 (09 2024).https://doi.org/10. 1093/jamia/ocae045

  31. [31]

    arXiv preprint arXiv:2402.12749 (2024)

    Xie, Q., Chen, Q., Chen, A., Peng, C., Hu, Y., et al.: Me-LLaMA: Foundation large language models for medical applications. arXiv preprint arXiv:2402.12749 (2024)

  32. [32]

    Qwen2.5 Technical Report

    Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., et al.: Qwen2.5 technical report. arXiv preprint arXiv:2412.15115 (2024)

  33. [33]

    Yin, Q., Wu, Y., Shen, Z., Li, S., Wang, Z., Li, Y., Leong, C.T., Kang, J., Gu, J.: Evaluating parameter efficient methods for RLVR (2025),https://arxiv.org/ abs/2512.23165

  34. [34]

    arXiv preprint arXiv:2310.14558 (2024)

    Zhang, X., Tian, C., Yang, X., Chen, L., Li, Z., Petzold, L.R.: AlpaCare: Instruction-tuned large language models for medical application. arXiv preprint arXiv:2310.14558 (2024)

  35. [35]

    Zheng, G., Wang, X., Liang, J., Chen, N., Zheng, Y., Wang, B.: Efficiently democ- ratizing medical llms for 50 languages via a mixture of language family experts (2024),https://arxiv.org/abs/2410.10626

  36. [36]

    In: ICLR (2025)

    Zheng, J., et al.: Spurious forgetting in continual learning of language models. In: ICLR (2025)

  37. [37]

    Instruction-Following Evaluation for Large Language Models

    Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y., Zhou, D., Hou, L.: Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911 (2023)