MED-VRAG reaches 78.6% average accuracy on four medical QA benchmarks by iteratively retrieving PMC page images with ColQwen2.5 embeddings and a VLM that refines queries over up to three rounds.
hub
Capabil- ities of GPT-4 on Medical Challenge Problems
16 Pith papers cite this work. Polarity classification is still indexing.
hub tools
representative citing papers
Indistinguishability-based privacy is incomparable to extractability in LLMs, and a new (l, b)-inextractability definition with rank-based bounds provides a tighter measure of extraction risk than prior proxies.
CuraView detects sentence-level faithfulness hallucinations in medical discharge summaries via GraphRAG knowledge graphs and multi-agent evidence grading, achieving 0.831 F1 on critical contradictions with a fine-tuned Qwen3-14B model and 50% relative improvement over baselines.
MedSkillAudit is a new domain-specific audit framework for medical research agent skills that achieved moderate agreement with expert reviews (ICC 0.449), exceeding the human inter-rater baseline (ICC 0.300).
HEG-TKG grounds LLM clinical reasoning in hierarchical evidence-based temporal knowledge graphs from 4,512 PubMed records, delivering 100% citation verifiability and error detectability where standard RAG and unprompted LLMs produce none.
MedDialBench shows LLMs suffer 1.7-3.4x larger diagnostic accuracy drops from patients fabricating symptoms than withholding them, with fabrication driving super-additive interaction effects across models.
EvidenceNet releases disease-specific biomedical knowledge bases with 7,872 and 6,622 evidence records for HCC and CRC, plus graphs, extracted via LLM pipeline with reported high fidelity.
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
Domain fine-tuning of a 4B LLM yields a statistically significant 6.8 pp accuracy gain on MedQA-USMLE over a general baseline, while RAG over medical explanations produces no significant improvement.
VeriLLMed is an interactive visual debugging tool that maps LLM diagnostic reasoning to knowledge graphs to identify and categorize relation, branch, and missing errors.
EviCare uses deep model-guided evidence to enhance LLM in-context reasoning for accurate diagnosis prediction from EHRs, outperforming baselines by 20.65% on average and 30.97% for novel diagnoses on MIMIC datasets.
A 14B model trained on synthetic data from Brazilian clinical guidelines outperforms larger LLMs on new benchmarks for Brazilian healthcare protocols.
The paper introduces a dual-layer AI identification framework that integrates cryptographic, blockchain, and zero-knowledge techniques with governance checkpoints to support lifecycle accountability in digital enterprises.
Dense retrieval plus query reformulation and reranking reaches 60.49% accuracy on MedQA USMLE, outperforming other setups while domain-specialized models make better use of the retrieved evidence.
Domain-specific models like ChatDoctor excel at medically accurate and contextually reliable text while general-purpose models like Grok and LLaMA perform better on structured medical question-answering tasks.
citing papers explorer
-
Domain Fine-Tuning vs. Retrieval-Augmented Generation for Medical Multiple-Choice Question Answering: A Controlled Comparison at the 4B-Parameter Scale
Domain fine-tuning of a 4B LLM yields a statistically significant 6.8 pp accuracy gain on MedQA-USMLE over a general baseline, while RAG over medical explanations produces no significant improvement.