Supervised fine-tuning of LLMs often fails to fully internalize all training instances due to five recurring causes including missing prerequisites and data conflicts, as diagnosed via a new framework across multiple models.
Title resolution pending
19 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.
A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
GPT-4 exceeds the USMLE passing score by more than 20 points and outperforms both GPT-3.5 and the medically fine-tuned Med-PaLM on the MultiMedQA benchmarks.
Atomic fact-checking of LLM oncology recommendations increased clinician trust from 26.9% to 66.5% (Cohen's d=0.94) in a trial of 356 doctors.
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
RAPTOR introduces a tree-organized retrieval method using recursive abstractive summaries, achieving a 20% absolute accuracy improvement on the QuALITY benchmark when paired with GPT-4.
PMC-VQA dataset and MedVInT model achieve better generative performance on medical VQA benchmarks by visual instruction tuning on a newly constructed large-scale dataset.
Med-PaLM 2 achieves 86.5% accuracy on MedQA and approaches or exceeds prior state-of-the-art on other medical QA benchmarks while receiving higher physician preference ratings than human answers on consumer questions.
BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
The Flan Collection demonstrates that task balancing, data enrichment, and mixed prompt training are critical to effective instruction tuning, yielding stronger Flan-T5 models released publicly.
TriageRA-CCF combines source-side confidence, coverage, and counterfactual signals to supervise an adaptive LoRA rank router, reporting modest average accuracy gains over LoRA/DoRA/MoELoRA baselines on two 8B models under matched training.
Introduces a French OSCE dialogue dataset of 240 interactions and a modular LLM-based controllable virtual patient generation system with multi-level LLM-as-Judge evaluation for clinical skills training.
MeMo encodes new knowledge into a separate memory model that integrates with frozen LLMs, showing strong performance on QA benchmarks while avoiding catastrophic forgetting and working without access to model weights.
Fine-tuned MedGemma outperforms untuned GPT-4 in zero-shot medical image disease classification, achieving 80.37% versus 69.58% mean test accuracy with higher sensitivity for cancer and pneumonia.
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
The paper surveys data-centric strategies for foundation models in computational healthcare and supplies a curated list of related models and datasets.
citing papers explorer
-
Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models
Supervised fine-tuning of LLMs often fails to fully internalize all training instances due to five recurring causes including missing prerequisites and data conflicts, as diagnosed via a new framework across multiple models.
-
Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment
Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.
-
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
-
Capabilities of GPT-4 on Medical Challenge Problems
GPT-4 exceeds the USMLE passing score by more than 20 points and outperforms both GPT-3.5 and the medically fine-tuned Med-PaLM on the MultiMedQA benchmarks.
-
Atomic Fact-Checking Increases Clinician Trust in Large Language Model Recommendations for Oncology Decision Support: A Randomized Controlled Trial
Atomic fact-checking of LLM oncology recommendations increased clinician trust from 26.9% to 66.5% (Cohen's d=0.94) in a trial of 356 doctors.
-
Compared to What? Baselines and Metrics for Counterfactual Prompting
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.
-
Towards an AI co-scientist
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
-
RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
RAPTOR introduces a tree-organized retrieval method using recursive abstractive summaries, achieving a 20% absolute accuracy improvement on the QuALITY benchmark when paired with GPT-4.
-
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
PMC-VQA dataset and MedVInT model achieve better generative performance on medical VQA benchmarks by visual instruction tuning on a newly constructed large-scale dataset.
-
Towards Expert-Level Medical Question Answering with Large Language Models
Med-PaLM 2 achieves 86.5% accuracy on MedQA and approaches or exceeds prior state-of-the-art on other medical QA benchmarks while receiving higher physician preference ratings than human answers on consumer questions.
-
BloombergGPT: A Large Language Model for Finance
BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
-
The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
The Flan Collection demonstrates that task balancing, data enrichment, and mixed prompt training are critical to effective instruction tuning, yielding stronger Flan-T5 models released publicly.
-
TriageRA-CCF: Source-Side Clinical Confidence and Coverage Signals for Adaptive Rank Budgeting in Medical LLMs
TriageRA-CCF combines source-side confidence, coverage, and counterfactual signals to supervise an adaptive LoRA rank router, reporting modest average accuracy gains over LoRA/DoRA/MoELoRA baselines on two 8B models under matched training.
-
A French OSCE Dialogue Dataset and Controllable Virtual Patient System for Clinical Training
Introduces a French OSCE dialogue dataset of 240 interactions and a modular LLM-based controllable virtual patient generation system with multi-level LLM-as-Judge evaluation for clinical skills training.
-
MeMo: Memory as a Model
MeMo encodes new knowledge into a separate memory model that integrates with frozen LLMs, showing strong performance on QA benchmarks while avoiding catastrophic forgetting and working without access to model weights.
-
MedGemma vs GPT-4: Open-Source and Proprietary Zero-shot Medical Disease Classification from Images
Fine-tuned MedGemma outperforms untuned GPT-4 in zero-shot medical image disease classification, achieving 80.37% versus 69.58% mean test accuracy with higher sensitivity for cancer and pneumonia.
-
Large Language Models: A Survey
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
-
Data-Centric Foundation Models in Computational Healthcare: A Survey
The paper surveys data-centric strategies for foundation models in computational healthcare and supplies a curated list of related models and datasets.
- CareTransition-Audit: A Benchmark to Audit Discharge Summaries for Efficient Care Transitions