Title resolution pending

Why chain of thought fails in clinical text understanding · 2025 · arXiv 2509.21933

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6.7 points.

Better Accuracies, Worse Reasoning: A Step-Level Audit of Medical Chain-of-Thought Distillation

cs.AI · 2026-05-27 · unverdicted · novelty 6.0

In medical CoT distillation, answer accuracy on MedQA-USMLE rises from 74.7% to 84.4% while step-level reasoning error increases from 30.6% to 50.3% per LLM-judge audit.

LLM-as-a-Judge in Healthcare: A Scoping Analysis of Applications, Methods, and Human Alignment

cs.CY · 2026-05-24 · unverdicted · novelty 6.0

Scoping review of 134 studies on LLM-as-a-Judge in healthcare finds concentration in clinical decision support and NLP, frequent use of OpenAI models with prompt engineering, and moderate-to-strong human alignment where validated.

Compared to What? Baselines and Metrics for Counterfactual Prompting

cs.CL · 2026-05-01 · conditional · novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.

When Reasoning Hurts: Source-Aware Evaluation of Frontier LLMs for Clinical SOAP Note Generation

cs.CL · 2026-05-24 · unverdicted · novelty 5.0

Non-reasoning GPT-5.4 outperforms its reasoning version on SOAP note quality across datasets, with RAG giving smaller model-dependent gains.

citing papers explorer

Showing 1 of 1 citing paper after filters.

LLM-as-a-Judge in Healthcare: A Scoping Analysis of Applications, Methods, and Human Alignment cs.CY · 2026-05-24 · unverdicted · none · ref 158
Scoping review of 134 studies on LLM-as-a-Judge in healthcare finds concentration in clinical decision support and NLP, frequent use of OpenAI models with prompt engineering, and moderate-to-strong human alignment where validated.

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer