InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6071–6086

Jiageng Wu, Kevin Xie, Bowen Gu, Nils Krüger, Kueiyu Joshua Lin, Jie Yang · 2025 · arXiv 2509.21933

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6.7 points.

Better Accuracies, Worse Reasoning: A Step-Level Audit of Medical Chain-of-Thought Distillation

cs.AI · 2026-05-27 · unverdicted · novelty 6.0

In medical CoT distillation, answer accuracy on MedQA-USMLE rises from 74.7% to 84.4% while step-level reasoning error increases from 30.6% to 50.3% per LLM-judge audit.

LLM-as-a-Judge in Healthcare: A Scoping Analysis of Applications, Methods, and Human Alignment

cs.CY · 2026-05-24 · unverdicted · novelty 6.0

Scoping review of 134 studies on LLM-as-a-Judge in healthcare finds concentration in clinical decision support and NLP, frequent use of OpenAI models with prompt engineering, and moderate-to-strong human alignment where validated.

Compared to What? Baselines and Metrics for Counterfactual Prompting

cs.CL · 2026-05-01 · conditional · novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Better Accuracies, Worse Reasoning: A Step-Level Audit of Medical Chain-of-Thought Distillation cs.AI · 2026-05-27 · unverdicted · none · ref 4
In medical CoT distillation, answer accuracy on MedQA-USMLE rises from 74.7% to 84.4% while step-level reasoning error increases from 30.6% to 50.3% per LLM-judge audit.

InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6071–6086

fields

years

verdicts

representative citing papers

citing papers explorer