InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6071–6086

Jiageng Wu, Kevin Xie, Bowen Gu, Nils Krüger, Kueiyu Joshua Lin, Jie Yang · 2025 · arXiv 2509.21933

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6.7 points.

Better Accuracies, Worse Reasoning: A Step-Level Audit of Medical Chain-of-Thought Distillation

cs.AI · 2026-05-27 · unverdicted · novelty 6.0

In medical CoT distillation, answer accuracy on MedQA-USMLE rises from 74.7% to 84.4% while step-level reasoning error increases from 30.6% to 50.3% per LLM-judge audit.

LLM-as-a-Judge in Healthcare: A Scoping Analysis of Applications, Methods, and Human Alignment

cs.CY · 2026-05-24 · unverdicted · novelty 6.0

Scoping review of 134 studies on LLM-as-a-Judge in healthcare finds concentration in clinical decision support and NLP, frequent use of OpenAI models with prompt engineering, and moderate-to-strong human alignment where validated.

Compared to What? Baselines and Metrics for Counterfactual Prompting

cs.CL · 2026-05-01 · conditional · novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.

citing papers explorer

Showing 4 of 4 citing papers.

MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning cs.CL · 2026-04-19 · unverdicted · none · ref 14
MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6.7 points.
Better Accuracies, Worse Reasoning: A Step-Level Audit of Medical Chain-of-Thought Distillation cs.AI · 2026-05-27 · unverdicted · none · ref 4
In medical CoT distillation, answer accuracy on MedQA-USMLE rises from 74.7% to 84.4% while step-level reasoning error increases from 30.6% to 50.3% per LLM-judge audit.
LLM-as-a-Judge in Healthcare: A Scoping Analysis of Applications, Methods, and Human Alignment cs.CY · 2026-05-24 · unverdicted · none · ref 158
Scoping review of 134 studies on LLM-as-a-Judge in healthcare finds concentration in clinical decision support and NLP, frequent use of OpenAI models with prompt engineering, and moderate-to-strong human alignment where validated.
Compared to What? Baselines and Metrics for Counterfactual Prompting cs.CL · 2026-05-01 · conditional · none · ref 66
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.

InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6071–6086

fields

years

verdicts

representative citing papers

citing papers explorer