Reward Modeling for Scientific Writing Evaluation

· 2026 · cs.CL · arXiv 2601.11374

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Scientific writing is an expert-domain task that demands deep domain knowledge, task-specific requirements and reasoning capabilities that leverage the domain knowledge to satisfy the task specifications. While scientific text generation has been widely studied, its evaluation remains a challenging and open problem. It is critical to develop models that can be reliably deployed for evaluating diverse open-ended scientific writing tasks while adhering to their distinct requirements. However, existing LLM-based judges and reward models are primarily optimized for general-purpose benchmarks with fixed scoring rubrics and evaluation criteria. Consequently, they often fail to reason over sparse knowledge of scientific domains when interpreting task-dependent and multi-faceted criteria. Moreover, fine-tuning for each individual task is costly and impractical for low-resource settings. To bridge these gaps, we propose cost-efficient, open-source reward models tailored for scientific writing evaluation. We introduce a two-stage training framework that initially optimizes scientific evaluation preferences and then refines reasoning capabilities. Our multi-aspect evaluation design and joint training across diverse tasks enable fine-grained assessment and robustness to dynamic criteria and scoring rubrics. Experimental analysis shows that our training regime strongly improves LLM-based scientific writing evaluation. Our models generalize effectively across tasks and to previously unseen scientific writing evaluation settings, allowing a single trained evaluator to be reused without task-specific retraining.

representative citing papers

Subjective Code Preferences in Experts and Large Language Models

cs.HC · 2026-05-24 · unverdicted · novelty 6.0

LLMs frequently reverse their stated coding preferences when shown actual code instead of descriptions, show positional bias, and produce more polarized ratings than human experts on complexity, commenting, modularity, and readability.

Uncertainty-Aware Generation and Decision-Making Under Ambiguity

cs.CL · 2026-06-29 · unverdicted · novelty 4.0

Uncertainty-aware algorithms based on Bayesian decision theory improve generation utility on tutoring and reviewing tasks while risk-averse methods can degrade performance under high ambiguity, with conformal prediction providing guarantees.

citing papers explorer

Showing 2 of 2 citing papers.

Subjective Code Preferences in Experts and Large Language Models cs.HC · 2026-05-24 · unverdicted · none · ref 8 · internal anchor
LLMs frequently reverse their stated coding preferences when shown actual code instead of descriptions, show positional bias, and produce more polarized ratings than human experts on complexity, commenting, modularity, and readability.
Uncertainty-Aware Generation and Decision-Making Under Ambiguity cs.CL · 2026-06-29 · unverdicted · none · ref 11 · internal anchor
Uncertainty-aware algorithms based on Bayesian decision theory improve generation utility on tutoring and reviewing tasks while risk-averse methods can degrade performance under high ambiguity, with conformal prediction providing guarantees.

Reward Modeling for Scientific Writing Evaluation

fields

years

verdicts

representative citing papers

citing papers explorer