Refinebench: Evaluating refinement capability of language models via checklists.arXiv preprint arXiv:2511.22173, 2025

Young-Jun Lee, Seungone Kim, Byung-Kwan Lee, Minkyeong Moon, Yechan Hwang, Jong Myoung Kim, Graham Neubig, Sean Welleck, Ho-Jin Choi · 2025 · arXiv 2511.22173

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

representative citing papers

Evolution Fine-Tuning: Learning to Discover Across 371 Optimization Tasks

cs.CL · 2026-06-27 · unverdicted · novelty 6.0

Evolution Fine-Tuning trains LLMs on 156K trajectories spanning 371 tasks to achieve 10.22% average improvement on 22 held-out optimization tasks and match SOTA on select circle-packing problems when combined with test-time RL.

Agent Explorative Policy Optimization for Multimodal Agentic Reasoning

cs.CL · 2026-05-27 · unverdicted · novelty 6.0

AXPO addresses the Thinking-Acting Gap in agentic RL training by targeted resampling of tool calls in all-wrong subgroups, delivering +1.8pp gains over GRPO on nine multimodal benchmarks with an 8B model beating a 32B baseline on Pass@4.

citing papers explorer

Showing 2 of 2 citing papers.

Evolution Fine-Tuning: Learning to Discover Across 371 Optimization Tasks cs.CL · 2026-06-27 · unverdicted · none · ref 61
Evolution Fine-Tuning trains LLMs on 156K trajectories spanning 371 tasks to achieve 10.22% average improvement on 22 held-out optimization tasks and match SOTA on select circle-packing problems when combined with test-time RL.
Agent Explorative Policy Optimization for Multimodal Agentic Reasoning cs.CL · 2026-05-27 · unverdicted · none · ref 83
AXPO addresses the Thinking-Acting Gap in agentic RL training by targeted resampling of tool calls in all-wrong subgroups, delivering +1.8pp gains over GRPO on nine multimodal benchmarks with an 8B model beating a 32B baseline on Pass@4.

Refinebench: Evaluating refinement capability of language models via checklists.arXiv preprint arXiv:2511.22173, 2025

fields

years

verdicts

representative citing papers

citing papers explorer