CodeCriticBench: A holistic code critique benchmark for large language models

Alexander Zhang, Marcus Dong, Jiaheng Liu, Wei Zhang, Yejie Wang, Jian Yang, Ge Zhang, Tianyu Liu, Zhongyuan Peng, Yingshui Tan, Yuanxing Zhang, Zhexu Wang, Weixun Wang, Yancheng He, Ken Deng, Wangchunshu Zhou, Wenhao Huang, Zhaoxiang Z · 2025 · arXiv 2502.16614

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

dataset 1

citation-polarity summary

use dataset 1

representative citing papers

DeployBench: Benchmarking LLM Agents for Research Artifact Deployment

cs.SE · 2026-06-03 · unverdicted · novelty 7.0

DeployBench is a new benchmark of 51 research-artifact deployment tasks where four LLMs with OpenHands achieve 7.8-51% pass rates, with failures mostly from agents stopping after weaker self-checks than the paper requires.

Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces

cs.CL · 2026-06-05 · unverdicted · novelty 5.0

Reasoning in large output spaces proceeds via shortlisting then fine-grained reasoning; this characterization enables a mechanistic distillation strategy that outperforms standard distillation.

Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces

cs.CL · 2026-05-04 · unverdicted · novelty 4.0

This survey organizes RL for LLM multi-agent systems into reward families, credit units, and five orchestration sub-decisions, notes the absence of explicit stopping-decision training in its paper pool, and releases a tagged corpus.

citing papers explorer

Showing 3 of 3 citing papers.

DeployBench: Benchmarking LLM Agents for Research Artifact Deployment cs.SE · 2026-06-03 · unverdicted · none · ref 20
DeployBench is a new benchmark of 51 research-artifact deployment tasks where four LLMs with OpenHands achieve 7.8-51% pass rates, with failures mostly from agents stopping after weaker self-checks than the paper requires.
Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces cs.CL · 2026-06-05 · unverdicted · none · ref 116
Reasoning in large output spaces proceeds via shortlisting then fine-grained reasoning; this characterization enables a mechanistic distillation strategy that outperforms standard distillation.
Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces cs.CL · 2026-05-04 · unverdicted · none · ref 83
This survey organizes RL for LLM multi-agent systems into reward families, credit units, and five orchestration sub-decisions, notes the absence of explicit stopping-decision training in its paper pool, and releases a tagged corpus.

CodeCriticBench: A holistic code critique benchmark for large language models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer