arXiv preprint arXiv:2401.00595 , year =

(Cited on pg · 2022 · arXiv 2401.00595

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

cs.SE · 2024-03-12 · unverdicted · novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

cs.SE · 2024-01-05 · accept · novelty 6.0

CRUXEval benchmark shows current code models including GPT-4 achieve at most 81% on input and output prediction for short Python functions, exposing gaps not captured by HumanEval.

citing papers explorer

Showing 3 of 3 citing papers.

Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity cs.CL · 2026-05-07 · unverdicted · none · ref 14
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code cs.SE · 2024-03-12 · unverdicted · none · ref 106
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution cs.SE · 2024-01-05 · accept · none · ref 5
CRUXEval benchmark shows current code models including GPT-4 achieve at most 81% on input and output prediction for short Python functions, exposing gaps not captured by HumanEval.

arXiv preprint arXiv:2401.00595 , year =

fields

years

verdicts

representative citing papers

citing papers explorer