A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
arXiv preprint arXiv:2401.00595 , year =
3 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
CRUXEval benchmark shows current code models including GPT-4 achieve at most 81% on input and output prediction for short Python functions, exposing gaps not captured by HumanEval.
citing papers explorer
-
Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
-
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
CRUXEval benchmark shows current code models including GPT-4 achieve at most 81% on input and output prediction for short Python functions, exposing gaps not captured by HumanEval.