Naturalcodebench: Examining coding performance mismatch on humaneval and natural user prompts

Shudan Zhang, Hanlin Zhao, Xiao Liu, Qinkai Zheng, Zehan Qi, Xiaotao Gu, Xiaohan Zhang, Yuxiao Dong, Jie Tang · 2024 · arXiv 2405.04520

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

dataset 1

citation-polarity summary

use dataset 1

representative citing papers

Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

cs.CL · 2025-08-04 · unverdicted · novelty 6.0

Seed Diffusion Preview is a discrete diffusion language model that reaches 2146 tokens per second inference on H20 GPUs with competitive code benchmark performance, establishing a new speed-quality Pareto frontier.

I-WebGenBench : Evaluating Interactivity in LLM-Generated Scientific Web Applications

cs.CL · 2026-05-30 · unverdicted · novelty 5.0

A Paper-to-Interactive-System Agent and I-WebGenBench benchmark with 19 papers enable converting scientific PDFs into executable interactive web systems, with PaperVoyager framework shown to improve quality.

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

cs.CL · 2024-06-18 · unverdicted · novelty 3.0

GLM-4 models rival or exceed GPT-4 on MMLU, GSM8K, MATH, BBH, GPQA, HumanEval, IFEval, long-context tasks, and Chinese alignment while adding autonomous tool use for web, code, and image generation.

citing papers explorer

Showing 3 of 3 citing papers.

Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference cs.CL · 2025-08-04 · unverdicted · none · ref 31
Seed Diffusion Preview is a discrete diffusion language model that reaches 2146 tokens per second inference on H20 GPUs with competitive code benchmark performance, establishing a new speed-quality Pareto frontier.
I-WebGenBench : Evaluating Interactivity in LLM-Generated Scientific Web Applications cs.CL · 2026-05-30 · unverdicted · none · ref 50
A Paper-to-Interactive-System Agent and I-WebGenBench benchmark with 19 papers enable converting scientific PDFs into executable interactive web systems, with PaperVoyager framework shown to improve quality.
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools cs.CL · 2024-06-18 · unverdicted · none · ref 55
GLM-4 models rival or exceed GPT-4 on MMLU, GSM8K, MATH, BBH, GPQA, HumanEval, IFEval, long-context tasks, and Chinese alignment while adding autonomous tool use for web, code, and image generation.

Naturalcodebench: Examining coding performance mismatch on humaneval and natural user prompts

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer