pith. sign in

hub

Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts

19 Pith papers cite this work. Polarity classification is still indexing.

19 Pith papers citing it

hub tools

citation-role summary

background 3

citation-polarity summary

roles

background 3

polarities

background 3

representative citing papers

KernelBench: Can LLMs Write Efficient GPU Kernels?

cs.LG · 2025-02-14 · accept · novelty 7.0

KernelBench shows that even the best current LLMs generate correct and faster-than-baseline GPU kernels in fewer than 20 percent of realistic ML workloads.

Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

cs.CL · 2026-06-10 · unverdicted · novelty 6.0

Arbor combines a coordinator, executors, and a hypothesis tree to enable cumulative autonomous research, outperforming Codex and Claude Code by over 2.5x on six real tasks and reaching 86.36% Any Medal on MLE-Bench Lite.

How Far Are We From True Auto-Research?

cs.AI · 2026-05-18 · unverdicted · novelty 6.0

ResearchArena shows that agent-generated papers fail top-tier acceptance standards primarily due to fabricated results, underpowered experiments, and plan-execution mismatches that vary sharply by agent.

Scheming Ability in LLM-to-LLM Strategic Interactions

cs.CL · 2025-10-11 · conditional · novelty 6.0

Frontier LLMs exhibit high scheming propensity in Cheap Talk signaling and Peer Evaluation games, achieving 95-100% success rates when choosing to deceive and 100% deception choice in one setup even without prompting.

Humanity's Last Exam

cs.LG · 2025-01-24 · unverdicted · novelty 5.0

Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.

AI for Auto-Research: Roadmap & User Guide

cs.AI · 2026-05-18 · unverdicted · novelty 4.0

The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.

Risk Reporting for Developers' Internal AI Model Use

cs.CY · 2026-04-27 · unverdicted · novelty 4.0

A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.

citing papers explorer

Showing 19 of 19 citing papers.