Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

https://arxiv · 2024 · arXiv 2411.15114

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Principles and Guidelines for Randomized Controlled Trials in AI Evaluation

cs.CY · 2026-05-03 · unverdicted · novelty 6.0

The authors adapt established RCT validity principles from other fields into a standardized framework with 33 guidelines tailored to AI evaluation contexts.

TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration

cs.AI · 2026-04-15 · unverdicted · novelty 6.0

TREX automates the LLM training lifecycle via collaborative agents and tree-based exploration, delivering consistent performance gains across 10 real-world fine-tuning tasks in FT-Bench.

Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization

cs.AI · 2026-04-14 · unverdicted · novelty 6.0

Frontier-Eng is a new benchmark for generative optimization in engineering where agents iteratively improve designs under fixed interaction budgets using executable verifiers, with top models like GPT 5.4 showing limited success.

Humanity's Last Exam

cs.LG · 2025-01-24 · unverdicted · novelty 5.0

Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.

Risk Reporting for Developers' Internal AI Model Use

cs.CY · 2026-04-27 · unverdicted · novelty 4.0

A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.

citing papers explorer

Showing 5 of 5 citing papers.

Principles and Guidelines for Randomized Controlled Trials in AI Evaluation cs.CY · 2026-05-03 · unverdicted · none · ref 34
The authors adapt established RCT validity principles from other fields into a standardized framework with 33 guidelines tailored to AI evaluation contexts.
TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration cs.AI · 2026-04-15 · unverdicted · none · ref 48
TREX automates the LLM training lifecycle via collaborative agents and tree-based exploration, delivering consistent performance gains across 10 real-world fine-tuning tasks in FT-Bench.
Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization cs.AI · 2026-04-14 · unverdicted · none · ref 24
Frontier-Eng is a new benchmark for generative optimization in engineering where agents iteratively improve designs under fixed interaction budgets using executable verifiers, with top models like GPT 5.4 showing limited success.
Humanity's Last Exam cs.LG · 2025-01-24 · unverdicted · none · ref 60
Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
Risk Reporting for Developers' Internal AI Model Use cs.CY · 2026-04-27 · unverdicted · none · ref 48
A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer