hub

Evaluating frontier models for dangerous capabilities

Evaluating Frontier Models for Dangerous Capabilities , author= · 2024 · arXiv 2403.13793

18 Pith papers cite this work. Polarity classification is still indexing.

18 Pith papers citing it

read on arXiv browse 18 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3

citation-polarity summary

background 2 unclear 1

representative citing papers

Theory of Mind and Persuasion Beyond Conversation: Assessing the Capacity of LLMs to Induce Belief States via Planning and Action

cs.CL · 2026-06-30 · unverdicted · novelty 7.0

Introduces NCP-ExploreToM framework to evaluate LLMs on inducing belief states via planning and action, with GPT-5 succeeding on ~80% of tasks and outperforming humans.

Measuring Safety Alignment Effects in Autonomous Security Agents

cs.CR · 2026-05-19 · conditional · novelty 7.0

A trace-based benchmark of 30 security tasks finds that less-restricted LLM derivatives outperform stock safety-aligned models on some agent tasks for Gemma but not Qwen or Llama, with similar patterns on non-security controls.

ExploitGym: Can AI Agents Turn Security Vulnerabilities into Real Attacks?

cs.CR · 2026-05-11 · conditional · novelty 7.0

ExploitGym benchmark shows frontier AI models can generate working exploits for 120-157 of 898 real vulnerabilities, with non-trivial success even when common security defenses are enabled.

CyberCertBench: Evaluating LLMs in Cybersecurity Certification Knowledge

cs.CR · 2026-04-22 · unverdicted · novelty 7.0

CyberCertBench shows frontier LLMs reach human-expert performance on general IT and networking security but drop on vendor-specific and formal standards questions such as IEC 62443, with a new framework for producing interpretable explanations.

Frontier Models are Capable of In-context Scheming

cs.AI · 2024-12-06 · conditional · novelty 7.0

Frontier models demonstrate in-context scheming by strategically deceiving in multiple agentic evaluations to achieve given goals.

LLM Agents can Autonomously Exploit One-day Vulnerabilities

cs.CR · 2024-04-11 · unverdicted · novelty 7.0

GPT-4 LLM agents autonomously exploit 87% of tested one-day vulnerabilities when given CVE descriptions, far outperforming other models and tools.

Rollout Cards: A Reproducibility Standard for Agent Research

cs.AI · 2026-05-12 · conditional · novelty 6.0

Rollout cards preserve complete agent rollout records and declare the reporting rules behind scores, enabling reproducible evaluation where changing only the rule can alter success rates by over 20 percentage points.

Comprehensive AI governance requires addressing non-model gains

cs.CY · 2026-05-01 · unverdicted · novelty 6.0

Non-model gains via inference, systems, and assets can drive AI capabilities independently of base models, requiring governance beyond model-level evaluation and mitigation.

The 2025 AI Agent Index: Documenting Technical and Safety Features of Deployed Agentic AI Systems

cs.CY · 2026-02-19 · accept · novelty 6.0

The 2025 AI Agent Index catalogs technical and safety details for 30 deployed AI agents and finds low developer transparency on safety, evaluations, and societal impacts.

Safe for Whom? Rethinking How We Evaluate the Safety of LLMs for Real Users

cs.AI · 2025-12-11 · unverdicted · novelty 6.0

LLM safety evaluations for personal advice must test responses against diverse user vulnerability profiles, since context-blind ratings overestimate safety and realistic prompt context does not fix the problem.

Benchmarking Misuse Mitigation Against Covert Adversaries

cs.CR · 2025-06-06 · unverdicted · novelty 6.0

Develops the BSD data generation pipeline and two new datasets to evaluate decomposition attacks as effective misuse enablers and stateful defenses as a countermeasure in language model safety.

Towards an AI co-scientist

cs.AI · 2025-02-26 · unverdicted · novelty 6.0

A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.

The Emergence of Autonomous Penetration Capabilities in Large Language Model-Powered AI Systems

cs.CR · 2026-06-11 · unverdicted · novelty 5.0

A tiered server benchmark with 300 targets shows current LLMs achieve autonomous penetration success rates of 10.7-69.3% using only general cybersecurity tools and no target-specific knowledge.

Solipsistic Superintelligence is Unlikely to be Cooperative

cs.AI · 2026-06-02 · unverdicted · novelty 5.0

Solipsistic superintelligence developed via unilateral optimization is unlikely to cooperate due to endogenous non-stationarity creating an unclosable train-test-deploy gap.

From Disclosure to Self-Referential Opacity: Six Dimensions of Strain in Current AI Governance

cs.CY · 2026-04-15 · unverdicted · novelty 4.0

As AI capability asymmetry increases, disclosure-based governance fails because systems either game evaluations or become embedded in oversight, straining legitimacy and non-domination more than corrigibility or resilience.

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

cs.CL · 2025-07-07 · unverdicted · novelty 4.0

Gemini 2.5 Pro and Flash models are presented as achieving frontier performance in reasoning, coding, and long-context multimodal tasks while spanning a cost-capability Pareto curve.

OpenAI o1 System Card

cs.AI · 2024-12-21 · unverdicted · novelty 4.0

OpenAI reports that chain-of-thought reasoning in o1 models enables deliberative alignment, yielding state-of-the-art results on selected safety benchmarks for illicit advice, stereotypes, and jailbreaks.

Gemma 2: Improving Open Language Models at a Practical Size

cs.CL · 2024-07-31 · conditional · novelty 3.0

Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.

citing papers explorer

Showing 18 of 18 citing papers.

Theory of Mind and Persuasion Beyond Conversation: Assessing the Capacity of LLMs to Induce Belief States via Planning and Action cs.CL · 2026-06-30 · unverdicted · none · ref 75
Introduces NCP-ExploreToM framework to evaluate LLMs on inducing belief states via planning and action, with GPT-5 succeeding on ~80% of tasks and outperforming humans.
Measuring Safety Alignment Effects in Autonomous Security Agents cs.CR · 2026-05-19 · conditional · none · ref 49
A trace-based benchmark of 30 security tasks finds that less-restricted LLM derivatives outperform stock safety-aligned models on some agent tasks for Gemma but not Qwen or Llama, with similar patterns on non-security controls.
ExploitGym: Can AI Agents Turn Security Vulnerabilities into Real Attacks? cs.CR · 2026-05-11 · conditional · none · ref 44
ExploitGym benchmark shows frontier AI models can generate working exploits for 120-157 of 898 real vulnerabilities, with non-trivial success even when common security defenses are enabled.
CyberCertBench: Evaluating LLMs in Cybersecurity Certification Knowledge cs.CR · 2026-04-22 · unverdicted · none · ref 18
CyberCertBench shows frontier LLMs reach human-expert performance on general IT and networking security but drop on vendor-specific and formal standards questions such as IEC 62443, with a new framework for producing interpretable explanations.
Frontier Models are Capable of In-context Scheming cs.AI · 2024-12-06 · conditional · none · ref 31
Frontier models demonstrate in-context scheming by strategically deceiving in multiple agentic evaluations to achieve given goals.
LLM Agents can Autonomously Exploit One-day Vulnerabilities cs.CR · 2024-04-11 · unverdicted · none · ref 12
GPT-4 LLM agents autonomously exploit 87% of tested one-day vulnerabilities when given CVE descriptions, far outperforming other models and tools.
Rollout Cards: A Reproducibility Standard for Agent Research cs.AI · 2026-05-12 · conditional · none · ref 2
Rollout cards preserve complete agent rollout records and declare the reporting rules behind scores, enabling reproducible evaluation where changing only the rule can alter success rates by over 20 percentage points.
Comprehensive AI governance requires addressing non-model gains cs.CY · 2026-05-01 · unverdicted · none · ref 76
Non-model gains via inference, systems, and assets can drive AI capabilities independently of base models, requiring governance beyond model-level evaluation and mitigation.
The 2025 AI Agent Index: Documenting Technical and Safety Features of Deployed Agentic AI Systems cs.CY · 2026-02-19 · accept · none · ref 103
The 2025 AI Agent Index catalogs technical and safety details for 30 deployed AI agents and finds low developer transparency on safety, evaluations, and societal impacts.
Safe for Whom? Rethinking How We Evaluate the Safety of LLMs for Real Users cs.AI · 2025-12-11 · unverdicted · none · ref 8
LLM safety evaluations for personal advice must test responses against diverse user vulnerability profiles, since context-blind ratings overestimate safety and realistic prompt context does not fix the problem.
Benchmarking Misuse Mitigation Against Covert Adversaries cs.CR · 2025-06-06 · unverdicted · none · ref 24
Develops the BSD data generation pipeline and two new datasets to evaluate decomposition attacks as effective misuse enablers and stateful defenses as a countermeasure in language model safety.
Towards an AI co-scientist cs.AI · 2025-02-26 · unverdicted · none · ref 38
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
The Emergence of Autonomous Penetration Capabilities in Large Language Model-Powered AI Systems cs.CR · 2026-06-11 · unverdicted · none · ref 12
A tiered server benchmark with 300 targets shows current LLMs achieve autonomous penetration success rates of 10.7-69.3% using only general cybersecurity tools and no target-specific knowledge.
Solipsistic Superintelligence is Unlikely to be Cooperative cs.AI · 2026-06-02 · unverdicted · none · ref 199
Solipsistic superintelligence developed via unilateral optimization is unlikely to cooperate due to endogenous non-stationarity creating an unclosable train-test-deploy gap.
From Disclosure to Self-Referential Opacity: Six Dimensions of Strain in Current AI Governance cs.CY · 2026-04-15 · unverdicted · none · ref 69
As AI capability asymmetry increases, disclosure-based governance fails because systems either game evaluations or become embedded in oversight, straining legitimacy and non-domination more than corrigibility or resilience.
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities cs.CL · 2025-07-07 · unverdicted · none · ref 64
Gemini 2.5 Pro and Flash models are presented as achieving frontier performance in reasoning, coding, and long-context multimodal tasks while spanning a cost-capability Pareto curve.
OpenAI o1 System Card cs.AI · 2024-12-21 · unverdicted · none · ref 5
OpenAI reports that chain-of-thought reasoning in o1 models enables deliberative alignment, yielding state-of-the-art results on selected safety benchmarks for illicit advice, stereotypes, and jailbreaks.
Gemma 2: Improving Open Language Models at a Practical Size cs.CL · 2024-07-31 · conditional · none · ref 37
Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.

Evaluating frontier models for dangerous capabilities

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer