ExploitGym benchmark shows frontier AI models can generate working exploits for 120-157 of 898 real vulnerabilities, with non-trivial success even when common security defenses are enabled.
Title resolution pending
6 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
CyberCertBench shows frontier LLMs reach human-expert performance on general IT and networking security but drop on vendor-specific and formal standards questions such as IEC 62443, with a new framework for producing interpretable explanations.
Rollout cards preserve complete agent rollout records and declare the reporting rules behind scores, enabling reproducible evaluation where changing only the rule can alter success rates by over 20 percentage points.
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
As AI capability asymmetry increases, disclosure-based governance fails because systems either game evaluations or become embedded in oversight, straining legitimacy and non-domination more than corrigibility or resilience.
Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.
citing papers explorer
-
ExploitGym: Can AI Agents Turn Security Vulnerabilities into Real Attacks?
ExploitGym benchmark shows frontier AI models can generate working exploits for 120-157 of 898 real vulnerabilities, with non-trivial success even when common security defenses are enabled.
-
CyberCertBench: Evaluating LLMs in Cybersecurity Certification Knowledge
CyberCertBench shows frontier LLMs reach human-expert performance on general IT and networking security but drop on vendor-specific and formal standards questions such as IEC 62443, with a new framework for producing interpretable explanations.
-
Rollout Cards: A Reproducibility Standard for Agent Research
Rollout cards preserve complete agent rollout records and declare the reporting rules behind scores, enabling reproducible evaluation where changing only the rule can alter success rates by over 20 percentage points.
-
Towards an AI co-scientist
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
-
From Disclosure to Self-Referential Opacity: Six Dimensions of Strain in Current AI Governance
As AI capability asymmetry increases, disclosure-based governance fails because systems either game evaluations or become embedded in oversight, straining legitimacy and non-domination more than corrigibility or resilience.
-
Gemma 2: Improving Open Language Models at a Practical Size
Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.