hub

Weak-to-strong generalization: eliciting strong capabilities with weak supervision

Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, et al · 2023 · arXiv 2312.09390

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

read on arXiv browse 12 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

cs.AI · 2024-08-12 · unverdicted · novelty 8.0

The AI Scientist framework enables LLMs to independently conduct the full scientific process from idea generation to paper writing and review, demonstrated across three ML subfields with papers costing under $15 each.

The Mechanism of Weak-to-Strong Generalization: Feature Elicitation from Latent Knowledge

stat.ML · 2026-05-13 · unverdicted · novelty 7.0

In two-layer networks, weak-to-strong training elicits the target feature direction from pre-trained subspaces and preserves correlated off-target features, unlike standard fine-tuning.

Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images

cs.CV · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

A new benchmark reveals MLLMs achieve only 13% or lower accuracy on advanced perspective-conditioned spatial tasks in omnidirectional images, with RL reward shaping raising a 7B model from 31% to 60% in controlled settings.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.

Automated alignment is harder than you think

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

Automating alignment research with AI agents risks undetected systematic errors in fuzzy tasks, producing overconfident but misleading safety evaluations that could enable deployment of misaligned AI.

Honest Reporting in Scored Oversight: True-KL0 Property via the Prekopa Principle

cs.GT · 2026-05-05 · conditional · novelty 6.0

For heterogeneous power-p pseudospherical scoring rules with d ≤ 4, the True-KL0 property R(M,p,d) < 1 holds for all M > 1, establishing unconditional DSIC via a Prekopa-based log-concavity argument on the loss integral.

AI Alignment via Incentives and Correction

cs.LG · 2026-05-02 · unverdicted · novelty 6.0 · 2 refs

AI alignment is reframed as a fixed-point incentive problem in a solver-auditor pipeline, solved via bilevel optimization and bandit search over reward profiles to maintain monitoring and reduce hallucinations in LLM coding tasks.

Learning Stable Predictors from Weak Supervision under Distribution Shift

cs.LG · 2026-04-05 · unverdicted · novelty 6.0

Weak supervision supports in-domain learning for CRISPR transcriptomic perturbations but temporal shifts cause negative R-squared and near-zero correlation across linear and tree models, unlike partial cell-line transfer.

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

cs.LG · 2024-03-28 · unverdicted · novelty 6.0

Sparse feature circuits are introduced as interpretable causal subnetworks in language models, supporting unsupervised discovery of thousands of circuits and a method called SHIFT to improve classifier generalization by ablating irrelevant features.

Weak-to-Strong Knowledge Distillation Accelerates Visual Learning

cs.CV · 2026-04-16 · unverdicted · novelty 5.0

Weak-to-strong knowledge distillation applied early and then turned off accelerates convergence to target performance in visual learning tasks by factors of 1.7-4.8x.

Contextual Multi-Objective Optimization: Rethinking Objectives in Frontier AI Systems

cs.AI · 2026-05-05 · unverdicted · novelty 4.0

Frontier AI needs contextual multi-objective optimization to select and balance multiple context-dependent objectives rather than relying on single stable goals.

citing papers explorer

Showing 12 of 12 citing papers.

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery cs.AI · 2024-08-12 · unverdicted · none · ref 13
The AI Scientist framework enables LLMs to independently conduct the full scientific process from idea generation to paper writing and review, demonstrated across three ML subfields with papers costing under $15 each.
The Mechanism of Weak-to-Strong Generalization: Feature Elicitation from Latent Knowledge stat.ML · 2026-05-13 · unverdicted · none · ref 13
In two-layer networks, weak-to-strong training elicits the target feature direction from pre-trained subspaces and preserves correlated off-target features, unlike standard fine-tuning.
Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images cs.CV · 2026-05-12 · unverdicted · none · ref 3 · 2 links
A new benchmark reveals MLLMs achieve only 13% or lower accuracy on advanced perspective-conditioned spatial tasks in omnidirectional images, with RL reward shaping raising a 7B model from 31% to 60% in controlled settings.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework cs.CR · 2026-04-25 · unverdicted · none · ref 130
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy cs.LG · 2026-05-13 · unverdicted · none · ref 6
Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.
Automated alignment is harder than you think cs.AI · 2026-05-07 · unverdicted · none · ref 4
Automating alignment research with AI agents risks undetected systematic errors in fuzzy tasks, producing overconfident but misleading safety evaluations that could enable deployment of misaligned AI.
Honest Reporting in Scored Oversight: True-KL0 Property via the Prekopa Principle cs.GT · 2026-05-05 · conditional · none · ref 11
For heterogeneous power-p pseudospherical scoring rules with d ≤ 4, the True-KL0 property R(M,p,d) < 1 holds for all M > 1, establishing unconditional DSIC via a Prekopa-based log-concavity argument on the loss integral.
AI Alignment via Incentives and Correction cs.LG · 2026-05-02 · unverdicted · none · ref 10 · 2 links
AI alignment is reframed as a fixed-point incentive problem in a solver-auditor pipeline, solved via bilevel optimization and bandit search over reward profiles to maintain monitoring and reduce hallucinations in LLM coding tasks.
Learning Stable Predictors from Weak Supervision under Distribution Shift cs.LG · 2026-04-05 · unverdicted · none · ref 3
Weak supervision supports in-domain learning for CRISPR transcriptomic perturbations but temporal shifts cause negative R-squared and near-zero correlation across linear and tree models, unlike partial cell-line transfer.
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models cs.LG · 2024-03-28 · unverdicted · none · ref 6
Sparse feature circuits are introduced as interpretable causal subnetworks in language models, supporting unsupervised discovery of thousands of circuits and a method called SHIFT to improve classifier generalization by ablating irrelevant features.
Weak-to-Strong Knowledge Distillation Accelerates Visual Learning cs.CV · 2026-04-16 · unverdicted · none · ref 3
Weak-to-strong knowledge distillation applied early and then turned off accelerates convergence to target performance in visual learning tasks by factors of 1.7-4.8x.
Contextual Multi-Objective Optimization: Rethinking Objectives in Frontier AI Systems cs.AI · 2026-05-05 · unverdicted · none · ref 11
Frontier AI needs contextual multi-objective optimization to select and balance multiple context-dependent objectives rather than relying on single stable goals.

Weak-to-strong generalization: eliciting strong capabilities with weak supervision

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer