LLM attackers persuade frontier LLMs to generate prohibited essays on consensus topics through multi-turn natural-language pressure, with success rates up to 100% in some model-topic pairs.
hub
R., Rocktäschel, T., and Perez, E
12 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to adversarial agents.
LLMs engage in spontaneous persuasion in virtually all multi-turn conversations by favoring information-based strategies like logic and evidence, in contrast to human responses that rely more on social influence and negative emotions.
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
LLMs achieve 64% accuracy detecting Wikipedia bias and remove 79% of words removed by editors when correcting, but produce high-recall low-precision edits rated more neutral by crowds than human versions.
DPA-GRPO trains a generator-verifier pair via group-relative policy optimization on paired counterfactual actions, improving structured output accuracy on TaxCalcBench over zero-shot and generator-only baselines.
Transformer models trained on synthetic pedagogical interaction data in spatial navigation achieve more robust expert-like performance than those trained only on expert demonstrations, particularly when they can distinguish epistemic states of expert and novice agents.
CRAVE is a new framework that clusters retrieved text and image evidence into narratives and uses an LLM judge to produce explained fact-checking verdicts.
TIDE integrates trial and debate mechanisms to improve criteria-based prompt optimization for argumentative essay tasks including automated scoring, component detection, and relation identification.
An LLM agent with grounding, personalization, and marketing modules generates real estate descriptions that human buyers prefer over expert-written ones while matching factual accuracy.
A survey that deconstructs LLM agent systems via a methodology-centered taxonomy linking design principles to emergent behaviors, applications, and challenges.
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
citing papers explorer
-
LLM-Based Persuasion Enables Guardrail Override in Frontier LLMs
LLM attackers persuade frontier LLMs to generate prohibited essays on consensus topics through multi-turn natural-language pressure, with success rates up to 100% in some model-topic pairs.
-
EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium
EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to adversarial agents.
-
Spontaneous Persuasion: An Audit of Model Persuasiveness in Everyday Conversations
LLMs engage in spontaneous persuasion in virtually all multi-turn conversations by favoring information-based strategies like logic and evidence, in contrast to human responses that rely more on social influence and negative emotions.
-
Towards an AI co-scientist
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
-
Seeing Like an AI: How LLMs Apply (and Misapply) Wikipedia Neutrality Norms
LLMs achieve 64% accuracy detecting Wikipedia bias and remove 79% of words removed by editors when correcting, but produce high-recall low-precision edits rated more neutral by crowds than human versions.
-
Interactive Critique-Revision Training for Reliable Structured LLM Generation
DPA-GRPO trains a generator-verifier pair via group-relative policy optimization on paired counterfactual actions, improving structured output accuracy on TaxCalcBench over zero-shot and generator-only baselines.
-
Representing expertise accelerates learning from pedagogical interaction data
Transformer models trained on synthetic pedagogical interaction data in spatial navigation achieve more robust expert-like performance than those trained only on expert demonstrations, particularly when they can distinguish epistemic states of expert and novice agents.
-
Fact-Checking with Contextual Narratives: Leveraging Retrieval-Augmented LLMs for Social Media Analysis
CRAVE is a new framework that clusters retrieved text and image evidence into narratives and uses an LLM judge to produce explained fact-checking verdicts.
-
Towards Robust Argumentative Essay Understanding via TIDE: An Interactive Framework with Trial and Debate
TIDE integrates trial and debate mechanisms to improve criteria-based prompt optimization for argumentative essay tasks including automated scoring, component detection, and relation identification.
-
AI Realtor: Towards Grounded Persuasive Language Generation for Automated Copywriting
An LLM agent with grounding, personalization, and marketing modules generates real estate descriptions that human buyers prefer over expert-written ones while matching factual accuracy.
-
Large Language Model Agent: A Survey on Methodology, Applications and Challenges
A survey that deconstructs LLM agent systems via a methodology-centered taxonomy linking design principles to emergent behaviors, applications, and challenges.
-
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.