Red Teaming Language Models with Language Models.Proceedings of EMNLP 2022, pp

· 2022 · DOI 10.18653/v1/2022.emnlp-main

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

open at publisher browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

STARE: Step-wise Temporal Alignment and Red-teaming Engine for Multi-modal Toxicity Attack

cs.CR · 2026-05-01 · unverdicted · novelty 8.0

STARE uses step-wise RL to attack multimodal models, achieving 68% higher attack success rate while revealing that adversarial optimization concentrates conceptual toxicity early and detail toxicity late in the generation trajectory.

Hierarchical Alignment: Enforcing Hierarchical Instruction-Following in LLMs through Logical Consistency

cs.CL · 2026-04-10 · unverdicted · novelty 5.0

NSHA improves LLM handling of hierarchical instruction conflicts by combining solver-guided constraint satisfaction at inference with distillation of those decisions into model parameters at training.

PaliGemma: A versatile 3B VLM for transfer

cs.CV · 2024-07-10 · unverdicted · novelty 4.0

PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.

From Safety Risk to Design Principle: Peer-Preservation in Multi-Agent LLM Systems and Its Implications for Orchestrated Democratic Discourse Analysis

cs.AI · 2026-04-09 · unverdicted · novelty 3.0

Peer-preservation in LLMs requires architectural mitigations such as identity anonymization rather than model selection to maintain reliability in multi-agent systems for democratic discourse evaluation.

citing papers explorer

Showing 4 of 4 citing papers.

STARE: Step-wise Temporal Alignment and Red-teaming Engine for Multi-modal Toxicity Attack cs.CR · 2026-05-01 · unverdicted · none · ref 7
STARE uses step-wise RL to attack multimodal models, achieving 68% higher attack success rate while revealing that adversarial optimization concentrates conceptual toxicity early and detail toxicity late in the generation trajectory.
Hierarchical Alignment: Enforcing Hierarchical Instruction-Following in LLMs through Logical Consistency cs.CL · 2026-04-10 · unverdicted · none · ref 1
NSHA improves LLM handling of hierarchical instruction conflicts by combining solver-guided constraint satisfaction at inference with distillation of those decisions into model parameters at training.
PaliGemma: A versatile 3B VLM for transfer cs.CV · 2024-07-10 · unverdicted · none · ref 127
PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.
From Safety Risk to Design Principle: Peer-Preservation in Multi-Agent LLM Systems and Its Implications for Orchestrated Democratic Discourse Analysis cs.AI · 2026-04-09 · unverdicted · none · ref 18
Peer-preservation in LLMs requires architectural mitigations such as identity anonymization rather than model selection to maintain reliability in multi-agent systems for democratic discourse evaluation.

Red Teaming Language Models with Language Models.Proceedings of EMNLP 2022, pp

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer