STARE uses step-wise RL to attack multimodal models, achieving 68% higher attack success rate while revealing that adversarial optimization concentrates conceptual toxicity early and detail toxicity late in the generation trajectory.
Red Teaming Language Models with Language Models.Proceedings of EMNLP 2022, pp
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 4roles
background 1polarities
background 1representative citing papers
NSHA improves LLM handling of hierarchical instruction conflicts by combining solver-guided constraint satisfaction at inference with distillation of those decisions into model parameters at training.
PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.
Peer-preservation in LLMs requires architectural mitigations such as identity anonymization rather than model selection to maintain reliability in multi-agent systems for democratic discourse evaluation.
citing papers explorer
-
STARE: Step-wise Temporal Alignment and Red-teaming Engine for Multi-modal Toxicity Attack
STARE uses step-wise RL to attack multimodal models, achieving 68% higher attack success rate while revealing that adversarial optimization concentrates conceptual toxicity early and detail toxicity late in the generation trajectory.
-
Hierarchical Alignment: Enforcing Hierarchical Instruction-Following in LLMs through Logical Consistency
NSHA improves LLM handling of hierarchical instruction conflicts by combining solver-guided constraint satisfaction at inference with distillation of those decisions into model parameters at training.
-
PaliGemma: A versatile 3B VLM for transfer
PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.
-
From Safety Risk to Design Principle: Peer-Preservation in Multi-Agent LLM Systems and Its Implications for Orchestrated Democratic Discourse Analysis
Peer-preservation in LLMs requires architectural mitigations such as identity anonymization rather than model selection to maintain reliability in multi-agent systems for democratic discourse evaluation.