Felkner, Ho-Chun Herbert Chang, Eugene Jang, and Jonathan May

Association for Computational Linguistics · 2023 · DOI 10.18653/v1/2023.acl-long.507

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open at publisher browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

The Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs

cs.CL · 2026-06-17 · unverdicted · novelty 6.0

LLMs exhibit misfired alignment on stereotype questions at 4.7-18.9% rates on the new VETO benchmark of 2,032 contrastive pairs, unlike humans at 0%, due to overgeneralized safety cues after instruction tuning.

It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO

cs.CL · 2026-06-09 · unverdicted · novelty 5.0

One-shot GRPO on a single biased example induces generalizing stereotype bias in post-trained LLMs, with susceptibility varying by initial bias likelihood.

AI Safety Landscape for Large Language Models: Taxonomy, State-of-the-art, and Future Directions

cs.AI · 2024-08-23 · unverdicted · novelty 4.0

The paper introduces a taxonomy of AI safety for LLMs organized into Trustworthy AI, Responsible AI, and Safe AI perspectives, accompanied by a review of state-of-the-art methods, challenges, and future directions.

citing papers explorer

Showing 3 of 3 citing papers.

The Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs cs.CL · 2026-06-17 · unverdicted · none · ref 29
LLMs exhibit misfired alignment on stereotype questions at 4.7-18.9% rates on the new VETO benchmark of 2,032 contrastive pairs, unlike humans at 0%, due to overgeneralized safety cues after instruction tuning.
It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO cs.CL · 2026-06-09 · unverdicted · none · ref 7
One-shot GRPO on a single biased example induces generalizing stereotype bias in post-trained LLMs, with susceptibility varying by initial bias likelihood.
AI Safety Landscape for Large Language Models: Taxonomy, State-of-the-art, and Future Directions cs.AI · 2024-08-23 · unverdicted · none · ref 206
The paper introduces a taxonomy of AI safety for LLMs organized into Trustworthy AI, Responsible AI, and Safe AI perspectives, accompanied by a review of state-of-the-art methods, challenges, and future directions.

Felkner, Ho-Chun Herbert Chang, Eugene Jang, and Jonathan May

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer