Presents a new expert-curated dataset of multi-turn counterspeech dialogues in five languages targeting hate against seven groups, with span annotations linking to verified external knowledge for RAG applications.
Is Attention Interpretable?
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 6roles
background 1polarities
background 1representative citing papers
LOCOS scores attention heads via OV-circuit output projection onto answer-token unembedding directions and identifies non-literal retrieval heads whose ablation collapses performance on non-literal benchmarks more than prior literal-copy detectors.
Pruning attention layers in five LLMs across eight datasets maintains accuracy but degrades faithfulness and calibration.
CERA fine-tunes a dense retriever with triplet contrastive learning plus attention alignment to human rationales, claiming better retrieval effectiveness and faithfulness on clinical trial reports than Contriever and standard hard-negative baselines.
LLMs generate adequate counterspeech for co-occurring hate and misinformation in 40% of cases, with a mixed knowledge strategy from fact-checkers and NGOs proving most effective after expert revision.
A causal route gating intervention decomposes attention heads and suppresses text-dominant routes using one-forward/one-gradient estimates to reduce unsupported content generation in LVLMs.
citing papers explorer
-
CATCH-ME if you RAG: a dataset of Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation Exchanges
Presents a new expert-curated dataset of multi-turn counterspeech dialogues in five languages targeting hate against seven groups, with span annotations linking to verified external knowledge for RAG applications.
-
Logit-Contribution Scoring Identifies Non-Literal Retrieval Heads
LOCOS scores attention heads via OV-circuit output projection onto answer-token unembedding directions and identifies non-literal retrieval heads whose ablation collapses performance on non-literal benchmarks more than prior literal-copy detectors.
-
Don't Go Breaking My LLM: The Impact of Pruning Attention Layers on Explanation Faithfulness and Confidence Calibration
Pruning attention layers in five LLMs across eight datasets maintains accuracy but degrades faithfulness and calibration.
-
Beyond Topical Similarity: Contrastive Evidence Retrieval with Interpretable Attention Alignment in RAG
CERA fine-tunes a dense retriever with triplet contrastive learning plus attention alignment to human rationales, claiming better retrieval effectiveness and faithfulness on clinical trial reports than Contriever and standard hard-negative baselines.
-
Assisted Counterspeech Writing at the Crossroads of Hate Speech and Misinformation
LLMs generate adequate counterspeech for co-occurring hate and misinformation in 40% of cases, with a mixed knowledge strategy from fact-checkers and NGOs proving most effective after expert revision.
-
Mitigating Hallucinations in Large Vision-Language Models via Causal Route Gating
A causal route gating intervention decomposes attention heads and suppresses text-dominant routes using one-forward/one-gradient estimates to reduce unsupported content generation in LVLMs.