Title resolution pending

Ouyang, L · 2022

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

browse 5 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

Understanding New-Knowledge-Induced Factual Hallucinations in LLMs: Analysis and Interpretation

cs.CL · 2025-11-04 · unverdicted · novelty 6.0

Fine-tuning on new knowledge induces propagating hallucinations in LLMs by weakening attention to key entities, with mitigation via reintroducing known knowledge during later training stages.

Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal

cs.CL · 2025-09-07 · unverdicted · novelty 6.0

Sparse autoencoders plus greedy filtering and factorization-machine interaction modeling identify minimal sets of features in Gemma-2-2B-IT and LLaMA-3.1-8B-IT whose ablation produces jailbreaks by flipping refusal to compliance.

Uncovering the Internet's Hidden Values: An Empirical Study of Desirable Behavior Using Highly-Upvoted Content on Reddit

cs.HC · 2024-10-16 · unverdicted · novelty 6.0

LLM analysis of highly-upvoted Reddit comments yields 64-72 macro/meso/micro values per year; existing prosocial measures capture only 18% on average while the method also recovers and extends prior qualitative taxonomies.

LifeAlign: Lifelong Alignment for Large Language Models with Memory-Augmented Focalized Preference Optimization

cs.CL · 2025-09-21 · unverdicted · novelty 5.0

LifeAlign uses focalized preference optimization and short-to-long memory consolidation via dimensionality reduction to let LLMs align with new preferences while retaining prior knowledge.

The Ratchet Effect in Silico: How Interaction Drives Cumulative Intelligence in Large Language Models

cs.LG · 2025-07-25

citing papers explorer

Showing 5 of 5 citing papers.

Understanding New-Knowledge-Induced Factual Hallucinations in LLMs: Analysis and Interpretation cs.CL · 2025-11-04 · unverdicted · none · ref 17
Fine-tuning on new knowledge induces propagating hallucinations in LLMs by weakening attention to key entities, with mitigation via reintroducing known knowledge during later training stages.
Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal cs.CL · 2025-09-07 · unverdicted · none · ref 23
Sparse autoencoders plus greedy filtering and factorization-machine interaction modeling identify minimal sets of features in Gemma-2-2B-IT and LLaMA-3.1-8B-IT whose ablation produces jailbreaks by flipping refusal to compliance.
Uncovering the Internet's Hidden Values: An Empirical Study of Desirable Behavior Using Highly-Upvoted Content on Reddit cs.HC · 2024-10-16 · unverdicted · none · ref 42
LLM analysis of highly-upvoted Reddit comments yields 64-72 macro/meso/micro values per year; existing prosocial measures capture only 18% on average while the method also recovers and extends prior qualitative taxonomies.
LifeAlign: Lifelong Alignment for Large Language Models with Memory-Augmented Focalized Preference Optimization cs.CL · 2025-09-21 · unverdicted · none · ref 20
LifeAlign uses focalized preference optimization and short-to-long memory consolidation via dimensionality reduction to let LLMs align with new preferences while retaining prior knowledge.
The Ratchet Effect in Silico: How Interaction Drives Cumulative Intelligence in Large Language Models cs.LG · 2025-07-25 · unreviewed · ref 36

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer