Logit composition of autoregressive models is projective under factorized conditionals, preserved under smooth reparameterizations, and maintains length generalization when assumptions hold uniformly.
Smith, and Yejin Choi
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
Contrastive Decoding Diffing recovers exact implanted facts from finetuned LLMs via logit-space differences between finetuned and base models, outperforming white-box baselines with less access.
An inference-time technique turns BPE-based LMs into byte- or character-level models, solving the prompt boundary problem while unifying vocabularies across different tokenizers.
ExecTune trains guide models via acceptance sampling, supervised fine-tuning, and structure-aware RL to boost executability of strategies for black-box LLMs, yielding up to 9.2% higher accuracy and 22.4% lower cost on math and code tasks.
Toxic prompt perturbations reduce LLM factual accuracy on three benchmarks and selectively amplify perturbation-sensitive nodes in attribution graphs.
Toxicity in language models is disproportionately encoded in early MLP layers and can be localized via activation differentials then suppressed at inference time without gradient descent.
Re-evaluating controlled text generation systems under standardized conditions reveals that many published performance claims do not hold, highlighting the need for consistent evaluation practices.
Closure of the Perspective API exposes structural dependence on a single proprietary toxicity scorer, leaving non-updatable benchmarks and irreproducible results while risking continued reliance on closed LLMs.
citing papers explorer
-
Compositional Generalization in Autoregressive Models via Logit Composition
Logit composition of autoregressive models is projective under factorized conditionals, preserved under smooth reparameterizations, and maintains length generalization when assumptions hold uniformly.
-
Reading the Finetuning Prior: Verbatim Content Recovery via Contrastive Decoding Diffing
Contrastive Decoding Diffing recovers exact implanted facts from finetuned LLMs via logit-space differences between finetuned and base models, outperforming white-box baselines with less access.
-
Sampling from Your Language Model One Byte at a Time
An inference-time technique turns BPE-based LMs into byte- or character-level models, solving the prompt boundary problem while unifying vocabularies across different tokenizers.
-
ExecTune: Effective Steering of Black-Box LLMs with Guide Models
ExecTune trains guide models via acceptance sampling, supervised fine-tuning, and structure-aware RL to boost executability of strategies for black-box LLMs, yielding up to 9.2% higher accuracy and 22.4% lower cost on math and code tasks.
-
Toxic HallucinAItions: Perturbing Prompts and Tracing LLM Circuits
Toxic prompt perturbations reduce LLM factual accuracy on three benchmarks and selectively amplify perturbation-sensitive nodes in attribution graphs.
-
Where Does Toxicity Live? Mechanistic Localization and Targeted Suppression in Language Models
Toxicity in language models is disproportionately encoded in early MLP layers and can be localized via activation differentials then suppressed at inference time without gradient descent.
-
A Comparative Study of Controlled Text Generation Systems Using Level-Playing-Field Evaluation Principles
Re-evaluating controlled text generation systems under standardized conditions reveals that many published performance claims do not hold, highlighting the need for consistent evaluation practices.
-
Bye Bye Perspective API: Lessons for Measurement Infrastructure in NLP, CSS and LLM Evaluation
Closure of the Perspective API exposes structural dependence on a single proprietary toxicity scorer, leaving non-updatable benchmarks and irreproducible results while risking continued reliance on closed LLMs.
- The Safety-Aware Denoiser for Text Diffusion Models