arXiv preprint arXiv:2009.06367 , year =

· 2009 · arXiv 2009.06367

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

representative citing papers

Decision Transformer: Reinforcement Learning via Sequence Modeling

cs.LG · 2021-06-02 · accept · novelty 8.0

Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.

Inference Time Causal Probing in LLMs

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

HDMI is a new probe-free technique that steers LLM hidden states via margin objectives to achieve more reliable causal interventions than prior probe-based methods on standard benchmarks.

Prefix-Tuning: Optimizing Continuous Prompts for Generation

cs.CL · 2021-01-01 · conditional · novelty 7.0

Prefix-tuning matches or exceeds fine-tuning on NLG tasks by optimizing a continuous prefix using 0.1% of parameters while keeping the LM frozen.

On The Effectiveness-Fluency Trade-Off In LLM Conditioning: A Systematic Study

cs.CL · 2026-06-10 · unverdicted · novelty 6.0

Systematic experiments reveal that activation steering trades fluency for concept control, is less effective on instruction-tuned models, and that prompting/SFT excel at injection but not removal, with textual metrics correlating to LLM judges.

Conditional Attribute Estimation with Autoregressive Sequence Models

cs.AI · 2026-05-13 · unverdicted · novelty 6.0

Conditional Attribute Transformers jointly estimate next-token probabilities and conditional attribute values for autoregressive sequence models, enabling credit assignment, counterfactuals, and steerable generation in one pass.

Toward Principled LLM Safety Testing: Solving the Jailbreak Oracle Problem

cs.CR · 2025-06-17 · unverdicted · novelty 6.0

Formalizes the jailbreak oracle problem for LLMs and introduces Boa, a two-phase breadth-first then depth-first search system to solve it efficiently.

Aligning AI With Shared Human Values

cs.CY · 2020-08-05 · conditional · novelty 6.0

Introduces ETHICS benchmark showing current language models have promising but incomplete ability to predict basic human ethical judgments on text scenarios.

citing papers explorer

Showing 7 of 7 citing papers.

Decision Transformer: Reinforcement Learning via Sequence Modeling cs.LG · 2021-06-02 · accept · none · ref 65
Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.
Inference Time Causal Probing in LLMs cs.AI · 2026-05-08 · unverdicted · none · ref 10
HDMI is a new probe-free technique that steers LLM hidden states via margin objectives to achieve more reliable causal interventions than prior probe-based methods on standard benchmarks.
Prefix-Tuning: Optimizing Continuous Prompts for Generation cs.CL · 2021-01-01 · conditional · none · ref 11
Prefix-tuning matches or exceeds fine-tuning on NLG tasks by optimizing a continuous prefix using 0.1% of parameters while keeping the LM frozen.
On The Effectiveness-Fluency Trade-Off In LLM Conditioning: A Systematic Study cs.CL · 2026-06-10 · unverdicted · none · ref 64
Systematic experiments reveal that activation steering trades fluency for concept control, is less effective on instruction-tuned models, and that prompting/SFT excel at injection but not removal, with textual metrics correlating to LLM judges.
Conditional Attribute Estimation with Autoregressive Sequence Models cs.AI · 2026-05-13 · unverdicted · none · ref 16
Conditional Attribute Transformers jointly estimate next-token probabilities and conditional attribute values for autoregressive sequence models, enabling credit assignment, counterfactuals, and steerable generation in one pass.
Toward Principled LLM Safety Testing: Solving the Jailbreak Oracle Problem cs.CR · 2025-06-17 · unverdicted · none · ref 32
Formalizes the jailbreak oracle problem for LLMs and introduces Boa, a two-phase breadth-first then depth-first search system to solve it efficiently.
Aligning AI With Shared Human Values cs.CY · 2020-08-05 · conditional · none · ref 18
Introduces ETHICS benchmark showing current language models have promising but incomplete ability to predict basic human ethical judgments on text scenarios.

arXiv preprint arXiv:2009.06367 , year =

fields

years

verdicts

representative citing papers

citing papers explorer