pith. sign in

arXiv preprint arXiv:2009.06367 , year =

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

representative citing papers

Inference Time Causal Probing in LLMs

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

HDMI is a new probe-free technique that steers LLM hidden states via margin objectives to achieve more reliable causal interventions than prior probe-based methods on standard benchmarks.

Conditional Attribute Estimation with Autoregressive Sequence Models

cs.AI · 2026-05-13 · unverdicted · novelty 6.0

Conditional Attribute Transformers jointly estimate next-token probabilities and conditional attribute values for autoregressive sequence models, enabling credit assignment, counterfactuals, and steerable generation in one pass.

Aligning AI With Shared Human Values

cs.CY · 2020-08-05 · conditional · novelty 6.0

Introduces ETHICS benchmark showing current language models have promising but incomplete ability to predict basic human ethical judgments on text scenarios.

citing papers explorer

Showing 7 of 7 citing papers.

  • Decision Transformer: Reinforcement Learning via Sequence Modeling cs.LG · 2021-06-02 · accept · none · ref 65

    Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.

  • Inference Time Causal Probing in LLMs cs.AI · 2026-05-08 · unverdicted · none · ref 10

    HDMI is a new probe-free technique that steers LLM hidden states via margin objectives to achieve more reliable causal interventions than prior probe-based methods on standard benchmarks.

  • Prefix-Tuning: Optimizing Continuous Prompts for Generation cs.CL · 2021-01-01 · conditional · none · ref 11

    Prefix-tuning matches or exceeds fine-tuning on NLG tasks by optimizing a continuous prefix using 0.1% of parameters while keeping the LM frozen.

  • On The Effectiveness-Fluency Trade-Off In LLM Conditioning: A Systematic Study cs.CL · 2026-06-10 · unverdicted · none · ref 64

    Systematic experiments reveal that activation steering trades fluency for concept control, is less effective on instruction-tuned models, and that prompting/SFT excel at injection but not removal, with textual metrics correlating to LLM judges.

  • Conditional Attribute Estimation with Autoregressive Sequence Models cs.AI · 2026-05-13 · unverdicted · none · ref 16

    Conditional Attribute Transformers jointly estimate next-token probabilities and conditional attribute values for autoregressive sequence models, enabling credit assignment, counterfactuals, and steerable generation in one pass.

  • Toward Principled LLM Safety Testing: Solving the Jailbreak Oracle Problem cs.CR · 2025-06-17 · unverdicted · none · ref 32

    Formalizes the jailbreak oracle problem for LLMs and introduces Boa, a two-phase breadth-first then depth-first search system to solve it efficiently.

  • Aligning AI With Shared Human Values cs.CY · 2020-08-05 · conditional · none · ref 18

    Introduces ETHICS benchmark showing current language models have promising but incomplete ability to predict basic human ethical judgments on text scenarios.