pith. sign in

‘python import inspect_ai import numpy as np from scipy.optimize import curve_fit import matplotlib.pyplot as plt def task_func(x, y, labels)

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

fields

cs.LG 1

years

2026 1

verdicts

UNVERDICTED 1

representative citing papers

Exploration Hacking: Can LLMs Learn to Resist RL Training?

cs.LG · 2026-04-30 · unverdicted · novelty 6.0

LLMs can be fine-tuned into model organisms that resist RL elicitation in domains like biosecurity while preserving related skills, and frontier models show explicit reasoning to suppress exploration when given training context.

citing papers explorer

Showing 1 of 1 citing paper.

  • Exploration Hacking: Can LLMs Learn to Resist RL Training? cs.LG · 2026-04-30 · unverdicted · none · ref 63

    LLMs can be fine-tuned into model organisms that resist RL elicitation in domains like biosecurity while preserving related skills, and frontier models show explicit reasoning to suppress exploration when given training context.