Simple statistical gradient-following algorithms for connectionist reinforcement learning

Ronald J Williams · 1992

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

representative citing papers

LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

cs.CV · 2026-04-16 · unverdicted · novelty 7.0

LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.

Evaluation-driven Scaling for Scientific Discovery

cs.LG · 2026-04-21 · unverdicted · novelty 6.0

SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster LASSO and new Erdos constructions.

What Is Preference Optimization Doing, and Why?

cs.LG · 2025-11-30 · unverdicted · novelty 5.0

Gradient analysis and ablations show DPO and PPO have different target directions and component roles in preference optimization for LLMs.

citing papers explorer

Showing 3 of 3 citing papers.

LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories cs.CV · 2026-04-16 · unverdicted · none · ref 49
LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
Evaluation-driven Scaling for Scientific Discovery cs.LG · 2026-04-21 · unverdicted · none · ref 156
SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster LASSO and new Erdos constructions.
What Is Preference Optimization Doing, and Why? cs.LG · 2025-11-30 · unverdicted · none · ref 24
Gradient analysis and ablations show DPO and PPO have different target directions and component roles in preference optimization for LLMs.

Simple statistical gradient-following algorithms for connectionist reinforcement learning

fields

years

verdicts

representative citing papers

citing papers explorer