pith. sign in

hub

Learning by distilling context

23 Pith papers cite this work. Polarity classification is still indexing.

23 Pith papers citing it

hub tools

citation-role summary

background 3

citation-polarity summary

roles

background 3

polarities

background 3

clear filters

representative citing papers

Near-Future Policy Optimization

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

NPO uses a policy's own near-future checkpoint as auxiliary trajectories to maximize effective learning signal S = Q/V, improving performance from 57.88 to 63.15 on Qwen3-VL-8B-Instruct with GRPO while accelerating convergence.

Context Memorization for Efficient Long Context Generation

cs.CL · 2026-05-18 · unverdicted · novelty 6.0

Attention-state memory externalizes long prefixes into a lightweight lookup table of precomputed attention states, yielding higher accuracy than standard in-context learning at fixed memory budgets and lower latency than full attention.

Self-Supervised On-Policy Distillation for Reasoning Language Models

cs.LG · 2026-05-17 · unverdicted · novelty 6.0

SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.

Large Language Models Can Self-Improve

cs.CL · 2022-10-20 · unverdicted · novelty 6.0

A 540B-parameter LLM improves reasoning performance on GSM8K, DROP, OpenBookQA, and ANLI-A3 by fine-tuning on self-generated high-confidence CoT solutions from unlabeled data.

citing papers explorer

Showing 3 of 3 citing papers after filters.

  • From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation cs.LG · 2026-05-12 · conditional · none · ref 33

    Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.

  • Near-Future Policy Optimization cs.LG · 2026-04-22 · unverdicted · none · ref 22

    NPO uses a policy's own near-future checkpoint as auxiliary trajectories to maximize effective learning signal S = Q/V, improving performance from 57.88 to 63.15 on Qwen3-VL-8B-Instruct with GRPO while accelerating convergence.

  • Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning cs.CL · 2026-04-09 · accept · none · ref 58

    LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.