Recognition: 2 theorem links
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Pith reviewed 2026-05-12 21:25 UTC · model grok-4.3
The pith
Reinforcement learning enables foundation models to generalize to unseen variants while supervised fine-tuning leads to memorization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Reinforcement learning, particularly with outcome-based rewards, generalizes across both rule-based textual and visual variants in tasks like GeneralPoints and V-IRL. In contrast, supervised fine-tuning tends to memorize training data and struggles with out-of-distribution scenarios. Reinforcement learning enhances underlying visual recognition capabilities, but supervised fine-tuning is essential to stabilize output formats before reinforcement learning can achieve its gains.
What carries the argument
Outcome-based reward signals in reinforcement learning applied after supervised fine-tuning, tested on the GeneralPoints card game for arithmetic reasoning and V-IRL for real-world navigation.
If this is right
- RL-trained models perform well on unseen rule variants in text and new visual conditions without additional training.
- SFT alone results in poor generalization to new scenarios in multi-modal tasks.
- A two-stage process of SFT followed by RL combines format stability with generalization ability.
- RL training improves core perceptual skills such as visual recognition that support broader task performance.
Where Pith is reading between the lines
- These results point to the value of incorporating outcome-based RL in post-training pipelines for better adaptability in changing environments.
- The pattern may extend to other foundation model applications like language understanding or decision making where rules can vary.
- Developers could experiment with different reward designs to see if they further enhance generalization beyond the tested games and navigation tasks.
Load-bearing premise
The performance gaps observed between SFT and RL on GeneralPoints and V-IRL truly indicate differences in generalization ability rather than being specific to the design of those two test environments.
What would settle it
Training new models with RL on GeneralPoints or V-IRL and then testing them on a completely different task with novel rules and visuals, finding no generalization advantage over SFT models, would falsify the central claim.
read the original abstract
Supervised fine-tuning (SFT) and reinforcement learning (RL) are widely used post-training techniques for foundation models. However, their roles in enhancing model generalization capabilities remain unclear. This paper studies the difference between SFT and RL on generalization and memorization, focusing on text-based rule variants and visual variants. We introduce GeneralPoints, an arithmetic reasoning card game, and adopt V-IRL, a real-world navigation environment, to assess how models trained with SFT and RL generalize to unseen variants in both textual and visual domains. We show that RL, especially when trained with an outcome-based reward, generalizes across both rule-based textual and visual variants. SFT, in contrast, tends to memorize training data and struggles to generalize out-of-distribution scenarios. Further analysis reveals that RL improves the model's underlying visual recognition capabilities, contributing to its enhanced generalization in the visual domain. Despite RL's superior generalization, we show that SFT remains essential for effective RL training; SFT stabilizes the model's output format, enabling subsequent RL to achieve its performance gains. These findings demonstrates the capability of RL for acquiring generalizable knowledge in complex, multi-modal tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that RL (especially outcome-based) generalizes better than SFT to out-of-distribution rule-based textual and visual variants in the newly introduced GeneralPoints arithmetic card game and the adopted V-IRL navigation environment. SFT is said to memorize training data and fail on OOD cases, while RL also improves underlying visual recognition; SFT is nevertheless required to stabilize output formats before effective RL training.
Significance. If the central empirical distinction holds after addressing setup details, the work would usefully separate the memorization/generalization roles of SFT versus RL in multi-modal post-training and demonstrate that outcome rewards can yield rule-like behavior across modalities. The introduction of GeneralPoints and controlled variants of V-IRL supplies concrete testbeds that future studies could reuse.
major comments (3)
- [Experiments / Results sections] The manuscript provides no details on the number of training runs, random seeds, statistical significance tests, or error bars for the reported performance gaps between SFT and RL on GeneralPoints and V-IRL variants. Without these, it is impossible to judge whether the observed generalization advantage is robust or could be explained by variance or implementation differences.
- [GeneralPoints and V-IRL environment descriptions] The claim that performance differences reflect acquisition of generalizable rules rather than RL exploiting the outcome reward's tolerance for exploration (while SFT overfits surface patterns) rests on the untested assumption that the textual and visual variants in GeneralPoints and V-IRL alter only the intended rule or visual feature. No ablations are described that hold reward structure, visual complexity, or required computation constant while changing only the generalization target.
- [Analysis of visual recognition] The additional assertion that RL improves underlying visual recognition in V-IRL is not separated from policy optimization effects; the paper does not report auxiliary recognition metrics or controlled probes that would show gains independent of the navigation policy.
minor comments (2)
- [Abstract] Abstract contains a subject-verb agreement error: 'These findings demonstrates' should read 'These findings demonstrate'.
- [Figures and tables] All result tables and figures should include error bars, exact sample sizes, and explicit comparison of SFT-only, RL-only, and SFT-then-RL pipelines.
Simulated Author's Rebuttal
We thank the referee for their insightful comments on our work. We provide detailed responses to each major comment below and outline the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Experiments / Results sections] The manuscript provides no details on the number of training runs, random seeds, statistical significance tests, or error bars for the reported performance gaps between SFT and RL on GeneralPoints and V-IRL variants. Without these, it is impossible to judge whether the observed generalization advantage is robust or could be explained by variance or implementation differences.
Authors: We agree with this observation. The original manuscript omitted these details due to space constraints, but we recognize their importance. In the revised version, we will report the number of independent training runs (conducted with different random seeds), include error bars (standard deviations) on all performance figures, and add statistical significance tests (such as paired t-tests) to confirm that the generalization gaps between SFT and RL are significant. revision: yes
-
Referee: [GeneralPoints and V-IRL environment descriptions] The claim that performance differences reflect acquisition of generalizable rules rather than RL exploiting the outcome reward's tolerance for exploration (while SFT overfits surface patterns) rests on the untested assumption that the textual and visual variants in GeneralPoints and V-IRL alter only the intended rule or visual feature. No ablations are described that hold reward structure, visual complexity, or required computation constant while changing only the generalization target.
Authors: The variants were designed with the explicit goal of isolating changes to the rule or visual feature. For GeneralPoints, textual variants modify the arithmetic operations while keeping card values, game mechanics, and the outcome-based reward identical. Similarly for V-IRL visual variants. We will revise the environment sections to provide more explicit descriptions of these controls and how they maintain constancy in other factors. While we did not perform additional ablations beyond the main experiments, the controlled construction supports our claims; we can add a paragraph discussing this design choice. revision: partial
-
Referee: [Analysis of visual recognition] The additional assertion that RL improves underlying visual recognition in V-IRL is not separated from policy optimization effects; the paper does not report auxiliary recognition metrics or controlled probes that would show gains independent of the navigation policy.
Authors: Our analysis in the manuscript includes probes on visual element recognition in V-IRL scenes after training. To address the separation from policy optimization, we will add explicit auxiliary metrics, such as accuracy on a visual recognition task using frozen policy models or separate evaluation on image classification of navigation-relevant objects. This will be included in the revised analysis section to demonstrate gains independent of the full navigation policy. revision: yes
Circularity Check
No circularity: empirical results on new environments
full rationale
The paper is an empirical study that introduces GeneralPoints and adopts V-IRL, then reports direct experimental comparisons of SFT versus RL on in-distribution and out-of-distribution textual and visual variants. No mathematical derivation, prediction, or first-principles claim is present that reduces by construction to fitted parameters, self-definitions, or self-citations. Generalization claims rest on measured performance gaps rather than any re-labeling of inputs as outputs. The work is self-contained against external benchmarks via explicit task variants and metrics.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Performance differences on held-out variants of GeneralPoints and V-IRL indicate true generalization rather than benchmark-specific effects.
Forward citations
Cited by 30 Pith papers
-
MeMo: Memory as a Model
MeMo encodes new knowledge into a separate memory model for frozen LLMs, achieving strong performance on BrowseComp-Plus, NarrativeQA, and MuSiQue while capturing cross-document relationships and remaining robust to r...
-
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
-
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
-
TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment
TRACE improves math reasoning by distilling only on annotator-marked critical spans with forward KL on correct key spans, optional reverse KL on errors, and GRPO elsewhere, gaining 2.76 points over GRPO while preservi...
-
Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization
Omni-Persona benchmark with 18 tasks shows open-source models have audio-visual grounding gaps, RLVR narrows them but leads to conservative outputs, and scale or recall alone fail as diagnostics.
-
Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT
ConSFT prevents catastrophic forgetting in fine-tuning flow-matching VLAs by dynamically scaling gradients based on model confidence, retaining over 20% more pre-trained capability than standard SFT without prior data...
-
Fine-Tuning Small Reasoning Models for Quantum Field Theory
Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
-
Rule-VLN: Bridging Perception and Compliance via Semantic Reasoning and Geometric Rectification
Rule-VLN is the first large-scale benchmark injecting 177 regulatory categories into an urban environment, and the proposed SNRM module equips pre-trained VLN agents with zero-shot semantic reasoning and detour planni...
-
S-GRPO: Unified Post-Training for Large Vision-Language Models
S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.
-
LASER: A Data-Centric Method for Low-Cost and Efficient SQL Rewriting based on SQL-GRPO
LASER generates complex slow-query training data with MCTS and aligns small models via SQL-GRPO to deliver efficient, low-cost SQL rewriting that outperforms rules and large models.
-
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better to...
-
Video-R1: Reinforcing Video Reasoning in MLLMs
Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
-
Teacher-Guided Policy Optimization for LLM Distillation
TGPO improves on-policy LLM distillation by using teacher predictions conditioned on student rollouts to supply informative guidance when the two distributions diverge.
-
Video-ToC: Video Tree-of-Cue Reasoning
Video-ToC adds tree-guided cue localization, demand-based RL rewards, and automated datasets to video LLMs, reporting better results than prior methods on six understanding benchmarks plus a hallucination test.
-
AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems
AutoOR uses synthetic data generation and RL post-training with solver feedback to enable 8B LLMs to autoformalize linear, mixed-integer, and non-linear OR problems, matching larger models on benchmarks.
-
Generalization in LLM Problem Solving: The Case of the Shortest Path
LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.
-
SFT-GRPO Data Overlap as a Post-Training Hyperparameter for Autoformalization
Disjoint SFT and GRPO data for autoformalization yields up to 10.4pp semantic accuracy gains over full overlap, which renders the GRPO stage redundant.
-
ReRec: Reasoning-Augmented LLM-based Recommendation Assistant via Reinforcement Fine-tuning
ReRec uses reinforcement fine-tuning with dual-graph reward shaping, reasoning-aware advantage estimation, and online curriculum scheduling to improve LLM reasoning and performance in recommendation tasks.
-
Watch Before You Answer: Learning from Visually Grounded Post-Training
Filtering post-training data to visually grounded questions improves VLM video understanding performance by up to 6.2 points using 69% of the data.
-
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.
-
ToolRL: Reward is All Tool Learning Needs
A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.
-
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
-
On-Policy Distillation with Best-of-N Teacher Rollout Selection
BRTS improves on-policy distillation by selecting the highest-quality teacher trajectory from a small pool of samples based on correctness and alignment with the student, yielding gains on AIME and AMC math benchmarks.
-
On-Policy Distillation with Best-of-N Teacher Rollout Selection
BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.
-
Cheap Expertise: Mapping and Challenging Industry Perspectives in the Expert Data Gig Economy
AI data firms view human expertise as an extractable, low-cost resource to feed AI systems while treating institutional expertise as something needing liberation or reform to fit this model.
-
On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length
Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.
-
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
-
Back to the Barn with LLAMAs: Evolving Pretrained LLM Backbones in Finetuning Vision Language Models
Newer LLM backbones in VLMs do not always improve performance; gains are task-dependent, with VQA models solving different questions due to better confidence calibration and stable representations.
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
-
DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving
DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.
Reference graph
Works this paper leans on
-
[1]
Quantifying Memorization Across Neural Language Models
1 Bousquet, O. and Elisseeff, A. Algorithmic stability and generalization performance. volume 13, 2000. 1 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sas- try, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing sys- tems, 33:1877–1901, 2020. 1, 2...
work page internal anchor Pith review arXiv 2000
-
[2]
arXiv preprint arXiv:2405.03553 , year=
2 Chen, G., Liao, M., Li, C., and Fan, K. AlphaMath al- most zero: Process supervision without process. arXiv preprint arXiv:2405.03553, 2024a. 3 Chen, J., Han, X., Ma, Y ., Zhou, X., and Xiang, L. Un- lock the correlation between supervised fine-tuning and reinforcement learning in training code large language models. arXiv preprint arXiv:2406.10305, 202...
-
[3]
URL https://arxiv.org/abs/2501. 12948. 1, 3, 7 Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The Llama 3 Herd of models. arXiv preprint arXiv:2407.21783, 2024. 2, 5, 6 Feng, X., Wan, Z., Wen, M., McAleer, S. M., Wen, Y ., Zhang, W., and Wang, J. AlphaZero-like tree-search can g...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
3 Liu, H., Li, C., Li, Y ., and Lee, Y . J. Improved base- lines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023. 3 Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., and Lee, Y . J. LLaV A-NeXT: Improved rea- soning, ocr, and world knowledge, 2024. URL https://llava-vl.github.io/blog/ 2024-01-30-llava-next/ . 3 Lu, P., Bansal, ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Proximal Policy Optimization Algorithms
URL https://openreview.net/forum? id=8aHzds2uUyB. 2 10 SFT Memorizes, RL Generalizes Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. 3, 18 Setlur, A., Nagpal, C., Fisch, A., Geng, X., Eisenstein, J., Agarwal, R., Agarwal, A., Berant, J., and Kumar, A. Re- w...
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[6]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
3 Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling LLM test- time compute optimally can be more effective than scal- ing model parameters. arXiv preprint arXiv:2408.03314,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Rethinking tabular data understanding with large language models
2, 3, 18 Sun, Z., Shen, S., Cao, S., Liu, H., Li, C., Shen, Y ., Gan, C., Gui, L., Wang, Y .-X., Yang, Y ., Keutzer, K., and Darrell, T. Aligning large multimodal models with fac- tually augmented RLHF. In Ku, L.-W., Martins, A., and Srikumar, V . (eds.), Findings of the Association for Computational Linguistics: ACL 2024 , pp. 13088– 13110, Bangkok, Thai...
-
[8]
cards": [x, y, z, w], where 'J', 'Q', and 'K' count as '10',
2 Wang, X., Antoniades, A., Elazar, Y ., Amayuelas, A., Albalak, A., Zhang, K., and Wang, W. Y . Gener- alization vs memorization: Tracing language models’ capabilities back to pretraining data. arXiv preprint arXiv:2407.14985, 2024. 2 Wei, J., Bosma, M., Zhao, V ., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., and Le, Q. V . Finetuned language mode...
-
[14]
Move forward until the destination Café de la Presse is on your right. [Current observation] You observe a 2x2 grid of street view images with the following headings: [front, right back, left] You need to identify if any of the landmarks in the instruction are visible in the street view grid. [Action space] - "forward()": indicates moving forward for 1 st...
-
[15]
First, turn left to face east
-
[16]
Move forward until you reach the next intersection where Hotel 32One is on your right behind
-
[18]
Move forward until you reach the next intersection where Dragon Gate Chinatown SF is on your right front
-
[20]
Move forward until the destination Café de la Presse is on your right. [Action space] - "forward()": indicates moving forward for 1 step; - "turn_direction(x)": indicates turn direction to the target heading, where x ∈['north', 'northeast', 'east', 'southeast', 'south', 'southwest', 'west', 'northwest']; - "stop()": indicates the navigation is finished; [...
work page 2017
-
[21]
First, turn right to face northwest
-
[22]
Move forward until you reach next intersection where Korean War Memorial is on your left
-
[23]
Turn left to face southwest
-
[24]
Move forward until you reach next intersection where Korean War Memorial is on your left behind
-
[25]
Turn right to face north
-
[27]
Turn left to face east
-
[29]
Turn left to face north
-
[31]
Turn right to face east
-
[32]
Move forward until you reach next intersection
-
[33]
Turn left to face northeast
-
[34]
Move forward until you reach next intersection where 9/11 Memorial & Museum is on your left
-
[35]
Turn right to face northwest
-
[36]
Move forward until you reach destination where The destination 9/11 Memorial & Museum is on your right front. [Action space] "forward()": indicates moving forward one step "turn_direction(x)": indicates adjust the ego agent direction towards x direction. x could be any following [’left’, ’right’, ’slightly left’, ’slightly right’] "stop()": indicates the ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.