arxiv: 2605.11549 · v1 · submitted 2026-05-12 · 💻 cs.HC

Recognition: no theorem link

UNIPO: Unified Interactive Visual Explanation for RL Fine-Tuning Policy Optimization

Aeree Cho, Alexander D. Greenhalgh, Anthony Peng, Duen Horng (Polo) Chau, Jonathan Bodea

Pith reviewed 2026-05-13 01:35 UTC · model grok-4.3

classification 💻 cs.HC

keywords visualization toolreinforcement learningpolicy optimizationfine-tuninglanguage modelstoken-level dynamicsinteractive comparison

0 comments

The pith

UNIPO supplies the first interactive visualization that unifies token-level training dynamics across RL fine-tuning algorithms for language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reinforcement learning algorithms for fine-tuning large language models differ in clipping, advantage estimation, and reward aggregation, yet these details appear in separate papers with inconsistent notation. UNIPO supplies a single interactive interface with three linked views so users can watch how each design choice shapes training step by step at the token level. The views include an overview of the full training run, a detailed inspector for individual prompts and responses, and a side-by-side comparison of multiple algorithms. This setup is meant to help both learners in classrooms and practitioners selecting an algorithm for production use. The paper demonstrates the approach through two usage scenarios that illustrate these educational and practical benefits.

Core claim

UNIPO is presented as the first interactive visualization tool that exposes the token-level training dynamics of RL fine-tuning algorithms through a unified design. It integrates a high-level training overview, a step-level prompt and response inspector, and a side-by-side algorithm comparison to let users trace how modular differences in clipping, advantage estimation, and reward aggregation propagate through training.

What carries the argument

The unified design that connects three complementary interactive views exposing token-level training dynamics.

Load-bearing premise

That inconsistent notation across papers creates a meaningful barrier and that an interactive visualization will help non-experts and practitioners compare and select algorithms more effectively.

What would settle it

A controlled study in which participants shown the original papers identify differences in clipping or advantage estimation as accurately and quickly as participants using UNIPO.

Figures

Figures reproduced from arXiv: 2605.11549 by Aeree Cho, Alexander D. Greenhalgh, Anthony Peng, Duen Horng (Polo) Chau, Jonathan Bodea.

**Figure 1.** Figure 1: UNIPO unifies the visual explanation of policy optimization algorithms for RL fine-tuning through three coordinated views [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Selecting a token in the (A) Step Inspector opens its computation in the (B) Algorithm Explainer. Here, “17” receives a token-level objective of 0.000 even though the reward is 1.00. Every response in the group is correct, so the Advantage collapses to 0.000. This reveals that GRPO reinforces responses relative to the group, not by correctness alone. algorithm’s objective function in the Algorithm Explain… view at source ↗

**Figure 3.** Figure 3: Algorithm Explainer’s Comparison mode renders two algorithms side-by-side with color-coded differences, revealing evolutionary relationships across policy optimization methods. (A) GRPO vs. DAPO surfaces DAPO’s added Dynamic Sampling constraint, with tooltip explaining it in plain language for non-experts. (B) DAPO vs. Dr. GRPO contrasts aggregation strategies, annotating DAPO’s cross-group length bias a… view at source ↗

read the original abstract

Reinforcement learning has emerged as a dominant technique for fine-tuning the behavior of large language models, with policy optimization (PO) algorithms such as GRPO, DAPO, and Dr. GRPO emerging in rapid succession to advance state-of-the-art reasoning and alignment performance. However, the modular differences between these algorithms, including targeted improvements to clipping, advantage estimation, and reward aggregation, are introduced across separate papers with inconsistent notation, making them difficult to compare and intimidating to the non-expert community. We present UNIPO, the first interactive visualization tool that exposes the token-level training dynamics of RL fine-tuning algorithms through a unified design. UNIPO connects three complementary views, a high-level training overview, a step-level prompt and response inspector, and a side-by-side algorithm comparison, allowing learners to observe how individual design decisions propagate through training. Through two usage scenarios, we demonstrate how UNIPO supports both classroom instruction for non-experts and algorithm selection for AI practitioners. Our tool is open-source and publicly available at https://poloclub.github.io/unipo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UNIPO is a new open-source visualization tool unifying token-level views of RL fine-tuning algorithms like GRPO and DAPO, but its utility claims rest only on two narrative scenarios with no evaluation.

read the letter

The core of this paper is a practical visualization tool called UNIPO that pulls together high-level training overviews, step-by-step prompt and response inspectors, and side-by-side comparisons for several RL policy optimization methods used in LLM fine-tuning. The authors point out that papers on GRPO, DAPO, and similar algorithms use inconsistent notation, which makes direct comparison hard, and they built an interface to surface the token-level effects of design choices like clipping and advantage estimation in one place. The tool is open source, which lowers the barrier for anyone who wants to explore it directly rather than just read about it. That part is straightforward and addresses a real friction in the subfield. What the paper does well is keep the focus on concrete, observable differences during training instead of abstract algorithm descriptions. The three-view layout seems like a reasonable way to let users zoom from overall trends down to individual tokens and across algorithms at once. For classroom use or quick practitioner checks, this could save time compared to juggling multiple papers. The main limitation is the lack of any testing. The claims about supporting non-experts in learning dynamics or helping with algorithm selection come only from two illustrative scenarios. There are no user studies, no learning outcome measures, and no head-to-head checks against reading the source papers or other existing tools. Without that, it's impossible to know if the interface actually delivers better understanding or better decisions. The 'first' claim for a unified interactive tool looks plausible based on the references, but the effectiveness part stays unverified. This work sits in the HCI space around explainable AI and RL for language models. Readers who teach these topics or need to compare recent PO variants quickly could get something out of trying the tool itself. It is worth sending to peer review because the contribution is a concrete, usable artifact rather than another theoretical tweak, though reviewers will almost certainly push for some form of validation on the learning or decision-making benefits.

Referee Report

2 major / 1 minor

Summary. The paper introduces UNIPO, an open-source interactive visualization tool for unifying the explanation of token-level training dynamics in RL fine-tuning algorithms such as GRPO, DAPO, and Dr. GRPO. It integrates three complementary views—a high-level training overview, a step-level prompt and response inspector, and a side-by-side algorithm comparison—to allow observation of how design decisions propagate through training. The authors claim it is the first such tool and demonstrate its utility for classroom instruction to non-experts and algorithm selection by practitioners through two usage scenarios.

Significance. If the tool's unified views prove effective at exposing token-level dynamics and aiding comparison despite inconsistent notations across papers, it could meaningfully lower barriers for non-experts learning RL fine-tuning and support practitioners in evaluating algorithmic variants, contributing to education and informed development in LLM alignment.

major comments (2)

[Abstract] Abstract: the claim that UNIPO is 'the first interactive visualization tool that exposes the token-level training dynamics of RL fine-tuning algorithms through a unified design' is presented without any systematic comparison to prior visualization tools or literature on RL/LLM explanation interfaces, which is load-bearing for the novelty positioning.
[Abstract] Abstract (usage scenarios): the central claims that the tool 'supports both classroom instruction for non-experts and algorithm selection for AI practitioners' rest solely on two narrative usage scenarios with no accompanying user studies, learning metrics, task performance data, or controlled comparisons to baseline resources such as papers or other tools.

minor comments (1)

The manuscript would benefit from explicit discussion of how the three views handle specific modular differences (e.g., clipping, advantage estimation) in a way that resolves notation inconsistencies, with concrete examples tied to the views.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, indicating where we agree and the specific revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that UNIPO is 'the first interactive visualization tool that exposes the token-level training dynamics of RL fine-tuning algorithms through a unified design' is presented without any systematic comparison to prior visualization tools or literature on RL/LLM explanation interfaces, which is load-bearing for the novelty positioning.

Authors: We agree that the novelty positioning would be strengthened by a more systematic comparison. While the manuscript contains a Related Work section discussing prior RL visualization efforts, it does not include an exhaustive survey or explicit feature comparison. In the revised manuscript we will expand Related Work with a dedicated subsection surveying existing visualization tools for RL training dynamics and LLM explanation interfaces. We will add a comparison table contrasting UNIPO's unified token-level views, multi-algorithm support, and interactive design against prior single-algorithm or non-token-level tools, thereby providing the requested grounding for the 'first' claim. revision: yes
Referee: [Abstract] Abstract (usage scenarios): the central claims that the tool 'supports both classroom instruction for non-experts and algorithm selection for AI practitioners' rest solely on two narrative usage scenarios with no accompanying user studies, learning metrics, task performance data, or controlled comparisons to baseline resources such as papers or other tools.

Authors: The referee is correct that the claims rest on narrative scenarios without formal user studies or quantitative metrics. For an initial tool-introduction paper, narrative scenarios are a conventional method to illustrate potential use cases. However, to avoid overstatement we will revise the abstract and introduction to present the scenarios explicitly as 'illustrative usage scenarios' rather than as demonstrated support. We will also add a Limitations section that acknowledges the lack of controlled user studies, learning metrics, or comparisons to baselines and outlines plans for such evaluations as future work. revision: partial

Circularity Check

0 steps flagged

No circularity: tool description paper with no derivations or fitted predictions

full rationale

The paper presents UNIPO, an interactive visualization tool for comparing RL fine-tuning algorithms like GRPO and DAPO. It contains no mathematical derivation chain, equations, predictions, or first-principles results. Claims about supporting instruction and algorithm selection rest on two narrative usage scenarios rather than any self-referential definitions, fitted parameters renamed as outputs, or load-bearing self-citations. The design is self-contained as a software artifact description with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a software tool paper focused on visualization and user interface design rather than a theoretical or empirical derivation, so no free parameters, axioms, or invented entities are involved.

pith-pipeline@v0.9.0 · 5501 in / 1027 out tokens · 38013 ms · 2026-05-13T01:35:00.651426+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 7 internal anchors

[1]

Shadows can be

E. Aflalo, M. Du, S.-Y . Tseng, Y . Liu, C. Wu, N. Duan, et al. VL- InterpreT: An Interactive Visualization Tool for Interpreting Vision- Language Transformers . In2022 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pp. 21374–21383. IEEE Computer Society, Los Alamitos, CA, USA, June 2022. doi: 10.1109/CVPR52688.2022.020722

work page doi:10.1109/cvpr52688.2022.020722 2022
[2]

Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, et al. Training a helpful and harmless assistant with reinforcement learn- ing from human feedback, 2022. URL:https://arxiv.org/abs/ 2204.05862. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

L. Biewald. Experiment tracking with weights and biases, 2020. URL: https://www.wandb.com/. 1, 2, 4

work page 2020
[4]

A. Cho, G. C. Kim, A. Karpekov, S. Lee, A. Helbling, B. Hoover, et al. Transformer explainer: Learning llm transformers with interactive vi- sual explanation and experimentation. InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems, CHI ’26. As- sociation for Computing Machinery, New York, NY , USA, 2026. doi: 10.1145/3772318.37917252

work page doi:10.1145/3772318.37917252 2026
[5]

P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. In I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish- wanathan, et al., eds.,Advances in Neural Information Processing Sys- tems, vol. 30. Curran Associates, Inc., 2017. 1

work page 2017
[6]

G. M. Draper, Y . Livnat, and R. F. Riesenfeld. A survey of radial meth- ods for information visualization.IEEE Transactions on Visualization & Computer Graphics, 15(05):759–776, 2009. 3

work page 2009
[7]

Endert, W

A. Endert, W. Ribarsky, C. Turkay, B. W. Wong, I. Nabney, I. D. Blanco, et al. The state of the art in integrating machine learning into visual analytics.Computer Graphics F orum, 36(8):458–486, Mar

work page
[8]

doi:10.1111/cgf.130922

work page doi:10.1111/cgf.130922
[9]

D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, et al. Deepseek- r1 incentivizes reasoning in llms through reinforcement learning.Na- ture, 645(8081):633–638, 2025. doi:10.1038/s41586-025-09422-z1, 2

work page doi:10.1038/s41586-025-09422-z1 2025
[10]

Hohman, M

F. Hohman, M. Kahng, R. Pienta, and D. H. Chau. Visual analytics in deep learning: An interrogative survey for the next frontiers.IEEE Transactions on Visualization and Computer Graphics, 2018. 2

work page 2018
[11]

Huang, R

S. Huang, R. F. J. Dossa, A. Raffin, A. Kanervisto, and W. Wang. The 37 implementation details of proximal policy optimization. InICLR Blog Track, 2022. https://iclr-blog-track.github.io/2022/03/25/ppo- implementation-details/. 2

work page 2022
[12]

Karpathy

A. Karpathy. Deep dive into LLMs like ChatGPT. YouTube video, Feb. 2025.https://www.youtube.com/watch?v=7xTGNNLPyMI, URL:https://www.youtube.com/watch?v=7xTGNNLPyMI. 1

work page 2025
[13]

Y . Kilcher. [GRPO Explained] DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. YouTube video, Jan. 2025.https://www.youtube.com/watch?v=bAWV_yrqx4w, URL:https://www.youtube.com/watch?v=bAWV_yrqx4w. 1

work page 2025
[14]

Lakatos, J

I. Lakatos, J. Worrall, and E. Zahar, eds.Proofs and Refutations: The Logic of Mathematical Discovery. Cambridge University Press, Cambridge and London, 1976. 2

work page 1976
[15]

N. Lambert. Reinforcement learning from human feedback, 2026. URL:https://arxiv.org/abs/2504.12501. 1

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

Lambert, L

N. Lambert, L. Castricato, L. von Werra, and A. Havrilla. Illustrating reinforcement learning from human feedback (rlhf).Hugging Face Blog, 2022. https://huggingface.co/blog/rlhf. 1

work page 2022
[17]

Lambert, J

N. Lambert, J. Morrison, V . Pyatkin, S. Huang, H. Ivison, F. Brahman, et al. Tulu 3: Pushing frontiers in open language model post-training. InSecond Conference on Language Modeling, 2025. 2

work page 2025
[18]

Y . Lian. Comparative analysis and parametric tuning of ppo, grpo, and dapo for llm reasoning enhancement, 2025. URL:https://arxiv. org/abs/2512.07611. 1

work page arXiv 2025
[19]

Z. Liu, C. Chen, W. Li, P. Qi, T. Pang, C. Du, et al. Understanding r1-zero-like training: A critical perspective, 2025. URL:https:// arxiv.org/abs/2503.20783. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

E. Lobo, C. Agarwal, and H. Lakkaraju. On the impact of fine-tuning on chain-of-thought reasoning, 2025. URL:https://arxiv.org/ abs/2411.15382. 1

work page arXiv 2025
[21]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, et al. Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, eds.,Advances in Neural Information Process- ing Systems, vol. 35, pp. 27730–27744. Curran Associates, Inc., 2022. 1, 2

work page 2022
[22]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms, 2017. URL:https:// arxiv.org/abs/1707.06347. 2

work page internal anchor Pith review Pith/arXiv arXiv 2017
[23]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, et al. Deepseek- math: Pushing the limits of mathematical reasoning in open language models, 2024. URL:https://arxiv.org/abs/2402.03300. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Shneiderman

B. Shneiderman. The Eyes Have It: A Task by Data Type Taxonomy for Information Visualizations . InVisual Languages, IEEE Sympo- sium on, p. 336. IEEE Computer Society, Los Alamitos, CA, USA, Sept. 1996. doi:10.1109/VL.1996.5453072

work page doi:10.1109/vl.1996.5453072 1996
[25]

Smilkov, S

D. Smilkov, S. Carter, D. Sculley, F. B. Vi ´egas, and M. Wattenberg. Direct-manipulation visualization of deep networks, 2017. URL: https://arxiv.org/abs/1708.03788. 2

work page arXiv 2017
[26]

Spinner, U

T. Spinner, U. Schlegel, H. Sch ¨afer, and M. El-Assady. explainer: A visual analytics framework for interactive and explainable machine learning.IEEE Transactions on Visualization and Computer Graph- ics, 26(1):1064–1074, 2020. doi:10.1109/TVCG.2019.29346292

work page doi:10.1109/tvcg.2019.29346292 2020
[27]

Steinarsson.Downsampling time series for visual representation

S. Steinarsson.Downsampling time series for visual representation. PhD thesis, 2013. 3

work page 2013
[28]

R. S. Sutton and A. G. Barto.Reinforcement learning - an introduc- tion, 2nd Edition. MIT Press, 2018. 1

work page 2018
[29]

J. Vig. A multiscale visualization of attention in the transformer model. In M. R. Costa-juss `a and E. Alfonseca, eds.,Proceedings of the 57th Annual Meeting of the Association for Computational Lin- guistics: System Demonstrations, pp. 37–42. Association for Compu- tational Linguistics, Florence, Italy, July 2019. doi:10.18653/v1/P19 -30072

work page doi:10.18653/v1/p19 2019
[30]

Y . Wang, J. Zhao, C. Zhao, S. Guan, G. Penn, and S. Liu.λ-grpo: Unifying the grpo frameworks with learnable token preferences, 2025. URL:https://arxiv.org/abs/2510.06870. 2

work page arXiv 2025
[31]

Z. Wang, K. Ramnath, B. Bi, S. K. Pentyala, S. Chaudhuri, S. Mehro- tra, et al. Reinforcement learning for llm post-training: A survey,

work page
[32]

URL:https://arxiv.org/abs/2407.16216. 2

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Z. J. Wang, R. Turko, O. Shaikh, H. Park, N. Das, F. Hohman, et al. Cnn explainer: Learning convolutional neural networks with interac- tive visualization.IEEE Transactions on Visualization and Computer Graphics, 27(2):1396–1406, Feb. 2021. doi:10.1109/tvcg.2020.3030418 2

work page doi:10.1109/tvcg.2020.3030418 2021
[34]

R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Mach. Learn., 8(3–4):229–256, May 1992. doi:10.1007/BF009926962

work page doi:10.1007/bf009926962 1992
[35]

C. Yeh, Y . Chen, A. Wu, C. Chen, F. Vi´egas, and M. Wattenberg. At- tentionviz: A global view of transformer attention.IEEE Transactions on Visualization and Computer Graphics, 30(1):262–272, Jan. 2024. doi:10.1109/TVCG.2023.33271632

work page doi:10.1109/tvcg.2023.33271632 2024
[36]

Q. Yu, Z. Zhang, R. Zhu, Y . Yuan, X. Zuo, Y . Yue, et al. Dapo: An open-source llm reinforcement learning system at scale, 2025. URL: https://arxiv.org/abs/2503.14476. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

M. A. Zaharia, A. Chen, A. Davidson, A. Ghodsi, S. A. Hong, A. Kon- winski, et al. Accelerating the machine learning lifecycle with mlflow. IEEE Data Eng. Bull., 41:39–45, 2018. 2

work page 2018
[38]

Y . Zhang. From GRPO to DAPO and GSPO: What, why, and how. Hugging Face Blog, Aug. 2025.https://huggingface.co/ blog/NormalUhr/grpo-to-dapo-and-gspo, URL:https:// huggingface.co/blog/NormalUhr/grpo-to-dapo-and-gspo. 1, 2

work page 2025