Recon: Reconstruction-Guided Reasoning Synthesis for User Modeling

Alan Zhu; Andrew Zhou; Carolyn Wang; Joseph E. Gonzalez; Lisa Dunlap; Mihran Miroyan; Narges Norouzi

arxiv: 2605.26969 · v1 · pith:PNHVJ7RWnew · submitted 2026-05-26 · 💻 cs.CL · cs.AI

Recon: Reconstruction-Guided Reasoning Synthesis for User Modeling

Alan Zhu , Mihran Miroyan , Carolyn Wang , Andrew Zhou , Lisa Dunlap , Narges Norouzi , Joseph E. Gonzalez This is my paper

Pith reviewed 2026-06-29 18:48 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords user modelingreasoning synthesisreconstructionpost-hoc rationalizationlanguage modelsbehavior simulationreward modeling

0 comments

The pith

Reconstruction of actions from reasoning traces identifies better decision paths than conditioning on the action itself.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that typical reasoning synthesis for user modeling creates post-hoc justifications because the trace is generated while knowing the action. Recon instead generates candidate traces and scores them by how accurately a separate model can reconstruct the original action when given only the context plus the trace. This produces traces that more faithfully encode the latent factors driving user behavior. Across domains the method beats standard backward synthesis and yields higher-quality training data for user simulators.

Core claim

Recon scores candidate reasoning traces by their ability to let a reconstruction model recover the observed action from context alone; traces that enable higher-fidelity reconstruction are retained or used as rewards. This replaces post-hoc rationalization with a predictive criterion and yields reasoning that improves downstream user modeling accuracy while transferring across models.

What carries the argument

Action reconstruction fidelity, the accuracy with which a model predicts the action given context plus a candidate reasoning trace.

If this is right

Recon achieves a 54.7 percent win rate against backward synthesis across four domains.
Reward-based training with Recon scores reaches up to 70 percent win rate on user modeling tasks.
Reasoning synthesized under Recon transfers to different language models and improves performance beyond the reconstruction model itself.
Post-hoc rationalization is shown to be insufficient when the goal is to recover causal decision structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If reconstruction fidelity tracks causal structure, the same scoring could be applied to chain-of-thought traces in non-user-modeling tasks to reduce spurious justifications.
User simulators built this way might produce more stable long-horizon behavior predictions because the traces are constrained to be predictive rather than merely consistent.
The approach supplies an automatic filter that could be inserted into any pipeline that generates synthetic reasoning from observed behavior.
Testing whether Recon traces generalize to new contexts where human decisions are known would directly test whether the method captures transferable decision rules.

Load-bearing premise

That accurate reconstruction of the observed action indicates the trace encodes the true latent decision process rather than incidental correlations.

What would settle it

A controlled experiment in which user models trained on Recon traces are tested on held-out human actions and fail to outperform models trained on post-hoc rationales.

Figures

Figures reproduced from arXiv: 2605.26969 by Alan Zhu, Andrew Zhou, Carolyn Wang, Joseph E. Gonzalez, Lisa Dunlap, Mihran Miroyan, Narges Norouzi.

**Figure 1.** Figure 1: RECON pipeline. Given a context–action pair (c, a∗ ) (1), we sample N = 4 candidate rationalizations from the reasoning model Mr (2). Each candidate is evaluated by using the action model Ma to reconstruct the action from the context c and candidate reasoning rˆi , and measuring its alignment with the ground-truth action a ∗ (3). The resulting scores are used either to select rationalizations for RECON-Sel… view at source ↗

**Figure 2.** Figure 2: Data domains. We study Reddit, Podcasts, U.K. Parliament debates, and U.S. Supreme Court oral arguments, ranging from brief spoken formal questions to long-form informal writing. action aˆj is most aligned with a ∗ . We follow HumanLM [27] and use an LM judge to score alignment between reconstructed and ground-truth actions along three predefined dimensions (style, intent, and values) before making an over… view at source ↗

**Figure 3.** Figure 3: RECON-Select and E2E-GRPO results. Win rates against Backward Synthesis across four domains and overall. (Left) On Qwen3-8B, RECON-Select achieves a 54.7% overall win rate over baseline. E2E-GRPO achieves only a 38.4% win rate, indicating that optimizing for action accuracy alone does not produce transferable reasoning traces. (Right) On GPT-5-mini, RECON-Select achieves a 53.5% overall win rate, demonstra… view at source ↗

**Figure 4.** Figure 4: RECON-Select results by dimension. Win rates of RECON-Select against Backward Synthesis across domains and overall broken down by alignment dimensions: Style (Left), Intent (Middle), and Values (Right). RECON-Select consistently improves over the baseline for both Qwen3-8B and GPT-5-mini models across all dimensions. The inner polygon denotes 50% win rate. SCOTUS UK PMQ Podcast Reddit Overall 0% 10% 20% 30… view at source ↗

**Figure 5.** Figure 5: Cross-model transfer results. Win rates for RECON-Select against Backward Synthesis using different reasoning (Mr) and action (Ma) models. (Left) Qwen3-8B as Mr and GPT-5-mini as Ma: gains are not statistically significant, suggesting limited transfer from weaker to stronger models. (Right) GPT-5-mini as Mr and Qwen3-8B as Ma: gains are statistically significant, demonstrating that RECON-Select reasoning t… view at source ↗

**Figure 6.** Figure 6: RECON-Select and RECON-GRPO results across models. Win rates against Backward Synthesis on the PMQ domain across Mr model families and sizes (Qwen3-4B, Llama-3.1-8BInstruct, Qwen3-8B, Qwen3-14B), with Qwen3-8B fixed as Ma. (Left) RECON-Select: weaker reasoning models benefit most, while stronger models leave less headroom for improvement. (Right) RECON-GRPO: RL training with RECON-based rewards yields fur… view at source ↗

**Figure 7.** Figure 7: Qualitative example from PMQ. RECON-GRPO identifies the Prime Minister’s intended move as attacking the opposition rather than responding defensively, guiding the action model toward a response closer to the ground truth. In contrast, Backward Synthesis produces a defensive rationale, leading to a less aligned prediction. rates than those provided by RECON-Select at 70.0% and 56.4%, respectively, despite R… view at source ↗

read the original abstract

User modeling aims to use language models (LMs) to mimic an individual's behavior from a corpus of past context-action pairs (e.g., conversation turns), enabling the simulation of users in settings like behavioral science, human-AI collaboration, and market research. Recent approaches augment these corpora with synthesized reasoning traces, typically generated by conditioning on both context and action. However, such conditioning constitutes post-hoc rationalization rather than reasoning: the trace is guaranteed to justify the action, but may not encode the underlying latent causal decision paths. We propose Recon, which uses action reconstruction to score reasoning traces by their predictive power: given a context and candidate reasoning, a reconstruction model predicts the action, and reconstruction fidelity determines reasoning quality. Across four domains, Recon achieves a 54.7% win rate over Backward Synthesis, a standard post-hoc rationalization baseline. Further, we find that training a reasoning synthesis model with rewards derived from Recon improves downstream user modeling performance, achieving a win rate of up to 70.0% over baselines. We further show that Recon-synthesized reasoning transfers across models, and improves user modeling beyond the reconstruction model. Our work demonstrates that post-hoc rationalization is insufficient for reasoning synthesis, and that useful and interpretable reasoning should naturally elicit the action from the context.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Recon scores reasoning traces by action reconstruction accuracy and reports gains over post-hoc baselines, but the abstract leaves the experimental controls and causal vs correlational distinction unaddressed.

read the letter

The core idea is to score synthesized reasoning traces for user modeling by how well a separate reconstruction model can predict the observed action from context plus the trace. This is meant to favor traces that actually help elicit the action rather than just rationalize it after the fact. The paper shows this scoring beats a standard backward synthesis baseline at 54.7% win rate across four domains, and using the scores to train a synthesis model lifts downstream user modeling to 70% win rate over baselines. It also claims the traces transfer across models and improve performance beyond the reconstruction model itself.

What stands out is the explicit separation of the scoring signal from the generation process, which avoids the built-in guarantee that post-hoc conditioning provides. The downstream transfer result is the most useful piece if it holds, because it suggests the selected traces carry something usable for simulation tasks in behavioral science or market research.

The main gaps are in the abstract: no description of data splits, baseline implementations, statistical tests, or how the reconstruction model was trained and kept independent. Without those, the win rates are hard to interpret. The central assumption—that high reconstruction fidelity means the trace encodes the latent causal decision process rather than just statistical associations—also needs direct tests; the reported transfer helps but does not rule out surface-level correlation capture. If the full paper has ablations on this point and reproducible code, the contribution becomes clearer.

This is aimed at applied LM work on user simulation rather than foundational reasoning research. A reader already working on similar synthesis pipelines could extract the scoring method and test it quickly. The paper deserves peer review because the claims are concrete and falsifiable once the methods are shown; the idea is simple enough that referees can check the evidence directly.

Referee Report

2 major / 1 minor

Summary. The manuscript presents Recon, a reconstruction-guided method for synthesizing reasoning traces in user modeling tasks. Unlike traditional approaches that generate reasoning conditioned on both context and action (post-hoc rationalization), Recon employs a reconstruction model to evaluate candidate reasoning traces based on their ability to predict the observed action from the context alone. The quality is determined by reconstruction fidelity. The paper reports that Recon achieves a 54.7% win rate against Backward Synthesis across four domains and that incorporating Recon-derived rewards into training a reasoning synthesis model yields up to 70.0% win rate over baselines in downstream user modeling. Additional results show transferability across models and improvements beyond the reconstruction model.

Significance. Should the results and the underlying assumption prove robust, this approach could advance the field of user modeling by providing a more principled way to generate interpretable reasoning that aligns with actual decision-making processes. The distinction drawn between rationalization and predictive reasoning is important, and the reported gains suggest potential for better simulation in applications like behavioral science and human-AI interaction. The cross-model transfer is a positive indicator of generality.

major comments (2)

[Abstract] Abstract: The abstract states win-rate numbers (54.7% and 70.0%) but provides no information on experimental design, statistical tests, baseline implementations, data splits, or controls. This absence prevents assessment of whether the reported results support the central claims about Recon's superiority.
[Abstract] Abstract: The core assumption that reconstruction fidelity serves as a valid proxy for the reasoning trace encoding the latent causal decision process (rather than merely capturing correlational patterns) is not directly tested or justified in the provided description. A reconstruction model could achieve high fidelity through shared lexical cues or post-hoc patterns without the reasoning reflecting the actual causal path, which would undermine the claim that this is superior for causal modeling.

minor comments (1)

[Abstract] The abstract mentions 'four domains' but does not specify what they are; including this would improve clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback. We address each major comment below, proposing revisions to improve clarity and address concerns where feasible.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract states win-rate numbers (54.7% and 70.0%) but provides no information on experimental design, statistical tests, baseline implementations, data splits, or controls. This absence prevents assessment of whether the reported results support the central claims about Recon's superiority.

Authors: We agree the abstract is too concise and omits key details needed to contextualize the reported win rates. In the revised version, we will expand the abstract to briefly note the evaluation across four domains, the Backward Synthesis baseline, the use of win-rate metrics from comparative judgments, and that full experimental design, data splits, controls, and statistical details appear in Sections 4 and 5. This will better support the claims while respecting abstract length constraints. revision: yes
Referee: [Abstract] Abstract: The core assumption that reconstruction fidelity serves as a valid proxy for the reasoning trace encoding the latent causal decision process (rather than merely capturing correlational patterns) is not directly tested or justified in the provided description. A reconstruction model could achieve high fidelity through shared lexical cues or post-hoc patterns without the reasoning reflecting the actual causal path, which would undermine the claim that this is superior for causal modeling.

Authors: The paper explicitly contrasts post-hoc rationalization with predictive reasoning and positions reconstruction fidelity as a proxy for the latter. While we lack direct causal intervention experiments to rule out lexical or correlational confounds, the downstream gains (up to 70% win rate) and cross-model transfer results provide empirical support that Recon-derived traces improve user modeling beyond what post-hoc methods achieve. We will add a dedicated paragraph in the discussion clarifying this assumption, its predictive (rather than strictly causal) framing, and the supporting evidence, while acknowledging the limitation. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's core mechanism uses a separate reconstruction model to score candidate reasoning traces via action prediction fidelity. The abstract and description explicitly state that Recon-synthesized reasoning transfers across models and improves user modeling performance beyond the reconstruction model itself. No equations, self-citations, or definitional steps are present that reduce the claimed prediction or scoring to inputs by construction. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the approach appears to rest on standard language-model training and an auxiliary reconstruction model whose training details are not described.

pith-pipeline@v0.9.1-grok · 5773 in / 1108 out tokens · 32971 ms · 2026-06-29T18:48:05.646578+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 15 canonical work pages · 8 internal anchors

[1]

Using large language models to simulate multiple humans and replicate human subject studies

Gati V Aher, Rosa I Arriaga, and Adam Tauman Kalai. Using large language models to simulate multiple humans and replicate human subject studies. InInternational conference on machine learning, 2023

2023
[2]

How to Train Your Advisor: Steering Black-Box LLMs with Advisor Models

Parth Asawa, Alan Zhu, Abby O’Neill, Matei Zaharia, Alexandros G Dimakis, and Joseph E Gonzalez. How to train your advisor: Steering black-box LLMs with advisor models.arXiv preprint arXiv:2510.02453, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Aligning language models from user interactions.arXiv preprint arXiv:2603.12273, 2026

Thomas Kleine Buening, Jonas Hübotter, Barna Pásztor, Idan Shenfeld, Giorgia Ramponi, and Andreas Krause. Aligning language models from user interactions.arXiv preprint arXiv:2603.12273, 2026

work page arXiv 2026
[4]

PAL: Pluralistic alignment framework for learning from heterogeneous preferences

Daiwei Chen, Yi Chen, Aniket Rege, and Ramya Korlakai Vinayak. PAL: Pluralistic alignment framework for learning from heterogeneous preferences. InNeurIPS 2024 Workshop on Behavioral Machine Learning, 2024

2024
[5]

A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 1960

Jacob Cohen. A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 1960

1960
[6]

TRACE the evidence: Constructing knowledge-grounded reasoning chains for retrieval-augmented generation

Jinyuan Fang, Zaiqiao Meng, and Craig Macdonald. TRACE the evidence: Constructing knowledge-grounded reasoning chains for retrieval-augmented generation. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 8472–8494, 2024

2024
[7]

Gemini 3.1 flash-lite preview

Google. Gemini 3.1 flash-lite preview. https://ai.google.dev/gemini-api/docs/ models/gemini-3.1-flash-lite-preview, 2026. Accessed: 2026-05-04

2026
[8]

LoRA: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. LoRA: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

2022
[9]

Human subjects research in the age of generative AI: Opportunities and challenges of applying LLM-simulated data to HCI studies

Angel Hsing-Chi Hwang, Michael S Bernstein, S Shyam Sundar, Renwen Zhang, Manoel Horta Ribeiro, Yingdan Lu, Serina Chang, Tongshuang Wu, Aimei Yang, Dmitri Williams, et al. Human subjects research in the age of generative AI: Opportunities and challenges of applying LLM-simulated data to HCI studies. InProceedings of the Extended Abstracts of the CHI Conf...

2025
[10]

Personalized soups: Per- sonalized large language model alignment via post-hoc parameter merging.arXiv preprint arXiv:2310.11564, 2023

Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, and Prithviraj Ammanabrolu. Personalized soups: Per- sonalized large language model alignment via post-hoc parameter merging.arXiv preprint arXiv:2310.11564, 2023

work page arXiv 2023
[11]

Improv- ing language model personas via rationalization with psychological scaffolds

Brihi Joshi, Xiang Ren, Swabha Swayamdipta, Rik Koncel-Kedziorski, and Tim Paek. Improv- ing language model personas via rationalization with psychological scaffolds. InFindings of the Association for Computational Linguistics: EMNLP 2025, 2025

2025
[12]

Personalized language modeling from personalized human feedback.arXiv preprint arXiv:2402.05133, 2024

Xinyu Li, Ruiyang Zhou, Zachary C Lipton, and Liu Leqi. Personalized language modeling from personalized human feedback.arXiv preprint arXiv:2402.05133, 2024

work page arXiv 2024
[13]

Can LLM Agents Simulate Multi-Turn Human Behavior? Evidence from Real Online Customer Behavior Data

Yuxuan Lu, Jing Huang, Yan Han, Bingsheng Yao, Sisong Bei, Jiri Gesi, Yaochen Xie, Qi He, Dakuo Wang, et al. Can LLM agents simulate multi-turn human behavior? evidence from real online customer behavior data.arXiv preprint arXiv:2503.20749, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Parastudent: Generating and evaluating realistic student code by teaching llms to struggle.arXiv preprint arXiv:2507.12674, 2025

Mihran Miroyan, Rose Niousha, Joseph E Gonzalez, Gireeja Ranade, and Narges Norouzi. Parastudent: Generating and evaluating realistic student code by teaching llms to struggle.arXiv preprint arXiv:2507.12674, 2025

work page arXiv 2025
[15]

Flipping the dialogue: Training and evaluating user language models.arXiv preprint arXiv:2510.06552, 2025

Tarek Naous, Philippe Laban, Wei Xu, and Jennifer Neville. Flipping the dialogue: Training and evaluating user language models.arXiv preprint arXiv:2510.06552, 2025

work page arXiv 2025
[16]

Telling more than we can know: Verbal reports on mental processes.Psychological review, 84(3):231, 1977

Richard E Nisbett and Timothy D Wilson. Telling more than we can know: Verbal reports on mental processes.Psychological review, 84(3):231, 1977

1977
[17]

Oyez.https://www.oyez.org/, 2026

Oyez. Oyez.https://www.oyez.org/, 2026. Accessed: 2026-05-06

2026
[18]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, 2023

2023
[19]

LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals

Joon Sung Park, Carolyn Q Zou, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Meredith Ringel Morris, Robb Willer, Percy Liang, and Michael S Bernstein. Generative agent simulations of 1,000 people.arXiv preprint arXiv:2411.10109, 52, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Per- sonalizing reinforcement learning from human feedback with variational preference learning

Sriyash Poddar, Yanming Wan, Hamish Ivison, Abhishek Gupta, and Natasha Jaques. Per- sonalizing reinforcement learning from human feedback with variational preference learning. Advances in Neural Information Processing Systems, 37:52516–52544, 2024

2024
[21]

SynthesizeMe! inducing persona-guided prompts for personalized reward models in LLMs

Michael J Ryan, Omar Shaikh, Aditri Bhagirath, Daniel Frees, William Barr Held, and Diyi Yang. SynthesizeMe! inducing persona-guided prompts for personalized reward models in LLMs. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 8045–8078, 2025

2025
[22]

Synthetic prompting: Generating chain-of-thought demonstrations for large language models

Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Synthetic prompting: Generating chain-of-thought demonstrations for large language models. InInternational conference on machine learning, pages 30706–30775. PMLR, 2023

2023
[23]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. OpenAI GPT-5 system card. arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

FSPO: Few-Shot Optimization of Synthetic Preferences Personalizes to Real Users

Anikait Singh, Sheryl Hsu, Kyle Hsu, Eric Mitchell, Stefano Ermon, Tatsunori Hashimoto, Archit Sharma, and Chelsea Finn. FSPO: Few-shot optimization of synthetic preferences personalizes to real users.arXiv preprint arXiv:2502.19312, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

arXiv preprint arXiv:2502.00640 , note =

Shirley Wu, Michel Galley, Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, James Zou, Jure Leskovec, and Jianfeng Gao. CollabLLM: From passive responders to active collabo- rators.arXiv preprint arXiv:2502.00640, 2025

work page arXiv 2025
[27]

Humanlm: Simulating users with state alignment beats response imitation.arXiv preprint arXiv:2603.03303, 2026

Shirley Wu, Evelyn Choi, Arpandeep Khatua, Zhanghan Wang, Joy He-Yueya, Tharindu Cyril Weerasooriya, Wei Wei, Diyi Yang, Jure Leskovec, and James Zou. HumanLM: Simulating users with state alignment beats response imitation.arXiv preprint arXiv:2603.03303, 2026

work page arXiv 2026
[28]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

STaR: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. STaR: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

2022
[30]

How to Steal Reasoning Without Reasoning Traces

Tingwei Zhang, John X Morris, and Vitaly Shmatikov. How to steal reasoning without reasoning traces.arXiv preprint arXiv:2603.07267, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

(?i)inaudible

Siyan Zhao, John Dang, and Aditya Grover. Group preference optimization: Few-shot alignment of large language models. InNeurIPS 2023 Workshop on Instruction Tuning and Instruction F ollowing, 2023. 11 A Ethics Statement Our work uses real human interaction data for user modeling experiments. The data consists of (i) publicly available transcripts of publi...

2023
[32]

Retrieve context-action pairs:{(c i, a∗ i )}k i=1
[33]

Obtain Backward Synthesis augmentations:{ˆr b i }k i=1 =f b(c, a∗)
[34]

Obtain Backward Synthesis-based action:ˆab T =M a({(ci,ˆrb i , a∗ i )}k i=1}, cT )
[35]

Obtain augmentations fromf:{ˆr f i }k i=1 =f(c, a ∗)
[36]

Obtainf-based action:ˆa f T =M a({(ci,ˆrf i , a∗ i )}k i=1}, cT )
[37]

I">. It is possible you have yet to speak in the conversation, in which case no turns in the Current Context are labeled with <turn author=

Compareˆab T andˆaf T for similarity toa ∗ T . Step 6 is performed using an LM pairwise judge, specifically Gemini-3.1 Flash Lite with the default recommended sampling parameters. We describe the prompt below. D.2 Action Generation Following Equation 1, we provide the action generation model with the augmented retrieved examples and the current test conte...

[1] [1]

Using large language models to simulate multiple humans and replicate human subject studies

Gati V Aher, Rosa I Arriaga, and Adam Tauman Kalai. Using large language models to simulate multiple humans and replicate human subject studies. InInternational conference on machine learning, 2023

2023

[2] [2]

How to Train Your Advisor: Steering Black-Box LLMs with Advisor Models

Parth Asawa, Alan Zhu, Abby O’Neill, Matei Zaharia, Alexandros G Dimakis, and Joseph E Gonzalez. How to train your advisor: Steering black-box LLMs with advisor models.arXiv preprint arXiv:2510.02453, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Aligning language models from user interactions.arXiv preprint arXiv:2603.12273, 2026

Thomas Kleine Buening, Jonas Hübotter, Barna Pásztor, Idan Shenfeld, Giorgia Ramponi, and Andreas Krause. Aligning language models from user interactions.arXiv preprint arXiv:2603.12273, 2026

work page arXiv 2026

[4] [4]

PAL: Pluralistic alignment framework for learning from heterogeneous preferences

Daiwei Chen, Yi Chen, Aniket Rege, and Ramya Korlakai Vinayak. PAL: Pluralistic alignment framework for learning from heterogeneous preferences. InNeurIPS 2024 Workshop on Behavioral Machine Learning, 2024

2024

[5] [5]

A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 1960

Jacob Cohen. A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 1960

1960

[6] [6]

TRACE the evidence: Constructing knowledge-grounded reasoning chains for retrieval-augmented generation

Jinyuan Fang, Zaiqiao Meng, and Craig Macdonald. TRACE the evidence: Constructing knowledge-grounded reasoning chains for retrieval-augmented generation. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 8472–8494, 2024

2024

[7] [7]

Gemini 3.1 flash-lite preview

Google. Gemini 3.1 flash-lite preview. https://ai.google.dev/gemini-api/docs/ models/gemini-3.1-flash-lite-preview, 2026. Accessed: 2026-05-04

2026

[8] [8]

LoRA: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. LoRA: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

2022

[9] [9]

Human subjects research in the age of generative AI: Opportunities and challenges of applying LLM-simulated data to HCI studies

Angel Hsing-Chi Hwang, Michael S Bernstein, S Shyam Sundar, Renwen Zhang, Manoel Horta Ribeiro, Yingdan Lu, Serina Chang, Tongshuang Wu, Aimei Yang, Dmitri Williams, et al. Human subjects research in the age of generative AI: Opportunities and challenges of applying LLM-simulated data to HCI studies. InProceedings of the Extended Abstracts of the CHI Conf...

2025

[10] [10]

Personalized soups: Per- sonalized large language model alignment via post-hoc parameter merging.arXiv preprint arXiv:2310.11564, 2023

Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, and Prithviraj Ammanabrolu. Personalized soups: Per- sonalized large language model alignment via post-hoc parameter merging.arXiv preprint arXiv:2310.11564, 2023

work page arXiv 2023

[11] [11]

Improv- ing language model personas via rationalization with psychological scaffolds

Brihi Joshi, Xiang Ren, Swabha Swayamdipta, Rik Koncel-Kedziorski, and Tim Paek. Improv- ing language model personas via rationalization with psychological scaffolds. InFindings of the Association for Computational Linguistics: EMNLP 2025, 2025

2025

[12] [12]

Personalized language modeling from personalized human feedback.arXiv preprint arXiv:2402.05133, 2024

Xinyu Li, Ruiyang Zhou, Zachary C Lipton, and Liu Leqi. Personalized language modeling from personalized human feedback.arXiv preprint arXiv:2402.05133, 2024

work page arXiv 2024

[13] [13]

Can LLM Agents Simulate Multi-Turn Human Behavior? Evidence from Real Online Customer Behavior Data

Yuxuan Lu, Jing Huang, Yan Han, Bingsheng Yao, Sisong Bei, Jiri Gesi, Yaochen Xie, Qi He, Dakuo Wang, et al. Can LLM agents simulate multi-turn human behavior? evidence from real online customer behavior data.arXiv preprint arXiv:2503.20749, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Parastudent: Generating and evaluating realistic student code by teaching llms to struggle.arXiv preprint arXiv:2507.12674, 2025

Mihran Miroyan, Rose Niousha, Joseph E Gonzalez, Gireeja Ranade, and Narges Norouzi. Parastudent: Generating and evaluating realistic student code by teaching llms to struggle.arXiv preprint arXiv:2507.12674, 2025

work page arXiv 2025

[15] [15]

Flipping the dialogue: Training and evaluating user language models.arXiv preprint arXiv:2510.06552, 2025

Tarek Naous, Philippe Laban, Wei Xu, and Jennifer Neville. Flipping the dialogue: Training and evaluating user language models.arXiv preprint arXiv:2510.06552, 2025

work page arXiv 2025

[16] [16]

Telling more than we can know: Verbal reports on mental processes.Psychological review, 84(3):231, 1977

Richard E Nisbett and Timothy D Wilson. Telling more than we can know: Verbal reports on mental processes.Psychological review, 84(3):231, 1977

1977

[17] [17]

Oyez.https://www.oyez.org/, 2026

Oyez. Oyez.https://www.oyez.org/, 2026. Accessed: 2026-05-06

2026

[18] [18]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, 2023

2023

[19] [19]

LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals

Joon Sung Park, Carolyn Q Zou, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Meredith Ringel Morris, Robb Willer, Percy Liang, and Michael S Bernstein. Generative agent simulations of 1,000 people.arXiv preprint arXiv:2411.10109, 52, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Per- sonalizing reinforcement learning from human feedback with variational preference learning

Sriyash Poddar, Yanming Wan, Hamish Ivison, Abhishek Gupta, and Natasha Jaques. Per- sonalizing reinforcement learning from human feedback with variational preference learning. Advances in Neural Information Processing Systems, 37:52516–52544, 2024

2024

[21] [21]

SynthesizeMe! inducing persona-guided prompts for personalized reward models in LLMs

Michael J Ryan, Omar Shaikh, Aditri Bhagirath, Daniel Frees, William Barr Held, and Diyi Yang. SynthesizeMe! inducing persona-guided prompts for personalized reward models in LLMs. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 8045–8078, 2025

2025

[22] [22]

Synthetic prompting: Generating chain-of-thought demonstrations for large language models

Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Synthetic prompting: Generating chain-of-thought demonstrations for large language models. InInternational conference on machine learning, pages 30706–30775. PMLR, 2023

2023

[23] [23]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. OpenAI GPT-5 system card. arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

FSPO: Few-Shot Optimization of Synthetic Preferences Personalizes to Real Users

Anikait Singh, Sheryl Hsu, Kyle Hsu, Eric Mitchell, Stefano Ermon, Tatsunori Hashimoto, Archit Sharma, and Chelsea Finn. FSPO: Few-shot optimization of synthetic preferences personalizes to real users.arXiv preprint arXiv:2502.19312, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

arXiv preprint arXiv:2502.00640 , note =

Shirley Wu, Michel Galley, Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, James Zou, Jure Leskovec, and Jianfeng Gao. CollabLLM: From passive responders to active collabo- rators.arXiv preprint arXiv:2502.00640, 2025

work page arXiv 2025

[27] [27]

Humanlm: Simulating users with state alignment beats response imitation.arXiv preprint arXiv:2603.03303, 2026

Shirley Wu, Evelyn Choi, Arpandeep Khatua, Zhanghan Wang, Joy He-Yueya, Tharindu Cyril Weerasooriya, Wei Wei, Diyi Yang, Jure Leskovec, and James Zou. HumanLM: Simulating users with state alignment beats response imitation.arXiv preprint arXiv:2603.03303, 2026

work page arXiv 2026

[28] [28]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

STaR: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. STaR: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

2022

[30] [30]

How to Steal Reasoning Without Reasoning Traces

Tingwei Zhang, John X Morris, and Vitaly Shmatikov. How to steal reasoning without reasoning traces.arXiv preprint arXiv:2603.07267, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[31] [31]

(?i)inaudible

Siyan Zhao, John Dang, and Aditya Grover. Group preference optimization: Few-shot alignment of large language models. InNeurIPS 2023 Workshop on Instruction Tuning and Instruction F ollowing, 2023. 11 A Ethics Statement Our work uses real human interaction data for user modeling experiments. The data consists of (i) publicly available transcripts of publi...

2023

[32] [32]

Retrieve context-action pairs:{(c i, a∗ i )}k i=1

[33] [33]

Obtain Backward Synthesis augmentations:{ˆr b i }k i=1 =f b(c, a∗)

[34] [34]

Obtain Backward Synthesis-based action:ˆab T =M a({(ci,ˆrb i , a∗ i )}k i=1}, cT )

[35] [35]

Obtain augmentations fromf:{ˆr f i }k i=1 =f(c, a ∗)

[36] [36]

Obtainf-based action:ˆa f T =M a({(ci,ˆrf i , a∗ i )}k i=1}, cT )

[37] [37]

I">. It is possible you have yet to speak in the conversation, in which case no turns in the Current Context are labeled with <turn author=

Compareˆab T andˆaf T for similarity toa ∗ T . Step 6 is performed using an LM pairwise judge, specifically Gemini-3.1 Flash Lite with the default recommended sampling parameters. We describe the prompt below. D.2 Action Generation Following Equation 1, we provide the action generation model with the augmented retrieved examples and the current test conte...