arxiv: 2605.08334 · v1 · submitted 2026-05-08 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

SalesSim: Benchmarking and Aligning Multimodal Language Models as Retail User Simulators

Yada Pruksachatkun , Elaine Wan , Lyanna Chen , Kai-Wei Chang , Chien-Sheng Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:47 UTC · model grok-4.3

classification 💻 cs.CL

keywords multimodal language modelsuser simulationretail conversationspersona alignmentdecision alignmentreinforcement learningconversational quality

0 comments

The pith

Multimodal models simulate retail shoppers with under 79 percent average alignment to assigned personas, but a new multi-turn reinforcement learning method raises alignment by 13.8 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets up SalesSim to test whether multimodal language models can act as believable customers who follow their own backgrounds, preferences, and dealbreakers through full retail conversations that include images, tool use, and multiple turns. Benchmarking shows the models generate fluent exchanges yet fall short on lexical variety, reveal too much about their criteria, and shift away from their specifications when sales agents make suggestions. The authors introduce UserGRPO, a reinforcement learning approach that trains the model on both conversational quality and decision consistency with the persona. Readers would care because accurate simulators are needed to develop and test sales assistant systems at scale without constant human involvement. If the results hold, training methods like this could produce simulators that stay closer to real customer behavior across diverse profiles.

Core claim

The paper establishes that even leading multimodal models achieve less than 79 percent average alignment with their underlying persona specifications in multi-turn, tool-augmented retail settings. It documents specific gaps relative to human baselines, including reduced lexical diversity and a tendency to overdisclose criteria or yield to sales persuasion. To address these shortfalls, the authors present UserGRPO as a multi-turn, multi-objective reinforcement learning procedure that jointly optimizes conversational fluency and adherence to persona-driven decisions, yielding a 13.8 percent gain in decision alignment on the baseline model.

What carries the argument

SalesSim framework, which treats user simulation as an agentic retail interaction process and evaluates it via a suite of decision alignment metrics that check consistency between simulator actions and explicit persona specifications.

Load-bearing premise

The chosen metrics for decision alignment and conversational quality, along with the collected human conversation baselines, correctly measure realistic customer behavior and the persona specifications remain clear and independent of the evaluation setup.

What would settle it

An experiment in which UserGRPO-trained simulators still show the same rates of persona drift and criteria overdisclosure as baseline models when sales agents apply stronger persuasion in new multi-turn scenarios would falsify the reported alignment improvement.

Figures

Figures reproduced from arXiv: 2605.08334 by Chien-Sheng Wu, Elaine Wan, Kai-Wei Chang, Lyanna Chen, Yada Pruksachatkun.

**Figure 1.** Figure 1: Qualitative examples of retail simulations on SALESSIM. Baseline models exhibit over-leniency. They are also susceptible to tonality of the salesperson simulator, both in proceeding with unsuitable purchases or rejecting acceptable products as specified by their persona. In contrast, our USERGRPO model demonstrates more grounded reasoning based on product attributes. sales agent performance. In the retail … view at source ↗

**Figure 2.** Figure 2: Example of the SALESSIM product and persona data. Our product data consists of rich metadata including features, prices, and multimodal information. Our persona data consists of fine-grained preferences and dealbreakers that tie closely to product choices. 2 SALESSIM Benchmark: Overview We propose SALESSIM, an agentic simulation environment targeting retail interaction. Our framework extends from the dual-… view at source ↗

**Figure 3.** Figure 3: ChatGPT overdiscloses criteria in the first turn using struc [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative examples on SALESSIM. 4.5 Qualitative Results Models frequently exhibit overly lenient behavior, accepting recommendations that violate explicit preferences or dealbreakers [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

We present SalesSim, a framework and testbed for evaluating the ability of Multimodal Large Language Models (MLLMs) to simulate realistic, persona-driven customer behavior in multi-turn, multi-modal, tool-augmented online retail conversations. Unlike prior work that treat user simulation as surface-level dialogue generation, SalesSim models retail interaction and decision-making as a grounded, agentic process, where shoppers with diverse backgrounds, preferences, and dealbreakers interact with a sales agent, seek clarifications, and make informed purchasing decisions. For evaluation, we design a suite of metrics centered on decision alignment, measuring the consistency between the simulator's actions and its persona specifications, as well as conversational quality. We find several behavioral gaps after benchmarking 6 open and closed-source state-of-the-art models. First, while models produce fluent conversations, they display significantly lower lexical diversity and overdisclosure of criteria across personas compared to human conversations. Second, models tend to be persuaded by sales agent suggestions and drift from persona specifications. Even the strongest model achieves less than 79% average alignment with its underlying persona specifications. To make progress on these limitations, we propose UserGRPO, a multi-turn, multi-objective reinforcement learning recipe to optimize both conversational fluency and decision alignment under persona specifications. Our experiments demonstrate that UserGRPO boosts decision alignment of the baseline model by 13.8% while improving conversational quality. By introducing SalesSim, we provide a new testbed for the community to investigate and improve the adherence of user simulators in goal-oriented settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SalesSim gives a grounded retail simulation benchmark and an RL method that lifts persona adherence, but the alignment metric needs clearer independence from the prompts and judges.

read the letter

The main point is that this paper introduces SalesSim as a testbed for multimodal models simulating persona-driven retail shoppers in multi-turn, tool-using conversations. They benchmark six models, document gaps such as low lexical diversity, overdisclosure of criteria, and easy drift under persuasion, and show that even the best model stays under 79 percent alignment. Their UserGRPO multi-turn RL recipe then raises alignment by 13.8 percent while also improving conversational quality.

Referee Report

3 major / 2 minor

Summary. The paper introduces SalesSim, a benchmark and testbed for MLLMs simulating persona-driven retail customers in multi-turn, multimodal, tool-augmented conversations. It benchmarks six open- and closed-source models on decision alignment (consistency of actions with persona specifications) and conversational quality metrics, reporting that models show low lexical diversity, overdisclosure, persuasion by agents, and persona drift, with even the strongest model below 79% average alignment. It then proposes UserGRPO, a multi-turn multi-objective RL method, which improves baseline decision alignment by 13.8% while also raising conversational quality.

Significance. If the metrics prove externally valid and independent of prompting, SalesSim supplies a needed testbed for goal-oriented user simulation research, moving beyond surface dialogue generation to agentic decision-making. The UserGRPO recipe offers a concrete, reproducible optimization path that simultaneously targets fluency and persona adherence.

major comments (3)

[Evaluation Metrics] The decision alignment metric (abstract and evaluation section) must be defined with sufficient detail to rule out circularity: if alignment is scored by an LLM judge that receives the identical persona text used to prompt the simulator, the reported <79% ceiling and 13.8% UserGRPO gain may simply reflect surface consistency rather than independent behavioral fidelity. A concrete protocol showing how drift is measured without leaking persona criteria into the judge is required.
[Human Baselines] Human baseline collection (abstract and experimental setup) lacks documented instructions, inter-annotator agreement statistics, and controls confirming that human participants received non-leaking persona specifications identical to those given models. Without these, the contrast used to support both the model gaps and the RL improvement cannot be verified as externally valid.
[Experimental Setup] The paper states concrete results (79% alignment, 13.8% gain) but the abstract and methods summary omit full metric definitions, data-collection details, and experimental controls. These omissions make the central empirical claims unverifiable at present and constitute a load-bearing gap for the benchmarking and alignment contributions.

minor comments (2)

[Benchmarking] Clarify the exact identities and access methods for the six benchmarked models to support reproducibility.
[Results] Lexical diversity and overdisclosure claims would benefit from explicit formulas or code references for the metrics used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We appreciate the focus on ensuring the verifiability of our metrics, baselines, and experimental claims. We address each major comment below and commit to revisions that strengthen the paper without altering its core contributions.

read point-by-point responses

Referee: [Evaluation Metrics] The decision alignment metric (abstract and evaluation section) must be defined with sufficient detail to rule out circularity: if alignment is scored by an LLM judge that receives the identical persona text used to prompt the simulator, the reported <79% ceiling and 13.8% UserGRPO gain may simply reflect surface consistency rather than independent behavioral fidelity. A concrete protocol showing how drift is measured without leaking persona criteria into the judge is required.

Authors: We agree that greater specificity is needed to demonstrate that the decision alignment metric evaluates independent behavioral fidelity rather than surface-level consistency. The current manuscript provides a high-level description of the metric and LLM judge but does not include the full judge prompt or an explicit anti-leakage protocol. In the revised manuscript, we will add a dedicated subsection under Evaluation Metrics that: (1) reproduces the exact judge prompt template, which will direct the judge to assess alignment exclusively from the simulator's observed actions, decisions, and statements (without re-supplying the full persona text); and (2) details a turn-by-turn drift measurement protocol based on logged action sequences and preference consistency checks. These additions will directly address the circularity concern. revision: yes
Referee: [Human Baselines] Human baseline collection (abstract and experimental setup) lacks documented instructions, inter-annotator agreement statistics, and controls confirming that human participants received non-leaking persona specifications identical to those given models. Without these, the contrast used to support both the model gaps and the RL improvement cannot be verified as externally valid.

Authors: We concur that the human baseline documentation is currently insufficient for full verification. While the experimental setup references human comparisons, it omits the participant instructions, agreement statistics, and explicit controls for identical, non-leaking persona delivery. In the revision, we will expand the Human Baselines subsection to include: the complete instructions provided to participants, inter-annotator agreement metrics (e.g., Fleiss' kappa), and a clear statement confirming that persona specifications were presented in identical format and without leakage to both human participants and models. This will enable independent assessment of the baseline validity. revision: yes
Referee: [Experimental Setup] The paper states concrete results (79% alignment, 13.8% gain) but the abstract and methods summary omit full metric definitions, data-collection details, and experimental controls. These omissions make the central empirical claims unverifiable at present and constitute a load-bearing gap for the benchmarking and alignment contributions.

Authors: We acknowledge that the abstract and high-level methods overview are concise and do not repeat the full metric definitions, data-collection protocols, or controls that appear in later dedicated sections. While the full manuscript contains these elements, their absence from the summary sections reduces immediate verifiability. In the revision, we will: (1) augment the abstract with brief but precise metric definitions; (2) expand the methods summary to explicitly reference the subsections containing complete protocols, data collection procedures, and experimental controls; and (3) consider adding a concise appendix summarizing key configurations. These changes will make the empirical claims more readily verifiable while preserving the paper's structure. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical benchmarking and RL optimization

full rationale

The paper introduces SalesSim as an empirical testbed for benchmarking MLLMs on persona-driven retail simulation, defines decision alignment as a consistency metric between simulator actions and provided persona specifications, reports benchmarking results across models (including <79% alignment for the strongest), and proposes UserGRPO as a multi-objective RL method that yields a measured 13.8% improvement. No equations, predictions, or first-principles derivations are present that reduce any reported quantity to a fitted parameter, self-referential definition, or self-citation chain by construction. The work is self-contained against external human baselines and model evaluations, with no load-bearing steps that collapse to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; full paper details on any free parameters in metrics or RL objectives, background assumptions about human retail behavior, or invented entities are unavailable, so ledger is minimal and provisional.

axioms (1)

domain assumption Retail customer behavior can be modeled as a grounded agentic process driven by explicit persona specifications including preferences and dealbreakers.
Invoked as the core modeling choice for the SalesSim framework and evaluation metrics.

pith-pipeline@v0.9.0 · 5595 in / 1407 out tokens · 66339 ms · 2026-05-12T00:47:24.859343+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formalize Decision Alignment as a binary function that evaluates whether a simulated shopper’s final decision is consistent with their latent persona constraints. ... DA(C) = 1 if a(C) ≠ ∅ and a(C) ∈ A, or if a(C) = ∅ and R(C) ∩ A = ∅; 0 otherwise.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We design three main reward components: Decision Alignment (Ralign), Reasoning Quality (Rreason), Linguistic Reward (Rngram) ... final reward is instantiated as a weighted average of all reward components, each normalized to [0,1].

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 7 internal anchors

[1]

Consistently simulating human personas with multi-turn reinforcement learning

Marwa Abdulhai, Ryan Cheng, Donovan Clay, Tim Althoff, Sergey Levine, and Natasha Jaques. Consistently simulating human personas with multi-turn reinforcement learning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=A0T3piHiis

work page 2025
[2]

LMRL gym: Benchmarks for multi-turn reinforcement learning with language models

Marwa Abdulhai, Isadora White, Charlie Victor Snell, Charles Sun, Joey Hong, Yuexiang Zhai, Kelvin Xu, and Sergey Levine. LMRL gym: Benchmarks for multi-turn reinforcement learning with language models. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=hmGhP5DO2W

work page 2025
[3]

Prompting large language models for user simulation in task-oriented dialogue systems.Computer Speech & Language, 89:101697, 2025

Atheer Algherairy and Moataz Ahmed. Prompting large language models for user simulation in task-oriented dialogue systems.Computer Speech & Language, 89:101697, 2025. ISSN 0885-

work page 2025
[4]

URL https://www.sciencedirect

doi: https://doi.org/10.1016/j.csl.2024.101697. URL https://www.sciencedirect. com/science/article/pii/S0885230824000809

work page doi:10.1016/j.csl.2024.101697 2024
[6]

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. τ 2-bench: Evaluating conversational agents in a dual-control environment.arXiv, 2025. doi: 10.48550/ arxiv.2506.07982. 9

work page internal anchor Pith review arXiv 2025
[7]

Multiwoz-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling

Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gasic. Multiwoz-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 5016–5026, 2018

work page 2018
[8]

SocialBench: Sociality evaluation of role-playing conversational agents

Hongzhan Chen, Hehong Chen, Ming Yan, Wenshen Xu, Gao Xing, Weizhou Shen, Xiaojun Quan, Chenliang Li, Ji Zhang, and Fei Huang. SocialBench: Sociality evaluation of role-playing conversational agents. InFindings of the Association for Computational Linguistics: ACL 2024, pages 2108–2126, Bangkok, Thailand, 2024. Association for Computational Linguistics. d...

work page 2024
[9]

Bahri, H

Qinyuan Cheng, Linyang Li, Guofeng Quan, Feng Gao, Xiaofeng Mou, and Xipeng Qiu. Is MultiWOZ a solved task? an interactive TOD evaluation framework with user simulator. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1248–1259, Abu Dhabi, United Arab Emirates, Decembe...

work page doi:10.18653/v1/2022 2022
[10]

arXiv preprint arXiv:2309.13233 , year=

Sam Davidson, Salvatore Romeo, Raphael Shu, James Gung, Arshit Gupta, Saab Mansour, and Yi Zhang. User simulation with large language models for evaluating task-oriented dialogue. arXiv preprint arXiv:2309.13233, 2023

work page arXiv 2023
[11]

Gemini 3 flash model card

Google DeepMind. Gemini 3 flash model card. https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-3-Flash-Model-Card.pdf , 2025. Accessed: 2026-03-30

work page 2025
[12]

The faiss library.arXiv,

Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre- Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. The faiss library.arXiv,

work page
[13]

doi: 10.48550/arxiv.2401.08281

work page internal anchor Pith review doi:10.48550/arxiv.2401.08281
[14]

Rojas-Barahona, and Michał K

Justyna Gromada, Alicja Kasicka, Ewa Komkowska, Lukasz Krajewski, Natalia Krawczyk, Morgan Veyret, Bartosz Przybył, Lina M. Rojas-Barahona, and Michał K. Szczerbak. Evaluating conversational agents with persona-driven user simulations based on large language models: A sales bot case study. InProceedings of the 2025 Conference on Empirical Methods in Natur...

work page 2025
[15]

InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Association for Computational Linguistics. ISBN 979-8-89176-333-3. doi: 10.18653/v1/ 2025.emnlp-industry.16. URL https://aclanthology.org/2025.emnlp-industry.16/

work page doi:10.18653/v1/ 2025
[16]

Unlocking the potential of user feedback: Leveraging large language model as user simulators to enhance dialogue system

Zhiyuan Hu, Yue Feng, Anh Tuan Luu, Bryan Hooi, and Aldo Lipani. Unlocking the potential of user feedback: Leveraging large language model as user simulators to enhance dialogue system. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 3953–3957, 2023

work page 2023
[17]

CRMArena- pro: Holistic assessment of LLM agents across diverse business scenarios and interac- tions.Transactions on Machine Learning Research, 2026

Kung-Hsiang Huang, Akshara Prabhakar, Onkar Thorat, Divyansh Agarwal, Prafulla Kumar Choubey, Yixin Mao, Silvio Savarese, Caiming Xiong, and Chien-Sheng Wu. CRMArena- pro: Holistic assessment of LLM agents across diverse business scenarios and interac- tions.Transactions on Machine Learning Research, 2026. ISSN 2835-8856. URL https://openreview.net/forum?...

work page 2026
[18]

Large language models as user-agents for evaluating task-oriented-dialogue systems

Taaha Kazi, Ruiliang Lyu, Sizhe Zhou, Dilek Hakkani-Tür, and Gokhan Tur. Large language models as user-agents for evaluating task-oriented-dialogue systems. In2024 IEEE Spoken Language Technology Workshop (SLT), pages 913–920. IEEE, 2024

work page 2024
[19]

Know your users! estimating user domain knowledge in conversational recommenders, 2025

Ivica Kostric, Krisztian Balog, and Jeffrey Dalton. Know your users! estimating user domain knowledge in conversational recommenders, 2025. URL https://arxiv.org/abs/2512. 13173

work page 2025
[20]

MOA: Multi-Objective Alignment for Role-Playing Agents

Chonghua Liao, Ke Wang, Yuchuan Wu, Fei Huang, and Yongbin Li. Moa: Multi-objective alignment for role-playing agents, 2025. URLhttps://arxiv.org/abs/2512.09756. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

DuetSim: Building user simulator with dual large language models for task-oriented dialogues

Xiang Luo, Zhiwen Tang, Jin Wang, and Xuejie Zhang. DuetSim: Building user simulator with dual large language models for task-oriented dialogues. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors,Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Reso...

work page 2024
[22]

Embracing imperfection: Simulating students with diverse cognitive levels using LLM-based agents.arXiv preprint arXiv:2505.19997, 2025

Yiping Ma et al. Embracing imperfection: Simulating students with diverse cognitive levels using LLM-based agents.arXiv preprint arXiv:2505.19997, 2025. URL https://arxiv. org/abs/2505.19997

work page arXiv 2025
[23]

InProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval(Santiago, Chile)(SIGIR ’15)

Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. Image-based recommendations on styles and substitutes. InProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 43–52, 2015. doi: 10.1145/2766462.2767755

work page doi:10.1145/2766462.2767755 2015
[24]

Nemotron-Personas-USA: Synthetic personas aligned to real-world distributions, June 2025

Yev Meyer and Dane Corneil. Nemotron-Personas-USA: Synthetic personas aligned to real-world distributions, June 2025. URL https://huggingface.co/datasets/nvidia/ Nemotron-Personas-USA

work page 2025
[25]

Murakhovs’ ka, P

Lidiya Murakhovs’ka, Philippe Laban, Tian Xie, Caiming Xiong, and Chien-Sheng Wu. Sales- people vs salesbot: Exploring the role of educational value in conversational recommender systems, 2023. URLhttps://arxiv.org/abs/2310.17749

work page arXiv 2023
[26]

Flipping the dialogue: Training and evaluating user language models.arXiv preprint arXiv:2510.06552, 2025

Tarek Naous, Philippe Laban, Wei Xu, and Jennifer Neville. Flipping the dialogue: Training and evaluating user language models.arXiv preprint arXiv:2510.06552, 2025

work page arXiv 2025
[27]

Introducing gpt-5.4

OpenAI. Introducing gpt-5.4. https://openai.com/index/introducing-gpt-5-4/ ,

work page
[28]

Accessed: 2026-03-30

work page 2026
[29]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

work page 2023
[30]

Userbench: An interactive gym environment for user-centric agents

Cheng Qian, Zuxin Liu, Akshara Prabhakar, Zhiwei Liu, Jianguo Zhang, Haolin Chen, Heng Ji, Weiran Yao, Shelby Heinecke, Silvio Savarese, et al. Userbench: An interactive gym environment for user-centric agents. InWorkshop on Scaling Environments for Agents

work page
[31]

Userrl: Training interactive user-centric agent via reinforcement learning.arXiv preprint arXiv:2509.19736, 2025

Cheng Qian, Zuxin Liu, Akshara Prabhakar, Jielin Qiu, Zhiwei Liu, Haolin Chen, Shirley Kokane, Heng Ji, Weiran Yao, Shelby Heinecke, et al. Userrl: Training interactive user-centric agent via reinforcement learning.arXiv preprint arXiv:2509.19736, 2025

work page arXiv 2025
[32]

Learning to make mistakes: Modeling incorrect student thinking and key errors.arXiv preprint arXiv:2510.11502, 2025

Alexis Ross and Jacob Andreas. Learning to make mistakes: Modeling incorrect student thinking and key errors.arXiv, 2025. doi: 10.48550/arxiv.2510.11502

work page doi:10.48550/arxiv.2510.11502 2025
[33]

τ 3-bench: Fixing airline + retail

Sierra Engineering. τ 3-bench: Fixing airline + retail. https://taubench.com/blog/ tau3-task-fixes.html, February 2026. Accessed: 2026-04-06

work page 2026
[34]

Simulating novice students using machine unlearn- ing and relearning in large language models.arXiv preprint arXiv:2603.26142, March 2026

Jiajia Song, Zhihan Guo, and Jionghao Lin. Simulating novice students using machine unlearn- ing and relearning in large language models.arXiv preprint arXiv:2603.26142, March 2026. URLhttps://arxiv.org/abs/2603.26142

work page arXiv 2026
[35]

Metaphorical user simulators for evaluating task-oriented dialogue systems

Weiwei Sun, Shuyu Guo, Shuo Zhang, Pengjie Ren, Zhumin Chen, Maarten de Rijke, and Zhaochun Ren. Metaphorical user simulators for evaluating task-oriented dialogue systems. ACM Trans. Inf. Syst., 42(1), August 2023. ISSN 1046-8188. doi: 10.1145/3596510. URL https://doi.org/10.1145/3596510

work page doi:10.1145/3596510 2023
[36]

Character-r1: Enhancing role-aware reasoning in role-playing agents via rlvr.arXiv preprint arXiv:2601.04611, 2026

Yihong Tang, Kehai Chen, Xuefeng Bai, Benyou Wang, Zeming Liu, Haifeng Wang, and Min Zhang. Character-r1: Enhancing role-aware reasoning in role-playing agents via rlvr.arXiv preprint arXiv:2601.04611, 2026. URLhttps://arxiv.org/abs/2601.04611. 11

work page arXiv 2026
[37]

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Bey...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiale Zhu, Jiali Chen, J...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Charactereval: A chinese benchmark for role-playing conversational agent evaluation

Quan Tu, Shilong Fan, Zihang Tian, Tianhao Shen, Shuo Shang, Xin Gao, and Rui Yan. Charactereval: A chinese benchmark for role-playing conversational agent evaluation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 11836–11850, Bangkok, Thailand, 2024. Associa- 12 tion for Comput...

work page doi:10.18653/v1/2024.acl-long.638 2024
[41]

, Milani, S

Ruiyi Wang, Stephanie Milani, Jamie C. Chiu, Jiayin Zhi, Shaun M. Eack, Travis Labrum, Samuel M. Murphy, Nev Jones, Kate Hardy, Hong Shen, Fei Fang, and Zhiyu Zoey Chen. Patient-Ψ: Using large language models to simulate patients for training mental health profes- sionals.arXiv, 2024. doi: 10.48550/arxiv.2405.19660

work page doi:10.48550/arxiv.2405.19660 2024
[42]

RAIDEN benchmark: Evaluating role-playing conversational agents with measurement-driven custom dialogues

Bowen Wu, Kaili Sun, Ziwei Bai, Ying Li, and Baoxun Wang. RAIDEN benchmark: Evaluating role-playing conversational agents with measurement-driven custom dialogues. InProceedings of the 31st International Conference on Computational Linguistics, pages 11086–11106, Abu Dhabi, UAE, 2025. Association for Computational Linguistics. URLhttps://aclanthology. org...

work page 2025
[43]

Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Meng Bao, Zora Zhiruo Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Ming-Hsuan Yang, Hao Lu, Amaad Martin, Zhe Su, Leander Melroy Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, and Graham Neubig. Theagentcompany: Benchmarking llm agents on consequential real world tasks.A...

work page internal anchor Pith review arXiv 2024
[44]

Mua-rl: Multi-turn user-interacting agent reinforcement learning for agentic tool use.arXiv preprint arXiv:2508.18669, 2025

Weikang Zhao, Xili Wang, Chengdi Ma, Lingbin Kong, Zhaohua Yang, Mingxiang Tuo, Xiaowei Shi, Yitao Zhai, and Xunliang Cai. Mua-rl: Multi-turn user-interacting agent reinforcement learning for agentic tool use.arXiv preprint arXiv:2508.18669, 2025

work page arXiv 2025
[45]

Characterbench: Benchmarking character customization of large language models

Jinfeng Zhou, Yongkang Huang, Bosi Wen, Guanqun Bi, Yuxuan Chen, Pei Ke, Zhuang Chen, Xiyao Xiao, Libiao Peng, Kuntian Tang, Rongsheng Zhang, Le Zhang, Tangjie Lv, Zhipeng Hu, Hongning Wang, and Minlie Huang. Characterbench: Benchmarking character customization of large language models. In Toby Walsh, Julie Shah, and Zico Kolter, editors, AAAI-25, Sponsor...

work page doi:10.1609/aaai.v39i24.34806 2025
[46]

arXiv preprint arXiv:2310.11667 , year=

Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, and Maarten Sap. Sotopia: Interactive evaluation for social intelligence in language agents.ArXiv, abs/2310.11667, 2023. URLhttps://api.semanticscholar.org/CorpusID:264289186

work page arXiv 2023
[47]

A Reward Tier Definitions Read-only tools: get_user_details, get_reservation_details, search_direct_flight, search_onestop_flight, list_all_airports,calculate

Yifei Zhou, Song Jiang, Yuandong Tian, Jason Weston, Sergey Levine, Sainbayar Sukhbaatar, and Xian Li. Sweet-rl: Training multi-turn llm agents on collaborative reasoning tasks, 2025. URLhttps://arxiv.org/abs/2503.15478. A Dataset Construction Details We provide an overview of the data sources and enrichment procedures used to develop our SALESSIM product...

work page arXiv 2025
[48]

Acer Aspire 5 Slim — $269.99 - 15.6" display - Intel Celeron N4020 - 4GB RAM - 64GB eMMC

work page
[49]

function

Dell Inspiron 15 3000 Series — $349.99 - 15.6" display - Intel Celeron N3060 - 4GB RAM - 500GB hard drive The other two shown are well over budget. My recommendation: - Acer Aspire 5 Slim if you want the lowest price and mainly plan to write, browse, email, and use cloud storage. - Dell Inspiron 15 3000 if you want more local storage for files and documen...

work page
[50]

**Look for:** - **MIL-STD-810H** - **IP rating** (IP53 minimum, IP65 preferred) - **Drop rating**

Ruggedness ratings These matter more than raw CPU speed. **Look for:** - **MIL-STD-810H** - **IP rating** (IP53 minimum, IP65 preferred) - **Drop rating**

work page
[51]

Outdoor screen visibility - **800–1200+ nits** - Anti-glare display

work page
[52]

Weather and port protection - Sealed ports - Spill-resistant keyboard

work page
[53]

Battery setup - 8+ hours - Hot-swappable batteries

work page
[54]

Size and weight tradeoff - 13–14 inch preferred - Expect heavier than standard laptops

work page
[55]

Performance baseline - Intel Core i5 / Ryzen 5 - 16GB RAM - 512GB SSD

work page
[56]

Storage durability - SSD only - Shock-mounted

work page
[57]

Connectivity - LTE / 5G - GPS - Ethernet - USB-A + USB-C - Serial port

work page
[58]

Keyboard/input - Glove usable - Backlit

work page
[59]

Temperature range - Check operating conditions

work page
[60]

function

Repairability - 3-year warranty - Replaceable parts — Shopper:Let’s go with option 1 — give me a shortlist of the most promising rugged lines to compare. Focus on 13–14 inch, MIL-STD- 810H, IP53+, 800+ nits, and 8+ hour battery. Salesperson:Absolutely — for a **13–14" rugged shortlist**, compare: - **Getac B360** - **Dell Latitude 7330 Rugged Extreme** - ...

work page