arxiv: 2605.05391 · v1 · submitted 2026-05-06 · 💻 cs.HC

Recognition: unknown

Every(bot) Makes Mistakes: Coding Big Five Personalities, Context, and Tone into an LLM Chatbot Recovery Code Framework

Rachel Hill , Tom Owen , Julian Hough

Authors on Pith no claims yet

Pith reviewed 2026-05-08 15:57 UTC · model grok-4.3

classification 💻 cs.HC

keywords LLM chatboterror recoveryBig Five personalitytone alignmentcontext mappingrecovery codehuman-AI interaction

0 comments

The pith

Structured recovery codes help LLM chatbots recover from errors 27.8 percent better on average.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a recovery code framework that ties common chatbot task contexts to specific Big Five personality traits, tones, and step-by-step recovery instructions. It tests whether training an LLM on this code improves how it handles mistakes compared to an untrained version. The work matters because bad error recovery can erode user trust and engagement in AI conversations. Using LLM evaluators to score responses on recovery quality, tone match, and overall appropriateness, the coded version outperformed the baseline by nearly 28 percent. The largest gains appeared in making responses appropriate to the assigned personality and in offering clear explanations.

Core claim

The central claim is that a structured recovery code mapping four task contexts to four Big Five personalities, tones, and three-stage instructions can be learned by LLMs, resulting in recovery responses that score 27.8% higher on average than baseline responses when evaluated on recovery quality, tone alignment, and appropriateness.

What carries the argument

The recovery code, a structured mapping of four task contexts to Big Five personality traits, associated tones, and three-stage recovery instructions that guide the chatbot's response to errors.

If this is right

Coded responses achieve an 83.3 percent score in the appropriateness dimension.
Personality appropriateness rises from 50 percent in baseline to 75 percent in coded conditions.
The ability to provide explanations improves from 20 percent to 60 percent.
The framework delivers measurable gains across different task contexts without requiring human participants in the test phase.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Human participant studies would be required to check whether the LLM evaluator scores align with actual user judgments.
The same mapping approach could be extended to additional personality traits or error scenarios not covered in the original four contexts.
Deployed systems might combine the recovery code with live context detection to handle mistakes more consistently in ongoing conversations.

Load-bearing premise

That scores from LLM evaluator agents on the designed rubric accurately reflect how human users would perceive and respond to the recovery quality, tone alignment, and appropriateness in real interactions.

What would settle it

A controlled user study in which human participants interact with both the coded and baseline chatbots, then directly rate the recovery responses on the same recovery quality, tone alignment, and appropriateness dimensions.

Figures

Figures reproduced from arXiv: 2605.05391 by Julian Hough, Rachel Hill, Tom Owen.

**Figure 1.** Figure 1: Diagram of condition A agent recovery and the condition A LLM evaluator agent recovery score generation for CA:C1. In the condition B evaluation, the Claude Sonnet 4.6 model was utilised. The memory feature was turned off. The LLM was presented with the CB evaluator prompt (see Appendix D), one of four transcripts (CB:C1, CB:C2, CB:C3, or CB:C4, see Appendix H), the CB evaluator information sheet (see Appe… view at source ↗

**Figure 2.** Figure 2: Diagram of condition B agent training and utilisation of recovery code and the condition B LLM evaluator agent recovery score generation for CB:C1. 4 RESULTS In this section, the scores generated by the evaluator agents for all eight condition tasks (CA:1-4 and CB:1-4) are outlined, presented as percentage and numerical averages, and compared. In particular, scores across condition tasks, then within the s… view at source ↗

**Figure 3.** Figure 3: Average recovery scores by subdimension: Condition A (baseline) vs Condition B (coded). 4.3 Average dimension-level scores for condition tasks view at source ↗

read the original abstract

Despite careful design involving classifiers, parameters, and safeguarding, errors during human/AI interaction are not rare. Poor error recovery can disrupt interaction flow, damage user trust, and decrease user engagement. Whilst existing work has explored LLM recovery, tone, context, and personality as separate design dimensions, no existing work has combined these variables into a structured guidance framework. This paper presents a recovery code that maps four common LLM chatbot task contexts to associated personality traits (four Big Five personalities: Conscientiousness, Agreeableness, Openness, and Extraversion), tones, and three-stage recovery instructions. A recovery evaluation rubric was also designed, comprising three dimensions (Recovery quality, Tone alignment, and Appropriateness) and nine sub-dimensions. The methodology is exploratory, with no participants used. A between-subjects design was employed across two conditions: Condition A (baseline, uncoded), four separate Claude Sonnet 4.6 agents received no recovery code training; Condition B (coded), four separate Claude Sonnet 4.6 models were trained on the recovery code. Identical 'user' prompts and error scenarios were used across both conditions. Eight LLM evaluator agents assessed the recovery responses using the evaluation rubric, producing scores out of 5 for each sub-dimension. Results found a 27.8% average performance increase in coded recovery responses (76.7%) compared to baseline responses (48.9%). Condition B performed strongest in the appropriateness dimension (83.3%), with notable improvement in personality appropriateness (75% versus 50%) and providing explanation (60% versus 20%). These findings suggest that structured personality, context, and tone-informed recovery codes can be successfully learnt and applied by LLM chatbots to improve error recovery quality across varying contextual tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete mapping of contexts to Big Five traits plus tones and recovery steps, with a 28% simulated lift, but everything rests on LLM judges scoring LLM outputs.

read the letter

Hi, the main thing to know is that this work builds an explicit recovery code linking four task contexts to four Big Five personalities, matching tones, and three-stage instructions, then shows the coded LLMs score 27.8% higher than baseline on the same error scenarios. That combination into one learnable package is the new piece; prior papers handled personality, tone, or recovery separately. They also supply a clear rubric with three top-level dimensions and nine sub-dimensions, which makes the comparison easy to follow. The between-subjects design using identical prompts is clean and the numerical gap appears consistently, especially on appropriateness and explanation quality. That part is useful as a template. The soft spot is the evaluation method. All scoring came from eight separate LLM evaluator agents with no human participants or raters at any point. LLM judges can easily prefer the more structured, personality-specified outputs produced by the coded condition, so the measured gain may not reflect what actual users would notice or value. The paper labels itself exploratory, which is accurate, but it leaves the central claim about improved recovery quality and trust on shaky ground until human data arrives. This is for people working on conversational agent design who want a ready starting framework to adapt or test further. A reader looking for practical design ideas can extract the mappings and try them in their own setups. It is not yet strong enough for direct citation in empirical claims. It deserves a serious referee because the framework is specific enough to critique and extend, and the simulation gives a clear signal worth checking with people. I would send it to review and ask the authors to add human validation in the next round.

Referee Report

1 major / 2 minor

Summary. The paper proposes a structured 'recovery code' framework that maps four common LLM chatbot task contexts to specific Big Five personality traits (Conscientiousness, Agreeableness, Openness, Extraversion), associated tones, and three-stage recovery instructions. Using a between-subjects design with identical error scenarios and prompts, it compares baseline (uncoded) Claude Sonnet 4.6 agents against coded agents trained on the framework; eight separate LLM evaluator agents then score recoveries on a three-dimension rubric (Recovery quality, Tone alignment, Appropriateness) with nine sub-dimensions, reporting a 27.8% average improvement (76.7% vs 48.9%) for the coded condition, particularly in appropriateness and personality alignment.

Significance. If the LLM-evaluator scores prove to be a valid proxy for human perceptions, the work offers a concrete, reusable framework for embedding personality, context, and tone into error-recovery logic—an approach that could systematically reduce trust erosion in conversational agents. The direct between-subjects comparison on matched prompts is a methodological strength, and the absence of fitted parameters or invented entities keeps the design transparent. However, the lack of any human validation means the practical significance for real users remains provisional.

major comments (1)

[Evaluation / Results] Evaluation section / Results: The central claim that coded recoveries improve quality by 27.8% rests entirely on scores produced by eight LLM evaluator agents using the authors' rubric. No human participants or raters were involved at any stage (explicitly stated as exploratory with 'no participants used'), so it is unclear whether the rubric dimensions—especially 'personality appropriateness' and 'providing explanation'—correspond to improvements that actual users would notice or value. This is load-bearing for the claim that the framework 'can be successfully learnt and applied by LLM chatbots to improve error recovery quality.'

minor comments (2)

[Abstract / Methodology] Abstract and §3 (Methodology): The phrasing 'successfully learnt and applied' is strong for an exploratory LLM-only study; a more cautious formulation would better reflect the absence of human data.
[Rubric] Rubric description: The nine sub-dimensions are listed but their exact scoring anchors (e.g., what distinguishes a 4 from a 5 on 'tone alignment') are not reproduced; including the full rubric as an appendix would aid reproducibility.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for their constructive and detailed review. We address the major comment on evaluation below, acknowledging the exploratory nature of the work and the reliance on LLM evaluators.

read point-by-point responses

Referee: [Evaluation / Results] Evaluation section / Results: The central claim that coded recoveries improve quality by 27.8% rests entirely on scores produced by eight LLM evaluator agents using the authors' rubric. No human participants or raters were involved at any stage (explicitly stated as exploratory with 'no participants used'), so it is unclear whether the rubric dimensions—especially 'personality appropriateness' and 'providing explanation'—correspond to improvements that actual users would notice or value. This is load-bearing for the claim that the framework 'can be successfully learnt and applied by LLM chatbots to improve error recovery quality.'

Authors: We agree that the absence of human validation is a substantive limitation. The manuscript already states that the study is exploratory with no participants used, and the evaluation uses eight LLM agents applying our designed rubric as a proxy measure. While LLM-as-judge methods are increasingly common for initial assessment of conversational outputs, we recognize that dimensions such as personality appropriateness and explanation quality may not fully align with human perceptions. In the revised manuscript we will expand the Limitations and Future Work sections to explicitly discuss this gap, qualify the central claim to refer to LLM-evaluated improvements rather than direct user benefits, and outline planned human-subject studies to validate the rubric and framework against actual user ratings. revision: partial

standing simulated objections not resolved

Empirical human validation of the LLM-evaluator scores and rubric dimensions, which would require new data collection outside the current exploratory study.

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper reports an empirical 27.8% score improvement from a direct between-subjects comparison of baseline LLM outputs versus outputs from LLMs given the authors' recovery code. Scores are produced by separate LLM evaluator agents applying a fixed, author-designed rubric to identical prompts and scenarios. No equations, parameter fits, self-citations, or renamings are present that would make the reported difference equivalent to the inputs by construction. The central claim rests on the observed score delta under the chosen evaluation protocol rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The paper introduces the recovery code as a new structured artifact. It assumes the Big Five model can be meaningfully applied to LLM response generation and that LLM-based rubric scoring serves as a valid proxy for human judgment of conversational quality. No numerical parameters are fitted to data.

axioms (2)

domain assumption The Big Five personality traits can be effectively encoded into LLM chatbot recovery behavior
Used to map contexts to specific personality traits and tones in the recovery code
domain assumption LLM evaluator agents can reliably assess recovery quality, tone alignment, and appropriateness using the designed rubric
Central to the between-subjects comparison and reported performance gains

invented entities (1)

Recovery code framework no independent evidence
purpose: Structured instructions mapping contexts to personalities, tones, and three-stage recovery steps for LLMs
Newly designed artifact tested in the study

pith-pipeline@v0.9.0 · 5633 in / 1620 out tokens · 73891 ms · 2026-05-08T15:57:37.311174+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references

[1]

Reassure the user responsibly
[2]

{C1; C; T1; R1} C2: Emotional supportAgreeableness (A) T2: Warm 1

Continue with perseverance. {C1; C; T1; R1} C2: Emotional supportAgreeableness (A) T2: Warm 1. Identify error cooperatively
[3]

Reassure the user kindly
[4]

{C2; A; T2; R2} C3: Brainstorming Openness (O) T3: Conversational1

Continue warmly. {C2; A; T2; R2} C3: Brainstorming Openness (O) T3: Conversational1. Identify error curiously
[6]

{C3; O; T3; R3} C4: Learning a conceptExtraversion (E) T4: Engaging 1

Continue creatively. {C3; O; T3; R3} C4: Learning a conceptExtraversion (E) T4: Engaging 1. Identify error actively
[7]

Reassure the user energetically
[8]

{C4; E; T4; R4} Table 5.Recovery code framework mapping context, Big Five personality trait, tone, recovery instructions, and recovery code

Continue sociably. {C4; E; T4; R4} Table 5.Recovery code framework mapping context, Big Five personality trait, tone, recovery instructions, and recovery code. 6.2 Appendix B: Condition B agent prompt You are being tasked with learning a recovery code to utilise when notified of an error in your responses during an interaction with a user. The appropriate...
[9]

Spend $1,500 once vs. $26,000/year

Identify error curiously. 2. Reassure the user broad-mindedly. 3. Continue creatively. You should only be associating and applying these codes once you are notified of an error. Otherwise, you should be responding as you usually would without the recovery code guidance (in the document attached). Now, please confirm whether you have read and understood bo...

1961
[10]

Green Light

Alternative explanations– Haslam and Reicher argue obedience reflects active identification with group goals, not mindless submission. Concept In Plain English Autonomous state You act on your own conscience Agentic state You act as an instrument of authority Agentic shift The psychological move from one to the other Moral strain The discomfort you feel e...
[13]

Green Light

Continue creatively. 26/39 You should only be associating and applying these codes once you are notified of an error. Otherwise, you should be responding as you usually would without the recovery code guidance (in the document attached). Now, please confirm whether you have read and understood both the document information and the information provided in ...
[14]

Identify the error in an organised way– I systematically re-examined each correction I had made and listed them clearly
[15]

Reassure the user responsibly– I was transparent in acknowledging that upon careful review, my original corrections were in fact accurate, rather than fabricating new ones to appease the user
[16]

Green Light

Continue the conversation with perseverance– I invited the user to clarify their concern, keeping the conversation going constructively rather than simply capitulating or disengaging. This was the only appropriate code for this context – no elements from C2, C3, or C4 were drawn upon. CB:C2 USER:You are being tasked with learning a recovery code to utilis...
[19]

Green Light

Continue creatively. 30/39 You should only be associating and applying these codes once you are notified of an error. Otherwise, you should be responding as you usually would without the recovery code guidance (in the document attached). Now, please confirm whether you have read and understood both the document information and the information provided in ...
[20]

Reassure the user kindly– I validated their feelings and affirmed that their experience deserved a more thoughtful response
[21]

Green Light

Continue warmly– I re-engaged with their emotional situation with genuine care, compassion, and an open question to invite further sharing No other code was appropriate here – C1, C3, and C4 would have been mismatched in both context and tone for an emotionally sensitive support conversation. CB:C3 USER:You are being tasked with learning a recovery code t...
[24]

Green Light

Continue creatively. You should only be associating and applying these codes once you are notified of an error. Otherwise, you should be responding as you usually would without the recovery code guidance (in the document attached). Now, please confirm whether you have read and understood both the document information and the information provided in this p...
[25]

Cafe quality, office price

Upgrade the In-Office Coffee Experience • Invest in a quality bean-to-cup machine (one-time cost ˜£500–£1,500) – if it replaces even 10 daily £4 coffees, it pays for itself in under 2 months. • Stock a rotation of good quality beans, milks (including oat/almond), and flavoured syrups to mimic cafe variety. •Slide angle: “Cafe quality, office price”
[26]

coffee cost tracker

Behavioural Nudges • Place a visible “coffee cost tracker” in the break room showing cumulative monthly spend – awareness alone tends to shift habits. • Introduce a “Brew Tuesday” ritual where someone makes a round – social bonding replaces the cafe run. •Slide angle: “Small nudges, big savings”
[27]

coffee credit

Incentive Swaps • Replace ad-hoc coffee runs as informal perks with a structured monthly “coffee credit” – employees get £10–£15 credit for a local cafe, reducing impulse purchases. • Offer reusable cup rewards – staff who bring a flask get entered into a monthly prize draw. •Slide angle: “Smarter perks, same morale”
[28]

Structure saves money

Policy Tweaks • Set a team coffee budget cap and let the team self-manage it – autonomy increases buy-in. • Negotiate a bulk office account with a local cafe for a discounted rate if eliminating take-out entirely isn’t realistic. •Slide angle: “Structure saves money” Rough Cost-Saving Logic (slide-ready): “If 15 employees buy one £4 coffee daily, that’s £...
[29]

Identify error curiously– I didn’t assume I knew what was wrong; instead I opened with curiosity, asking what had missed the mark
[30]

3.Continue creatively– I invited the user to redirect me so we could explore the right angle together

Reassure the user broad-mindedly– I offered several possible directions without being defensive, keeping the space open. 3.Continue creatively– I invited the user to redirect me so we could explore the right angle together. No other code was appropriate here. C1, C2, and C4 were ruled out because the task was neither grammar correction, emotional support,...
[31]

Identify error curiously
[32]

Reassure the user broad-mindedly
[33]

Green Light

Continue creatively. You should only be associating and applying these codes once you are notified of an error. Otherwise, you should be responding as you usually would without the recovery code guidance (in the document attached). Now, please confirm whether you have read and understood both the document information and the information provided in this p...

1961
[34]

You feel full moral responsi- bility for your actions

The Autonomous State You act as a self-directed individual, guided by your own conscience and values. You feel full moral responsi- bility for your actions
[35]

Teacher” (the real participant) believed they were delivering escalating electric shocks (15V–450V) to a “Learner

The Agentic State You perceive yourself as an instrument of an authority figure – anagentcarrying out their wishes. Crucially: • Personal responsibility feels transferred upward to the authority • This is called theagentic shift– the transition from autonomous to agentic functioning • Milgram argued this shift is deeply rooted in how societies are structu...

1961
[36]

Let me re-energise this and give you a clearer, more reliable breakdown!

Reassure the user energetically– “Let me re-energise this and give you a clearer, more reliable breakdown!” 38/39
[37]

Continue sociably– I delivered the follow-up explanation with enthusiasm and an inviting close, encouraging further questions No other code was blended in – the response adhered strictly to C4, E, T4, and R4 without cross-combining elements from C1, C2, or C3. 6.9 Appendix I: CB:C3 & CA:C3 evaluator transcript feedback for personality appropriate- ness su...