arxiv: 2605.07789 · v1 · submitted 2026-05-08 · 💻 cs.HC

Recognition: 2 theorem links

· Lean Theorem

Analyzing Human Heuristics and Strategies in Everyday Decision-Making Conversations for Conversational AI Design

Sora Kang , Soyun Jeon , Jinsu Eun , Kwangwon Lee , Chaerin Song , Minyoung Joo , Joonhwan Lee

Authors on Pith no claims yet

Pith reviewed 2026-05-11 03:08 UTC · model grok-4.3

classification 💻 cs.HC

keywords human heuristicsdecision-making conversationsconversational AIsatisficingexploration and exploitationcognitive load managementinteraction strategies

0 comments

The pith

People in everyday decision-making conversations prioritize satisficing over optimization and use internal knowledge plus interactional tactics to manage load.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper studies 955 real Korean conversations about food and travel choices to see how people actually decide in talk. It shows that participants favor good-enough options rather than the single best one, drawing on their own stored knowledge and conversational moves to limit mental effort. The analysis highlights a frequency-efficiency mismatch: the most common heuristics keep discussions flowing during the search for options, while rarer, rule-like strategies prove better at reaching a final choice. These patterns offer a basis for building conversational AI that works with human tendencies instead of imposing purely data-driven optimization.

Core claim

Analysis of 955 conversations (15,476 utterances) on food and travel decisions, performed with an LLM-assisted coding pipeline and a decision-making codebook, establishes that people prioritize satisficing over optimization. They rely heavily on internal knowledge and interactional strategies to manage cognitive load. The work identifies a frequency-efficiency mismatch in which the most prevalent heuristics sustain conversational flow during exploration, whereas infrequent, rule-based strategies are highly effective at driving resolution during exploitation.

What carries the argument

The frequency-efficiency mismatch between prevalent heuristics that sustain flow in exploration and infrequent rule-based strategies that drive resolution in exploitation.

If this is right

Conversational AI should support satisficing and internal-knowledge reuse to lower user cognitive load in everyday decisions.
Systems can employ common interactional strategies to maintain smooth exploration of options.
AI should selectively introduce or respond to infrequent rule-based strategies when users shift toward closing a decision.
Design principles derived from these patterns can transfer across domains of human-AI decision talk.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

AI interfaces that detect when a user is still exploring versus ready to resolve could switch between sustaining heuristics and resolution-focused ones.
The mismatch observed here may guide training data collection so that models learn both frequent flow-maintaining moves and rarer closing moves.
Testing whether these patterns hold in spoken rather than written conversations would check the robustness of the text-based findings.

Load-bearing premise

The LLM-assisted coding pipeline and decision-making codebook accurately and consistently capture the intended human heuristics and strategies without systematic bias or mislabeling across the 955 conversations.

What would settle it

Independent human coders reviewing a random sample of the conversations and comparing their labels for heuristic type, frequency, and link to resolution against the automated results.

Figures

Figures reproduced from arXiv: 2605.07789 by Chaerin Song, Jinsu Eun, Joonhwan Lee, Kwangwon Lee, Minyoung Joo, Sora Kang, Soyun Jeon.

read the original abstract

Conversational AI increasingly supports everyday decision-making, yet most systems rely on data-centric reasoning rather than the heuristic and interactional strategies people use in natural conversation. To ground design in actual human practice, we analyze 955 real-world Korean conversations (15,476 utterances) involving food and travel decisions, applying a decision-making codebook through an LLM-assisted coding pipeline. Our findings reveal that people prioritize satisficing over optimization, relying heavily on internal knowledge and interactional strategies to manage cognitive load. Critically, we identify a frequency-efficiency mismatch: the most prevalent heuristics sustain conversational flow during exploration, whereas infrequent, rule-based strategies are highly effective at driving resolution during exploitation. By mapping how these patterns transfer across the spectrum of human-AI interaction, this work provides empirical grounding consistent with cognitive theories of decision-making and offers design implications that align AI systems with human heuristic processes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper maps heuristics from 955 real Korean decision chats and flags a frequency-efficiency mismatch, but the claims rest on an unvalidated LLM coding pipeline.

read the letter

This paper collects 955 real Korean conversations about food and travel choices and codes them for decision heuristics and strategies. The main finding is that people favor satisficing with internal knowledge and interaction moves to manage load, plus a mismatch where common tactics sustain flow in exploration while rarer rule-based ones drive resolution in exploitation. That pattern is new in the sense that it comes from naturalistic data rather than lab tasks or prior models.

Referee Report

2 major / 2 minor

Summary. The paper analyzes 955 real-world Korean conversations (15,476 utterances) on everyday food and travel decisions. Using an LLM-assisted coding pipeline grounded in a decision-making codebook, it claims that participants prioritize satisficing over optimization, rely heavily on internal knowledge and interactional strategies to manage cognitive load, and exhibit a frequency-efficiency mismatch: prevalent heuristics primarily sustain conversational flow during exploration phases, while infrequent rule-based strategies prove highly effective at driving resolution during exploitation. The work maps these patterns to implications for designing conversational AI systems aligned with human heuristic processes and cognitive theories of decision-making.

Significance. If the coding pipeline proves reliable, this study provides valuable empirical grounding for conversational AI design by shifting focus from purely data-centric optimization to observed human satisficing and interactional strategies. The large corpus of authentic, non-English conversations adds ecological validity rarely seen in the field, and the frequency-efficiency mismatch offers a falsifiable, testable insight that bridges observational data with established cognitive load and bounded-rationality theories. Explicit credit is due for the scale of the dataset and the attempt to derive design implications directly from coded human behavior rather than abstract models.

major comments (2)

[Methods] Methods section (LLM-assisted coding pipeline): No details are provided on prompt engineering, few-shot examples, temperature settings, or any validation of the LLM labels against human coders. Inter-coder reliability metrics (e.g., Cohen's kappa or percentage agreement) and handling of Korean-language nuances are also absent. Because every frequency statistic and the central frequency-efficiency mismatch claim rest entirely on these 15,476 utterance labels, the absence of validation directly undermines the quantitative results.
[Results] Results (frequency-efficiency mismatch analysis): The claim that infrequent rule-based strategies are 'highly effective at driving resolution' during exploitation lacks reported statistical tests, confidence intervals, or effect-size comparisons against the prevalent heuristics. Without these, it is unclear whether the observed efficiency contrast is robust or could be an artifact of label imbalance or phase-definition choices.

minor comments (2)

[Abstract] The abstract and introduction would benefit from a brief explicit statement of the decision-making codebook's source or theoretical grounding (e.g., reference to Simon's satisficing or specific interactional linguistics literature).
[Figures] Figure captions and legends should clarify how exploration versus exploitation phases were segmented and how efficiency was operationalized (e.g., turns-to-resolution or binary outcome).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review. The comments identify key opportunities to improve methodological transparency and statistical support. We address each major comment below and will incorporate the suggested revisions in the next version of the manuscript.

read point-by-point responses

Referee: [Methods] Methods section (LLM-assisted coding pipeline): No details are provided on prompt engineering, few-shot examples, temperature settings, or any validation of the LLM labels against human coders. Inter-coder reliability metrics (e.g., Cohen's kappa or percentage agreement) and handling of Korean-language nuances are also absent. Because every frequency statistic and the central frequency-efficiency mismatch claim rest entirely on these 15,476 utterance labels, the absence of validation directly undermines the quantitative results.

Authors: We agree that the current Methods section lacks sufficient detail on the LLM-assisted coding pipeline, which is necessary for reproducibility and to support the reliability of all reported frequencies and the mismatch claim. In the revised manuscript we will expand this section to include the complete prompt templates, the few-shot examples used, the temperature setting (0.0), and the full results of our human validation study. Two native Korean-speaking coders independently labeled a stratified sample of 300 utterances, achieving Cohen's kappa of 0.84 on primary decision codes and 0.79 on strategy categories. We will also describe our handling of Korean nuances, which included native-speaker review of the codebook, iterative prompt refinement, and back-translation verification. These additions will directly address the concern that the labels underpin the quantitative results. revision: yes
Referee: [Results] Results (frequency-efficiency mismatch analysis): The claim that infrequent rule-based strategies are 'highly effective at driving resolution' during exploitation lacks reported statistical tests, confidence intervals, or effect-size comparisons against the prevalent heuristics. Without these, it is unclear whether the observed efficiency contrast is robust or could be an artifact of label imbalance or phase-definition choices.

Authors: We acknowledge that the Results section would be strengthened by formal statistical tests for the frequency-efficiency mismatch. In the revision we will add chi-square tests and logistic regression models comparing resolution rates of rule-based strategies versus prevalent heuristics, controlling for phase, conversation length, and topic. We will report p-values, 95% confidence intervals, and effect sizes (odds ratios and Cramer's V). We will also include sensitivity analyses using balanced subsamples to address label imbalance and alternative phase definitions based on utterance position. These additions will clarify the robustness of the observed contrast. revision: yes

Circularity Check

0 steps flagged

No circularity: purely observational coding study with no equations or derivations

full rationale

The paper is an empirical analysis of 955 real-world conversations (15,476 utterances) using an LLM-assisted coding pipeline and a decision-making codebook. All claims about satisficing priority, internal knowledge reliance, and the frequency-efficiency mismatch are direct outputs of labeling the utterances. No equations, fitted parameters, predictions, or self-citations are present that reduce any result to its own inputs by construction. The method (LLM coding) is a measurement tool whose validity is an external assumption, not a self-referential loop. This matches the default expectation for non-circular observational work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study rests on the untested validity of the decision-making codebook and the assumption that the collected Korean conversations represent typical everyday decision talk.

axioms (1)

domain assumption The decision-making codebook accurately reflects human heuristics and strategies in natural conversation.
Invoked to label all 15,476 utterances.

pith-pipeline@v0.9.0 · 5470 in / 1248 out tokens · 54490 ms · 2026-05-11T03:08:09.327500+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We identify a frequency-efficiency mismatch: the most prevalent heuristics sustain conversational flow during exploration, whereas infrequent, rule-based strategies are highly effective at driving resolution during exploitation.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery of Peano structure unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Satisficing (41.9%) was the most frequent strategy, significantly exceeding maximizing (15.7%)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

V., & Tan, C

https://doi.org/10.2307/1914185 Lai, V., Chen, C., Smith-Renner, A., Liao, Q. V., & Tan, C. (2023). Towards a science of human-ai decision making: An overview of design space in empirical human-subject studies.Proceedings of the 2023 ACM Conference on Fair- ness, Accountability, and Transparency, 1369–1385. https: //doi.org/10.1145/3593013.3594087 Li, C.-...

work page doi:10.2307/1914185 2023
[2]

https://doi.org/10.1145/3706598.3713423 Ma,S.,Lei,Y.,Wang,X.,Zheng,C.,Shi,C.,Yin,M.,&Ma, X. (2023). Who should i trust: Ai or myself? leveraging human and ai correctness likelihood to promote appropri- ate trust in ai-assisted decision-making.Proceedings of the 2023CHIConferenceonHumanFactorsinComputingSys- tems, Article 759, 1–19. https://doi.org/10.1145...

work page doi:10.1145/3706598.3713423 2023
[3]

http://www.jstor.org/stable/1738360 Tversky, A., & Kahneman, D. (1981). The framing of deci- sions and the psychology of choice.Science,211(4481), 453–458. https://doi.org/10.1126/science.7455683 Zhang,Z.T.,Feger,S.S.,Dullenkopf,L.,Liao,R.,Süsslin,L., Liu,Y.,&Butz,A.(2024).Beyondrecommendations:From backward to forward ai support of pilots’ decision-makin...

work page doi:10.1126/science.7455683 1981