SkillDroid compiles LLM-guided GUI trajectories into parameterized skill templates and replays them via a matching cascade, reaching 85.3% success rate with 49% fewer LLM calls and improving from 87% to 91% over 150 rounds while the stateless baseline drops to 44%.
Title resolution pending
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
A controlled eye-tracking study finds that code priority affects review time, cognitive load, and perceived quality but not reuse decisions, while author reputation changes visual attention patterns without altering performance or reuse choices.
Two controlled experiments show multi-agent LLM configurations with both tutors and peers deliver higher learning gains and less homogeneous outputs than single-LLM tutoring in math problem-solving and essay writing.
A decision-theoretic model based on the observed Confirmation-Diagnosis-Correction-Redo user pattern places intermediate confirmations in AI agent tasks, yielding 81% user preference and 13.54% faster completion versus confirm-at-end.
LLMs can generate coherent multimodal behaviors for SIAs that align with intended ability and benevolence levels as confirmed by user perceptions, while also reproducing gender stereotypes.
Pista decomposes AI agent actions in spreadsheets into auditable steps, enabling real-time user intervention that improves task outcomes, user comprehension, agent perception, and sense of co-ownership over baseline agents.
AVA is a specialized GenAI platform for development policy research that provides verifiable syntheses from World Bank reports and is associated with 2.4-3.9 hours of weekly time savings in a large-scale user evaluation.
An empirical study creates guidelines for interpreting the Human-Computer Trust Scale as a starting point for assessing trust propensity in technology interactions, while stressing the need for contextual reflection.
citing papers explorer
-
SkillDroid: Compile Once, Reuse Forever
SkillDroid compiles LLM-guided GUI trajectories into parameterized skill templates and replays them via a matching cascade, reaching 85.3% success rate with 49% fewer LLM calls and improving from 87% to 91% over 150 rounds while the stateless baseline drops to 44%.
-
An Eye for Trust: An Exploration of Developers' Trust Perceptions Through Urgency and Reputation
A controlled eye-tracking study finds that code priority affects review time, cognitive load, and perceived quality but not reuse decisions, while author reputation changes visual attention patterns without altering performance or reuse choices.
-
Beyond the AI Tutor: Social Learning with LLM Agents
Two controlled experiments show multi-agent LLM configurations with both tutors and peers deliver higher learning gains and less homogeneous outputs than single-LLM tutoring in math problem-solving and essay writing.
-
When Should Users Check? Modeling Confirmation Frequency inMulti-Step Agentic AI Tasks
A decision-theoretic model based on the observed Confirmation-Diagnosis-Correction-Redo user pattern places intermediate confirmations in AI agent tasks, yielding 81% user preference and 13.54% faster completion versus confirm-at-end.
-
Towards Trust Calibration in Socially Interactive Agents: Investigating Gendered Multimodal Behaviors Generation with LLMs
LLMs can generate coherent multimodal behaviors for SIAs that align with intended ability and benevolence levels as confirmed by user perceptions, while also reproducing gender stereotypes.
-
Auditing and Controlling AI Agent Actions in Spreadsheets
Pista decomposes AI agent actions in spreadsheets into auditable steps, enabling real-time user intervention that improves task outcomes, user comprehension, agent perception, and sense of co-ownership over baseline agents.
-
Learning from AVA: Early Lessons from a Curated and Trustworthy Generative AI for Policy and Development Research
AVA is a specialized GenAI platform for development policy research that provides verifiable syntheses from World Bank reports and is associated with 2.4-3.9 hours of weekly time savings in a large-scale user evaluation.
-
How Much Trust is Enough? Towards Calibrating Trust in Technology
An empirical study creates guidelines for interpreting the Human-Computer Trust Scale as a starting point for assessing trust propensity in technology interactions, while stressing the need for contextual reflection.