pith. machine review for the scientific record. sign in

Towards a science of ai agent reliability

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

years

2026 6

verdicts

UNVERDICTED 6

representative citing papers

MarketBench: Evaluating AI Agents as Market Participants

cs.AI · 2026-04-26 · unverdicted · novelty 6.0

LLMs show poor calibration in predicting task success and token use on software engineering benchmarks, causing market auctions to underperform compared to perfect information scenarios, with limited improvement from added context.

citing papers explorer

Showing 6 of 6 citing papers.