pith. sign in

← back to paper

Review history

arxiv: 2606.25819 · 2 revisions

Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability

  1. 2026-06-30 UNVERDICTED LOW v0.9.1-grok novelty 7.0
    40775 ms 5773 in 1356 out 2026-06-30T09:52:55.411240+00:00
  2. 2026-06-25 UNVERDICTED LOW v0.9.1-grok novelty 7.0
    25252 ms 5770 in 1138 out 2026-06-25T20:48:42.911178+00:00