(a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A]

If you are including theoretical results

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

browse 7 citing papers

representative citing papers

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

cs.CR · 2024-06-19 · unverdicted · novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

cs.AI · 2023-06-05 · conditional · novelty 8.0

LIBERO is a new benchmark for lifelong robot learning that evaluates transfer of declarative, procedural, and mixed knowledge across 130 manipulation tasks with provided demonstration data.

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

cs.IR · 2021-04-17 · accept · novelty 8.0

BEIR is a heterogeneous zero-shot benchmark showing BM25 as a robust baseline while re-ranking and late-interaction models perform best on average at higher cost, with dense and sparse models lagging in generalization.

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

cs.AI · 2024-06-17 · unverdicted · novelty 7.0

τ-bench shows state-of-the-art agents like GPT-4o succeed on under 50% of tool-using, rule-following tasks and are inconsistent across repeated trials.

Large Language Models are Zero-Shot Reasoners

cs.CL · 2022-05-24 · accept · novelty 7.0

Adding the fixed prompt 'Let's think step by step' enables large language models to achieve substantial zero-shot gains on arithmetic, symbolic, and logical reasoning benchmarks without any task-specific examples.

Flamingo: a Visual Language Model for Few-Shot Learning

cs.CV · 2022-04-29 · unverdicted · novelty 7.0

Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

cs.CV · 2023-06-15 · conditional · novelty 6.0

HPD v2 is the largest human preference dataset for text-to-image images with 798k choices, and HPS v2 is the resulting CLIP-based scorer that better predicts human judgments and responds to model improvements.

citing papers explorer

Showing 2 of 2 citing papers after filters.

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning cs.AI · 2023-06-05 · conditional · none · ref 78
LIBERO is a new benchmark for lifelong robot learning that evaluates transfer of declarative, procedural, and mixed knowledge across 130 manipulation tasks with provided demonstration data.
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains cs.AI · 2024-06-17 · unverdicted · none · ref 31
τ-bench shows state-of-the-art agents like GPT-4o succeed on under 50% of tool-using, rule-following tasks and are inconsistent across repeated trials.

(a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A]

fields

years

verdicts

representative citing papers

citing papers explorer