Title resolution pending

tinybenchmarks: evaluating llms with fewer examples · 2023 · arXiv 2402.14992

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Putting HUMANS first: Efficient LAM Evaluation with Human Preference Alignment

cs.CL · 2026-04-20 · unverdicted · novelty 7.0

Curated 50-example subsets of LAM benchmarks, via regression, predict human preferences at 0.98 correlation, outperforming the full benchmark and yielding the open-sourced HUMANS proxy.

Activation Steering with a Feedback Controller

cs.LG · 2025-10-05 · unverdicted · novelty 7.0

Popular LLM activation steering methods are shown to act as proportional controllers; a PID steering framework is proposed that improves robustness and outperforms baselines in experiments across model families.

Refusal in Language Models Is Mediated by a Single Direction

cs.LG · 2024-06-17 · accept · novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

Minimizing Collateral Damage in Activation Steering

cs.LG · 2026-05-01 · unverdicted · novelty 6.0

Activation steering is cast as constrained optimization that minimizes collateral damage by weighting perturbations according to the empirical second-moment matrix of activations instead of assuming isotropy.

ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lower cost.

Select Smarter, Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees

cs.AI · 2026-04-13 · unverdicted · novelty 6.0

POES frames prompt evaluation as online adaptive testing and uses a provably submodular objective to pick informative examples, delivering 6.2% higher average accuracy and 35-60% token savings versus naive full-set scoring.

Efficient Evaluation of LLM Performance with Statistical Guarantees

stat.ML · 2026-01-28 · unverdicted · novelty 6.0

Factorized Active Querying (FAQ) provides up to 5 times more effective samples for LLM accuracy estimation by using Bayesian factor models and adaptive querying under a fixed budget with guaranteed coverage.

LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users

cs.CL · 2025-07-03 · unverdicted · novelty 6.0

A single attacker can use strategic upvoting and downvoting on language model outputs to inject facts, security flaws, or fake news that persist in the model for all users after preference tuning.

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

cs.CL · 2024-07-17 · unverdicted · novelty 6.0

LMMS-EVAL delivers a standardized multimodal evaluation framework with lite and live variants that target the trade-offs among coverage, cost, and zero contamination.

Measuring Competency, Not Performance: Item-Aware Evaluation Across Medical Benchmarks

cs.CL · 2025-09-29 · conditional · novelty 5.0

MedIRT applies Item Response Theory to medical LLM benchmarks to separate latent competency from item difficulty and discrimination, producing more stable rankings and revealing domain heterogeneity than accuracy alone.

Small Language Models are the Future of Agentic AI

cs.AI · 2025-06-02 · unverdicted · novelty 5.0

Small language models are sufficiently capable, more suitable, and far more economical than large models for the repetitive tasks that dominate agentic AI systems.

LLM-Safety Evaluations Lack Robustness

cs.CR · 2025-03-04 · unverdicted · novelty 4.0

LLM safety evaluations are hindered by noise in dataset curation, automated red-teaming, response generation, and LLM-judge evaluation, making fair comparisons difficult and slowing progress.

Beyond the Singular: Revealing the Value of Multiple Generations in Benchmark Evaluation

cs.CL · 2025-02-13 · unverdicted · novelty 4.0

A hierarchical statistical model demonstrates that multiple LLM generations per prompt improve benchmark score accuracy, reduce variance, and enable prompt-level difficulty scoring via correct ratios.

Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models

cs.AI · 2026-05-07

ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation

cs.LG · 2026-04-25

citing papers explorer

Showing 15 of 15 citing papers.

Putting HUMANS first: Efficient LAM Evaluation with Human Preference Alignment cs.CL · 2026-04-20 · unverdicted · none · ref 4
Curated 50-example subsets of LAM benchmarks, via regression, predict human preferences at 0.98 correlation, outperforming the full benchmark and yielding the open-sourced HUMANS proxy.
Activation Steering with a Feedback Controller cs.LG · 2025-10-05 · unverdicted · none · ref 19
Popular LLM activation steering methods are shown to act as proportional controllers; a PID steering framework is proposed that improves robustness and outperforms baselines in experiments across model families.
Refusal in Language Models Is Mediated by a Single Direction cs.LG · 2024-06-17 · accept · none · ref 173
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Minimizing Collateral Damage in Activation Steering cs.LG · 2026-05-01 · unverdicted · none · ref 25
Activation steering is cast as constrained optimization that minimizes collateral damage by weighting perturbations according to the empirical second-moment matrix of activations instead of assuming isotropy.
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents cs.AI · 2026-04-20 · unverdicted · none · ref 125
ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lower cost.
Select Smarter, Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees cs.AI · 2026-04-13 · unverdicted · none · ref 21
POES frames prompt evaluation as online adaptive testing and uses a provably submodular objective to pick informative examples, delivering 6.2% higher average accuracy and 35-60% token savings versus naive full-set scoring.
Efficient Evaluation of LLM Performance with Statistical Guarantees stat.ML · 2026-01-28 · unverdicted · none · ref 15
Factorized Active Querying (FAQ) provides up to 5 times more effective samples for LLM accuracy estimation by using Bayesian factor models and adaptive querying under a fixed budget with guaranteed coverage.
LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users cs.CL · 2025-07-03 · unverdicted · none · ref 26
A single attacker can use strategic upvoting and downvoting on language model outputs to inject facts, security flaws, or fake news that persist in the model for all users after preference tuning.
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models cs.CL · 2024-07-17 · unverdicted · none · ref 7
LMMS-EVAL delivers a standardized multimodal evaluation framework with lite and live variants that target the trade-offs among coverage, cost, and zero contamination.
Measuring Competency, Not Performance: Item-Aware Evaluation Across Medical Benchmarks cs.CL · 2025-09-29 · conditional · none · ref 21
MedIRT applies Item Response Theory to medical LLM benchmarks to separate latent competency from item difficulty and discrimination, producing more stable rankings and revealing domain heterogeneity than accuracy alone.
Small Language Models are the Future of Agentic AI cs.AI · 2025-06-02 · unverdicted · none · ref 64
Small language models are sufficiently capable, more suitable, and far more economical than large models for the repetitive tasks that dominate agentic AI systems.
LLM-Safety Evaluations Lack Robustness cs.CR · 2025-03-04 · unverdicted · none · ref 43
LLM safety evaluations are hindered by noise in dataset curation, automated red-teaming, response generation, and LLM-judge evaluation, making fair comparisons difficult and slowing progress.
Beyond the Singular: Revealing the Value of Multiple Generations in Benchmark Evaluation cs.CL · 2025-02-13 · unverdicted · none · ref 2
A hierarchical statistical model demonstrates that multiple LLM generations per prompt improve benchmark score accuracy, reduce variance, and enable prompt-level difficulty scoring via correct ratios.
Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models cs.AI · 2026-05-07 · unreviewed · ref 14
ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation cs.LG · 2026-04-25 · unreviewed · ref 59

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer