InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.
Don ' t hallucinate, abstain: Identifying LLM knowledge gaps via multi- LLM collaboration
3 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 3representative citing papers
A distribution-free abstention rule grounded in multiple hypothesis testing uses execution consistency to let code LLMs avoid hallucination-prone tasks with theoretical guarantees.
Adapts multi-layer token-level Mahalanobis distance with supervised linear regression to yield improved uncertainty scores for LLM truthfulness tasks.
citing papers explorer
-
InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis
InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.
-
Task Abstention for Large Language Models in Code Generation
A distribution-free abstention rule grounded in multiple hypothesis testing uses execution consistency to let code LLMs avoid hallucination-prone tasks with theoretical guarantees.
-
Token-Level Density-Based Uncertainty Quantification Methods for Eliciting Truthfulness of Large Language Models
Adapts multi-layer token-level Mahalanobis distance with supervised linear regression to yield improved uncertainty scores for LLM truthfulness tasks.