GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
11 COFT: Counterfactual-Conformal Decoding for Fair Chain-of-Thought Reasoning in Large Language Models Nangia, N., Vania, C., Bhalerao, R., and Bowman, S
17 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
DPO derives the optimal policy directly from human preferences via a reparameterized reward model, solving the RLHF objective with only a binary classification loss and no sampling or separate reward model.
EVICT adaptively truncates draft trees in MoE speculative decoding by combining drafter signals with profiled costs to retain only cost-effective prefixes, delivering up to 2.35x speedup over autoregressive decoding.
SCURank ranks multiple summary candidates with Summary Content Units to outperform ROUGE and LLM-based methods in summarization distillation.
Progress Ratio Embeddings use a trigonometric progress-ratio signal to deliver stable length control in transformers that generalizes to unseen target lengths.
Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.
LLMs show measurable self-recognition that linearly correlates with self-preference bias in evaluations, supported by fine-tuning experiments and controls for confounders.
Uptraining multi-head transformer checkpoints to grouped-query attention models achieves near multi-head quality at multi-query inference speeds using 5% additional compute.
Self-generated text recognition finetuning prevents and reverses emergent misalignment across multiple models by fortifying aligned character, unlike other finetuning baselines.
COFT is a decoding technique that creates masked counterfactual prompts, fuses logits to attenuate bias, and applies dual-branch split-conformal calibration to certify fair token sets with marginal validity guarantees under exchangeability.
ConSUM reranks candidate summaries using MBR consensus and source-consistency metrics to improve factuality over standard generation or reranking baselines.
An evidence-based model generates queries from query-free datasets, yielding summaries with competitive ROUGE scores to those using original queries.
ECHO uses sparse gating and elastic budget pivoting in a super-tree structure to achieve up to 5.35x speedup for LLM inference under high concurrency.
LogitSpec accelerates retrieval-based speculative decoding by speculating the next-next token from the last logit and retrieving relevant references for both next and next-next tokens, reporting up to 2.61x speedup and 3.28 mean accepted tokens.
POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.
AdaPLD adaptively mixes lexical and semantic retrieval with branched reuse to improve model-free speculative decoding and reports up to 3.10x speedup across benchmarks.
The paper calls for life cycle assessment to capture embodied hardware costs and full pipeline operational costs in AI development and deployment.
citing papers explorer
-
GAIA: a benchmark for General AI Assistants
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
-
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
DPO derives the optimal policy directly from human preferences via a reparameterized reward model, solving the RLHF objective with only a binary classification loss and no sampling or separate reward model.
-
Making Every Verified Token Count: Adaptive Verification for MoE Speculative Decoding
EVICT adaptively truncates draft trees in MoE speculative decoding by combining drafter signals with profiled costs to retain only cost-effective prefixes, delivering up to 2.35x speedup over autoregressive decoding.
-
SCURank: Ranking Multiple Candidate Summaries with Summary Content Units for Enhanced Summarization
SCURank ranks multiple summary candidates with Summary Content Units to outperform ROUGE and LLM-based methods in summarization distillation.
-
Progress Ratio Embeddings: An Impatience Signal for Robust Length Control in Neural Text Generation
Progress Ratio Embeddings use a trigonometric progress-ratio signal to deliver stable length control in transformers that generalizes to unseen target lengths.
-
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.
-
LLM Evaluators Recognize and Favor Their Own Generations
LLMs show measurable self-recognition that linearly correlates with self-preference bias in evaluations, supported by fine-tuning experiments and controls for confounders.
-
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Uptraining multi-head transformer checkpoints to grouped-query attention models achieves near multi-head quality at multi-query inference speeds using 5% additional compute.
-
Self-Recognition Finetuning can Prevent and Reverse Emergent Misalignment
Self-generated text recognition finetuning prevents and reverses emergent misalignment across multiple models by fortifying aligned character, unlike other finetuning baselines.
-
COFT: Counterfactual-Conformal Decoding for Fair Chain-of-Thought Reasoning in Large Language Models
COFT is a decoding technique that creates masked counterfactual prompts, fuses logits to attenuate bias, and applies dual-branch split-conformal calibration to certify fair token sets with marginal validity guarantees under exchangeability.
-
Enhancing Factuality through Consensus and Consistency in Summarization Using Minimum Bayes Risk Decoding
ConSUM reranks candidate summaries using MBR consensus and source-consistency metrics to improve factuality over standard generation or reranking baselines.
-
Generating Query-Focused Summarization Datasets from Query-Free Summarization Datasets
An evidence-based model generates queries from query-free datasets, yielding summaries with competitive ROUGE scores to those using original queries.
-
ECHO: Elastic Speculative Decoding with Sparse Gating for High-Concurrency Scenarios
ECHO uses sparse gating and elastic budget pivoting in a super-tree structure to achieve up to 5.35x speedup for LLM inference under high concurrency.
-
LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation
LogitSpec accelerates retrieval-based speculative decoding by speculating the next-next token from the last logit and retrieving relevant references for both next and next-next tokens, reporting up to 2.61x speedup and 3.28 mean accepted tokens.
-
Aligning Modalities in Vision Large Language Models via Preference Fine-tuning
POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.
-
AdaPLD: Adaptive Retrieval and Reuse for Efficient Model-Free Speculative Decoding
AdaPLD adaptively mixes lexical and semantic retrieval with branched reuse to improve model-free speculative decoding and reports up to 3.10x speedup across benchmarks.
-
Evaluation of ML Resource Utilization Requires Model Life Cycle Assessment
The paper calls for life cycle assessment to capture embodied hardware costs and full pipeline operational costs in AI development and deployment.