hub

Automatic Exploration of Textual Environments with Language-Conditioned Autotelic Agents

Teodorescu, Laetitia, Yuan, Xingdi, Oudeyer, Pierre-Yves · 2022 · DOI 10.18653/v1/2022.wordplay-1.5

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

open at publisher browse 12 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

representative citing papers

ORPO: Monolithic Preference Optimization without Reference Model

cs.CL · 2024-03-12 · conditional · novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

From Correctness to Utility: Gain-Based Prefix Evaluation for LLM Reasoning

cs.CL · 2026-06-05 · unverdicted · novelty 7.0

Prefix gain measured via student-model solve-rate improvement is used to train a Prefix Utility Model (PUM) that supplies stronger supervision than correctness-based process rewards for mathematical reasoning.

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

cs.CL · 2024-04-29 · conditional · novelty 7.0

A panel of smaller diverse LLMs outperforms a single large model as an evaluator of generations, showing less intra-model bias and over 7x lower cost.

MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries

cs.CL · 2024-01-27 · accept · novelty 7.0

MultiHop-RAG is a new benchmark dataset demonstrating that existing retrieval-augmented generation systems perform poorly on multi-hop queries requiring retrieval and reasoning over multiple evidence pieces.

Decisive: Guiding User Decisions with Optimal Preference Elicitation from Unstructured Documents

cs.CL · 2026-04-20 · unverdicted · novelty 6.0

Decisive combines document-grounded option scoring with adaptive Bayesian preference elicitation to achieve up to 20% higher decision accuracy than LLMs and existing frameworks across domains.

Steering Llama 2 via Contrastive Activation Addition

cs.CL · 2023-12-09 · unverdicted · novelty 6.0

Contrastive Activation Addition steers Llama 2 Chat by adding averaged residual-stream activation differences from contrastive example pairs to control targeted behaviors at inference time.

ART: Automatic multi-step reasoning and tool-use for large language models

cs.CL · 2023-03-16 · unverdicted · novelty 6.0

ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.

ProbeScale: Probing Analysis to Optimize Neural Scaling Laws for Efficient Small Language Model Inference

cs.CL · 2026-06-01 · unverdicted · novelty 5.0

ProbScale finds layer subsets in SLMs like RoBERTa-Large and T5-Base that cut parameters 5-10x while retaining 95-98% of original task performance by maximizing aggregated probe scores under a budget.

AppAgent: Multimodal Agents as Smartphone Users

cs.CV · 2023-12-21 · unverdicted · novelty 5.0

AppAgent lets large language models operate diverse smartphone apps via visual interactions and learns app usage from exploration or demonstrations.

Beyond Logical Forms: LLM-Extracted Patterns for Fallacy Classification

cs.CL · 2026-06-25 · unverdicted · novelty 4.0

LLM-extracted patterns merging logical structures and linguistic cues yield statistically significant gains in fallacy classification over zero-shot baselines with cross-dataset generalization.

MMoA: An AI-Agent framework with recurrence for Memoried Mixure-of-Agent

cs.CL · 2026-05-18 · unverdicted · novelty 3.0

MMoA adds LSTM recurrence to Mixture-of-Agents routing, reaching 58.0% win rate on AlpacaEval 2.0 versus 59.8% for baseline MoA while cutting runtime by up to 4.6%.

Optimizing Abstractive Summarization With Fine-Tuned PEGASUS

cs.CL · 2026-06-24 · unverdicted · novelty 2.0

Fine-tuned PEGASUS achieves state-of-the-art ROUGE scores on XL-Sum English corpus with 4.04% ROUGE-1, 15.25% ROUGE-2, and 3.39% ROUGE-L gains over mT5 baseline.

citing papers explorer

Showing 12 of 12 citing papers.

ORPO: Monolithic Preference Optimization without Reference Model cs.CL · 2024-03-12 · conditional · none · ref 147
ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
From Correctness to Utility: Gain-Based Prefix Evaluation for LLM Reasoning cs.CL · 2026-06-05 · unverdicted · none · ref 35
Prefix gain measured via student-model solve-rate improvement is used to train a Prefix Utility Model (PUM) that supplies stronger supervision than correctness-based process rewards for mathematical reasoning.
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models cs.CL · 2024-04-29 · conditional · none · ref 64
A panel of smaller diverse LLMs outperforms a single large model as an evaluator of generations, showing less intra-model bias and over 7x lower cost.
MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries cs.CL · 2024-01-27 · accept · none · ref 63
MultiHop-RAG is a new benchmark dataset demonstrating that existing retrieval-augmented generation systems perform poorly on multi-hop queries requiring retrieval and reasoning over multiple evidence pieces.
Decisive: Guiding User Decisions with Optimal Preference Elicitation from Unstructured Documents cs.CL · 2026-04-20 · unverdicted · none · ref 57
Decisive combines document-grounded option scoring with adaptive Bayesian preference elicitation to achieve up to 20% higher decision accuracy than LLMs and existing frameworks across domains.
Steering Llama 2 via Contrastive Activation Addition cs.CL · 2023-12-09 · unverdicted · none · ref 70
Contrastive Activation Addition steers Llama 2 Chat by adding averaged residual-stream activation differences from contrastive example pairs to control targeted behaviors at inference time.
ART: Automatic multi-step reasoning and tool-use for large language models cs.CL · 2023-03-16 · unverdicted · none · ref 181
ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.
ProbeScale: Probing Analysis to Optimize Neural Scaling Laws for Efficient Small Language Model Inference cs.CL · 2026-06-01 · unverdicted · none · ref 36
ProbScale finds layer subsets in SLMs like RoBERTa-Large and T5-Base that cut parameters 5-10x while retaining 95-98% of original task performance by maximizing aggregated probe scores under a budget.
AppAgent: Multimodal Agents as Smartphone Users cs.CV · 2023-12-21 · unverdicted · none · ref 94
AppAgent lets large language models operate diverse smartphone apps via visual interactions and learns app usage from exploration or demonstrations.
Beyond Logical Forms: LLM-Extracted Patterns for Fallacy Classification cs.CL · 2026-06-25 · unverdicted · none · ref 45
LLM-extracted patterns merging logical structures and linguistic cues yield statistically significant gains in fallacy classification over zero-shot baselines with cross-dataset generalization.
MMoA: An AI-Agent framework with recurrence for Memoried Mixure-of-Agent cs.CL · 2026-05-18 · unverdicted · none · ref 18
MMoA adds LSTM recurrence to Mixture-of-Agents routing, reaching 58.0% win rate on AlpacaEval 2.0 versus 59.8% for baseline MoA while cutting runtime by up to 4.6%.
Optimizing Abstractive Summarization With Fine-Tuned PEGASUS cs.CL · 2026-06-24 · unverdicted · none · ref 32
Fine-tuned PEGASUS achieves state-of-the-art ROUGE scores on XL-Sum English corpus with 4.04% ROUGE-1, 15.25% ROUGE-2, and 3.39% ROUGE-L gains over mT5 baseline.

Automatic Exploration of Textual Environments with Language-Conditioned Autotelic Agents

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer