hub

Multi-if: Benchmarking llms on multi-turn and multilingual instructions following

Yun He, Di Jin, Chaoqi Wang, Chloe Bi, Karishma Mandyam, Hejia Zhang, Chen Zhu, Ning Li, Tengyu Xu, Hongjiang Lv, et al · 2024 · arXiv 2410.15553

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

read on arXiv browse 12 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 dataset 2

citation-polarity summary

background 2 use dataset 2

representative citing papers

SAGE: A Service Agent Graph-guided Evaluation Benchmark

cs.AI · 2026-04-10 · unverdicted · novelty 7.0

SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 models in 6 scenarios.

When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction

cs.AI · 2026-05-13 · unverdicted · novelty 6.0

Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.

SEIF: Self-Evolving Reinforcement Learning for Instruction Following

cs.CL · 2026-05-08 · conditional · novelty 6.0

SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.

Alignment has a Fantasia Problem

cs.AI · 2026-04-23 · unverdicted · novelty 6.0

AI alignment must move beyond assuming users have fully formed goals and instead provide active cognitive support to help form and refine intent over time.

Efficient Evaluation of LLM Performance with Statistical Guarantees

stat.ML · 2026-01-28 · unverdicted · novelty 6.0

Factorized Active Querying (FAQ) provides up to 5 times more effective samples for LLM accuracy estimation by using Bayesian factor models and adaptive querying under a fixed budget with guaranteed coverage.

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

cs.CL · 2025-11-25 · unverdicted · novelty 6.0

Evo-Memory is a new streaming benchmark and evaluation framework for self-evolving memory in LLM agents, unifying over ten memory modules and introducing the ReMem pipeline for continual improvement on multi-turn and reasoning datasets.

Qwen3-Omni Technical Report

cs.CL · 2025-09-22 · unverdicted · novelty 6.0 · 2 refs

Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-modal Qwen counterparts.

ImgEdit: A Unified Image Editing Dataset and Benchmark

cs.CV · 2025-05-26 · conditional · novelty 6.0

ImgEdit supplies 1.2 million curated edit pairs and a three-part benchmark that let a VLM-based model outperform prior open-source editors on adherence, quality, and detail preservation.

Process Reinforcement through Implicit Rewards

cs.LG · 2025-02-03 · conditional · novelty 6.0

PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.

Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs

cs.CL · 2026-05-08 · conditional · novelty 5.0 · 2 refs

EngGPT2MoE-16B-A3B matches or exceeds other Italian open-source LLMs on most international benchmarks while remaining competitive on ITALIC, though it trails some top international models.

Qwen3 Technical Report

cs.CL · 2025-05-14 · unverdicted · novelty 5.0

Pith review generated a malformed one-line summary.

Hy-MT2: A Family of Fast, Efficient and Powerful Multilingual Translation Models in the Wild

cs.CL · 2026-05-21

citing papers explorer

Showing 1 of 1 citing paper after filters.

Efficient Evaluation of LLM Performance with Statistical Guarantees stat.ML · 2026-01-28 · unverdicted · none · ref 10
Factorized Active Querying (FAQ) provides up to 5 times more effective samples for LLM accuracy estimation by using Bayesian factor models and adaptive querying under a fixed budget with guaranteed coverage.

Multi-if: Benchmarking llms on multi-turn and multilingual instructions following

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer