hub

Multi-if: Benchmarking llms on multi-turn and multilingual instructions following

Multi-if: Benchmarking llms on multi-turn, multilingual instructions following , author= · 2024 · arXiv 2410.15553

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

SAGE: A Service Agent Graph-guided Evaluation Benchmark

cs.AI · 2026-04-10 · unverdicted · novelty 7.0

SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 models in 6 scenarios.

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

cs.CL · 2025-11-25 · unverdicted · novelty 7.0

Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.

When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction

cs.AI · 2026-05-13 · unverdicted · novelty 6.0

Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.

SEIF: Self-Evolving Reinforcement Learning for Instruction Following

cs.CL · 2026-05-08 · conditional · novelty 6.0

SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.

Alignment has a Fantasia Problem

cs.AI · 2026-04-23 · unverdicted · novelty 6.0

AI alignment must move beyond assuming users have fully formed goals and instead provide active cognitive support to help form and refine intent over time.

Qwen3-Omni Technical Report

cs.CL · 2025-09-22 · unverdicted · novelty 6.0 · 2 refs

Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-modal Qwen counterparts.

ImgEdit: A Unified Image Editing Dataset and Benchmark

cs.CV · 2025-05-26 · conditional · novelty 6.0

ImgEdit supplies 1.2 million curated edit pairs and a three-part benchmark that let a VLM-based model outperform prior open-source editors on adherence, quality, and detail preservation.

Process Reinforcement through Implicit Rewards

cs.LG · 2025-02-03 · conditional · novelty 6.0

PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.

Qwen3 Technical Report

cs.CL · 2025-05-14 · unverdicted · novelty 5.0

Pith review generated a malformed one-line summary.

Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs

cs.CL · 2026-05-08 · unverdicted · novelty 3.0

EngGPT2MoE-16B-A3B matches or beats other Italian models on most international benchmarks but trails top international models such as GPT-5 nano and Qwen3-8B.

citing papers explorer

Showing 10 of 10 citing papers.

SAGE: A Service Agent Graph-guided Evaluation Benchmark cs.AI · 2026-04-10 · unverdicted · none · ref 16
SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 models in 6 scenarios.
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory cs.CL · 2025-11-25 · unverdicted · none · ref 130
Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction cs.AI · 2026-05-13 · unverdicted · none · ref 2
Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.
SEIF: Self-Evolving Reinforcement Learning for Instruction Following cs.CL · 2026-05-08 · conditional · none · ref 11
SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.
Alignment has a Fantasia Problem cs.AI · 2026-04-23 · unverdicted · none · ref 68
AI alignment must move beyond assuming users have fully formed goals and instead provide active cognitive support to help form and refine intent over time.
Qwen3-Omni Technical Report cs.CL · 2025-09-22 · unverdicted · none · ref 14 · 2 links
Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-modal Qwen counterparts.
ImgEdit: A Unified Image Editing Dataset and Benchmark cs.CV · 2025-05-26 · conditional · none · ref 23
ImgEdit supplies 1.2 million curated edit pairs and a three-part benchmark that let a VLM-based model outperform prior open-source editors on adherence, quality, and detail preservation.
Process Reinforcement through Implicit Rewards cs.LG · 2025-02-03 · conditional · none · ref 125
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.
Qwen3 Technical Report cs.CL · 2025-05-14 · unverdicted · none · ref 17
Pith review generated a malformed one-line summary.
Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs cs.CL · 2026-05-08 · unverdicted · none · ref 70
EngGPT2MoE-16B-A3B matches or beats other Italian models on most international benchmarks but trails top international models such as GPT-5 nano and Qwen3-8B.

Multi-if: Benchmarking llms on multi-turn and multilingual instructions following

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer