REDACT is a new systematically controlled multilingual PII detection benchmark with 51 entity types, sensitivity-tier metadata, and stratified evaluation revealing that rule-based detectors fail on high-stakes data while LLM detectors are more robust.
Sygra: A unified graph-based framework for scalable generation, quality tagging, and management of synthetic data
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
years
2026 3verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
EVA-Bench supplies a simulation engine for bot-to-bot voice dialogues plus two composite metrics (EVA-A for accuracy, EVA-X for experience) evaluated on 213 enterprise scenarios, showing no tested system exceeds 0.5 on both pass@1 scores.
In configurable enterprise systems, runtime discovery of transition dynamics from system configuration is more robust to deployment shifts than offline-trained world models.
citing papers explorer
-
EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents
EVA-Bench supplies a simulation engine for bot-to-bot voice dialogues plus two composite metrics (EVA-A for accuracy, EVA-X for experience) evaluated on 213 enterprise scenarios, showing no tested system exceeds 0.5 on both pass@1 scores.