REDACT is a new systematically controlled multilingual PII detection benchmark with 51 entity types, sensitivity-tier metadata, and stratified evaluation revealing that rule-based detectors fail on high-stakes data while LLM detectors are more robust.
Sygra: A unified graph-based framework for scalable generation, quality tagging, and management of synthetic data
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 3verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
EVA-Bench supplies a simulation engine for bot-to-bot voice dialogues plus two composite metrics (EVA-A for accuracy, EVA-X for experience) evaluated on 213 enterprise scenarios, showing no tested system exceeds 0.5 on both pass@1 scores.
In configurable enterprise systems, runtime discovery of transition dynamics from system configuration is more robust to deployment shifts than offline-trained world models.
citing papers explorer
-
REDACT: A Systematically Controlled Multilingual Benchmark for Personal Information Detection
REDACT is a new systematically controlled multilingual PII detection benchmark with 51 entity types, sensitivity-tier metadata, and stratified evaluation revealing that rule-based detectors fail on high-stakes data while LLM detectors are more robust.
-
EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents
EVA-Bench supplies a simulation engine for bot-to-bot voice dialogues plus two composite metrics (EVA-A for accuracy, EVA-X for experience) evaluated on 213 enterprise scenarios, showing no tested system exceeds 0.5 on both pass@1 scores.
-
Do Enterprise Systems Need Learned World Models? The Importance of Context to Infer Dynamics
In configurable enterprise systems, runtime discovery of transition dynamics from system configuration is more robust to deployment shifts than offline-trained world models.