ActPlane enforces agent-declared policies at OS level using IFC DSL and eBPF, improving compliance on indirect paths with 1.9-8.4% overhead.
F ollow B ench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 3roles
background 1polarities
background 1representative citing papers
SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.
MixRea benchmark reveals LLMs achieve at most 42.8% consistency on explicit-implicit reasoning tasks, with PRCP prompting proposed to recover overlooked relations.
citing papers explorer
-
ActPlane: Programmable OS-Level Policy Enforcement for Agent Harnesses
ActPlane enforces agent-declared policies at OS level using IFC DSL and eBPF, improving compliance on indirect paths with 1.9-8.4% overhead.
-
The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models
SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.
-
MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models
MixRea benchmark reveals LLMs achieve at most 42.8% consistency on explicit-implicit reasoning tasks, with PRCP prompting proposed to recover overlooked relations.