PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.
arXiv preprint arXiv:2503.02863 (2025)
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
CaliDist calibrates LLMs by scaling confidence according to how much predictions change under semantic distractors, cutting average ECE from 23% to 7% on seven NLU benchmarks across six models.
Introduces Zoom-then-Diagnose paradigm and uncertainty-aware reward in GRPO for confidence-aware ultrasound VQA, reporting 39.3% improvement in lesion localization across liver, breast, and thyroid datasets.
CoMet decomposes MLLM uncertainty into context-specific and multiplicity-specific terms estimated by a trained post-hoc module, improving performance on open-ended multimodal benchmarks and hallucination detection.
citing papers explorer
No citing papers match the current filters.