HTEB introduces dynamic, multi-axis evaluation of text embedding robustness using LLM transformations, finding decoupled profiles across models and that scaling does not close all robustness gaps.
Replicability analysis for natural language processing: Testing significance with multiple datasets
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CL 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
Decan (D_Ca_n = C × a_n) measures text diversity as progressive conditional surprise from base LM log-probabilities, scoring 0.846 OCA on McDiv benchmark and detecting monotonic diversity drop across base→SFT→DPO→RLVR stages.
citing papers explorer
-
The Harder Text Embedding Benchmark (HTEB): Beyond One-dimensional Static Robustness
HTEB introduces dynamic, multi-axis evaluation of text embedding robustness using LLM transformations, finding decoupled profiles across models and that scaling does not close all robustness gaps.
-
"I've Seen How This Goes": Characterizing Diversity via Progressive Conditional Surprise
Decan (D_Ca_n = C × a_n) measures text diversity as progressive conditional surprise from base LM log-probabilities, scoring 0.846 OCA on McDiv benchmark and detecting monotonic diversity drop across base→SFT→DPO→RLVR stages.