Qualitative study of 19 practitioners reveals ten LLM product evaluation practices and introduces the results-actionability gap as a key barrier to turning findings into improvements.
Title resolution pending
11 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
A 1.88-million-article biomedical summarization dataset is released and quality-aware selection of training data based on abstract alignment outperforms random sampling on factuality metrics.
A systematic analysis of 284 manually reviewed papers plus 1.8k+ others from 2023-2025 reveals under-reporting of human evaluation study design details, creating ambiguity in what was measured and how.
Introduces the MEA benchmark for multi-target cross-lingual summarization across 24 languages and demonstrates that activation steering from English summarization representations improves performance.
LIFT pairs a pre-trained LLM for initial table extraction with a fine-tuned SLM for error repair, matching end-to-end SLM fine-tuning on TEDS while needing only 1,000 examples and gaining robustness.
ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
A proposed pipeline shows LLMs introduce detectable race and gender biases when summarizing life narratives, creating potential for representational harm in research.
SCURank ranks multiple summary candidates with Summary Content Units to outperform ROUGE and LLM-based methods in summarization distillation.
ConSUM reranks candidate summaries using MBR consensus and source-consistency metrics to improve factuality over standard generation or reranking baselines.
Multi-agent debate with tit-for-tat arguments and a judge LLM improves reasoning by preventing LLMs from locking into incorrect initial solutions.
citing papers explorer
-
Results-Actionability Gap: Understanding How Practitioners Evaluate LLM Products in the Wild
Qualitative study of 19 practitioners reveals ten LLM product evaluation practices and introduces the results-actionability gap as a key barrier to turning findings into improvements.
-
Less is More: Quality-Aware Training Data Selection for Scientific Summarization
A 1.88-million-article biomedical summarization dataset is released and quality-aware selection of training data based on abstract alignment outperforms random sampling on factuality metrics.
-
Illusions of the Gold Standard: A Large-scale Analysis of Human Evaluation Protocols for Long-form Text Generation
A systematic analysis of 284 manually reviewed papers plus 1.8k+ others from 2023-2025 reveals under-reporting of human evaluation study design details, creating ambiguity in what was measured and how.
-
Understanding LLM Behavior in Multi-Target Cross-Lingual Summarization
Introduces the MEA benchmark for multi-target cross-lingual summarization across 24 languages and demonstrates that activation steering from English summarization representations improves performance.
-
LIFT: Last-Mile Fine-Tuning for Table Explicitation
LIFT pairs a pre-trained LLM for initial table extraction with a fine-tuned SLM for error repair, matching end-to-end SLM fine-tuning on TEDS while needing only 1,000 examples and gaining robustness.
-
Search Your Block Floating Point Scales!
ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
-
Whose Story Gets Told? Positionality and Bias in LLM Summaries of Life Narratives
A proposed pipeline shows LLMs introduce detectable race and gender biases when summarizing life narratives, creating potential for representational harm in research.
-
SCURank: Ranking Multiple Candidate Summaries with Summary Content Units for Enhanced Summarization
SCURank ranks multiple summary candidates with Summary Content Units to outperform ROUGE and LLM-based methods in summarization distillation.
-
Enhancing Factuality through Consensus and Consistency in Summarization Using Minimum Bayes Risk Decoding
ConSUM reranks candidate summaries using MBR consensus and source-consistency metrics to improve factuality over standard generation or reranking baselines.
-
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
Multi-agent debate with tit-for-tat arguments and a judge LLM improves reasoning by preventing LLMs from locking into incorrect initial solutions.
- A Recipe for Long-Context Reasoning in Large Language Models via On-Policy Optimization and Distillation