FlowEval evaluates generated UIs by measuring how closely their navigation flows match real websites via reference-based similarity metrics and shows strong correlation with human expert judgments.
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
6 Pith papers cite this work. Polarity classification is still indexing.
years
2026 6verdicts
UNVERDICTED 6representative citing papers
Introduces the TCR framework to evaluate educational LLM assistants on transparency, consistency, and refinement in multi-turn interactions, complementing aggregate metrics.
SelectiveRM applies optimal transport with a joint consistency discrepancy and partial mass relaxation to produce reward models that optimize a tighter upper bound on clean risk while autonomously dropping noisy preference samples.
LLMs can be statistically superior to humans at estimating group-level judgments on subjective tasks because of their low variance and decoupled representation-processing biases.
StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.
TGS-RAG adds graph-to-text re-ranking with global voting and text-to-graph orphan path bridging to improve precision and efficiency in multi-hop RAG over prior baselines.
citing papers explorer
-
FlowEval: Reference-based Evaluation of Generated User Interfaces
FlowEval evaluates generated UIs by measuring how closely their navigation flows match real websites via reference-based similarity metrics and shows strong correlation with human expert judgments.
-
Evaluating Multi-turn Human-AI Interaction
Introduces the TCR framework to evaluate educational LLM assistants on transparency, consistency, and refinement in multi-turn interactions, complementing aggregate metrics.
-
Optimal Transport for LLM Reward Modeling from Noisy Preference
SelectiveRM applies optimal transport with a joint consistency discrepancy and partial mass relaxation to produce reward models that optimize a tighter upper bound on clean risk while autonomously dropping noisy preference samples.
-
From Fallback to Frontline: When Can LLMs be Superior Annotators of Human Perspectives?
LLMs can be statistically superior to humans at estimating group-level judgments on subjective tasks because of their low variance and decoupled representation-processing biases.
-
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction
StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.
-
Text-Graph Synergy: A Bidirectional Verification and Completion Framework for RAG
TGS-RAG adds graph-to-text re-ranking with global voting and text-to-graph orphan path bridging to improve precision and efficiency in multi-hop RAG over prior baselines.