Across 30 LLMs and 205 TLA+ tasks, syntactic correctness reaches at most 26.6% and semantic correctness 8.6%, with all successes limited to progressive prompting and no advantage from larger models.
Mixed citations
Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics , pages =
Mixed citation behavior. Most common role is background (62%).
citation-role summary
citation-polarity summary
co-cited works
representative citing papers
ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.
RoFormer introduces rotary position embeddings that encode absolute positions via rotation matrices and relative dependencies in attention, outperforming prior position methods on long text classification tasks.
Introduces SolidityBench benchmark and SolidityScore metric for repository-level Solidity code generation, finding supervised fine-tuning outperforms prompting, CoT, ICL, and RAG methods on evaluated LLMs.
Analysis of 14,727 security and privacy prompts from WildChat finds commercial LLMs give higher-quality responses than open-weight models but can produce inconsistent answers across repeated queries.
The paper releases Structured PubMed: 23.2 million harmonized, section-labeled biomedical abstracts (5.9M author-structured + 17.2M LLM-labeled) mapped to PubMed IDs for training and benchmarking.
A cycle-consistent MT pipeline generates and similarity-weights training data for coreference resolution, producing gains on four low-resource languages and enabling the task where no corpora existed.
Stateful visual encoders condition each visual representation on prior features, yielding consistent gains on multi-image tasks under supervised finetuning across model sizes and domains.
ClinicalMC is a benchmark of 1,275 Chinese and 5,804 English multi-course clinical samples across four stages, evaluated via a multi-agent framework on closed-source, open-source, and medical LLMs in static and dynamic settings.
AutoMedBench evaluates AI agents on long-horizon medical workflows across five stages and finds validation and submission as dominant failure points based on thousands of runs.
Brain-IT-VQA decodes visual question answers from fMRI using a transformer to extract language tokens and introduces the NSD-VQA benchmark with 20 controlled questions per image across 20 categories.
TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding execution 44.64%.
Automatic evaluation tools for literary translations correlate poorly with expert human judgments on creativity and exhibit bias favoring machine-translated texts.
PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.
English print media coverage of human-elephant conflicts in India is dominated by fear-inducing and aggression-related language.
ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.
LQM introduces a six-level linguistically motivated error taxonomy for MT evaluation and applies it via expert annotation to LLM outputs on a new 3,850-sentence multi-dialect Arabic corpus.
MultiLogBench shows that LLM performance on automated logging varies substantially across programming languages, demonstrating that single-language evidence is insufficient for general claims about model behavior or tool design.
AsymmetryZero operationalizes expert preferences as stable evaluation contracts for semantic evals, with a study showing 75.9-89.6% criterion agreement between frontier and compact model juries at 4-5% of the cost.
CWCD improves structured chest X-ray report generation by using category-wise contrastive decoding to reduce spurious pathology co-occurrences in multi-modal LLMs.
Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.
LLM in-context translation accuracy falls sharply with larger grammars and longer sentences, and drops further when source and target languages differ in morphology or writing system, with common errors including wrong word recall, hallucinations, and untranslated source words.
SalesLLM provides an automatic evaluation framework for LLM sales dialogues that correlates 0.98 with human experts and shows top models approaching human performance while weaker ones lag.
citing papers explorer
-
Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation
Across 30 LLMs and 205 TLA+ tasks, syntactic correctness reaches at most 26.6% and semantic correctness 8.6%, with all successes limited to progressive prompting and no advantage from larger models.
-
ClinicalMC: A Benchmark for Multi-Course Clinical Decision-Making with Large Language Models
ClinicalMC is a benchmark of 1,275 Chinese and 5,804 English multi-course clinical samples across four stages, evaluated via a multi-agent framework on closed-source, open-source, and medical LLMs in static and dynamic settings.
-
AutoMedBench: Towards Medical AutoResearch with Agentic AI Models
AutoMedBench evaluates AI agents on long-horizon medical workflows across five stages and finds validation and submission as dominant failure points based on thousands of runs.
-
From Table to Cell: Attention for Better Reasoning with TABALIGN
TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding execution 44.64%.
-
PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents
PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.
-
How English Print Media Frames Human-Elephant Conflicts in India
English print media coverage of human-elephant conflicts in India is dominated by fear-inducing and aggression-related language.
-
CWCD: Category-Wise Contrastive Decoding for Structured Medical Report Generation
CWCD improves structured chest X-ray report generation by using category-wise contrastive decoding to reduce spurious pathology co-occurrences in multi-modal LLMs.
-
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
-
Quantization Inflates Reasoning: Token Inflation as a Hidden Cost of Low-Bit Reasoning Models
Low-bit post-training quantization of reasoning LLMs increases reasoning token counts while preserving accuracy, introducing a hidden test-time compute cost.
-
HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution
HAGE proposes a trainable weighted graph memory framework with LLM intent classification, dynamic edge modulation, and RL optimization that improves long-horizon reasoning accuracy in agentic LLMs over static baselines.
-
Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition
Adversarial competition between attacker and defender teams generates diverse multi-turn conversational data that improves LLM performance on secure code generation benchmarks by 18-29%.
-
Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation
A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.
-
POTracker: Optimizing Large Language Models for Standard-Compliant Power Outage Report Generation
POTracker fine-tunes an LLM with POTrackerLoss combining textual and structural similarity, achieving up to 86.47% structural accuracy on 1,000 power outage reports and outperforming baselines by up to 51%.
-
LLM-FACETS: A Privacy-Preserving Framework for Evaluating LLM Transparency and Accountability
Introduces LLM-FACETS, a privacy-preserving open-source framework for LLM evaluation using deterministic metrics locally, LLM-judge metrics with user-controlled APIs, and mechanisms for uncertainty visualization and hallucination detection.
-
Verify Before You Commit: Towards Faithful Reasoning in LLM Agents via Self-Auditing
SAVeR adds self-auditing of internal beliefs in LLM agents via persona-based candidates and constraint-guided repairs, improving faithfulness on six benchmarks without hurting task performance.
-
Human-LLM Dialogue Improves Diagnostic Accuracy in Emergency Care
Interactive LLM dialogue raised residents' hard-case diagnostic correctness from 0.589 to 0.734 and produced medium effect sizes in a blinded study of seven physicians on 52 emergency cases.