Mixed citations

B leu: a Method for Automatic Evaluation of Machine Translation

Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu · 2002 · arXiv 3083.107313

Mixed citation behavior. Most common role is background (62%).

113 Pith papers citing it

Background 62% of classified citations

read on arXiv browse 113 citing papers

citation-role summary

background 12 method 4

citation-polarity summary

background 10 use method 4 support 1 unclear 1

co-cited works

representative citing papers

Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation

cs.AI · 2026-06-04 · accept · novelty 8.0

Across 30 LLMs and 205 TLA+ tasks, syntactic correctness reaches at most 26.6% and semantic correctness 8.6%, with all successes limited to progressive prompting and no advantage from larger models.

ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

Evaluating Very Long-Term Conversational Memory of LLM Agents

cs.CL · 2024-02-27 · unverdicted · novelty 8.0

Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.

RoFormer: Enhanced Transformer with Rotary Position Embedding

cs.CL · 2021-04-20 · accept · novelty 8.0

RoFormer introduces rotary position embeddings that encode absolute positions via rotation matrices and relative dependencies in attention, outperforming prior position methods on long text classification tasks.

Repository-Level Solidity Code Generation with Large Language Models: From Prompting to Fine-Tuning

cs.SE · 2026-06-18 · unverdicted · novelty 7.0

Introduces SolidityBench benchmark and SolidityScore metric for repository-level Solidity code generation, finding supervised fine-tuning outperforms prompting, CoT, ICL, and RAG methods on evaluated LLMs.

Security and Privacy Prompts in the Wild: What Users Ask LLMs and How LLMs Respond

cs.CL · 2026-06-16 · unverdicted · novelty 7.0

Analysis of 14,727 security and privacy prompts from WildChat finds commercial LLMs give higher-quality responses than open-weight models but can produce inconsistent answers across repeated queries.

A PubMed-Scale Dataset of Structured Biomedical Abstracts

cs.IR · 2026-06-09 · unverdicted · novelty 7.0

The paper releases Structured PubMed: 23.2 million harmonized, section-labeled biomedical abstracts (5.9M author-structured + 17.2M LLM-labeled) mapped to PubMed IDs for training and benchmarking.

Multilingual Coreference Resolution via Cycle-Consistent Machine Translation

cs.CL · 2026-06-03 · unverdicted · novelty 7.0

A cycle-consistent MT pipeline generates and similarity-weights training data for coreference resolution, producing gains on four low-resource languages and enabling the task where no corpora existed.

Stateful Visual Encoders for Vision-Language Models

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

Stateful visual encoders condition each visual representation on prior features, yielding consistent gains on multi-image tasks under supervised finetuning across model sizes and domains.

ClinicalMC: A Benchmark for Multi-Course Clinical Decision-Making with Large Language Models

cs.AI · 2026-06-02 · unverdicted · novelty 7.0

ClinicalMC is a benchmark of 1,275 Chinese and 5,804 English multi-course clinical samples across four stages, evaluated via a multi-agent framework on closed-source, open-source, and medical LLMs in static and dynamic settings.

AutoMedBench: Towards Medical AutoResearch with Agentic AI Models

cs.AI · 2026-06-01 · conditional · novelty 7.0

AutoMedBench evaluates AI agents on long-horizon medical workflows across five stages and finds validation and submission as dominant failure points based on thousands of runs.

From Table to Cell: Attention for Better Reasoning with TABALIGN

cs.AI · 2026-05-14 · unverdicted · novelty 7.0

TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding execution 44.64%.

Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

Automatic evaluation tools for literary translations correlate poorly with expert human judgments on creativity and exhibit bias favoring machine-translated texts.

PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.

How English Print Media Frames Human-Elephant Conflicts in India

cs.AI · 2026-04-23 · unverdicted · novelty 7.0

English print media coverage of human-elephant conflicts in India is dominated by fear-inducing and aggression-related language.

ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.

LQM: Linguistically Motivated Multidimensional Quality Metrics for Machine Translation

cs.CL · 2026-04-20 · unverdicted · novelty 7.0

LQM introduces a six-level linguistically motivated error taxonomy for MT evaluation and applies it via expert annotation to LLM outputs on a new 3,850-sentence multi-dialect Arabic corpus.

Single-Language Evidence Is Insufficient for Automated Logging: A Multilingual Benchmark and Empirical Study with LLMs

cs.SE · 2026-04-19 · unverdicted · novelty 7.0

MultiLogBench shows that LLM performance on automated logging varies substantially across programming languages, demonstrating that single-language evidence is insufficient for general claims about model behavior or tool design.

AsymmetryZero: A Framework for Operationalizing Human Expert Preferences as Semantic Evals

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

AsymmetryZero operationalizes expert preferences as stable evaluation contracts for semantic evals, with a study showing 75.9-89.6% criterion agreement between frontier and compact model juries at 4-5% of the cost.

CWCD: Category-Wise Contrastive Decoding for Structured Medical Report Generation

cs.AI · 2026-04-12 · unverdicted · novelty 7.0

CWCD improves structured chest X-ray report generation by using category-wise contrastive decoding to reduce spurious pathology co-occurrences in multi-modal LLMs.

Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.

Evaluating In-Context Translation with Synchronous Context-Free Grammar Transduction

cs.CL · 2026-04-08 · unverdicted · novelty 7.0

LLM in-context translation accuracy falls sharply with larger grammars and longer sentences, and drops further when source and target languages differ in morphology or writing system, with common errors including wrong word recall, hallucinations, and untranslated source words.

Sell More, Play Less: Benchmarking LLM Realistic Selling Skill

cs.CL · 2026-04-08 · conditional · novelty 7.0

SalesLLM provides an automatic evaluation framework for LLM sales dialogues that correlates 0.98 with human experts and shows top models approaching human performance while weaker ones lag.

DeEscalWild: A Real-World Benchmark for Automated De-Escalation Training with SLMs

cs.CL · 2026-03-20 · unverdicted · novelty 7.0

DeEscalWild supplies 1,500 high-fidelity de-escalation scenarios that let fine-tuned 3B SLMs outperform general-purpose larger models on realism and dialogue metrics.

citing papers explorer

Showing 50 of 113 citing papers.

Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation cs.AI · 2026-06-04 · accept · none · ref 23
Across 30 LLMs and 205 TLA+ tasks, syntactic correctness reaches at most 26.6% and semantic correctness 8.6%, with all successes limited to progressive prompting and no advantage from larger models.
ArgBench: Benchmarking LLMs on Computational Argumentation Tasks cs.CL · 2026-04-19 · unverdicted · none · ref 49
ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
Evaluating Very Long-Term Conversational Memory of LLM Agents cs.CL · 2024-02-27 · unverdicted · none · ref 147
Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.
RoFormer: Enhanced Transformer with Rotary Position Embedding cs.CL · 2021-04-20 · accept · none · ref 12
RoFormer introduces rotary position embeddings that encode absolute positions via rotation matrices and relative dependencies in attention, outperforming prior position methods on long text classification tasks.
Repository-Level Solidity Code Generation with Large Language Models: From Prompting to Fine-Tuning cs.SE · 2026-06-18 · unverdicted · none · ref 48
Introduces SolidityBench benchmark and SolidityScore metric for repository-level Solidity code generation, finding supervised fine-tuning outperforms prompting, CoT, ICL, and RAG methods on evaluated LLMs.
Security and Privacy Prompts in the Wild: What Users Ask LLMs and How LLMs Respond cs.CL · 2026-06-16 · unverdicted · none · ref 53
Analysis of 14,727 security and privacy prompts from WildChat finds commercial LLMs give higher-quality responses than open-weight models but can produce inconsistent answers across repeated queries.
A PubMed-Scale Dataset of Structured Biomedical Abstracts cs.IR · 2026-06-09 · unverdicted · none · ref 2
The paper releases Structured PubMed: 23.2 million harmonized, section-labeled biomedical abstracts (5.9M author-structured + 17.2M LLM-labeled) mapped to PubMed IDs for training and benchmarking.
Multilingual Coreference Resolution via Cycle-Consistent Machine Translation cs.CL · 2026-06-03 · unverdicted · none · ref 152
A cycle-consistent MT pipeline generates and similarity-weights training data for coreference resolution, producing gains on four low-resource languages and enabling the task where no corpora existed.
Stateful Visual Encoders for Vision-Language Models cs.CV · 2026-06-03 · unverdicted · none · ref 38
Stateful visual encoders condition each visual representation on prior features, yielding consistent gains on multi-image tasks under supervised finetuning across model sizes and domains.
ClinicalMC: A Benchmark for Multi-Course Clinical Decision-Making with Large Language Models cs.AI · 2026-06-02 · unverdicted · none · ref 38
ClinicalMC is a benchmark of 1,275 Chinese and 5,804 English multi-course clinical samples across four stages, evaluated via a multi-agent framework on closed-source, open-source, and medical LLMs in static and dynamic settings.
AutoMedBench: Towards Medical AutoResearch with Agentic AI Models cs.AI · 2026-06-01 · conditional · none · ref 59
AutoMedBench evaluates AI agents on long-horizon medical workflows across five stages and finds validation and submission as dominant failure points based on thousands of runs.
From Table to Cell: Attention for Better Reasoning with TABALIGN cs.AI · 2026-05-14 · unverdicted · none · ref 45
TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding execution 44.64%.
Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations cs.CL · 2026-05-13 · unverdicted · none · ref 10
Automatic evaluation tools for literary translations correlate poorly with expert human judgments on creativity and exhibit bias favoring machine-translated texts.
PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents cs.AI · 2026-05-11 · unverdicted · none · ref 156
PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.
How English Print Media Frames Human-Elephant Conflicts in India cs.AI · 2026-04-23 · unverdicted · none · ref 19
English print media coverage of human-elephant conflicts in India is dominated by fear-inducing and aggression-related language.
ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation cs.CL · 2026-04-21 · unverdicted · none · ref 41
ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.
LQM: Linguistically Motivated Multidimensional Quality Metrics for Machine Translation cs.CL · 2026-04-20 · unverdicted · none · ref 71
LQM introduces a six-level linguistically motivated error taxonomy for MT evaluation and applies it via expert annotation to LLM outputs on a new 3,850-sentence multi-dialect Arabic corpus.
Single-Language Evidence Is Insufficient for Automated Logging: A Multilingual Benchmark and Empirical Study with LLMs cs.SE · 2026-04-19 · unverdicted · none · ref 49
MultiLogBench shows that LLM performance on automated logging varies substantially across programming languages, demonstrating that single-language evidence is insufficient for general claims about model behavior or tool design.
AsymmetryZero: A Framework for Operationalizing Human Expert Preferences as Semantic Evals cs.LG · 2026-04-15 · unverdicted · none · ref 13
AsymmetryZero operationalizes expert preferences as stable evaluation contracts for semantic evals, with a study showing 75.9-89.6% criterion agreement between frontier and compact model juries at 4-5% of the cost.
CWCD: Category-Wise Contrastive Decoding for Structured Medical Report Generation cs.AI · 2026-04-12 · unverdicted · none · ref 4
CWCD improves structured chest X-ray report generation by using category-wise contrastive decoding to reduce spurious pathology co-occurrences in multi-modal LLMs.
Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment cs.CV · 2026-04-09 · unverdicted · none · ref 60
Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.
Evaluating In-Context Translation with Synchronous Context-Free Grammar Transduction cs.CL · 2026-04-08 · unverdicted · none · ref 3
LLM in-context translation accuracy falls sharply with larger grammars and longer sentences, and drops further when source and target languages differ in morphology or writing system, with common errors including wrong word recall, hallucinations, and untranslated source words.
Sell More, Play Less: Benchmarking LLM Realistic Selling Skill cs.CL · 2026-04-08 · conditional · none · ref 31
SalesLLM provides an automatic evaluation framework for LLM sales dialogues that correlates 0.98 with human experts and shows top models approaching human performance while weaker ones lag.
DeEscalWild: A Real-World Benchmark for Automated De-Escalation Training with SLMs cs.CL · 2026-03-20 · unverdicted · none · ref 11
DeEscalWild supplies 1,500 high-fidelity de-escalation scenarios that let fine-tuned 3B SLMs outperform general-purpose larger models on realism and dialogue metrics.
Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation cs.CL · 2026-02-02 · unverdicted · none · ref 3
xMemory builds revisable hierarchical agent memory by segmenting histories, decoupling into components, and aggregating via sparsity-semantic objective, yielding better answer quality and lower token use than flat RAG on LoCoMo and PerLTQA.
DialectLLM: A Dialect-Aware Dialog[ue] Generation Framework Beyond Standard American English cs.CL · 2026-01-30 · unverdicted · none · ref 12
DialectLLM generates parallel multi-dialect dialog data and a 50k-dialog benchmark showing frontier LLMs achieve under 70% accuracy on dialect tasks while the generated data can improve post-training.
Creating ConLangs to Probe the Metalinguistic Grammatical Knowledge of LLMs cs.CL · 2025-10-08 · unverdicted · none · ref 1
IASC is an interactive modular LLM system for building ConLangs that serves as a probe for metalinguistic grammatical knowledge, revealing large performance differences across models and across common versus rare linguistic patterns.
Guidelines for Empirical Studies in Software Engineering involving Large Language Models cs.SE · 2025-08-21 · accept · none · ref 102 · 2 links
The paper delivers a taxonomy of seven LLM study types in software engineering along with eight guidelines that separate mandatory requirements from recommended practices to address reproducibility challenges.
Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation cs.CL · 2025-05-24 · unverdicted · none · ref 39
Smoothie performs diffusion by smoothing token embeddings based on semantic similarity, outperforming prior diffusion models on sequence-to-sequence and unconditional text generation tasks.
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models cs.AI · 2024-06-14 · conditional · none · ref 248
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA cs.CL · 2023-11-28 · unverdicted · none · ref 50
LoRA adapters should be scaled by 1/sqrt(rank) rather than 1/rank to stabilize learning and enable effective use of higher ranks during fine-tuning of large language models.
Prefix-Tuning: Optimizing Continuous Prompts for Generation cs.CL · 2021-01-01 · conditional · none · ref 78
Prefix-tuning matches or exceeds fine-tuning on NLG tasks by optimizing a continuous prefix using 0.1% of parameters while keeping the LM frozen.
AI translation of literary texts is "fine", but readers still prefer human translations cs.CL · 2026-06-24 · unverdicted · none · ref 77
Human readers prefer human literary translations over AI-generated ones for immersion and clarity despite finding MT adequate and struggling to identify the source.
Quantization Inflates Reasoning: Token Inflation as a Hidden Cost of Low-Bit Reasoning Models cs.AI · 2026-06-24 · unverdicted · none · ref 262
Low-bit post-training quantization of reasoning LLMs increases reasoning token counts while preserving accuracy, introducing a hidden test-time compute cost.
CORE-BREW: LLR-Based Soft Decoding for Robust Multi-Bit LLM Watermarking cs.CR · 2026-06-23 · unverdicted · none · ref 18
CORE-BREW introduces constant-hit-rate embedding to produce LLRs enabling soft-decision decoding for more robust multi-bit LLM watermarking with two FPR-aware detection modes.
LaViSA: A Language and Vision Structural Ambiguity Benchmark cs.CL · 2026-06-17 · unverdicted · none · ref 31
LaViSA is a new benchmark that pairs structurally ambiguous sentences with images of their disambiguated meanings to evaluate VLMs on visual resolution of ambiguity.
Looped World Models cs.LG · 2026-06-16 · unverdicted · none · ref 18
Introduces looped transformer architectures for world models that iteratively refine latent states to achieve up to 100x parameter efficiency via adaptive computation depth.
MindAlign: Decoding Inner Speech from fMRI Signals via Multimodal Embedding Alignment under Limited Data cs.CL · 2026-06-15 · unverdicted · none · ref 35
MindAlign decodes inner speech from fMRI via subject-specific neural-semantic alignment into a multimodal space followed by prompting of a frozen LM, outperforming baselines and generalizing across subjects.
M\"OVE: A Holistic LLM Benchmark for the German Public Sector cs.CL · 2026-06-11 · unverdicted · none · ref 65
MÖVE presents a new German-language benchmark evaluating 39 LLMs on performance and governance criteria using ten public-administration datasets.
Context-Driven Incremental Compression for Multi-Turn Dialogue Generation cs.CL · 2026-06-10 · unverdicted · none · ref 16
C-DIC achieves stable latency and perplexity over hundreds of dialogue turns via incremental per-thread compression with cross-turn revision.
Multilinguality of Large Language Models From a Structural Perspective cs.CL · 2026-06-01 · unverdicted · none · ref 29
Low-resource languages are structurally more different from English in LLMs than high- or mid-resource ones, and language-specific post-training alters structures while preserving inter-language relationships.
EmbGen: Teaching with Reassembled Corpora cs.CL · 2026-05-19 · unverdicted · none · ref 21
EmbGen creates synthetic QA data by entity decomposition, embedding-based reassembly into clusters, and multi-level sampling with cluster-specific prompts, yielding up to 88.9% higher Binary Accuracy than baselines on heterogeneous datasets under fixed token budgets.
ATD-Trans: A Geographically Grounded Japanese-English Travelogue Translation Dataset cs.CL · 2026-05-13 · conditional · none · ref 14
ATD-Trans is a new geographically annotated Japanese-English travelogue dataset that reveals Japanese-enhanced models perform better on geo-entity translation while domestic Japanese locations remain harder to translate accurately.
HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution cs.AI · 2026-05-11 · unverdicted · none · ref 55
HAGE proposes a trainable weighted graph memory framework with LLM intent classification, dynamic edge modulation, and RL optimization that improves long-horizon reasoning accuracy in agentic LLMs over static baselines.
MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents cs.CR · 2026-05-10 · unverdicted · none · ref 34 · 3 links
MemPrivacy uses edge-side privacy span detection and semantic placeholders to enable cloud memory management for LLM agents while limiting utility loss to 1.6% and outperforming masking baselines.
BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning cs.CV · 2026-05-08 · unverdicted · none · ref 16
BalCapRL applies balanced multi-objective RL with GDPO-style normalization and length-conditional masking to improve MLLM image captioning, reporting gains of up to +13.6 DCScore, +9.0 CaptionQA, and +29.0 CapArena on LLaVA and Qwen models.
SignVerse-2M: A Two-Million-Clip Pose-Native Universe of 55+ Sign Languages cs.CV · 2026-05-03 · unverdicted · none · ref 12
SignVerse-2M provides a 2-million-clip multilingual pose-native dataset for sign language derived from public videos via DWPose preprocessing to enable robust modeling in real-world conditions.
Cultural Benchmarking of LLMs in Standard and Dialectal Arabic Dialogues cs.CL · 2026-04-30 · unverdicted · none · ref 36
ArabCulture-Dialogue dataset shows LLMs perform worse on dialectal Arabic than Modern Standard Arabic across cultural reasoning, translation, and generation tasks.
Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition cs.AI · 2026-04-20 · unverdicted · none · ref 46
Adversarial competition between attacker and defender teams generates diverse multi-turn conversational data that improves LLM performance on secure code generation benchmarks by 18-29%.
Measuring Distribution Shift in User Prompts and Its Effects on LLM Performance cs.CL · 2026-04-19 · unverdicted · none · ref 53
The LENS framework applied to 192 real-world settings shows moderate natural prompt distribution shifts cause 73% average performance loss in deployed LLMs, especially across user groups and regions.

B leu: a Method for Automatic Evaluation of Machine Translation

citation-role summary

citation-polarity summary

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer