Mixed citations

Jan Melechovsky, Abhinaba Roy, and Dorien Herremans

mlsys · 2025 · arXiv 3083.107313

Mixed citation behavior. Most common role is background (62%).

131 Pith papers citing it

Background 62% of classified citations

read on arXiv browse 131 citing papers

citation-role summary

background 12 method 4

citation-polarity summary

background 10 use method 4 support 1 unclear 1

co-cited works

representative citing papers

Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation

cs.AI · 2026-06-04 · accept · novelty 8.0

Across 30 LLMs and 205 TLA+ tasks, syntactic correctness reaches at most 26.6% and semantic correctness 8.6%, with all successes limited to progressive prompting and no advantage from larger models.

ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

Evaluating Very Long-Term Conversational Memory of LLM Agents

cs.CL · 2024-02-27 · unverdicted · novelty 8.0

Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.

RoFormer: Enhanced Transformer with Rotary Position Embedding

cs.CL · 2021-04-20 · accept · novelty 8.0

RoFormer introduces rotary position embeddings that encode absolute positions via rotation matrices and relative dependencies in attention, outperforming prior position methods on long text classification tasks.

Repository-Level Solidity Code Generation with Large Language Models: From Prompting to Fine-Tuning

cs.SE · 2026-06-18 · unverdicted · novelty 7.0

Introduces SolidityBench benchmark and SolidityScore metric for repository-level Solidity code generation, finding supervised fine-tuning outperforms prompting, CoT, ICL, and RAG methods on evaluated LLMs.

Security and Privacy Prompts in the Wild: What Users Ask LLMs and How LLMs Respond

cs.CL · 2026-06-16 · unverdicted · novelty 7.0

Analysis of 14,727 security and privacy prompts from WildChat finds commercial LLMs give higher-quality responses than open-weight models but can produce inconsistent answers across repeated queries.

A PubMed-Scale Dataset of Structured Biomedical Abstracts

cs.IR · 2026-06-09 · unverdicted · novelty 7.0

The paper releases Structured PubMed: 23.2 million harmonized, section-labeled biomedical abstracts (5.9M author-structured + 17.2M LLM-labeled) mapped to PubMed IDs for training and benchmarking.

Multilingual Coreference Resolution via Cycle-Consistent Machine Translation

cs.CL · 2026-06-03 · unverdicted · novelty 7.0

A cycle-consistent MT pipeline generates and similarity-weights training data for coreference resolution, producing gains on four low-resource languages and enabling the task where no corpora existed.

Stateful Visual Encoders for Vision-Language Models

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

Stateful visual encoders condition each visual representation on prior features, yielding consistent gains on multi-image tasks under supervised finetuning across model sizes and domains.

ClinicalMC: A Benchmark for Multi-Course Clinical Decision-Making with Large Language Models

cs.AI · 2026-06-02 · unverdicted · novelty 7.0

ClinicalMC is a benchmark of 1,275 Chinese and 5,804 English multi-course clinical samples across four stages, evaluated via a multi-agent framework on closed-source, open-source, and medical LLMs in static and dynamic settings.

AutoMedBench: Towards Medical AutoResearch with Agentic AI Models

cs.AI · 2026-06-01 · conditional · novelty 7.0

AutoMedBench evaluates AI agents on long-horizon medical workflows across five stages and finds validation and submission as dominant failure points based on thousands of runs.

Brain-IT-VQA: From Brain Signals to Answers

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

Brain-IT-VQA decodes visual question answers from fMRI using a transformer to extract language tokens and introduces the NSD-VQA benchmark with 20 controlled questions per image across 20 categories.

From Table to Cell: Attention for Better Reasoning with TABALIGN

cs.AI · 2026-05-14 · unverdicted · novelty 7.0

TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding execution 44.64%.

Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

Automatic evaluation tools for literary translations correlate poorly with expert human judgments on creativity and exhibit bias favoring machine-translated texts.

PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.

GeoDial: A Multimodal Conversational Tutoring Dataset for Geometry Problem-Solving with Visual Tutor Turns

cs.CY · 2026-05-08 · unverdicted · novelty 7.0

Introduces the GeoDial dataset of 1.3K multimodal geometry tutoring dialogs grounded in diagram highlights, proposes an annotation protocol, and shows that fine-tuned VLMs improve dialog but struggle with accurate highlights.

How English Print Media Frames Human-Elephant Conflicts in India

cs.AI · 2026-04-23 · unverdicted · novelty 7.0

English print media coverage of human-elephant conflicts in India is dominated by fear-inducing and aggression-related language.

ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.

LQM: Linguistically Motivated Multidimensional Quality Metrics for Machine Translation

cs.CL · 2026-04-20 · unverdicted · novelty 7.0

LQM introduces a six-level linguistically motivated error taxonomy for MT evaluation and applies it via expert annotation to LLM outputs on a new 3,850-sentence multi-dialect Arabic corpus.

Single-Language Evidence Is Insufficient for Automated Logging: A Multilingual Benchmark and Empirical Study with LLMs

cs.SE · 2026-04-19 · unverdicted · novelty 7.0

MultiLogBench shows that LLM performance on automated logging varies substantially across programming languages, demonstrating that single-language evidence is insufficient for general claims about model behavior or tool design.

AsymmetryZero: A Framework for Operationalizing Human Expert Preferences as Semantic Evals

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

AsymmetryZero operationalizes expert preferences as stable evaluation contracts for semantic evals, with a study showing 75.9-89.6% criterion agreement between frontier and compact model juries at 4-5% of the cost.

CWCD: Category-Wise Contrastive Decoding for Structured Medical Report Generation

cs.AI · 2026-04-12 · unverdicted · novelty 7.0

CWCD improves structured chest X-ray report generation by using category-wise contrastive decoding to reduce spurious pathology co-occurrences in multi-modal LLMs.

Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.

Evaluating In-Context Translation with Synchronous Context-Free Grammar Transduction

cs.CL · 2026-04-08 · unverdicted · novelty 7.0

LLM in-context translation accuracy falls sharply with larger grammars and longer sentences, and drops further when source and target languages differ in morphology or writing system, with common errors including wrong word recall, hallucinations, and untranslated source words.

citing papers explorer

Showing 31 of 131 citing papers.

Building Customer Support AI Agents at 100M-User Scale: An Evaluation-Driven Framework cs.CL · 2026-06-07 · unverdicted · none · ref 35
An evaluation-driven framework for customer support AI agents at Nubank integrates context engineering, LLM judges, and A/B testing to deliver up to 37pp NPS gains and strong offline-online correlation across five production domains.
HydraQE: OSU's Submission for the IWSLT 2026 Speech Translation Metrics Shared Task cs.CL · 2026-06-07 · unverdicted · none · ref 23
HydraQE is a new end-to-end speech translation QE system using Qwen3-ASR backbone, sparsemax layer mixing, bidirectional Transformer, and multi-task curriculum training on human and pseudo labels that outperforms cascaded baselines.
Simple Token-Efficient Vision-Language Model for Case-level Pathology Synoptic Report Generation cs.CV · 2026-05-29 · unverdicted · none · ref 19
A token-efficient VLM with frozen encoder, two-layer MLP aligner, and LLM decoder generates case-level synoptic pathology reports from multi-WSI inputs using 5x magnification patches and two-stage supervised training.
Smarter edits? Post-editing with error highlights and translation suggestions cs.CL · 2026-05-20 · unverdicted · none · ref 20
User study finds no productivity or quality gains from APE-derived error highlights and suggestions over regular post-editing, but better user reception and experience.
Enhancing Target-Guided Proactive Dialogue Systems via Conversational Scenario Modeling and Intent-Keyword Bridging cs.CL · 2026-05-12 · unverdicted · none · ref 34
Conversational scenario modeling from user profiles and domain knowledge, combined with intent-keyword bridging, improves proactivity, fluency, and informativeness in target-guided proactive dialogue systems.
Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents cs.CL · 2026-05-11 · unverdicted · none · ref 282
Audio language models are benchmarked on five semantic and paralinguistic reasoning tasks to reveal limitations in handling spoken audio evidence, accent variation, and domain shifts.
Assessment of RAG and Fine-Tuning for Industrial Question-Answering-Applications cs.CL · 2026-05-10 · unverdicted · none · ref 32
RAG is more effective and cost-efficient than fine-tuning for industrial QA adaptation on automotive datasets.
Human-LLM Dialogue Improves Diagnostic Accuracy in Emergency Care cs.AI · 2026-05-08 · unverdicted · none · ref 41
Interactive LLM dialogue raised residents' hard-case diagnostic correctness from 0.589 to 0.734 and produced medium effect sizes in a blinded study of seven physicians on 52 emergency cases.
Reflections and New Directions for Human-Centered Large Language Models cs.CL · 2026-05-07 · unverdicted · none · ref 22
Model developers must address human concerns, preferences, values, and goals with rigor at every stage of the LLM pipeline rather than only in post-training.
Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility cs.LG · 2026-05-07 · unverdicted · none · ref 10 · 2 links
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.
Retina-RAG: Retrieval-Augmented Vision-Language Modeling for Joint Retinal Diagnosis and Clinical Report Generation cs.CV · 2026-05-07 · unverdicted · none · ref 35
Retina-RAG combines a retinal classifier, LoRA-tuned Qwen2.5-VL, and RAG to jointly grade DR, detect ME, and generate reports, reaching F1 scores of 0.731 and 0.948 while exceeding baselines on ROUGE-L and SBERT metrics.
Exploring Pass-Rate Reward in Reinforcement Learning for Code Generation cs.LG · 2026-05-01 · unverdicted · none · ref 38
Pass-rate rewards in critic-free RL for code generation fail to outperform binary rewards because partial-pass solutions induce conflicting gradient directions that do not consistently favor full correctness.
Users' Activity Logs: the Good, the Bad, the Misconception, and the Disastrous cs.HC · 2026-04-30 · unverdicted · none · ref 21
Secondary analysis of 30 Saudi Google users' interviews identifies balanced perceptions of activity logs spanning benefits, risks, misconceptions, and severe negative outcomes.
REFLEX: Reference-Free Evaluation of Log Summarization via Large Language Model Judgment cs.CL · 2025-11-06 · unverdicted · none · ref 15
REFLEX is a reference-free LLM-based evaluation metric for log summarization that assesses quality on relevance, informativeness, and coherence without gold references or human annotations.
FediLoRA: Practical Federated Fine-Tuning of Foundation Models Under Missing-Modality Constraints cs.LG · 2025-09-01 · unverdicted · none · ref 31
FediLoRA is a lightweight federated LoRA aggregation method that jointly mitigates missing modalities and heterogeneous ranks in collaborative fine-tuning of foundation models.
MsEdF: A Multi-stream Encoder-decoder Framework for Remote Sensing Image Captioning cs.CV · 2025-02-13 · unverdicted · none · ref 17
MsEdF combines two complementary image encoders for feature diversity and a stacked GRU decoder with element-wise aggregation to improve remote sensing image captioning on three benchmark datasets.
HULAT2 at MER-TRANS 2026: Governed Multi-Agent Simplification for Spanish Easy-to-Read Generation cs.CL · 2026-07-02 · unverdicted · none · ref 9
HULAT2 submitted three runs to the Spanish MER-TRANS 2026 track; a LangGraph multi-agent workflow with internal quality signals achieved the best SARI score (44.05) among them, outperforming a linear regeneration baseline.
A Tree-of-Thoughts Inspired Hybrid Approach for Legal Case Judgement Summarization using LLMs cs.CL · 2026-06-26 · unverdicted · none · ref 297
A tree-of-thoughts inspired hybrid extractive-abstractive LLM prompt yields better legal case judgment summaries than standard extractive or abstractive prompts.
CAT-Translate: Building Compact Open-Source Models for Japanese-English Translation cs.CL · 2026-06-19 · unverdicted · none · ref 40
Compact 0.8B-7B models for bidirectional Japanese-English translation outperform large multilingual models on real-world domain benchmarks.
Lius: Translation Model Based Instructional Lingustic Using Continual Instruction Tuning In Kupang Malay cs.CL · 2026-06-10 · unverdicted · none · ref 76
Lius improves LLM translation for Kupang Malay by 4-13 points over baselines via continual instruction tuning with dictionary-derived instructions.
AI Assurance: A Comprehensive Testing Strategy for Enterprise AI Systems cs.SE · 2026-05-22 · unverdicted · none · ref 7
Proposes an AI Failure Taxonomy, a five-layer AI Assurance Pyramid, and operational guidance for RAG testing and model lifecycle management in enterprise settings.
DFKI-MLT at SemEval-2026 TASK 7: Steering Multilingual Models Towards Cultural Knowledge cs.CL · 2026-05-21 · unverdicted · none · ref 36
Activation steering with FLORES-derived language vectors produces modest, layer-sensitive and language-dependent gains on cultural awareness tasks, with some settings degrading performance and strong interaction with prompt design.
Hy-MT2: A Family of Fast, Efficient and Powerful Multilingual Translation Models in the Wild cs.CL · 2026-05-21 · unverdicted · none · ref 23
Hy-MT2 presents three new multilingual translation models that claim to outperform listed open-source and commercial systems on diverse tasks while enabling low-storage on-device use.
Multilingual Vision-Language Models, A Survey cs.CL · 2025-09-26 · accept · none · ref 108
The survey identifies a key tension in multilingual vision-language models between language neutrality via contrastive learning and cultural awareness via diverse data, with most benchmarks relying on translation-based evaluation.
LLaMA-XR: A Novel Framework for Radiology Report Generation using LLaMA and QLoRA Fine Tuning eess.IV · 2025-05-29 · unverdicted · none · ref 53
LLaMA-XR fine-tunes LLaMA 3.1 with QLoRA on DenseNet-121 embeddings to generate radiology reports from chest X-rays, reporting ROUGE-L of 0.433 and METEOR of 0.336 on the IU X-ray benchmark.
Bridging the Linguistic Divide: A Survey on Leveraging Large Language Models for Machine Translation cs.CL · 2025-04-02 · unverdicted · none · ref 38
A literature survey that organizes prompting, fine-tuning, preference optimization, and context-aware techniques for LLM-based machine translation with emphasis on low-resource languages.
A Survey on Knowledge Distillation of Large Language Models cs.CL · 2024-02-20 · accept · none · ref 192
A comprehensive survey of knowledge distillation for LLMs structured around algorithms, skill enhancement, and vertical applications, highlighting data augmentation as a key enabler.
Syntax as a Rosetta Stone: Universal Dependencies for In-Context Coptic Translation cs.CL · 2026-04-20 · unreviewed · ref 28
Adam's Law: Textual Frequency Law on Large Language Models cs.CL · 2026-04-02 · unreviewed · ref 30
IDRBench: Understanding the Capability of Large Language Models on Interdisciplinary Research cs.CL · 2025-07-21 · unreviewed · ref 30
Lessons from the Trenches on Reproducible Evaluation of Language Models cs.CL · 2024-05-23 · unreviewed · ref 93

Jan Melechovsky, Abhinaba Roy, and Dorien Herremans

citation-role summary

citation-polarity summary

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer