Mixed citations

Jan Melechovsky, Abhinaba Roy, and Dorien Herremans

mlsys · 2025 · arXiv 3083.107313

Mixed citation behavior. Most common role is background (62%).

131 Pith papers citing it

Background 62% of classified citations

read on arXiv browse 131 citing papers

citation-role summary

background 12 method 4

citation-polarity summary

background 10 use method 4 support 1 unclear 1

co-cited works

representative citing papers

Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation

cs.AI · 2026-06-04 · accept · novelty 8.0

Across 30 LLMs and 205 TLA+ tasks, syntactic correctness reaches at most 26.6% and semantic correctness 8.6%, with all successes limited to progressive prompting and no advantage from larger models.

ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

Evaluating Very Long-Term Conversational Memory of LLM Agents

cs.CL · 2024-02-27 · unverdicted · novelty 8.0

Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.

RoFormer: Enhanced Transformer with Rotary Position Embedding

cs.CL · 2021-04-20 · accept · novelty 8.0

RoFormer introduces rotary position embeddings that encode absolute positions via rotation matrices and relative dependencies in attention, outperforming prior position methods on long text classification tasks.

Repository-Level Solidity Code Generation with Large Language Models: From Prompting to Fine-Tuning

cs.SE · 2026-06-18 · unverdicted · novelty 7.0

Introduces SolidityBench benchmark and SolidityScore metric for repository-level Solidity code generation, finding supervised fine-tuning outperforms prompting, CoT, ICL, and RAG methods on evaluated LLMs.

Security and Privacy Prompts in the Wild: What Users Ask LLMs and How LLMs Respond

cs.CL · 2026-06-16 · unverdicted · novelty 7.0

Analysis of 14,727 security and privacy prompts from WildChat finds commercial LLMs give higher-quality responses than open-weight models but can produce inconsistent answers across repeated queries.

A PubMed-Scale Dataset of Structured Biomedical Abstracts

cs.IR · 2026-06-09 · unverdicted · novelty 7.0

The paper releases Structured PubMed: 23.2 million harmonized, section-labeled biomedical abstracts (5.9M author-structured + 17.2M LLM-labeled) mapped to PubMed IDs for training and benchmarking.

Multilingual Coreference Resolution via Cycle-Consistent Machine Translation

cs.CL · 2026-06-03 · unverdicted · novelty 7.0

A cycle-consistent MT pipeline generates and similarity-weights training data for coreference resolution, producing gains on four low-resource languages and enabling the task where no corpora existed.

Stateful Visual Encoders for Vision-Language Models

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

Stateful visual encoders condition each visual representation on prior features, yielding consistent gains on multi-image tasks under supervised finetuning across model sizes and domains.

ClinicalMC: A Benchmark for Multi-Course Clinical Decision-Making with Large Language Models

cs.AI · 2026-06-02 · unverdicted · novelty 7.0

ClinicalMC is a benchmark of 1,275 Chinese and 5,804 English multi-course clinical samples across four stages, evaluated via a multi-agent framework on closed-source, open-source, and medical LLMs in static and dynamic settings.

AutoMedBench: Towards Medical AutoResearch with Agentic AI Models

cs.AI · 2026-06-01 · conditional · novelty 7.0

AutoMedBench evaluates AI agents on long-horizon medical workflows across five stages and finds validation and submission as dominant failure points based on thousands of runs.

Brain-IT-VQA: From Brain Signals to Answers

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

Brain-IT-VQA decodes visual question answers from fMRI using a transformer to extract language tokens and introduces the NSD-VQA benchmark with 20 controlled questions per image across 20 categories.

From Table to Cell: Attention for Better Reasoning with TABALIGN

cs.AI · 2026-05-14 · unverdicted · novelty 7.0

TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding execution 44.64%.

Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

Automatic evaluation tools for literary translations correlate poorly with expert human judgments on creativity and exhibit bias favoring machine-translated texts.

PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.

GeoDial: A Multimodal Conversational Tutoring Dataset for Geometry Problem-Solving with Visual Tutor Turns

cs.CY · 2026-05-08 · unverdicted · novelty 7.0

Introduces the GeoDial dataset of 1.3K multimodal geometry tutoring dialogs grounded in diagram highlights, proposes an annotation protocol, and shows that fine-tuned VLMs improve dialog but struggle with accurate highlights.

How English Print Media Frames Human-Elephant Conflicts in India

cs.AI · 2026-04-23 · unverdicted · novelty 7.0

English print media coverage of human-elephant conflicts in India is dominated by fear-inducing and aggression-related language.

ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.

LQM: Linguistically Motivated Multidimensional Quality Metrics for Machine Translation

cs.CL · 2026-04-20 · unverdicted · novelty 7.0

LQM introduces a six-level linguistically motivated error taxonomy for MT evaluation and applies it via expert annotation to LLM outputs on a new 3,850-sentence multi-dialect Arabic corpus.

Single-Language Evidence Is Insufficient for Automated Logging: A Multilingual Benchmark and Empirical Study with LLMs

cs.SE · 2026-04-19 · unverdicted · novelty 7.0

MultiLogBench shows that LLM performance on automated logging varies substantially across programming languages, demonstrating that single-language evidence is insufficient for general claims about model behavior or tool design.

AsymmetryZero: A Framework for Operationalizing Human Expert Preferences as Semantic Evals

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

AsymmetryZero operationalizes expert preferences as stable evaluation contracts for semantic evals, with a study showing 75.9-89.6% criterion agreement between frontier and compact model juries at 4-5% of the cost.

CWCD: Category-Wise Contrastive Decoding for Structured Medical Report Generation

cs.AI · 2026-04-12 · unverdicted · novelty 7.0

CWCD improves structured chest X-ray report generation by using category-wise contrastive decoding to reduce spurious pathology co-occurrences in multi-modal LLMs.

Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.

Evaluating In-Context Translation with Synchronous Context-Free Grammar Transduction

cs.CL · 2026-04-08 · unverdicted · novelty 7.0

LLM in-context translation accuracy falls sharply with larger grammars and longer sentences, and drops further when source and target languages differ in morphology or writing system, with common errors including wrong word recall, hallucinations, and untranslated source words.

citing papers explorer

Showing 50 of 131 citing papers.

ATD-Trans: A Geographically Grounded Japanese-English Travelogue Translation Dataset cs.CL · 2026-05-13 · conditional · none · ref 14
ATD-Trans is a new geographically annotated Japanese-English travelogue dataset that reveals Japanese-enhanced models perform better on geo-entity translation while domestic Japanese locations remain harder to translate accurately.
HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution cs.AI · 2026-05-11 · unverdicted · none · ref 55
HAGE proposes a trainable weighted graph memory framework with LLM intent classification, dynamic edge modulation, and RL optimization that improves long-horizon reasoning accuracy in agentic LLMs over static baselines.
MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents cs.CR · 2026-05-10 · unverdicted · none · ref 34 · 3 links
MemPrivacy uses edge-side privacy span detection and semantic placeholders to enable cloud memory management for LLM agents while limiting utility loss to 1.6% and outperforming masking baselines.
Dynamic Meta-Metrics: Source-Sentence Conditioned Weighting for MT Evaluation cs.CL · 2026-05-09 · unverdicted · none · ref 1
Dynamic Meta-Metrics learns source-sentence conditioned combinations of MT metrics, with MLP-based and soft-conditioned versions showing gains over linear and GP ensembles on WMT data.
BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning cs.CV · 2026-05-08 · unverdicted · none · ref 16
BalCapRL applies balanced multi-objective RL with GDPO-style normalization and length-conditional masking to improve MLLM image captioning, reporting gains of up to +13.6 DCScore, +9.0 CaptionQA, and +29.0 CapArena on LLaVA and Qwen models.
SignVerse-2M: A Two-Million-Clip Pose-Native Universe of 55+ Sign Languages cs.CV · 2026-05-03 · unverdicted · none · ref 12
SignVerse-2M provides a 2-million-clip multilingual pose-native dataset for sign language derived from public videos via DWPose preprocessing to enable robust modeling in real-world conditions.
Block-wise Codeword Embedding for Reliable Multi-bit Text Watermarking cs.CR · 2026-05-01 · unverdicted · none · ref 14 · 2 links
BREW uses block voting and window-shifting verification to reach TPR 0.965 and FPR 0.02 under 10% synonym substitution, addressing high false-positive issues in prior multi-bit LLM watermarking.
Cultural Benchmarking of LLMs in Standard and Dialectal Arabic Dialogues cs.CL · 2026-04-30 · unverdicted · none · ref 36
ArabCulture-Dialogue dataset shows LLMs perform worse on dialectal Arabic than Modern Standard Arabic across cultural reasoning, translation, and generation tasks.
Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition cs.AI · 2026-04-20 · unverdicted · none · ref 46
Adversarial competition between attacker and defender teams generates diverse multi-turn conversational data that improves LLM performance on secure code generation benchmarks by 18-29%.
Measuring Distribution Shift in User Prompts and Its Effects on LLM Performance cs.CL · 2026-04-19 · unverdicted · none · ref 53
The LENS framework applied to 192 real-world settings shows moderate natural prompt distribution shifts cause 73% average performance loss in deployed LLMs, especially across user groups and regions.
Enhancing Reinforcement Learning for Radiology Report Generation with Evidence-aware Rewards and Self-correcting Preference Learning cs.LG · 2026-04-15 · unverdicted · none · ref 19
ESC-RL improves RL for radiology reports via group-wise evidence-aware rewards (GEAR) and LLM-driven self-correcting preference learning (SPL), reaching state-of-the-art on two chest X-ray datasets.
AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis cs.CV · 2026-04-07 · unverdicted · none · ref 31
AICA-Bench evaluates 23 VLMs on affective image analysis, identifies weak intensity calibration and shallow descriptions as limitations, and proposes training-free Grounded Affective Tree Prompting to improve performance.
A Human-Centric Framework for Data Attribution in Large Language Models cs.CY · 2026-02-11 · unverdicted · none · ref 149
Introduces a parameter-driven framework for data attribution in LLMs that enables negotiation among creators, users, and intermediaries to meet stakeholder goals within the data economy.
Stress Testing Factual Consistency Metrics for Long-Document Summarization cs.CL · 2025-11-10 · unverdicted · none · ref 28
Short-form factual consistency metrics produce inconsistent scores on semantically equivalent long-document summaries and lose reliability on information-dense claims.
Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation cs.AI · 2025-10-05 · unverdicted · none · ref 76
A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.
Detecting LLM-Generated Spam Reviews by Integrating Language Model Embeddings and Graph Neural Network cs.CL · 2025-10-02 · unverdicted · none · ref 35
Introduces FraudSquad, a hybrid model using language model embeddings and a gated graph transformer that outperforms baselines on newly created LLM-generated spam review datasets.
SLoW: Select Low-frequency Words! Automatic Dictionary Selection for Translation on Large Language Models cs.CL · 2025-07-25 · conditional · none · ref 20
SLoW selects low-frequency word dictionaries to boost LLM translation quality and efficiency across 100 languages from FLORES.
Dictionary Insertion Prompting for Multilingual Reasoning on Multilingual Large Language Models cs.CL · 2024-11-02 · unverdicted · none · ref 23
DIP interleaves English word translations into non-English prompts to boost multilingual reasoning on synthetic benchmarks spanning 10-200 languages.
Toxic Subword Pruning for Dialogue Response Generation on Large Language Models cs.CL · 2024-10-05 · unverdicted · none · ref 29
ToxPrune prunes toxic subwords from BPE tokenizers in LLMs to mitigate toxic dialogue responses and improve diversity on both toxic and non-toxic models.
Nougat: Neural Optical Understanding for Academic Documents cs.LG · 2023-08-25 · conditional · none · ref 44
Nougat applies a visual transformer to convert academic PDFs into markup language while accurately handling mathematical content on a new scientific document dataset.
Large Language Models are not Fair Evaluators cs.CL · 2023-05-29 · conditional · none · ref 49
LLMs show strong position bias when scoring model outputs, allowing easy manipulation of rankings, but calibration with multiple evidence, position balancing, and selective human input reduces this bias to better match human judgments.
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model cs.CL · 2022-11-09 · unverdicted · none · ref 296
BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.
Findings of the First Shared Task on Machine Translation Robustness cs.CL · 2019-06-27 · unverdicted · none · ref 27
The first shared task on MT robustness received 23 submissions showing up to +22.33 BLEU gains on noisy Reddit data, with strong human-BLEU correlation.
A-TMA: Decoupling State-Aware Memory Failures in Long-Term Agent Memory cs.AI · 2026-07-02 · unverdicted · none · ref 42
ATMA adds state labels and evidence packets to existing memory systems to reduce ghost memory failures, with reported gains on a new LTP benchmark and LoCoMo.
LLM4MTLs: Automated Generation and Empirical Evaluation of Model Transformation Languages cs.SE · 2026-06-23 · unverdicted · none · ref 17
Few-shot prompting improves syntactic validity of LLM-generated code across ATL, ETL, QVTo, and Reactions, but semantic correctness gains remain uneven and language-dependent.
POTracker: Optimizing Large Language Models for Standard-Compliant Power Outage Report Generation cs.AI · 2026-06-22 · unverdicted · none · ref 48 · 2 links
POTracker fine-tunes an LLM with POTrackerLoss combining textual and structural similarity, achieving up to 86.47% structural accuracy on 1,000 power outage reports and outperforming baselines by up to 51%.
G-Long: Graph-Enhanced Memory Management for Efficient Long-Term Dialogue Agents cs.CL · 2026-06-11 · unverdicted · none · ref 24
G-Long uses graph-enhanced triplet memory and attention-aware scoring from a T5 summarizer to achieve up to 9.8% better response quality on MSC and 40.8% better retrieval recall on LME with lower overhead.
CRAFT: A Unified Counterfactual Reasoning Framework for Tabular Question Answering and Fact Verification cs.CL · 2026-06-05 · unverdicted · none · ref 32
CRAFT is a unified bidirectional counterfactual reasoning framework that improves LLM performance on tabular QA and fact verification tasks over baselines on WikiTQ and TabFact.
A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs cs.CL · 2026-06-03 · unverdicted · none · ref 31
Constructs multi-video summarization benchmark and evaluates nine MLLMs showing positional bias is domain- and model-dependent with middle positions often weaker and budgets not uniformly fixing it.
LLM-FACETS: A Privacy-Preserving Framework for Evaluating LLM Transparency and Accountability cs.AI · 2026-05-29 · unverdicted · none · ref 25
Introduces LLM-FACETS, a privacy-preserving open-source framework for LLM evaluation using deterministic metrics locally, LLM-judge metrics with user-controlled APIs, and mechanisms for uncertainty visualization and hallucination detection.
A Pilot Study on Curator-Guided Multilingual Art Description for Blind and Low-Vision Audiences with Small Vision-Language Models cs.MM · 2026-05-29 · unverdicted · none · ref 25
Pilot evaluation of language-specific versus multilingual LoRA adapters on Qwen2.5-VL-3B for curator-guided BLV art descriptions in three languages.
VaaWIT: Visual-Aware Adaptation of Large Language Models for Multilingual Web Image Translation cs.CV · 2026-05-23 · unverdicted · none · ref 36
VaaWIT proposes DSAM and VAA modules to adapt LLMs for multilingual web image translation, claiming outperformance over open-source baselines on benchmarks.
Empirical Analysis and Detection of Hallucinations in LLM-Generated Bug Report Summaries cs.SE · 2026-05-22 · unverdicted · none · ref 28
Develops a section-aware hallucination detection method for LLM bug report summaries using synthetic injection on the BugsRepo dataset from Mozilla projects, reporting up to 0.89 Macro-F1 at report level.
COPRA: Conditional Parameter Adaptation with Reinforcement Learning for Video Anomaly Detection cs.CV · 2026-05-14 · unverdicted · none · ref 53
COPRA introduces conditional parameter adaptation via RL to dynamically tune frozen VLMs for video anomaly detection, outperforming static methods in in-domain and cross-domain settings while generalizing to other video tasks.
Fine-Tuning Models for Automated Code Review Feedback cs.SE · 2026-05-12 · conditional · none · ref 28
PEFT fine-tuning of Code Llama yields feedback on student Java bugs that students judge equal to ChatGPT and better than prompt engineering, using BLEU/ROUGE/BERTScore plus human ratings.
Towards Visually-Guided Movie Subtitle Translation for Indic Languages cs.CL · 2026-05-12 · unverdicted · none · ref 1
Selective replacement of the worst 20-30% of text-only subtitle segments with visual-enhanced outputs raises COMET scores for Indic languages, but full visual grounding is ineffective because of temporal misalignment between subtitles and frames.
UserGPT Technical Report cs.IR · 2026-05-09 · unverdicted · none · ref 77
UserGPT introduces a generative LLM framework with a behavior simulation engine, semantization module, and DF-GRPO post-training that scores 0.7325 on tag prediction and 0.7528 on summary generation on HPR-Bench while compressing records by up to 97.9%.
Boosting Automatic Java-to-Cangjie Translation with Multi-Stage LLM Training and Error Repair cs.SE · 2026-05-08 · unverdicted · none · ref 39
Multi-stage LLM training plus compiler-guided error repair boosts functional equivalence in Java-to-Cangjie translation by 6.06% over prior methods despite scarce parallel data.
Exploiting LLM-as-a-Judge Disposition on Free Text Legal QA via Prompt Optimization cs.CL · 2026-04-22 · unverdicted · none · ref 12
Automatic prompt optimization using lenient LLM judges improves performance and transferability in legal QA evaluations compared to human design or strict judges.
An Explainable Approach to Document-level Translation Evaluation with Topic Modeling cs.CE · 2026-04-22 · unverdicted · none · ref 24
A topic-modeling framework measures document-level thematic consistency in translations by aligning key tokens across languages with a bilingual dictionary and scoring via cosine similarity, providing explainable insights beyond sentence-level metrics.
CXRMate-2: Structured Multimodal Temporal Embeddings and Tractable Reinforcement Learning for Clinically Acceptable Chest X-ray Radiology Report Generation cs.CV · 2026-04-21 · unverdicted · none · ref 46
CXRMate-2 improves chest X-ray report generation via temporal embeddings and tractable RL, delivering metric gains and 45% acceptability in radiologist review with no significant preference difference on most findings.
Verify Before You Commit: Towards Faithful Reasoning in LLM Agents via Self-Auditing cs.AI · 2026-04-09 · unverdicted · none · ref 39
SAVeR adds self-auditing of internal beliefs in LLM agents via persona-based candidates and constraint-guided repairs, improving faithfulness on six benchmarks without hurting task performance.
The Personalization Paradox: Semantic Loss vs. Reasoning Gains in Agentic AI Q&A cs.IR · 2025-12-04 · unverdicted · none · ref 20
Personalization in an agentic RAG advising system boosts reasoning quality and grounding while reducing semantic metric scores due to the inability of current metrics to accommodate user-specific responses.
PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models cs.CL · 2025-12-02 · unverdicted · none · ref 52
PEFT-Factory supplies a ready-to-use, extensible codebase that unifies 19 PEFT methods and evaluation pipelines for fine-tuning large autoregressive language models.
Do Activation Verbalization Methods Convey Privileged Information? cs.CL · 2025-09-16 · unverdicted · none · ref 42
Activation verbalization methods for LLMs largely reflect the verbalizer model's parametric knowledge rather than privileged information from the target model's activations.
Automated Description Generation of Cytologic Findings for Lung Cytological Images Using a Pretrained Vision Model and Dual Text Decoders: Preliminary Study eess.IV · 2024-03-26 · unverdicted · none · ref 19
A CNN classifies lung cytology patches as benign or malignant at 100% sensitivity and 96.4% specificity, then routes to one of two Transformer decoders to generate findings text achieving BLEU-4 of 0.828 on 801 images.
PaLM 2 Technical Report cs.CL · 2023-05-17 · unverdicted · none · ref 108
PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
StarCoder: may the source be with you! cs.CL · 2023-05-09 · accept · none · ref 257
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
Bridging Scientific Heritage: An Arabic--Russian Parallel Corpus and LLM Benchmark for Sustainable Knowledge Transfer cs.CL · 2026-06-29 · unverdicted · none · ref 4
A new 27k-sentence Arabic-Russian parallel corpus supports fine-tuned LLM translation benchmarks that improve BLEU by 4.36 and COMET by 0.051 over zero-shot baselines for scientific content.
Semantic Grading of Written Answers in Low-Resource Language Bangla Using a Fine-Tuned Lightweight Language Model cs.CL · 2026-06-10 · unverdicted · none · ref 2
QLoRA-tuned Qwen3-8B is fine-tuned on synthetic Bangla-English data to semantically grade written answers, reporting RoRa 0.819 and human agreement rho 0.936.

Jan Melechovsky, Abhinaba Roy, and Dorien Herremans

citation-role summary

citation-polarity summary

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer