ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
Mixed citations
Title resolution pending
Mixed citation behavior. Most common role is background (62%).
citation-role summary
citation-polarity summary
co-cited works
representative citing papers
Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.
RoFormer introduces rotary position embeddings that encode absolute positions via rotation matrices and relative dependencies in attention, outperforming prior position methods on long text classification tasks.
TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding execution 44.64%.
Automatic evaluation tools for literary translations correlate poorly with expert human judgments on creativity and exhibit bias favoring machine-translated texts.
PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.
English print media coverage of human-elephant conflicts in India is dominated by fear-inducing and aggression-related language.
ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.
LQM introduces a six-level linguistically motivated error taxonomy for MT evaluation and applies it via expert annotation to LLM outputs on a new 3,850-sentence multi-dialect Arabic corpus.
MultiLogBench shows that LLM performance on automated logging varies substantially across programming languages, demonstrating that single-language evidence is insufficient for general claims about model behavior or tool design.
AsymmetryZero operationalizes expert preferences as stable evaluation contracts for semantic evals, with a study showing 75.9-89.6% criterion agreement between frontier and compact model juries at 4-5% of the cost.
CWCD improves structured chest X-ray report generation by using category-wise contrastive decoding to reduce spurious pathology co-occurrences in multi-modal LLMs.
Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.
LLM in-context translation accuracy falls sharply with larger grammars and longer sentences, and drops further when source and target languages differ in morphology or writing system, with common errors including wrong word recall, hallucinations, and untranslated source words.
SalesLLM provides an automatic evaluation framework for LLM sales dialogues that correlates 0.98 with human experts and shows top models approaching human performance while weaker ones lag.
DeEscalWild supplies 1,500 high-fidelity de-escalation scenarios that let fine-tuned 3B SLMs outperform general-purpose larger models on realism and dialogue metrics.
xMemory builds revisable hierarchical agent memory by segmenting histories, decoupling into components, and aggregating via sparsity-semantic objective, yielding better answer quality and lower token use than flat RAG on LoCoMo and PerLTQA.
DialectLLM generates parallel multi-dialect dialog data and a 50k-dialog benchmark showing frontier LLMs achieve under 70% accuracy on dialect tasks while the generated data can improve post-training.
IASC is an interactive modular LLM system for building ConLangs that serves as a probe for metalinguistic grammatical knowledge, revealing large performance differences across models and across common versus rare linguistic patterns.
The paper delivers a taxonomy of seven LLM study types in software engineering along with eight guidelines that separate mandatory requirements from recommended practices to address reproducibility challenges.
Smoothie performs diffusion by smoothing token embeddings based on semantic similarity, outperforming prior diffusion models on sequence-to-sequence and unconditional text generation tasks.
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
LoRA adapters should be scaled by 1/sqrt(rank) rather than 1/rank to stabilize learning and enable effective use of higher ranks during fine-tuning of large language models.
Prefix-tuning matches or exceeds fine-tuning on NLG tasks by optimizing a continuous prefix using 0.1% of parameters while keeping the LM frozen.
citing papers explorer
-
ArgBench: Benchmarking LLMs on Computational Argumentation Tasks
ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
-
Evaluating Very Long-Term Conversational Memory of LLM Agents
Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.
-
RoFormer: Enhanced Transformer with Rotary Position Embedding
RoFormer introduces rotary position embeddings that encode absolute positions via rotation matrices and relative dependencies in attention, outperforming prior position methods on long text classification tasks.
-
From Table to Cell: Attention for Better Reasoning with TABALIGN
TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding execution 44.64%.
-
Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations
Automatic evaluation tools for literary translations correlate poorly with expert human judgments on creativity and exhibit bias favoring machine-translated texts.
-
PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents
PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.
-
How English Print Media Frames Human-Elephant Conflicts in India
English print media coverage of human-elephant conflicts in India is dominated by fear-inducing and aggression-related language.
-
ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation
ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.
-
LQM: Linguistically Motivated Multidimensional Quality Metrics for Machine Translation
LQM introduces a six-level linguistically motivated error taxonomy for MT evaluation and applies it via expert annotation to LLM outputs on a new 3,850-sentence multi-dialect Arabic corpus.
-
Single-Language Evidence Is Insufficient for Automated Logging: A Multilingual Benchmark and Empirical Study with LLMs
MultiLogBench shows that LLM performance on automated logging varies substantially across programming languages, demonstrating that single-language evidence is insufficient for general claims about model behavior or tool design.
-
AsymmetryZero: A Framework for Operationalizing Human Expert Preferences as Semantic Evals
AsymmetryZero operationalizes expert preferences as stable evaluation contracts for semantic evals, with a study showing 75.9-89.6% criterion agreement between frontier and compact model juries at 4-5% of the cost.
-
CWCD: Category-Wise Contrastive Decoding for Structured Medical Report Generation
CWCD improves structured chest X-ray report generation by using category-wise contrastive decoding to reduce spurious pathology co-occurrences in multi-modal LLMs.
-
Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment
Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.
-
Evaluating In-Context Translation with Synchronous Context-Free Grammar Transduction
LLM in-context translation accuracy falls sharply with larger grammars and longer sentences, and drops further when source and target languages differ in morphology or writing system, with common errors including wrong word recall, hallucinations, and untranslated source words.
-
Sell More, Play Less: Benchmarking LLM Realistic Selling Skill
SalesLLM provides an automatic evaluation framework for LLM sales dialogues that correlates 0.98 with human experts and shows top models approaching human performance while weaker ones lag.
-
DeEscalWild: A Real-World Benchmark for Automated De-Escalation Training with SLMs
DeEscalWild supplies 1,500 high-fidelity de-escalation scenarios that let fine-tuned 3B SLMs outperform general-purpose larger models on realism and dialogue metrics.
-
Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation
xMemory builds revisable hierarchical agent memory by segmenting histories, decoupling into components, and aggregating via sparsity-semantic objective, yielding better answer quality and lower token use than flat RAG on LoCoMo and PerLTQA.
-
DialectLLM: A Dialect-Aware Dialog[ue] Generation Framework Beyond Standard American English
DialectLLM generates parallel multi-dialect dialog data and a 50k-dialog benchmark showing frontier LLMs achieve under 70% accuracy on dialect tasks while the generated data can improve post-training.
-
Creating ConLangs to Probe the Metalinguistic Grammatical Knowledge of LLMs
IASC is an interactive modular LLM system for building ConLangs that serves as a probe for metalinguistic grammatical knowledge, revealing large performance differences across models and across common versus rare linguistic patterns.
-
Guidelines for Empirical Studies in Software Engineering involving Large Language Models
The paper delivers a taxonomy of seven LLM study types in software engineering along with eight guidelines that separate mandatory requirements from recommended practices to address reproducibility challenges.
-
Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation
Smoothie performs diffusion by smoothing token embeddings based on semantic similarity, outperforming prior diffusion models on sequence-to-sequence and unconditional text generation tasks.
-
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
-
A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA
LoRA adapters should be scaled by 1/sqrt(rank) rather than 1/rank to stabilize learning and enable effective use of higher ranks during fine-tuning of large language models.
-
Prefix-Tuning: Optimizing Continuous Prompts for Generation
Prefix-tuning matches or exceeds fine-tuning on NLG tasks by optimizing a continuous prefix using 0.1% of parameters while keeping the LM frozen.
-
EmbGen: Teaching with Reassembled Corpora
EmbGen creates synthetic QA data by entity decomposition, embedding-based reassembly into clusters, and multi-level sampling with cluster-specific prompts, yielding up to 88.9% higher Binary Accuracy than baselines on heterogeneous datasets under fixed token budgets.
-
ATD-Trans: A Geographically Grounded Japanese-English Travelogue Translation Dataset
ATD-Trans is a new geographically annotated Japanese-English travelogue dataset that reveals Japanese-enhanced models perform better on geo-entity translation while domestic Japanese locations remain harder to translate accurately.
-
HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution
HAGE proposes a trainable weighted graph memory framework with LLM intent classification, dynamic edge modulation, and RL optimization that improves long-horizon reasoning accuracy in agentic LLMs over static baselines.
-
MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents
MemPrivacy uses edge-side privacy span detection and semantic placeholders to enable cloud memory management for LLM agents while limiting utility loss to 1.6% and outperforming masking baselines.
-
BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning
BalCapRL applies balanced multi-objective RL with GDPO-style normalization and length-conditional masking to improve MLLM image captioning, reporting gains of up to +13.6 DCScore, +9.0 CaptionQA, and +29.0 CapArena on LLaVA and Qwen models.
-
SignVerse-2M: A Two-Million-Clip Pose-Native Universe of 55+ Sign Languages
SignVerse-2M provides a 2-million-clip multilingual pose-native dataset for sign language derived from public videos via DWPose preprocessing to enable robust modeling in real-world conditions.
-
Block-wise Codeword Embedding for Reliable Multi-bit Text Watermarking
BREW achieves TPR of 0.965 and FPR of 0.02 under 10% synonym substitution by shifting from ECC decoding to designated verification with block voting and local validation.
-
Cultural Benchmarking of LLMs in Standard and Dialectal Arabic Dialogues
ArabCulture-Dialogue dataset shows LLMs perform worse on dialectal Arabic than Modern Standard Arabic across cultural reasoning, translation, and generation tasks.
-
Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition
Adversarial competition between attacker and defender teams generates diverse multi-turn conversational data that improves LLM performance on secure code generation benchmarks by 18-29%.
-
Measuring Distribution Shift in User Prompts and Its Effects on LLM Performance
The LENS framework applied to 192 real-world settings shows moderate natural prompt distribution shifts cause 73% average performance loss in deployed LLMs, especially across user groups and regions.
-
Enhancing Reinforcement Learning for Radiology Report Generation with Evidence-aware Rewards and Self-correcting Preference Learning
ESC-RL improves RL for radiology reports via group-wise evidence-aware rewards (GEAR) and LLM-driven self-correcting preference learning (SPL), reaching state-of-the-art on two chest X-ray datasets.
-
AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis
AICA-Bench evaluates 23 VLMs on affective image analysis, identifies weak intensity calibration and shallow descriptions as limitations, and proposes training-free Grounded Affective Tree Prompting to improve performance.
-
A Human-Centric Framework for Data Attribution in Large Language Models
Introduces a parameter-driven framework for data attribution in LLMs that enables negotiation among creators, users, and intermediaries to meet stakeholder goals within the data economy.
-
Stress Testing Factual Consistency Metrics for Long-Document Summarization
Short-form factual consistency metrics produce inconsistent scores on semantically equivalent long-document summaries and lose reliability on information-dense claims.
-
Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation
A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.
-
Detecting LLM-Generated Spam Reviews by Integrating Language Model Embeddings and Graph Neural Network
Introduces FraudSquad, a hybrid model using language model embeddings and a gated graph transformer that outperforms baselines on newly created LLM-generated spam review datasets.
-
SLoW: Select Low-frequency Words! Automatic Dictionary Selection for Translation on Large Language Models
SLoW selects low-frequency word dictionaries to boost LLM translation quality and efficiency across 100 languages from FLORES.
-
Dictionary Insertion Prompting for Multilingual Reasoning on Multilingual Large Language Models
DIP interleaves English word translations into non-English prompts to boost multilingual reasoning on synthetic benchmarks spanning 10-200 languages.
-
Toxic Subword Pruning for Dialogue Response Generation on Large Language Models
ToxPrune prunes toxic subwords from BPE tokenizers in LLMs to mitigate toxic dialogue responses and improve diversity on both toxic and non-toxic models.
-
Nougat: Neural Optical Understanding for Academic Documents
Nougat applies a visual transformer to convert academic PDFs into markup language while accurately handling mathematical content on a new scientific document dataset.
-
Large Language Models are not Fair Evaluators
LLMs show strong position bias when scoring model outputs, allowing easy manipulation of rankings, but calibration with multiple evidence, position balancing, and selective human input reduces this bias to better match human judgments.
-
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.
-
Findings of the First Shared Task on Machine Translation Robustness
The first shared task on MT robustness received 23 submissions showing up to +22.33 BLEU gains on noisy Reddit data, with strong human-BLEU correlation.
-
COPRA: Conditional Parameter Adaptation with Reinforcement Learning for Video Anomaly Detection
COPRA introduces conditional parameter adaptation via RL to dynamically tune frozen VLMs for video anomaly detection, outperforming static methods in in-domain and cross-domain settings while generalizing to other video tasks.
-
Fine-Tuning Models for Automated Code Review Feedback
PEFT fine-tuning of Code Llama yields feedback on student Java bugs that students judge equal to ChatGPT and better than prompt engineering, using BLEU/ROUGE/BERTScore plus human ratings.
-
Towards Visually-Guided Movie Subtitle Translation for Indic Languages
Selective replacement of the worst 20-30% of text-only subtitle segments with visual-enhanced outputs raises COMET scores for Indic languages, but full visual grounding is ineffective because of temporal misalignment between subtitles and frames.