Mixed citations

Title resolution pending

Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu · 2002 · arXiv 3083.107313

Mixed citation behavior. Most common role is background (62%).

85 Pith papers citing it

Background 62% of classified citations

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 12 method 4

citation-polarity summary

background 10 use method 4 support 1 unclear 1

co-cited works

representative citing papers

ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

Evaluating Very Long-Term Conversational Memory of LLM Agents

cs.CL · 2024-02-27 · unverdicted · novelty 8.0

Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.

RoFormer: Enhanced Transformer with Rotary Position Embedding

cs.CL · 2021-04-20 · accept · novelty 8.0

RoFormer introduces rotary position embeddings that encode absolute positions via rotation matrices and relative dependencies in attention, outperforming prior position methods on long text classification tasks.

From Table to Cell: Attention for Better Reasoning with TABALIGN

cs.AI · 2026-05-14 · unverdicted · novelty 7.0

TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding execution 44.64%.

Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

Automatic evaluation tools for literary translations correlate poorly with expert human judgments on creativity and exhibit bias favoring machine-translated texts.

PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.

How English Print Media Frames Human-Elephant Conflicts in India

cs.AI · 2026-04-23 · unverdicted · novelty 7.0

English print media coverage of human-elephant conflicts in India is dominated by fear-inducing and aggression-related language.

ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.

LQM: Linguistically Motivated Multidimensional Quality Metrics for Machine Translation

cs.CL · 2026-04-20 · unverdicted · novelty 7.0

LQM introduces a six-level linguistically motivated error taxonomy for MT evaluation and applies it via expert annotation to LLM outputs on a new 3,850-sentence multi-dialect Arabic corpus.

Single-Language Evidence Is Insufficient for Automated Logging: A Multilingual Benchmark and Empirical Study with LLMs

cs.SE · 2026-04-19 · unverdicted · novelty 7.0

MultiLogBench shows that LLM performance on automated logging varies substantially across programming languages, demonstrating that single-language evidence is insufficient for general claims about model behavior or tool design.

AsymmetryZero: A Framework for Operationalizing Human Expert Preferences as Semantic Evals

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

AsymmetryZero operationalizes expert preferences as stable evaluation contracts for semantic evals, with a study showing 75.9-89.6% criterion agreement between frontier and compact model juries at 4-5% of the cost.

CWCD: Category-Wise Contrastive Decoding for Structured Medical Report Generation

cs.AI · 2026-04-12 · unverdicted · novelty 7.0

CWCD improves structured chest X-ray report generation by using category-wise contrastive decoding to reduce spurious pathology co-occurrences in multi-modal LLMs.

Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.

Evaluating In-Context Translation with Synchronous Context-Free Grammar Transduction

cs.CL · 2026-04-08 · unverdicted · novelty 7.0

LLM in-context translation accuracy falls sharply with larger grammars and longer sentences, and drops further when source and target languages differ in morphology or writing system, with common errors including wrong word recall, hallucinations, and untranslated source words.

Sell More, Play Less: Benchmarking LLM Realistic Selling Skill

cs.CL · 2026-04-08 · conditional · novelty 7.0

SalesLLM provides an automatic evaluation framework for LLM sales dialogues that correlates 0.98 with human experts and shows top models approaching human performance while weaker ones lag.

DeEscalWild: A Real-World Benchmark for Automated De-Escalation Training with SLMs

cs.CL · 2026-03-20 · unverdicted · novelty 7.0

DeEscalWild supplies 1,500 high-fidelity de-escalation scenarios that let fine-tuned 3B SLMs outperform general-purpose larger models on realism and dialogue metrics.

Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation

cs.CL · 2026-02-02 · unverdicted · novelty 7.0

xMemory builds revisable hierarchical agent memory by segmenting histories, decoupling into components, and aggregating via sparsity-semantic objective, yielding better answer quality and lower token use than flat RAG on LoCoMo and PerLTQA.

DialectLLM: A Dialect-Aware Dialog[ue] Generation Framework Beyond Standard American English

cs.CL · 2026-01-30 · unverdicted · novelty 7.0

DialectLLM generates parallel multi-dialect dialog data and a 50k-dialog benchmark showing frontier LLMs achieve under 70% accuracy on dialect tasks while the generated data can improve post-training.

Creating ConLangs to Probe the Metalinguistic Grammatical Knowledge of LLMs

cs.CL · 2025-10-08 · unverdicted · novelty 7.0

IASC is an interactive modular LLM system for building ConLangs that serves as a probe for metalinguistic grammatical knowledge, revealing large performance differences across models and across common versus rare linguistic patterns.

Guidelines for Empirical Studies in Software Engineering involving Large Language Models

cs.SE · 2025-08-21 · accept · novelty 7.0 · 2 refs

The paper delivers a taxonomy of seven LLM study types in software engineering along with eight guidelines that separate mandatory requirements from recommended practices to address reproducibility challenges.

Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation

cs.CL · 2025-05-24 · unverdicted · novelty 7.0

Smoothie performs diffusion by smoothing token embeddings based on semantic similarity, outperforming prior diffusion models on sequence-to-sequence and unconditional text generation tasks.

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

cs.AI · 2024-06-14 · conditional · novelty 7.0

LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.

A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA

cs.CL · 2023-11-28 · unverdicted · novelty 7.0

LoRA adapters should be scaled by 1/sqrt(rank) rather than 1/rank to stabilize learning and enable effective use of higher ranks during fine-tuning of large language models.

Prefix-Tuning: Optimizing Continuous Prompts for Generation

cs.CL · 2021-01-01 · conditional · novelty 7.0

Prefix-tuning matches or exceeds fine-tuning on NLG tasks by optimizing a continuous prefix using 0.1% of parameters while keeping the LM frozen.

citing papers explorer

Showing 5 of 5 citing papers after filters.

Single-Language Evidence Is Insufficient for Automated Logging: A Multilingual Benchmark and Empirical Study with LLMs cs.SE · 2026-04-19 · unverdicted · none · ref 49
MultiLogBench shows that LLM performance on automated logging varies substantially across programming languages, demonstrating that single-language evidence is insufficient for general claims about model behavior or tool design.
Guidelines for Empirical Studies in Software Engineering involving Large Language Models cs.SE · 2025-08-21 · accept · none · ref 102 · 2 links
The paper delivers a taxonomy of seven LLM study types in software engineering along with eight guidelines that separate mandatory requirements from recommended practices to address reproducibility challenges.
Fine-Tuning Models for Automated Code Review Feedback cs.SE · 2026-05-12 · conditional · none · ref 28
PEFT fine-tuning of Code Llama yields feedback on student Java bugs that students judge equal to ChatGPT and better than prompt engineering, using BLEU/ROUGE/BERTScore plus human ratings.
Boosting Automatic Java-to-Cangjie Translation with Multi-Stage LLM Training and Error Repair cs.SE · 2026-05-08 · unverdicted · none · ref 39
Multi-stage LLM training plus compiler-guided error repair boosts functional equivalence in Java-to-Cangjie translation by 6.06% over prior methods despite scarce parallel data.
AI Assurance: A Comprehensive Testing Strategy for Enterprise AI Systems cs.SE · 2026-05-22 · unverdicted · none · ref 7
Proposes an AI Failure Taxonomy, a five-layer AI Assurance Pyramid, and operational guidance for RAG testing and model lifecycle management in enterprise settings.

Title resolution pending

citation-role summary

citation-polarity summary

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer