Title resolution pending

Association for Computational Linguistics · 2024 · DOI 10.18653/v1/

133 Pith papers cite this work. Polarity classification is still indexing.

133 Pith papers citing it

open at publisher browse 133 citing papers more from Association for Computational Linguistics

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 2 dataset 1 method 1

citation-polarity summary

background 2 use dataset 1 use method 1

authors

Association for Computational Linguistics

co-cited works

representative citing papers

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

cs.AI · 2026-05-11 · unverdicted · novelty 8.0

Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.

DRACULA: Hunting for the Actions Users Want Deep Research Agents to Execute

cs.CL · 2026-04-26 · unverdicted · novelty 8.0

DRACULA is the first dataset of user feedback on intermediate actions for deep research agents, showing that LLMs predict preferred actions better with full user history and that history-based action generation leads to higher user selection rates.

Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models

cs.CL · 2026-04-13 · conditional · novelty 8.0

Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.

Layerwise Dynamics for In-Context Classification in Transformers

cs.LG · 2026-04-13 · unverdicted · novelty 8.0

Enforcing feature- and label-permutation equivariance in transformers for in-context classification yields an identifiable emergent update rule driven by mixed feature-label Gram matrices that amplifies class separation.

The Linear Representation Hypothesis and the Geometry of Large Language Models

cs.CL · 2023-11-07 · conditional · novelty 8.0

Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.

Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

EGRSD and CL-EGRSD advance the accuracy-length frontier in LLM reasoning by entropy-guided weighting of token-level distillation signals from the teacher.

Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

Goal-Mem improves RAG memory retrieval in agentic LLMs by explicit goal decomposition and backward chaining via Natural Language Logic, outperforming nine baselines on multi-hop and implicit inference tasks.

Large Language Models as Amortized Pareto-Front Generators for Constrained Bi-Objective Convex Optimization

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

DIPS fine-tunes LLMs to output ordered feasible decision vectors approximating Pareto fronts for constrained bi-objective convex problems, reaching 95-98% normalized hypervolume with 0.16s inference.

Deep Minds and Shallow Probes

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.

Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

BICR uses blind-image contrastive ranking on frozen LVLM hidden states to train a lightweight probe that penalizes confidence on blacked-out inputs, yielding top calibration and discrimination across five models and multiple tasks at low parameter cost.

MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

MulTaBench is a new collection of 40 image-tabular and text-tabular datasets designed to test target-aware representation tuning in multimodal tabular models.

PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.

Task-Aware Calibration: Provably Optimal Decoding in LLMs

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Task calibration aligns LLM distributions in latent task spaces to make MBR decoding provably optimal and improve generation quality.

SalesSim: Benchmarking and Aligning Multimodal Language Models as Retail User Simulators

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

SalesSim benchmarks MLLMs as retail user simulators, finds gaps in persona adherence and over-persuasion, and introduces UserGRPO RL to raise decision alignment by 13.8%.

OBLIQ-Bench: Exposing Overlooked Bottlenecks in Modern Retrievers with Latent and Implicit Queries

cs.IR · 2026-05-07 · unverdicted · novelty 7.0

OBLIQ-Bench reveals that modern retrievers fail to surface documents for latent and implicit queries even though LLMs reliably recognize relevance when those documents are provided.

Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

SIOP enables turn-level credit assignment in LLM agents via semantic clustering of final answers as latent outcomes, improving performance on reasoning benchmarks without verifiers.

Deep Graph-Language Fusion for Structure-Aware Code Generation

cs.SE · 2026-05-05 · unverdicted · novelty 7.0

CGFuse enables deep token-level fusion of graph-derived structural features into language models, yielding 10-16% BLEU and 6-11% CodeBLEU gains on code generation tasks.

Toward Structural Multimodal Representations: Specialization, Selection, and Sparsification via Mixture-of-Experts

cs.LG · 2026-05-05 · unverdicted · novelty 7.0 · 2 refs

S3 decomposes multimodal data into selectable semantic experts, routes them adaptively, and sparsifies to achieve higher accuracy on MultiBench benchmarks with peak performance at intermediate sparsity levels.

Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese

cs.CL · 2026-05-02 · conditional · novelty 7.0

Prosa demonstrates that rubric-based binary scoring with multi-judge filtering yields full agreement on 16 LLM rankings across judges on Brazilian Portuguese chats, compared to only 7/16 under holistic scoring, while widening score gaps by 47%.

On Stable Long-Form Generation: Benchmarking and Mitigating Length Volatility

cs.CL · 2026-05-02 · unverdicted · novelty 7.0

VOLTBench quantifies length volatility in LLM long-form generation; GLoBo, a logits-boosting decoder, increases mean length by 148% and cuts volatility by 69% while preserving quality.

How Generative AI Disrupts Search: An Empirical Study of Google Search, Gemini, and AI Overviews

cs.IR · 2026-04-30 · unverdicted · novelty 7.0

AI Overviews and Gemini retrieve substantially different sources than traditional Google search (Jaccard similarity <0.2), favor Google-owned content, appear for 51.5% of queries especially controversial ones, and are less consistent across repeated or slightly edited queries.

Green Shielding: A User-Centric Approach Towards Trustworthy AI

cs.CL · 2026-04-27 · unverdicted · novelty 7.0

Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like tradeoffs in plausibility versus coverage.

Factual and Edit-Sensitive Graph-to-Sequence Generation via Graph-Aware Adaptive Noising

cs.CL · 2026-04-27 · unverdicted · novelty 7.0

DLM4G applies graph-aware adaptive noising in a diffusion framework to generate text from graphs, outperforming larger autoregressive and diffusion baselines in factual grounding and edit sensitivity on three datasets plus molecule captioning.

A Large-Scale, Cross-Disciplinary Corpus of Systematic Reviews

cs.IR · 2026-04-23 · accept · novelty 7.0

A new corpus of 301,871 systematic reviews across all sciences is released with extracted method artifacts to support retrieval benchmarking and meta-research.

citing papers explorer

Showing 14 of 14 citing papers after filters.

OBLIQ-Bench: Exposing Overlooked Bottlenecks in Modern Retrievers with Latent and Implicit Queries cs.IR · 2026-05-07 · unverdicted · none · ref 3
OBLIQ-Bench reveals that modern retrievers fail to surface documents for latent and implicit queries even though LLMs reliably recognize relevance when those documents are provided.
How Generative AI Disrupts Search: An Empirical Study of Google Search, Gemini, and AI Overviews cs.IR · 2026-04-30 · unverdicted · none · ref 44
AI Overviews and Gemini retrieve substantially different sources than traditional Google search (Jaccard similarity <0.2), favor Google-owned content, appear for 51.5% of queries especially controversial ones, and are less consistent across repeated or slightly edited queries.
A Large-Scale, Cross-Disciplinary Corpus of Systematic Reviews cs.IR · 2026-04-23 · accept · none · ref 31
A new corpus of 301,871 systematic reviews across all sciences is released with extracted method artifacts to support retrieval benchmarking and meta-research.
Multilingual and Domain-Agnostic Tip-of-the-Tongue Query Generation for Simulated Evaluation cs.IR · 2026-04-22 · unverdicted · none · ref 30
An LLM simulation framework generates multilingual tip-of-the-tongue queries, validated by rank correlation with real queries, producing the first large-scale ToT benchmarks for four languages.
AdversarialCoT: Single-Document Retrieval Poisoning for LLM Reasoning cs.IR · 2026-04-14 · unverdicted · none · ref 28
A single query-specific poisoned document, built by extracting and iteratively refining an adversarial chain-of-thought, can substantially degrade reasoning accuracy in retrieval-augmented LLM systems.
MIRA: An LLM-Assisted Benchmark for Multi-Category Integrated Retrieval cs.IR · 2026-05-11 · unverdicted · none · ref 67
MIRA is a new benchmark for multi-category integrated retrieval built from real queries on a social science platform, with LLM assistance for topic descriptions and relevance labeling across four item categories.
A Replicability Study of XTR cs.IR · 2026-05-01 · accept · none · ref 14
XTR training does not improve retrieval effectiveness over ColBERT but enhances IVF engine efficiency by flattening token scores to produce more discriminative centroids.
Efficient Multivector Retrieval with Token-Aware Clustering and Hierarchical Indexing cs.IR · 2026-04-30 · unverdicted · none · ref 17
TACHIOM speeds up multivector retrieval by up to 247x in clustering and 9.8x in retrieval on MS-MARCOv1 and LoTTE benchmarks using token-distribution-aware centroid allocation and a graph-plus-PQ index, with comparable effectiveness to prior systems.
NeocorRAG: Less Irrelevant Information, More Explicit Evidence, and More Effective Recall via Evidence Chains cs.IR · 2026-04-30 · unverdicted · none · ref 10
NeocorRAG uses Evidence Chains to achieve SOTA retrieval quality in RAG on HotpotQA, 2WikiMultiHopQA, MuSiQue, and NQ for 3B and 70B models while using under 20% of the tokens of comparable methods.
Learning to Route Queries to Heads for Attention-based Re-ranking with Large Language Models cs.IR · 2026-04-27 · conditional · none · ref 17
RouteHead trains a lightweight router to dynamically select optimal LLM attention heads per query for improved attention-based document re-ranking.
Filling the Gaps: Selective Knowledge Augmentation for LLM Recommenders cs.IR · 2026-04-09 · unverdicted · none · ref 56
KnowSA_CKP uses comparative knowledge probing to selectively augment LLM prompts for items with knowledge gaps, improving recommendation accuracy and context efficiency.
Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference cs.IR · 2026-04-03 · conditional · none · ref 11
LLMLingua prompt compression yields up to 18% end-to-end LLM speedups with unchanged quality when prompt length, ratio, and hardware align, plus an open profiler to predict the break-even point.
Hypencoder Revisited: Reproducibility and Analysis of Non-Linear Scoring for First-Stage Retrieval cs.IR · 2026-04-29 · conditional · none · ref 39
Reproducibility study confirms Hypencoder's non-linear query-specific scoring improves retrieval over bi-encoders on standard benchmarks but standard methods remain faster and hard-task results are mixed due to implementation issues.
Reproducing Adaptive Reranking for Reasoning-Intensive IR cs.IR · 2026-04-30 · unverdicted · none · ref 2
Reproducing GAR on BRIGHT shows it boosts reasoning-intensive retrieval effectiveness with low overhead when the reranker's signal quality is strong.

Title resolution pending

citation-role summary

citation-polarity summary

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer