hub

URL https://aclantholo gy.org/2023.acl-long.557/

doi: 10 · 2023 · DOI 10.18653/v1/2023

35 Pith papers cite this work. Polarity classification is still indexing.

35 Pith papers citing it

open at publisher browse 35 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

co-cited works

representative citing papers

StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs

cs.CY · 2026-05-11 · accept · novelty 7.0 · 2 refs

StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.

Entropy-informed Decoding: Adaptive Information-Driven Branching

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

EDEN adaptively sets branching factor proportional to next-token entropy, achieving better accuracy per expansion than fixed beam search while providing a proof that monotone entropy-based branching outperforms any fixed budget allocation.

COPYCOP: Ownership Verification for Graph Neural Networks

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

COPYCOP identifies copycat GNNs by matching their node embeddings despite architectural differences and adversarial transformations, backed by theoretical guarantees and tests on 14 datasets across 5 architectures.

MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio

cs.SD · 2026-05-01 · unverdicted · novelty 7.0

MedMosaic is a large-scale medical audio QA benchmark that shows even state-of-the-art models like Gemini-2.5-pro reach only about 68% accuracy on diverse clinical audio scenarios.

When More Reformulations Hurt: Avoiding Drift using Ranker Feedback

cs.IR · 2026-05-01 · unverdicted · novelty 7.0

ReformIR adaptively prioritizes reformulations and documents with a surrogate model guided by ranker feedback to boost recall while suppressing drift under fixed reranking budgets.

How Generative AI Disrupts Search: An Empirical Study of Google Search, Gemini, and AI Overviews

cs.IR · 2026-04-30 · unverdicted · novelty 7.0

AI Overviews and Gemini retrieve substantially different sources than traditional Google search (Jaccard similarity <0.2), favor Google-owned content, appear for 51.5% of queries especially controversial ones, and are less consistent across repeated or slightly edited queries.

FlowBot: Inducing LLM Workflows with Bilevel Optimization and Textual Gradients

cs.CL · 2026-04-29 · unverdicted · novelty 7.0

FlowBot automatically induces LLM workflows through bilevel optimization with textual gradients, achieving competitive performance against human-crafted baselines.

Skill-RAG: Failure-State-Aware Retrieval Augmentation via Hidden-State Probing and Skill Routing

cs.CL · 2026-04-17 · unverdicted · novelty 7.0

Skill-RAG detects retrieval failure states from hidden representations and routes to one of four corrective skills to raise accuracy on persistent hard cases in open-domain QA and reasoning benchmarks.

Task-Adaptive Embedding Refinement via Test-time LLM Guidance

cs.CL · 2026-05-12 · unverdicted · novelty 6.0

Test-time LLM feedback refines query embeddings to deliver up to 25% relative gains on zero-shot literature search, intent detection, and related benchmarks.

An Annotation Scheme and Classifier for Personal Facts in Dialogue

cs.CL · 2026-05-11 · accept · novelty 6.0

An extended annotation scheme with new categories and attributes plus a Gemma-300M-based multi-head classifier achieves 81.6% macro F1 on personal fact classification, outperforming few-shot LLM baselines by nearly 9 points with lower compute.

Teachers' Perceived Benefits and Risks of AI Across Fifty-Five Countries: An Audit of LLM Alignment and Steerability

cs.CY · 2026-05-08 · unverdicted · novelty 6.0

Teachers' views on AI benefits and risks vary widely across 55 countries, but LLMs compress these differences, overestimate both sides, and show little improvement from country prompting or better reasoning.

A Semantic-Sampling Framework for Evaluating Calibration in Open-Ended Question Answering

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

Sem-ECE is an asymptotically unbiased calibration error estimator for open-ended QA that uses semantic sampling of answers to derive confidence from class frequencies, with two variants that diverge on hard questions.

MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

MedVIGIL introduces a clinician-supervised benchmark showing medical VLMs frequently give fluent answers on broken visual evidence, with top models 14 points below human radiologists on the composite score.

Multi-Dimensional Evaluation of LLMs for Grammatical Error Correction

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

Fine-tuned GPT-4o reaches state-of-the-art on grammatical error correction while reference-based metrics underestimate performance by missing 73.76 percent of valid or superior outputs.

HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents

cs.LG · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

HyperEyes presents a parallel multimodal search agent using dual-grained efficiency-aware RL with a new TRACE reward and IMEB benchmark, claiming 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source agents.

Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation

cs.CV · 2026-05-02 · unverdicted · novelty 6.0

CoE applies vision-language models directly to document screenshots to deliver pixel-level bounding-box attribution for evidence in iterative retrieval-augmented generation, outperforming text baselines on visual-layout tasks.

SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring

cs.CV · 2026-04-28 · unverdicted · novelty 6.0

SIEVES improves selective prediction coverage up to 3x on OOD VQA benchmarks by training a selector on visual localization quality, generalizing across datasets and proprietary reasoners without specific adaptation.

Looking for the Bottleneck in Fine-grained Temporal Relation Classification

cs.CL · 2026-04-27 · unverdicted · novelty 6.0

An endpoint point-relation classifier followed by decoding to interval relations achieves 70.1% temporal awareness on TempEval-3, setting a new state-of-the-art for the full set of fine-grained temporal relations.

Evaluation-driven Scaling for Scientific Discovery

cs.LG · 2026-04-21 · unverdicted · novelty 6.0

SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster LASSO and new Erdos constructions.

A Case Study on the Impact of Anonymization Along the RAG Pipeline

cs.CR · 2026-04-17 · unverdicted · novelty 6.0

Anonymization placement in RAG—at the dataset or at the generated answer—creates observable differences in privacy protection versus response utility.

GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking

cs.SD · 2026-04-10 · unverdicted · novelty 6.0

GRM ranks Mel bands by attack contribution versus utility sensitivity, perturbs a subset, and learns a universal perturbation to reach 88.46% average jailbreak success rate with improved attack-utility trade-off on four audio LLMs.

Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference

cs.IR · 2026-04-03 · conditional · novelty 6.0

LLMLingua prompt compression yields up to 18% end-to-end LLM speedups with unchanged quality when prompt length, ratio, and hardware align, plus an open profiler to predict the break-even point.

GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression

cs.CL · 2026-05-09 · unverdicted · novelty 5.0 · 2 refs

GRC unifies generation, retrieval, and compression in LLMs via meta latent tokens for single-pass execution with modular flexibility.

When Helpfulness Becomes Sycophancy: Sycophancy is a Boundary Failure Between Social Alignment and Epistemic Integrity in Large Language Models

cs.AI · 2026-05-06 · unverdicted · novelty 5.0

Sycophancy is a boundary failure between social alignment and epistemic integrity, captured by a three-condition framework plus taxonomy of targets, mechanisms, and severity.

citing papers explorer

Showing 35 of 35 citing papers.

StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs cs.CY · 2026-05-11 · accept · none · ref 58 · 2 links
StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.
Entropy-informed Decoding: Adaptive Information-Driven Branching cs.LG · 2026-05-10 · unverdicted · none · ref 7
EDEN adaptively sets branching factor proportional to next-token entropy, achieving better accuracy per expansion than fixed beam search while providing a proof that monotone entropy-based branching outperforms any fixed budget allocation.
COPYCOP: Ownership Verification for Graph Neural Networks cs.LG · 2026-05-06 · unverdicted · none · ref 26
COPYCOP identifies copycat GNNs by matching their node embeddings despite architectural differences and adversarial transformations, backed by theoretical guarantees and tests on 14 datasets across 5 architectures.
MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio cs.SD · 2026-05-01 · unverdicted · none · ref 4
MedMosaic is a large-scale medical audio QA benchmark that shows even state-of-the-art models like Gemini-2.5-pro reach only about 68% accuracy on diverse clinical audio scenarios.
When More Reformulations Hurt: Avoiding Drift using Ranker Feedback cs.IR · 2026-05-01 · unverdicted · none · ref 9
ReformIR adaptively prioritizes reformulations and documents with a surrogate model guided by ranker feedback to boost recall while suppressing drift under fixed reranking budgets.
How Generative AI Disrupts Search: An Empirical Study of Google Search, Gemini, and AI Overviews cs.IR · 2026-04-30 · unverdicted · none · ref 63
AI Overviews and Gemini retrieve substantially different sources than traditional Google search (Jaccard similarity <0.2), favor Google-owned content, appear for 51.5% of queries especially controversial ones, and are less consistent across repeated or slightly edited queries.
FlowBot: Inducing LLM Workflows with Bilevel Optimization and Textual Gradients cs.CL · 2026-04-29 · unverdicted · none · ref 5
FlowBot automatically induces LLM workflows through bilevel optimization with textual gradients, achieving competitive performance against human-crafted baselines.
Skill-RAG: Failure-State-Aware Retrieval Augmentation via Hidden-State Probing and Skill Routing cs.CL · 2026-04-17 · unverdicted · none · ref 14
Skill-RAG detects retrieval failure states from hidden representations and routes to one of four corrective skills to raise accuracy on persistent hard cases in open-domain QA and reasoning benchmarks.
Task-Adaptive Embedding Refinement via Test-time LLM Guidance cs.CL · 2026-05-12 · unverdicted · none · ref 36
Test-time LLM feedback refines query embeddings to deliver up to 25% relative gains on zero-shot literature search, intent detection, and related benchmarks.
An Annotation Scheme and Classifier for Personal Facts in Dialogue cs.CL · 2026-05-11 · accept · none · ref 24
An extended annotation scheme with new categories and attributes plus a Gemma-300M-based multi-head classifier achieves 81.6% macro F1 on personal fact classification, outperforming few-shot LLM baselines by nearly 9 points with lower compute.
Teachers' Perceived Benefits and Risks of AI Across Fifty-Five Countries: An Audit of LLM Alignment and Steerability cs.CY · 2026-05-08 · unverdicted · none · ref 10
Teachers' views on AI benefits and risks vary widely across 55 countries, but LLMs compress these differences, overestimate both sides, and show little improvement from country prompting or better reasoning.
A Semantic-Sampling Framework for Evaluating Calibration in Open-Ended Question Answering cs.CL · 2026-05-08 · unverdicted · none · ref 3
Sem-ECE is an asymptotically unbiased calibration error estimator for open-ended QA that uses semantic sampling of answers to derive confidence from class frequencies, with two variants that diverge on hard questions.
MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence cs.CV · 2026-05-08 · unverdicted · none · ref 17
MedVIGIL introduces a clinician-supervised benchmark showing medical VLMs frequently give fluent answers on broken visual evidence, with top models 14 points below human radiologists on the composite score.
Multi-Dimensional Evaluation of LLMs for Grammatical Error Correction cs.CL · 2026-05-08 · unverdicted · none · ref 20
Fine-tuned GPT-4o reaches state-of-the-art on grammatical error correction while reference-based metrics underestimate performance by missing 73.76 percent of valid or superior outputs.
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents cs.LG · 2026-05-08 · unverdicted · none · ref 27 · 2 links
HyperEyes presents a parallel multimodal search agent using dual-grained efficiency-aware RL with a new TRACE reward and IMEB benchmark, claiming 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source agents.
Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation cs.CV · 2026-05-02 · unverdicted · none · ref 12
CoE applies vision-language models directly to document screenshots to deliver pixel-level bounding-box attribution for evidence in iterative retrieval-augmented generation, outperforming text baselines on visual-layout tasks.
SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring cs.CV · 2026-04-28 · unverdicted · none · ref 50
SIEVES improves selective prediction coverage up to 3x on OOD VQA benchmarks by training a selector on visual localization quality, generalizing across datasets and proprietary reasoners without specific adaptation.
Looking for the Bottleneck in Fine-grained Temporal Relation Classification cs.CL · 2026-04-27 · unverdicted · none · ref 8
An endpoint point-relation classifier followed by decoding to interval relations achieves 70.1% temporal awareness on TempEval-3, setting a new state-of-the-art for the full set of fine-grained temporal relations.
Evaluation-driven Scaling for Scientific Discovery cs.LG · 2026-04-21 · unverdicted · none · ref 94
SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster LASSO and new Erdos constructions.
A Case Study on the Impact of Anonymization Along the RAG Pipeline cs.CR · 2026-04-17 · unverdicted · none · ref 14
Anonymization placement in RAG—at the dataset or at the generated answer—creates observable differences in privacy protection versus response utility.
GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking cs.SD · 2026-04-10 · unverdicted · none · ref 44
GRM ranks Mel bands by attack contribution versus utility sensitivity, perturbs a subset, and learns a universal perturbation to reach 88.46% average jailbreak success rate with improved attack-utility trade-off on four audio LLMs.
Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference cs.IR · 2026-04-03 · conditional · none · ref 3
LLMLingua prompt compression yields up to 18% end-to-end LLM speedups with unchanged quality when prompt length, ratio, and hardware align, plus an open profiler to predict the break-even point.
GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression cs.CL · 2026-05-09 · unverdicted · none · ref 10 · 2 links
GRC unifies generation, retrieval, and compression in LLMs via meta latent tokens for single-pass execution with modular flexibility.
When Helpfulness Becomes Sycophancy: Sycophancy is a Boundary Failure Between Social Alignment and Epistemic Integrity in Large Language Models cs.AI · 2026-05-06 · unverdicted · none · ref 32
Sycophancy is a boundary failure between social alignment and epistemic integrity, captured by a three-condition framework plus taxonomy of targets, mechanisms, and severity.
LLM-Oriented Information Retrieval: A Denoising-First Perspective cs.IR · 2026-05-01 · unverdicted · none · ref 219
Denoising to maximize usable evidence density and verifiability is becoming the primary bottleneck in LLM-oriented information retrieval, conceptualized via a four-stage framework and addressed through a pipeline taxonomy of optimization techniques.
Sustained Gradient Alignment Mediates Subliminal Learning in a Multi-Step Setting: Evidence from MNIST Auxiliary Logit Distillation Experiment cs.LG · 2026-04-28 · unverdicted · none · ref 5
Gradient alignment persists throughout multi-step distillation training and causally drives unintended teacher trait acquisition in the student, while liminal training attenuates alignment but does not stop the acquisition.
Shared Lexical Task Representations Explain Behavioral Variability In LLMs cs.CL · 2026-04-23 · unverdicted · none · ref 14
LLMs share task-specific attention heads across prompting styles, with activation strength explaining performance differences and failures arising from competing representations.
pAI/MSc: ML Theory Research with Humans on the Loop cs.AI · 2026-04-22 · unverdicted · none · ref 78
pAI/MSc is a customizable multi-agent system that reduces human steering by orders of magnitude when turning a hypothesis into a literature-grounded, mathematically established, experimentally supported manuscript draft in ML theory.
Differentiable Conformal Training for LLM Reasoning Factuality cs.LG · 2026-04-22 · unverdicted · none · ref 3
DCF relaxes non-differentiable conformal factuality for LLM reasoning chains into a trainable form, yielding up to 141% higher retention of true claims on benchmarks while preserving reliability guarantees.
Impact of large language models on peer review opinions from a fine-grained perspective: Evidence from top conference proceedings in AI cs.CL · 2026-04-21 · unverdicted · none · ref 54
Peer review reports in AI conferences have grown longer and more standardized after LLMs, with increased emphasis on surface-level clarity and summaries at the expense of deeper critiques on originality and replicability.
Less Approximates More: Harmonizing Performance and Confidence Faithfulness via Hybrid Post-Training for High-Stakes Tasks cs.LG · 2026-04-09 · unverdicted · none · ref 1
HyTuning uses a progressive reasoning gain metric to reweight reasoning distillation and RLIF, improving both accuracy and confidence faithfulness in LLMs under limited supervision.
K-CARE: Knowledge-driven Symmetrical Contextual Anchoring and Analogical Prototype Reasoning for E-commerce Relevance cs.IR · 2026-04-28 · unverdicted · none · ref 7
K-CARE uses behavior-derived anchoring and expert prototype analogies to ground LLMs and improve relevance on knowledge-intensive e-commerce cases.
Mira-Embeddings-V1: Domain-Adapted Semantic Reranking for Recruitment via LLM-Synthesized Data cs.CL · 2026-04-20 · conditional · none · ref 26
Mira-Embeddings-V1 adapts embeddings for recruitment reranking by synthesizing positive and hard-negative samples with LLMs, then applies JD-JD contrastive and JD-CV triplet training plus a BoundaryHead MLP, lifting Recall@50 from 68.89% to 77.55% and Recall@200 from 0.5969 to 0.7047.
Beyond the Parameters: A Technical Survey of Contextual Enrichment in Large Language Models: From In-Context Prompting to Causal Retrieval-Augmented Generation cs.CL · 2026-04-03 · unverdicted · none · ref 13
The survey unifies LLM augmentation techniques along the single axis of structured context supplied at inference time and supplies a literature screening protocol plus deployment decision framework.
Fairness in Multi-Agent Systems for Software Engineering: An SDLC-Oriented Rapid Review cs.SE · 2026-04-10 · unverdicted · none · ref 14
A rapid review of fairness in LLM-enabled multi-agent systems for the software development lifecycle concludes that the field lacks standardized evaluations, broad coverage, and effective governance, leaving it unprepared for deployable fair systems.

URL https://aclantholo gy.org/2023.acl-long.557/

hub tools

co-cited works

fields

years

verdicts

representative citing papers

citing papers explorer