StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.
hub
URL https://aclantholo gy.org/2023.acl-long.557/
35 Pith papers cite this work. Polarity classification is still indexing.
hub tools
co-cited works
years
2026 35representative citing papers
EDEN adaptively sets branching factor proportional to next-token entropy, achieving better accuracy per expansion than fixed beam search while providing a proof that monotone entropy-based branching outperforms any fixed budget allocation.
COPYCOP identifies copycat GNNs by matching their node embeddings despite architectural differences and adversarial transformations, backed by theoretical guarantees and tests on 14 datasets across 5 architectures.
MedMosaic is a large-scale medical audio QA benchmark that shows even state-of-the-art models like Gemini-2.5-pro reach only about 68% accuracy on diverse clinical audio scenarios.
ReformIR adaptively prioritizes reformulations and documents with a surrogate model guided by ranker feedback to boost recall while suppressing drift under fixed reranking budgets.
AI Overviews and Gemini retrieve substantially different sources than traditional Google search (Jaccard similarity <0.2), favor Google-owned content, appear for 51.5% of queries especially controversial ones, and are less consistent across repeated or slightly edited queries.
FlowBot automatically induces LLM workflows through bilevel optimization with textual gradients, achieving competitive performance against human-crafted baselines.
Skill-RAG detects retrieval failure states from hidden representations and routes to one of four corrective skills to raise accuracy on persistent hard cases in open-domain QA and reasoning benchmarks.
Test-time LLM feedback refines query embeddings to deliver up to 25% relative gains on zero-shot literature search, intent detection, and related benchmarks.
An extended annotation scheme with new categories and attributes plus a Gemma-300M-based multi-head classifier achieves 81.6% macro F1 on personal fact classification, outperforming few-shot LLM baselines by nearly 9 points with lower compute.
Teachers' views on AI benefits and risks vary widely across 55 countries, but LLMs compress these differences, overestimate both sides, and show little improvement from country prompting or better reasoning.
Sem-ECE is an asymptotically unbiased calibration error estimator for open-ended QA that uses semantic sampling of answers to derive confidence from class frequencies, with two variants that diverge on hard questions.
MedVIGIL introduces a clinician-supervised benchmark showing medical VLMs frequently give fluent answers on broken visual evidence, with top models 14 points below human radiologists on the composite score.
Fine-tuned GPT-4o reaches state-of-the-art on grammatical error correction while reference-based metrics underestimate performance by missing 73.76 percent of valid or superior outputs.
HyperEyes presents a parallel multimodal search agent using dual-grained efficiency-aware RL with a new TRACE reward and IMEB benchmark, claiming 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source agents.
CoE applies vision-language models directly to document screenshots to deliver pixel-level bounding-box attribution for evidence in iterative retrieval-augmented generation, outperforming text baselines on visual-layout tasks.
SIEVES improves selective prediction coverage up to 3x on OOD VQA benchmarks by training a selector on visual localization quality, generalizing across datasets and proprietary reasoners without specific adaptation.
An endpoint point-relation classifier followed by decoding to interval relations achieves 70.1% temporal awareness on TempEval-3, setting a new state-of-the-art for the full set of fine-grained temporal relations.
SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster LASSO and new Erdos constructions.
Anonymization placement in RAG—at the dataset or at the generated answer—creates observable differences in privacy protection versus response utility.
GRM ranks Mel bands by attack contribution versus utility sensitivity, perturbs a subset, and learns a universal perturbation to reach 88.46% average jailbreak success rate with improved attack-utility trade-off on four audio LLMs.
LLMLingua prompt compression yields up to 18% end-to-end LLM speedups with unchanged quality when prompt length, ratio, and hardware align, plus an open profiler to predict the break-even point.
GRC unifies generation, retrieval, and compression in LLMs via meta latent tokens for single-pass execution with modular flexibility.
Sycophancy is a boundary failure between social alignment and epistemic integrity, captured by a three-condition framework plus taxonomy of targets, mechanisms, and severity.
citing papers explorer
-
StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs
StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.
-
Entropy-informed Decoding: Adaptive Information-Driven Branching
EDEN adaptively sets branching factor proportional to next-token entropy, achieving better accuracy per expansion than fixed beam search while providing a proof that monotone entropy-based branching outperforms any fixed budget allocation.
-
COPYCOP: Ownership Verification for Graph Neural Networks
COPYCOP identifies copycat GNNs by matching their node embeddings despite architectural differences and adversarial transformations, backed by theoretical guarantees and tests on 14 datasets across 5 architectures.
-
MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio
MedMosaic is a large-scale medical audio QA benchmark that shows even state-of-the-art models like Gemini-2.5-pro reach only about 68% accuracy on diverse clinical audio scenarios.
-
When More Reformulations Hurt: Avoiding Drift using Ranker Feedback
ReformIR adaptively prioritizes reformulations and documents with a surrogate model guided by ranker feedback to boost recall while suppressing drift under fixed reranking budgets.
-
How Generative AI Disrupts Search: An Empirical Study of Google Search, Gemini, and AI Overviews
AI Overviews and Gemini retrieve substantially different sources than traditional Google search (Jaccard similarity <0.2), favor Google-owned content, appear for 51.5% of queries especially controversial ones, and are less consistent across repeated or slightly edited queries.
-
FlowBot: Inducing LLM Workflows with Bilevel Optimization and Textual Gradients
FlowBot automatically induces LLM workflows through bilevel optimization with textual gradients, achieving competitive performance against human-crafted baselines.
-
Skill-RAG: Failure-State-Aware Retrieval Augmentation via Hidden-State Probing and Skill Routing
Skill-RAG detects retrieval failure states from hidden representations and routes to one of four corrective skills to raise accuracy on persistent hard cases in open-domain QA and reasoning benchmarks.
-
Task-Adaptive Embedding Refinement via Test-time LLM Guidance
Test-time LLM feedback refines query embeddings to deliver up to 25% relative gains on zero-shot literature search, intent detection, and related benchmarks.
-
An Annotation Scheme and Classifier for Personal Facts in Dialogue
An extended annotation scheme with new categories and attributes plus a Gemma-300M-based multi-head classifier achieves 81.6% macro F1 on personal fact classification, outperforming few-shot LLM baselines by nearly 9 points with lower compute.
-
Teachers' Perceived Benefits and Risks of AI Across Fifty-Five Countries: An Audit of LLM Alignment and Steerability
Teachers' views on AI benefits and risks vary widely across 55 countries, but LLMs compress these differences, overestimate both sides, and show little improvement from country prompting or better reasoning.
-
A Semantic-Sampling Framework for Evaluating Calibration in Open-Ended Question Answering
Sem-ECE is an asymptotically unbiased calibration error estimator for open-ended QA that uses semantic sampling of answers to derive confidence from class frequencies, with two variants that diverge on hard questions.
-
MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence
MedVIGIL introduces a clinician-supervised benchmark showing medical VLMs frequently give fluent answers on broken visual evidence, with top models 14 points below human radiologists on the composite score.
-
Multi-Dimensional Evaluation of LLMs for Grammatical Error Correction
Fine-tuned GPT-4o reaches state-of-the-art on grammatical error correction while reference-based metrics underestimate performance by missing 73.76 percent of valid or superior outputs.
-
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
HyperEyes presents a parallel multimodal search agent using dual-grained efficiency-aware RL with a new TRACE reward and IMEB benchmark, claiming 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source agents.
-
Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation
CoE applies vision-language models directly to document screenshots to deliver pixel-level bounding-box attribution for evidence in iterative retrieval-augmented generation, outperforming text baselines on visual-layout tasks.
-
SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring
SIEVES improves selective prediction coverage up to 3x on OOD VQA benchmarks by training a selector on visual localization quality, generalizing across datasets and proprietary reasoners without specific adaptation.
-
Looking for the Bottleneck in Fine-grained Temporal Relation Classification
An endpoint point-relation classifier followed by decoding to interval relations achieves 70.1% temporal awareness on TempEval-3, setting a new state-of-the-art for the full set of fine-grained temporal relations.
-
Evaluation-driven Scaling for Scientific Discovery
SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster LASSO and new Erdos constructions.
-
A Case Study on the Impact of Anonymization Along the RAG Pipeline
Anonymization placement in RAG—at the dataset or at the generated answer—creates observable differences in privacy protection versus response utility.
-
GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking
GRM ranks Mel bands by attack contribution versus utility sensitivity, perturbs a subset, and learns a universal perturbation to reach 88.46% average jailbreak success rate with improved attack-utility trade-off on four audio LLMs.
-
Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference
LLMLingua prompt compression yields up to 18% end-to-end LLM speedups with unchanged quality when prompt length, ratio, and hardware align, plus an open profiler to predict the break-even point.
-
GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression
GRC unifies generation, retrieval, and compression in LLMs via meta latent tokens for single-pass execution with modular flexibility.
-
When Helpfulness Becomes Sycophancy: Sycophancy is a Boundary Failure Between Social Alignment and Epistemic Integrity in Large Language Models
Sycophancy is a boundary failure between social alignment and epistemic integrity, captured by a three-condition framework plus taxonomy of targets, mechanisms, and severity.
-
LLM-Oriented Information Retrieval: A Denoising-First Perspective
Denoising to maximize usable evidence density and verifiability is becoming the primary bottleneck in LLM-oriented information retrieval, conceptualized via a four-stage framework and addressed through a pipeline taxonomy of optimization techniques.
-
Sustained Gradient Alignment Mediates Subliminal Learning in a Multi-Step Setting: Evidence from MNIST Auxiliary Logit Distillation Experiment
Gradient alignment persists throughout multi-step distillation training and causally drives unintended teacher trait acquisition in the student, while liminal training attenuates alignment but does not stop the acquisition.
-
Shared Lexical Task Representations Explain Behavioral Variability In LLMs
LLMs share task-specific attention heads across prompting styles, with activation strength explaining performance differences and failures arising from competing representations.
-
pAI/MSc: ML Theory Research with Humans on the Loop
pAI/MSc is a customizable multi-agent system that reduces human steering by orders of magnitude when turning a hypothesis into a literature-grounded, mathematically established, experimentally supported manuscript draft in ML theory.
-
Differentiable Conformal Training for LLM Reasoning Factuality
DCF relaxes non-differentiable conformal factuality for LLM reasoning chains into a trainable form, yielding up to 141% higher retention of true claims on benchmarks while preserving reliability guarantees.
-
Impact of large language models on peer review opinions from a fine-grained perspective: Evidence from top conference proceedings in AI
Peer review reports in AI conferences have grown longer and more standardized after LLMs, with increased emphasis on surface-level clarity and summaries at the expense of deeper critiques on originality and replicability.
-
Less Approximates More: Harmonizing Performance and Confidence Faithfulness via Hybrid Post-Training for High-Stakes Tasks
HyTuning uses a progressive reasoning gain metric to reweight reasoning distillation and RLIF, improving both accuracy and confidence faithfulness in LLMs under limited supervision.
-
K-CARE: Knowledge-driven Symmetrical Contextual Anchoring and Analogical Prototype Reasoning for E-commerce Relevance
K-CARE uses behavior-derived anchoring and expert prototype analogies to ground LLMs and improve relevance on knowledge-intensive e-commerce cases.
-
Mira-Embeddings-V1: Domain-Adapted Semantic Reranking for Recruitment via LLM-Synthesized Data
Mira-Embeddings-V1 adapts embeddings for recruitment reranking by synthesizing positive and hard-negative samples with LLMs, then applies JD-JD contrastive and JD-CV triplet training plus a BoundaryHead MLP, lifting Recall@50 from 68.89% to 77.55% and Recall@200 from 0.5969 to 0.7047.
-
Beyond the Parameters: A Technical Survey of Contextual Enrichment in Large Language Models: From In-Context Prompting to Causal Retrieval-Augmented Generation
The survey unifies LLM augmentation techniques along the single axis of structured context supplied at inference time and supplies a literature screening protocol plus deployment decision framework.
-
Fairness in Multi-Agent Systems for Software Engineering: An SDLC-Oriented Rapid Review
A rapid review of fairness in LLM-enabled multi-agent systems for the software development lifecycle concludes that the field lacks standardized evaluations, broad coverage, and effective governance, leaving it unprepared for deployable fair systems.