GS-QA is a new benchmark of 2,800 QA pairs on 28 templates using OSM and Wikipedia data to evaluate LLMs on spatial predicates, multi-source reasoning, and diverse answer types including distances and counts.
hub Mixed citations
emnlp-main.308/
Mixed citation behavior. Most common role is background (57%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.
NL2SQLBench is a new modular benchmarking framework that evaluates LLM NL2SQL methods across three core modules on existing datasets, exposing large accuracy gaps and computational inefficiency.
MultiMat shows multimodal large models plus constrained search produce higher-quality procedural material graphs than text-only baselines on a new production dataset.
ProMQA-Assembly is a new multimodal procedural QA dataset with 646 pairs on assembly activities, built via LLM-generated candidates verified by humans plus 81 task graphs, and used to benchmark multimodal models.
LLMs default to responses more similar to opinions from the USA and some European and South American countries; prompting for a country shifts alignment but can introduce stereotypes, while translation does not reliably match language speakers.
TAVR-VLM introduces Risk-Conditioned Causal Grounding Attention to achieve SOTA AUROC 0.896, CIDEr 0.936, and 8.1% hallucination rate on a 1,482-patient TAVR cohort.
Position bias in on-policy distillation degrades later-token supervision; IW-OPD weights tokens by accumulated discrepancy, yielding faster convergence and up to 6.9 point gains on AIME-2025.
TPOUR uses a novel TRPO method to improve unsupervised retrievers for temporal relevance, outperforming baselines including a much larger model on nDCG@5 for explicit and implicit time queries.
MÖVE presents a new German-language benchmark evaluating 39 LLMs on performance and governance criteria using ten public-administration datasets.
Sgatlin replaces transformer FF layers with sparse single linear neurons, improving perplexity across compute budgets and enabling direct interpretation of semantically clustered circuits for factual recall.
A knowledge-graph multi-agent framework semi-automates virtual commissioning model creation by integrating Siemens TIA Portal and NX MCD data for system understanding, component generation, and signal mapping.
Controlled experiments on MNIST show human soft-labels act as a regularizer that improves calibration on hard samples and aligns model uncertainty with humans, beyond accuracy gains from correcting mislabels.
DPUA is a two-phase framework that aligns LLM uncertainty expressions with human disagreement distributions in subjectivity analysis while preserving task performance.
Latent-GRPO stabilizes reinforcement learning in latent space, delivering 7.86 Pass@1 gains on low-difficulty tasks over latent baselines and 4.27 points over explicit GRPO on high-difficulty tasks with 3-4x shorter reasoning chains.
LCF detects multiple LLM runtime threats by computing aggregated diagonal Mahalanobis distances on layer-wise hidden-state differences, calibrated on clean examples, achieving high detection rates with low overhead across several model architectures.
A context-aware Sentinel-Strategist system for RAG selectively applies defenses to block membership inference and data poisoning while recovering most retrieval utility compared to always-on defense stacks.
Token-level contrastive attribution yields informative signals for some LLM benchmark failures but is not universally applicable across datasets and models.
Adversarial explanation attacks preserve nearly all human trust in wrong AI outputs by using persuasive framing, shown in a study varying reasoning, evidence, style, and format with over 200 participants.
HyEm maps radius-controlled hyperbolic ontology embeddings to Euclidean space for ANN indexing and applies query-adaptive hyperbolic reranking to improve hierarchy-aware retrieval while preserving most Euclidean performance on flat queries.
CodeT5+ is a flexible encoder-decoder LLM family for code pretrained with diverse objectives on multilingual corpora and initialized from existing LLMs, achieving state-of-the-art results on code generation, completion, math programming, and retrieval tasks including new SoTA on HumanEval with the 1
Sparrow uses targeted rule-based human feedback and evidence provision to outperform baselines in preference while violating rules only 8% of the time under adversarial probing.
Empirical benchmarks on four SE tasks show grammar-constrained decoding and TTMG eliminate most syntax errors in LLM outputs while structural and semantic errors persist and cascade in downstream tools.
Paraphrased training prompts induce correlated cross-task differences in forgetting and generalization during LLM fine-tuning; superior prompts can be identified via pre-learning task loss and used in a state-adaptive optimization method (SAPO) to improve robustness.
citing papers explorer
No citing papers match the current filters.