SwissGov-RSD is the first naturalistic cross-lingual document-level benchmark with human token-level semantic difference annotations, on which both LLMs and encoders show a large performance gap relative to simpler settings.
super hub Mixed citations
write newline
Mixed citation behavior. Most common role is background (50%).
hub tools
citation-role summary
citation-polarity summary
claims ledger
- background Flesch-Kincaid Grade Level 8.97 9.08 -0.11 -0.1673 -0.1528 Table 5: Textual complexity metrics and their correlation with frequency. Corr. denotes correlation. We use nlp = spacy.load("en_core_web_sm") for calculation. Bin Range N BLEU(HF) BLEU(LF)∆BLEU(HF-LF) chrF(HF) chrF(LF)∆chrF(HF-LF) Strict Depth Match 144 20.82 16.04 +4.78 48.73 43.86 +4.87 [0%,5%) 144 20.82 16.04 +4.78 48.73 43.86 +4.87 [5%,10%) 6 22.45 14.79 +7.65 49.76 49.19 +0.57 [10%,15%) 71 19.12 15.38 +3.74 46.19 44.71 +1.47 [15%,2
authors
co-cited works
representative citing papers
BEAVER is the first text-to-SQL benchmark from private enterprise data warehouses, revealing SOTA agentic frameworks achieve only 10.8% accuracy on complex real-world queries.
A rule-generation perspective lets LLMs write programs as rules for data mapping and applies complexity theory to estimate their compositionality, tested on string-to-grid tasks.
OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
Cross-cultural survey of 4,641 participants shows LLM emotional support adoption varies widely by country and demographics, with socioeconomic status as strongest predictor of trust and use, and English-speaking nations more accepting than others in Europe.
VLMs reach only 42.1% exact accuracy on counting pushups in videos, with weaker models exploiting modal counts, and 1k-sample fine-tuning transfers gains to MVBench, PerceptionTest, and TVBench.
A controlled formal language task reveals fine-tuning outperforms in-context learning on in-distribution generalization but equals it on out-of-distribution, with ICL showing greater sensitivity to model size and tokenization.
StoryTR is a new benchmark and agentic data pipeline that adds explicit Theory of Mind reasoning chains to train smaller video retrieval models, yielding a 15% relative IoU gain over larger baselines on narrative content.
Language models frequently violate temporal scope stability in multi-turn dialogues by drifting toward present-day assumptions even when they possess the correct facts.
BERAG applies Bayesian ensemble weighting of individual documents via token-by-token posterior updates in retrieval-augmented generation, yielding gains on knowledge-based visual QA tasks.
Subword tokenization impairs phonological knowledge encoding in LMs, but an IPA-based fine-tuning method restores it with minimal impact on other capabilities.
BiasedTales-ML provides a parallel multilingual corpus of LLM-generated children's stories that reveals substantial cross-lingual differences in narrative attributes not captured by English-centric analyses.
Conjunctive prompt attacks split adversarial elements across agents and routing paths in multi-agent LLM systems, evading isolated defenses and succeeding through topology-aware optimization.
VisPCO uses continuous relaxation, straight-through estimators, and budget-aware Pareto-frontier learning to automatically discover optimal visual token pruning configurations that approximate grid-search results across VLMs and benchmarks.
HintPilot synthesizes semantics-preserving compiler hints via retrieval-augmented LLM generation and profiling-guided refinement, delivering up to 6.88x geometric mean speedup over -Ofast on PolyBench and HumanEval-CPP while preserving correctness.
R²A uses a hybrid ensemble surrogate router and suffix optimization to significantly increase black-box LLM router selection of expensive models across query distributions.
ADAPT augments planners with affordance reasoning to raise task success in environments with unspecified and time-varying object affordances, and a LoRA-finetuned VLM backend beats GPT-4o on the new DynAfford benchmark.
Schema-key wording functions as an implicit instruction channel under constrained decoding, with experiments showing that rephrasing only the keys can substantially change accuracy on math benchmarks while prompt, model, structure, and decoding remain unchanged.
SPAGBias reveals that LLMs form nuanced gender associations with specific urban micro-spaces that exceed real-world distributions and produce failures in planning and descriptive tasks.
CAR is a new retrieval objective that targets the currently active authority set rather than most-similar documents, with theorems on coverage conditions and evaluations showing two-stage methods outperform dense retrieval on authority-governed datasets.
Multimodal ICL lags text-only ICL in few-shot settings due to weak cross-modal reasoning alignment and unreliable task mapping transfer, with an inference-stage method proposed to strengthen transfer.
Reinforcement learning with a multi-part reward teaches LLMs to output independent, meaning-preserving sentence edits that raise argument appropriateness close to full rewriting.
Tabular QA LLMs are overconfident, but Multi-Format Agreement using Markdown/HTML/JSON/CSV variants improves AUROC to 0.80 and cuts calibration error by 44-63% at lower cost than sampling.
EgoEsportsQA is a new egocentric video QA benchmark from esports matches that shows state-of-the-art Video-LLMs reach only 71.58% accuracy and struggle more with tactical reasoning than basic perception.
citing papers explorer
-
EviSearch: A Human in the Loop System for Extracting and Auditing Clinical Evidence for Systematic Reviews
EviSearch automates ontology-aligned clinical evidence table creation from native PDFs with comprehensive provenance logging for auditability and iterative improvement.
-
Evaluating LLMs as Human Surrogates in Controlled Experiments
LLMs reproduce several directional effects from a human accuracy perception experiment but show inconsistent effect magnitudes and moderation patterns across models.
-
SpidR-Adapt: A Universal Speech Representation Model for Few-Shot Adaptation
SpidR-Adapt uses meta-learning with a first-order bi-level optimization heuristic to adapt speech representations to new languages with less than 1 hour of data, achieving 100x better efficiency than standard training.
-
Understanding Structured Financial Data with LLMs: A Case Study on Fraud Detection
FinFRE-RAG combines importance-guided feature reduction with label-aware retrieval-augmented generation to boost LLM performance on tabular fraud detection across four public datasets while providing human-readable rationales.
-
Progress Ratio Embeddings: An Impatience Signal for Robust Length Control in Neural Text Generation
Progress Ratio Embeddings use a trigonometric progress-ratio signal to deliver stable length control in transformers that generalizes to unseen target lengths.
-
Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning
Entropy Ratio Clipping introduces a global entropy-ratio constraint that stabilizes RL policy updates in LLM post-training beyond local PPO clipping.
-
CodeDistiller: Automatically Generating Code Libraries for Scientific Coding Agents
CodeDistiller distills 250 materials-science GitHub repositories into vetted code libraries that improve the accuracy and scientific soundness of experiments generated by ASD agents.
-
PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark
PEFT-Bench is a standardized end-to-end benchmark for 7 PEFT methods across 27 NLP datasets on autoregressive LLMs, accompanied by the PSCP metric that penalizes based on trainable parameters, inference speed, and training memory.
-
Stress Testing Factual Consistency Metrics for Long-Document Summarization
Short-form factual consistency metrics produce inconsistent scores on semantically equivalent long-document summaries and lose reliability on information-dense claims.
-
ReFACT: A Benchmark for Scientific Confabulation Detection with Positional Error Annotations
ReFACT benchmark reveals LLMs show a persistent salient distractor failure mode where 61% of incorrect error span predictions are semantically unrelated to actual errors, persisting across model sizes, and comparative judgment yields lower F1 than independent detection.
-
CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning
CE-GPPO preserves bounded gradients from clipped tokens in PPO to regulate entropy evolution and improve performance on mathematical reasoning benchmarks.
-
MOSAIC: A Multilingual, Taxonomy-Agnostic, and Computationally Efficient Approach for Radiological Report Classification
MOSAIC achieves mean macro F1 of 88 on chest X-ray report classification across five datasets in four languages using a 4B-parameter open model with low GPU memory and few-shot or light fine-tuning options.
-
ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs
ReGATE introduces a teacher-student adaptive token elision method that reduces training tokens to 38% while matching or exceeding baseline accuracy on multimodal benchmarks.
-
SessionIntentBench: A Multi-task Inter-session Intention-shift Modeling Benchmark for E-commerce Customer Behavior Understanding
SessionIntentBench is a large-scale multimodal benchmark for inter-session intention-shift modeling in e-commerce, with 1.95M intention entries and human-annotated gold labels showing current L(V)LMs struggle but improve when intention is injected.
-
SLoW: Select Low-frequency Words! Automatic Dictionary Selection for Translation on Large Language Models
SLoW selects low-frequency word dictionaries to boost LLM translation quality and efficiency across 100 languages from FLORES.
-
Synthia: Scalable Grounded Persona Generation from Social Media Data
Synthia creates scalable personas from Bluesky posts that better match human survey responses than prior methods, uses smaller models, and retains social network structure for network-aware analysis.
-
PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation
PromptSuite is a modular, extensible, task-agnostic framework for automatically generating diverse prompt variations to support robust multi-prompt LLM evaluation.
-
When Seeing Overrides Knowing: Disentangling Knowledge Conflicts in Vision-Language Models
The work identifies a small set of attention heads in VLMs that mediate conflicts between parametric knowledge and visual input, and shows that intervening on them steers model behavior while attention patterns provide precise image-region attribution.
-
WiseMind: a knowledge-guided multi-agent framework for accurate and empathetic psychiatric diagnosis
WiseMind is a dual-agent LLM system with DSM-5 knowledge graph guidance that reaches 85.6% top-1 diagnostic accuracy on simulated and real psychiatric conversations while producing supportive responses.
-
Token-Level Density-Based Uncertainty Quantification Methods for Eliciting Truthfulness of Large Language Models
Adapts multi-layer token-level Mahalanobis distance with supervised linear regression to yield improved uncertainty scores for LLM truthfulness tasks.
-
MapNav: A Novel Memory Representation via Annotated Semantic Maps for Vision-and-Language Navigation
MapNav uses annotated semantic maps as memory for VLN agents, claiming SOTA results in simulation and real-world tests while promising code and data release.
-
CounterBench: Evaluating and Improving Counterfactual Reasoning in Large Language Models
Introduces CounterBench benchmark and CoIn iterative reasoning method showing LLMs perform near random on formal counterfactual tasks but improve substantially with guided backtracking.
-
MultiFileTest: A Multi-File-Level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms
Frontier LLMs achieve only moderate performance on multi-file unit test generation, with basic executability and cascade errors common, but manual and self-error-fixing mechanisms yield measurable gains.
-
The Differences Between Direct Alignment Algorithms are a Blur
A controlled unification of direct alignment algorithms shows the ranking objective (pairwise vs pointwise) drives alignment quality more than the scalar score optimized.
-
Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning
CD-MoE condenses fine-grained MoE layers with shared experts into dense layers, retaining 90% accuracy with 27.5% memory cut and 1.26x speedup on DeepSeekMoE-16B, recovering 98% via brief fine-tuning.
-
A Computational Method for Measuring "Open Codes" in Qualitative Analysis
A method merges codebooks via LLM and evaluates human and AI inductive coding with four new metrics on an online conversation dataset.
-
Why These Documents? Explainable Generative Retrieval with Hierarchical Category Paths
HyPE improves generative retrieval by first generating hierarchical category paths for explainability and then using path-aware ranking to boost performance.
-
Dictionary Insertion Prompting for Multilingual Reasoning on Multilingual Large Language Models
DIP interleaves English word translations into non-English prompts to boost multilingual reasoning on synthetic benchmarks spanning 10-200 languages.
-
Toxic Subword Pruning for Dialogue Response Generation on Large Language Models
ToxPrune prunes toxic subwords from BPE tokenizers in LLMs to mitigate toxic dialogue responses and improve diversity on both toxic and non-toxic models.
-
DynamicNER: A Dynamic, Multilingual, and Fine-Grained Dataset for LLM-based Named Entity Recognition
DynamicNER is a dynamic-categorization multilingual NER dataset with 155 entity types paired with CascadeNER, a two-stage lightweight LLM method claiming higher fine-grained accuracy.
-
Learning to Ask: When LLM Agents Meet Unclear Instruction
Introduces NoisyToolBench benchmark and Ask-when-Needed framework to improve LLM tool-use performance when user instructions are unclear or incomplete.
-
Unconditional Truthfulness: Learning Unconditional Uncertainty of Large Language Models
A regression model using attention features and recurrent uncertainty scores improves selective generation in LLMs over unsupervised and supervised baselines on ten datasets and three models.
-
One Single Hub Text Breaks CLIP: Identifying Vulnerabilities in Cross-Modal Encoders via Hubness
A single hub text can unreasonably match many images in CLIP-based similarity, exposing vulnerabilities in cross-modal encoders for caption evaluation and retrieval.
-
JaiTTS: A Thai Voice Cloning Model
JaiTTS-v1.0 achieves 1.94% CER on short Thai speech, beating human ground truth of 1.98%, matches humans on long speech, and wins 283 of 400 human comparisons against commercial systems.
-
Assessing Y-Axis Influence: Bias in Multimodal Language Models on Chart-to-Table Translation
Y-axis features such as major tick digit length, number of ticks, value range, and format introduce significant biases in multimodal models during chart-to-table tasks, with y-axis prompting improving performance for some models.
-
Generating Place-Based Compromises Between Two Points of View
Empathic similarity feedback in prompts generates more acceptable compromises than chain-of-thought, and margin-based training on the resulting data lets smaller models produce them without ongoing empathy estimation.
-
From Coarse to Fine: Self-Adaptive Hierarchical Planning for LLM Agents
AdaPlan-H enables LLM agents to generate self-adaptive hierarchical plans that adjust detail level to task difficulty, improving success rates in multi-step tasks.
-
Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity
An LLM-as-a-judge evaluation framework for math reasoning outperforms symbolic methods by accurately assessing diverse answer representations and formats.
-
CoGR-MoE: Concept-Guided Expert Routing with Consistent Selection and Flexible Reasoning for Visual Question Answering
CoGR-MoE improves VQA by using concept-guided expert routing with option feature reweighting and contrastive learning to achieve consistent yet flexible reasoning across answer options.
-
Learning Uncertainty from Sequential Internal Dispersion in Large Language Models
SIVR detects LLM hallucinations by learning from token-wise and layer-wise variance patterns in internal hidden states, outperforming baselines with better generalization and less training data.
-
Imperfectly Cooperative Human-AI Interactions: Comparing the Impacts of Human and AI Attributes in Simulated and User Studies
In real human subjects, AI transparency impacts imperfectly cooperative interactions far more than personality traits, unlike simulations where both are comparably influential.
-
IUQ: Interrogative Uncertainty Quantification for Long-Form Large Language Model Generation
IUQ quantifies claim-level uncertainty in long-form LLM generation by combining inter-sample consistency and intra-sample faithfulness through an interrogate-then-respond approach and outperforms baselines on two datasets.
-
LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning
LongAct uses saliency from high-magnitude activations to guide sparse weight updates in long-context RL, yielding about 8% gains on LongBench v2 across multiple algorithms.
-
ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing
ActorMind is a four-agent chain-of-thought framework that emulates human actors to produce spontaneous, emotion-infused speech responses for role-playing scenarios.
-
Spotlight and Shadow: Attention-Guided Dual-Anchor Introspective Decoding for MLLM Hallucination Mitigation
DaID mitigates MLLM hallucinations by attention-guided selection of dual layers that calibrate token generation using internal perceptual discrepancies.
-
E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning
E3-TIR integrates expert prefixes, guided branches, and self-exploration via mix policy optimization to deliver 6% better tool-use performance with under 10% of the usual synthetic data and 1.46x ROI.
-
Confident in a Confidence Score: Investigating the Sensitivity of Confidence Scores to Supervised Fine-Tuning
Supervised fine-tuning degrades the correlation between confidence scores and output quality in language models, driven by factors like training distribution similarity rather than true quality.
-
Verify Before You Commit: Towards Faithful Reasoning in LLM Agents via Self-Auditing
SAVeR adds self-auditing of internal beliefs in LLM agents via persona-based candidates and constraint-guided repairs, improving faithfulness on six benchmarks without hurting task performance.
-
SAT: Balancing Reasoning Accuracy and Efficiency with Stepwise Adaptive Thinking
SAT reduces reasoning tokens by up to 40% across multiple large reasoning models and benchmarks by adaptively pruning steps based on difficulty while maintaining or improving accuracy.
-
Multi-Faceted Self-Consistent Preference Alignment for Query Rewriting in Conversational Search
MSPA-CQR improves conversational query rewriting by constructing self-consistent preference data across rewriting, retrieval, and response dimensions and training with prefix-guided multi-faceted direct preference optimization, showing effectiveness in both in- and out-of-distribution settings.