hub Mixed citations

Proceedings of the 2018

Wang, Alex, Singh, Amanpreet, Michael, Julian, Hill, Felix, Levy, Omer, Bowman, Samuel , year = · 2018 · DOI 10.18653/v1/w18-5446

Mixed citation behavior. Most common role is background (67%).

31 Pith papers citing it

Background 67% of classified citations

open at publisher browse 31 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 4 dataset 2

citation-polarity summary

background 4 use dataset 2

representative citing papers

What Do Biomedical NER and Entity Linking Benchmarks Measure? A Corpus-Centric Diagnostic Framework

cs.CL · 2026-05-19 · accept · novelty 7.0

A corpus-centric framework diagnoses scale, structure, overlap, metadata, and terminology properties across nine biomedical NER/EL corpora, showing substantial differences that common statistics fail to capture.

EdgeFlowerTune: Evaluating Federated LLM Fine-Tuning Under Realistic Edge System Constraints

cs.CL · 2026-05-09 · unverdicted · novelty 7.0

EdgeFlowerTune is a real-device benchmark that jointly assesses model quality and system costs for federated LLM fine-tuning on edge hardware using three protocols: Quality-under-Budget, Cost-to-Target, and Robustness.

Evolutionary Negative Module Pruning for Better LoRA Merging

cs.AI · 2026-04-20 · conditional · novelty 7.0

ENMP prunes negative LoRA modules via evolutionary search to boost merging performance to new state-of-the-art levels across language and vision tasks.

Results-Actionability Gap: Understanding How Practitioners Evaluate LLM Products in the Wild

cs.SE · 2026-01-25 · conditional · novelty 7.0

Qualitative study of 19 practitioners reveals ten LLM product evaluation practices and introduces the results-actionability gap as a key barrier to turning findings into improvements.

Task Prompt Vectors: Effective Initialization through Multi-Task Soft-Prompt Transfer

cs.CL · 2024-08-02 · unverdicted · novelty 7.0

Task prompt vectors, formed by subtracting random initialization from tuned soft prompts, support low-resource initialization and arithmetic combination across tasks on 12 NLU datasets while remaining independent of initialization seed on two model architectures.

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

cs.CL · 2020-05-22 · accept · novelty 7.0

RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

cs.CL · 2019-09-26 · accept · novelty 7.0

ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.

GiLT: Augmenting Transformer Language Models with Dependency Graphs

cs.CL · 2026-05-15 · unverdicted · novelty 6.0

GiLT augments Transformers with semantic dependency graphs by modulating attention to improve syntactic generalization while keeping perplexity competitive and enabling better finetuning on downstream tasks.

Measuring Five-Nines Reliability: Sample-Efficient LLM Evaluation in Saturated Benchmarks

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

Cross-entropy method sampling reduces inferences needed to estimate five-nines LLM reliability by up to 156x on parameterized GSM8K templates, revealing reliability differences hidden by saturated accuracy scores.

MC$^2$: Monte Carlo Correction for Fast Elliptic PDE Solving

cs.LG · 2026-05-10 · unverdicted · novelty 6.0

MC² corrects low-budget Monte Carlo solutions for elliptic PDEs with a single-pass neural network to match the accuracy of 1000× more Monte Carlo samples while outperforming classical and learned baselines.

Extreme Weather Bench: A framework and benchmark for evaluation of high-impact weather

cs.LG · 2026-05-01 · accept · novelty 6.0

Extreme Weather Bench supplies standardized case studies, observational data, impact metrics, and code to evaluate weather models on high-impact hazards.

Compared to What? Baselines and Metrics for Counterfactual Prompting

cs.CL · 2026-05-01 · conditional · novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.

Parameter-efficient Quantum Multi-task Learning

cs.LG · 2026-04-15 · unverdicted · novelty 6.0

QMTL uses shared VQC encoding plus task-specific quantum ansatz heads to achieve linear parameter scaling with the number of tasks while matching or exceeding classical multi-task baselines on three benchmarks.

Preconditioned Norms: A Unified Framework for Steepest Descent, Quasi-Newton and Adaptive Methods

cs.LG · 2025-10-12 · unverdicted · novelty 6.0

Preconditioned matrix norms unify steepest descent, quasi-Newton, and adaptive optimizers, revealing SGD, Adam, Muon, KL-Shampoo, SOAP, and SPlus as special cases and enabling new methods MuAdam and MuAdam-SANIA that are competitive in experiments.

HyperAdapt: Simple High-Rank Adaptation

cs.LG · 2025-09-23 · unverdicted · novelty 6.0

HyperAdapt performs parameter-efficient fine-tuning by row- and column-wise diagonal scaling to induce high-rank updates with only n+m trainable parameters.

Should We Still Pretrain Encoders with Masked Language Modeling?

cs.CL · 2025-07-01 · accept · novelty 6.0

Controlled ablations of 38 models find MLM superior to CLM on representation benchmarks while CLM offers better data efficiency and stability; a biphasic CLM-then-MLM schedule is optimal under fixed compute and improves when initialized from pretrained CLM models.

How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP

cs.CL · 2024-11-08 · unverdicted · novelty 6.0

The study filters non-English Wikipedia, reveals quality problems, proposes a 4-level ranking, and shows filtered data matches or beats raw data in language modeling with largest gains for lower-quality editions.

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

cs.CL · 2023-04-13 · accept · novelty 6.0

AGIEval shows GPT-4 exceeding average human scores on SAT Math at 95% and Chinese college entrance English at 92.5%, while revealing weaker results on complex reasoning tasks.

Survey-aware Machine Learning: A Guideline for Valid Population Health Inference based on Scoping Review

stat.ML · 2026-05-09 · unverdicted · novelty 5.0

The authors introduce Survey-aware Machine Learning (SaML) as a nine-step guideline that integrates survey design metadata throughout the ML lifecycle to enable valid population inference from complex health surveys.

Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs

cs.CL · 2026-05-09 · unverdicted · novelty 5.0 · 2 refs

Extremely quantized LLMs exhibit systematic smoothness degradation that reduces effective token candidates and degrades generation; a smoothness-preserving principle in PTQ and QAT delivers gains beyond numerical accuracy.

Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models

cs.LG · 2026-04-24 · unverdicted · novelty 5.0

Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.

Model-Agnostic Meta Learning for Class Imbalance Adaptation

cs.CL · 2026-04-20 · conditional · novelty 5.0

HAMR combines meta-learning with hardness-aware weighting and neighborhood resampling to improve minority-class performance on imbalanced NLP datasets.

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

cs.CL · 2024-12-18 · unverdicted · novelty 5.0

ModernBERT is a new bidirectional encoder model achieving SOTA performance on diverse classification and retrieval benchmarks while offering superior speed and memory efficiency for long-context inference.

Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility

cs.LG · 2026-05-07 · unverdicted · novelty 4.0 · 2 refs

Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.

citing papers explorer

Showing 4 of 4 citing papers after filters.

Evolutionary Negative Module Pruning for Better LoRA Merging cs.AI · 2026-04-20 · conditional · none · ref 33
ENMP prunes negative LoRA modules via evolutionary search to boost merging performance to new state-of-the-art levels across language and vision tasks.
Results-Actionability Gap: Understanding How Practitioners Evaluate LLM Products in the Wild cs.SE · 2026-01-25 · conditional · none · ref 63
Qualitative study of 19 practitioners reveals ten LLM product evaluation practices and introduces the results-actionability gap as a key barrier to turning findings into improvements.
Compared to What? Baselines and Metrics for Counterfactual Prompting cs.CL · 2026-05-01 · conditional · none · ref 122
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.
Model-Agnostic Meta Learning for Class Imbalance Adaptation cs.CL · 2026-04-20 · conditional · none · ref 57
HAMR combines meta-learning with hardness-aware weighting and neighborhood resampling to improve minority-class performance on imbalanced NLP datasets.

Proceedings of the 2018

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer