Title resolution pending

Association for Computational Linguistics · 1908 · DOI 10.18653/v1/w18-5446

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

open at publisher browse 16 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

EdgeFlowerTune: Evaluating Federated LLM Fine-Tuning Under Realistic Edge System Constraints

cs.CL · 2026-05-09 · unverdicted · novelty 7.0

EdgeFlowerTune is a real-device benchmark that jointly assesses model quality and system costs for federated LLM fine-tuning on edge hardware using three protocols: Quality-under-Budget, Cost-to-Target, and Robustness.

Evolutionary Negative Module Pruning for Better LoRA Merging

cs.AI · 2026-04-20 · conditional · novelty 7.0

ENMP prunes negative LoRA modules via evolutionary search to boost merging performance to new state-of-the-art levels across language and vision tasks.

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

cs.CL · 2020-05-22 · accept · novelty 7.0

RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

cs.CL · 2019-09-26 · accept · novelty 7.0

ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.

Measuring Five-Nines Reliability: Sample-Efficient LLM Evaluation in Saturated Benchmarks

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

Cross-entropy method sampling reduces inferences needed to estimate five-nines LLM reliability by up to 156x on parameterized GSM8K templates, revealing reliability differences hidden by saturated accuracy scores.

MC$^2$: Monte Carlo Correction for Fast Elliptic PDE Solving

cs.LG · 2026-05-10 · unverdicted · novelty 6.0

MC² corrects low-budget Monte Carlo solutions for elliptic PDEs with a single-pass neural network to match the accuracy of 1000× more Monte Carlo samples while outperforming classical and learned baselines.

Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs

cs.CL · 2026-05-09 · unverdicted · novelty 6.0

Extremely quantized LLMs degrade in smoothness, sparsifying the decoding tree and hurting generation quality; a smoothness-preserving principle delivers gains beyond numerical fitting.

Extreme Weather Bench: A framework and benchmark for evaluation of high-impact weather

cs.LG · 2026-05-01 · accept · novelty 6.0

Extreme Weather Bench supplies standardized case studies, observational data, impact metrics, and code to evaluate weather models on high-impact hazards.

Compared to What? Baselines and Metrics for Counterfactual Prompting

cs.CL · 2026-05-01 · conditional · novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.

MIPIC: Matryoshka Representation Learning via Self-Distilled Intra-Relational and Progressive Information Chaining

cs.CL · 2026-04-27 · unverdicted · novelty 6.0

MIPIC trains nested Matryoshka representations via self-distilled intra-relational alignment with top-k CKA and progressive information chaining across depths, yielding competitive performance especially at extreme low dimensions.

Parameter-efficient Quantum Multi-task Learning

cs.LG · 2026-04-15 · unverdicted · novelty 6.0

QMTL uses shared VQC encoding plus task-specific quantum ansatz heads to achieve linear parameter scaling with the number of tasks while matching or exceeding classical multi-task baselines on three benchmarks.

Survey-aware Machine Learning: A Guideline for Valid Population Health Inference based on Scoping Review

stat.ML · 2026-05-09 · unverdicted · novelty 5.0

The authors introduce Survey-aware Machine Learning (SaML) as a nine-step guideline that integrates survey design metadata throughout the ML lifecycle to enable valid population inference from complex health surveys.

Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models

cs.LG · 2026-04-24 · unverdicted · novelty 5.0

Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.

Model-Agnostic Meta Learning for Class Imbalance Adaptation

cs.CL · 2026-04-20 · conditional · novelty 5.0

HAMR combines meta-learning with hardness-aware weighting and neighborhood resampling to improve minority-class performance on imbalanced NLP datasets.

Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility

cs.LG · 2026-05-07 · unverdicted · novelty 4.0 · 2 refs

Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.

A Benchmark Suite of Reddit-Derived Datasets for Mental Health Detection

cs.CL · 2026-04-25 · unverdicted · novelty 4.0

Four new Reddit-derived datasets for mental health detection tasks are presented with inter-annotator agreement above 0.8 and reported model F1 scores of 93-99%.

citing papers explorer

Showing 16 of 16 citing papers.

EdgeFlowerTune: Evaluating Federated LLM Fine-Tuning Under Realistic Edge System Constraints cs.CL · 2026-05-09 · unverdicted · none · ref 25
EdgeFlowerTune is a real-device benchmark that jointly assesses model quality and system costs for federated LLM fine-tuning on edge hardware using three protocols: Quality-under-Budget, Cost-to-Target, and Robustness.
Evolutionary Negative Module Pruning for Better LoRA Merging cs.AI · 2026-04-20 · conditional · none · ref 33
ENMP prunes negative LoRA modules via evolutionary search to boost merging performance to new state-of-the-art levels across language and vision tasks.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks cs.CL · 2020-05-22 · accept · none · ref 64
RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations cs.CL · 2019-09-26 · accept · none · ref 36
ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
Measuring Five-Nines Reliability: Sample-Efficient LLM Evaluation in Saturated Benchmarks cs.LG · 2026-05-11 · unverdicted · none · ref 38
Cross-entropy method sampling reduces inferences needed to estimate five-nines LLM reliability by up to 156x on parameterized GSM8K templates, revealing reliability differences hidden by saturated accuracy scores.
MC$^2$: Monte Carlo Correction for Fast Elliptic PDE Solving cs.LG · 2026-05-10 · unverdicted · none · ref 35
MC² corrects low-budget Monte Carlo solutions for elliptic PDEs with a single-pass neural network to match the accuracy of 1000× more Monte Carlo samples while outperforming classical and learned baselines.
Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs cs.CL · 2026-05-09 · unverdicted · none · ref 39
Extremely quantized LLMs degrade in smoothness, sparsifying the decoding tree and hurting generation quality; a smoothness-preserving principle delivers gains beyond numerical fitting.
Extreme Weather Bench: A framework and benchmark for evaluation of high-impact weather cs.LG · 2026-05-01 · accept · none · ref 29
Extreme Weather Bench supplies standardized case studies, observational data, impact metrics, and code to evaluate weather models on high-impact hazards.
Compared to What? Baselines and Metrics for Counterfactual Prompting cs.CL · 2026-05-01 · conditional · none · ref 122
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.
MIPIC: Matryoshka Representation Learning via Self-Distilled Intra-Relational and Progressive Information Chaining cs.CL · 2026-04-27 · unverdicted · none · ref 41
MIPIC trains nested Matryoshka representations via self-distilled intra-relational alignment with top-k CKA and progressive information chaining across depths, yielding competitive performance especially at extreme low dimensions.
Parameter-efficient Quantum Multi-task Learning cs.LG · 2026-04-15 · unverdicted · none · ref 39
QMTL uses shared VQC encoding plus task-specific quantum ansatz heads to achieve linear parameter scaling with the number of tasks while matching or exceeding classical multi-task baselines on three benchmarks.
Survey-aware Machine Learning: A Guideline for Valid Population Health Inference based on Scoping Review stat.ML · 2026-05-09 · unverdicted · none · ref 34
The authors introduce Survey-aware Machine Learning (SaML) as a nine-step guideline that integrates survey design metadata throughout the ML lifecycle to enable valid population inference from complex health surveys.
Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models cs.LG · 2026-04-24 · unverdicted · none · ref 80
Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.
Model-Agnostic Meta Learning for Class Imbalance Adaptation cs.CL · 2026-04-20 · conditional · none · ref 57
HAMR combines meta-learning with hardness-aware weighting and neighborhood resampling to improve minority-class performance on imbalanced NLP datasets.
Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility cs.LG · 2026-05-07 · unverdicted · none · ref 203 · 2 links
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.
A Benchmark Suite of Reddit-Derived Datasets for Mental Health Detection cs.CL · 2026-04-25 · unverdicted · none · ref 24
Four new Reddit-derived datasets for mental health detection tasks are presented with inter-annotator agreement above 0.8 and reported model F1 scores of 93-99%.

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer