hub

Challenging

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, Jason Wei · 2023 · DOI 10.18653/v1/2023.findings-acl.824

21 Pith papers cite this work. Polarity classification is still indexing.

21 Pith papers citing it

open at publisher browse 21 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 1 dataset 1

citation-polarity summary

background 1 use dataset 1

representative citing papers

Erase-then-Delta Attention: Decoupling Erase and Write Addresses in Delta-Rule Linear Attention

cs.CL · 2026-06-25 · unverdicted · novelty 7.0

EDA decouples erase and write addresses in delta-rule linear attention by adding a targeted erase step along a learned direction before the corrective write, yielding best results on 2.5B dense and 25B MoE models in pretraining and long-context tasks.

LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening

cs.CL · 2026-05-19 · accept · novelty 7.0

LLMEval-Logic is a solver-verified Chinese logical reasoning benchmark with 246 base and 190 hard items that shows frontier LLMs reach only 37.5% hard-item accuracy and 60.16% joint formalization score.

ToxiREX: A Dataset on Toxic REasoning in ConteXt

cs.CL · 2026-06-26 · unverdicted · novelty 6.0

ToxiREX is a new dataset of 128k Reddit comments in six languages with hierarchical annotations for implicit toxicity in conversational context based on an existing reasoning schema.

Speculative Refinement: A Hybrid Autoregressive Diffusion Decoding Strategy and Its Behavior Across Benchmarks

cs.SE · 2026-06-25 · unverdicted · novelty 6.0

SpecRef hybrid AR-diffusion decoding is tested on six benchmarks with three protocols, showing code benchmarks conflate structural and logical correctness, refinement can degrade correct tokens, and log-likelihood versus generative scoring produce inconsistent model rankings.

Correct Looks Better: Pairwise Comparisons Reveal Accuracy Rankings

cs.AI · 2026-06-08 · unverdicted · novelty 6.0

Elo rankings from pairwise judgments correlate above 0.9 Spearman with accuracy rankings on five converted benchmarks, with minor style and bias effects.

CRAFT: Cost-aware Refinement And Front-aware Tuning of Prompts

cs.CL · 2026-06-03 · unverdicted · novelty 6.0

CRAFT is a Pareto-front prompt optimizer that allocates scarce LLM validation calls to candidates near the current front using accuracy- and cost-oriented generators plus NSGA-II retention.

Bridging Auxiliary Constraints to Resolve Instruction Following in Large Reasoning Models

cs.AI · 2026-06-02 · unverdicted · novelty 6.0

CRGC models instructions as constraint graphs, identifies bridge constraints, and cuts violations by 39% on three datasets while preserving reasoning performance.

Beyond Task-Agnostic: Task-Aware Grouping for Communication-Efficient Multi-Task MoE Inference

cs.LG · 2026-05-31 · unverdicted · novelty 6.0

Task-aware expert grouping derived from family-specific co-activation traces cuts average communication cost 31.39% versus task-agnostic baselines in multi-task MoE inference while maintaining Jain fairness near 1.0.

LLM Sparsity Prior for Robust Feature Selection

stat.ML · 2026-05-21 · unverdicted · novelty 6.0

LSP adds hierarchical hyperpriors over global sparsity and weight concentration parameters so that spike-and-slab models can discount inaccurate LLM weights while retaining gains when the weights are good.

SCALE-LoRA: Auditing Post-Retrieval LoRA Composition with Residual Merging and View Reliability

cs.AI · 2026-05-02 · unverdicted · novelty 6.0

SCALE-LoRA proposes a post-retrieval audit framework using sparse residual composition and disagreement-based reliability signals to improve open-pool LoRA adapter reuse on tasks like BIG-Bench Hard.

LightThinker++: From Reasoning Compression to Memory Management

cs.CL · 2026-04-04 · unverdicted · novelty 6.0

LightThinker++ adds explicit adaptive memory management and a trajectory synthesis pipeline to LLM reasoning, cutting peak token use by ~70% while gaining accuracy in standard and long-horizon agent tasks.

Einstein World Models

cs.AI · 2026-06-25 · unverdicted · novelty 5.0

Einstein World Models integrate visual rollouts from a callable world-module into LLM reasoning traces to support complex thought beyond language.

Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs

cs.CL · 2026-06-09 · unverdicted · novelty 5.0

Continual training recipe upcycles dense Qwen2.5-8B LLM to 4x channel-sparse model via predictor-gated bank-wise sparsity in SwiGLU FFN with a single-layer repair for long-context failure on RULER-CWE.

"I understand your perspective": LLM Persuasion and Sycophancy through the Lens of Communicative Action Theory

cs.CL · 2026-06-06 · unverdicted · novelty 5.0

LLMs outperform humans in expressing illocutionary intents and sycophancy in successful persuasive counter-arguments from ChangeMyView, with crowd workers preferring LLM versions.

GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression

cs.CL · 2026-05-09 · unverdicted · novelty 5.0 · 2 refs

GRC unifies generation, retrieval, and compression in LLMs via meta latent tokens for single-pass execution with modular flexibility.

Language models fail at extended rule following

cs.CL · 2026-05-03 · unverdicted · novelty 5.0

LLMs fail at extended counting of repeated characters due to finite internal states, with abrupt errors persisting across model scales and inference methods.

MiMo-V2-Flash Technical Report

cs.CL · 2026-01-06 · unverdicted · novelty 5.0

MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurposed MTP layers.

InternLM2 Technical Report

cs.CL · 2024-03-26 · unverdicted · novelty 5.0

InternLM2 is a new open-source LLM that outperforms prior versions on 30 benchmarks and long-context tasks through scaled pre-training to 32k tokens and a conditional online RLHF alignment strategy.

Influcoder: Distilling Decoders' Gradient Influence Rankings into an Encoder for Data Attribution

cs.CL · 2026-06-11 · unverdicted · novelty 4.0

Influcoder distills decoders' gradient influence rankings into an encoder for scalable influence-based data attribution.

MetaEvo: A Meta-Optimization Framework for Experience-Driven Agent Evolution

cs.LG · 2026-05-29 · unverdicted · novelty 4.0

MetaEvo is a two-stage framework using preference optimization for principle abstraction followed by modular reuse to enable continual improvement of LLM agents on reasoning tasks.

Efficient Benchmarking Is Just Feature Selection and Multiple Regression

stat.ML · 2026-05-25 · unverdicted · novelty 4.0

Kernel ridge regression combined with mRMR feature selection improves prediction of full benchmark scores from question subsets over existing efficient benchmarking techniques.

citing papers explorer

Showing 21 of 21 citing papers.

Erase-then-Delta Attention: Decoupling Erase and Write Addresses in Delta-Rule Linear Attention cs.CL · 2026-06-25 · unverdicted · none · ref 30
EDA decouples erase and write addresses in delta-rule linear attention by adding a targeted erase step along a learned direction before the corrective write, yielding best results on 2.5B dense and 25B MoE models in pretraining and long-context tasks.
LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening cs.CL · 2026-05-19 · accept · none · ref 26
LLMEval-Logic is a solver-verified Chinese logical reasoning benchmark with 246 base and 190 hard items that shows frontier LLMs reach only 37.5% hard-item accuracy and 60.16% joint formalization score.
ToxiREX: A Dataset on Toxic REasoning in ConteXt cs.CL · 2026-06-26 · unverdicted · none · ref 178
ToxiREX is a new dataset of 128k Reddit comments in six languages with hierarchical annotations for implicit toxicity in conversational context based on an existing reasoning schema.
Speculative Refinement: A Hybrid Autoregressive Diffusion Decoding Strategy and Its Behavior Across Benchmarks cs.SE · 2026-06-25 · unverdicted · none · ref 23
SpecRef hybrid AR-diffusion decoding is tested on six benchmarks with three protocols, showing code benchmarks conflate structural and logical correctness, refinement can degrade correct tokens, and log-likelihood versus generative scoring produce inconsistent model rankings.
Correct Looks Better: Pairwise Comparisons Reveal Accuracy Rankings cs.AI · 2026-06-08 · unverdicted · none · ref 17
Elo rankings from pairwise judgments correlate above 0.9 Spearman with accuracy rankings on five converted benchmarks, with minor style and bias effects.
CRAFT: Cost-aware Refinement And Front-aware Tuning of Prompts cs.CL · 2026-06-03 · unverdicted · none · ref 218
CRAFT is a Pareto-front prompt optimizer that allocates scarce LLM validation calls to candidates near the current front using accuracy- and cost-oriented generators plus NSGA-II retention.
Bridging Auxiliary Constraints to Resolve Instruction Following in Large Reasoning Models cs.AI · 2026-06-02 · unverdicted · none · ref 44
CRGC models instructions as constraint graphs, identifies bridge constraints, and cuts violations by 39% on three datasets while preserving reasoning performance.
Beyond Task-Agnostic: Task-Aware Grouping for Communication-Efficient Multi-Task MoE Inference cs.LG · 2026-05-31 · unverdicted · none · ref 42
Task-aware expert grouping derived from family-specific co-activation traces cuts average communication cost 31.39% versus task-agnostic baselines in multi-task MoE inference while maintaining Jain fairness near 1.0.
LLM Sparsity Prior for Robust Feature Selection stat.ML · 2026-05-21 · unverdicted · none · ref 21
LSP adds hierarchical hyperpriors over global sparsity and weight concentration parameters so that spike-and-slab models can discount inaccurate LLM weights while retaining gains when the weights are good.
SCALE-LoRA: Auditing Post-Retrieval LoRA Composition with Residual Merging and View Reliability cs.AI · 2026-05-02 · unverdicted · none · ref 12
SCALE-LoRA proposes a post-retrieval audit framework using sparse residual composition and disagreement-based reliability signals to improve open-pool LoRA adapter reuse on tasks like BIG-Bench Hard.
LightThinker++: From Reasoning Compression to Memory Management cs.CL · 2026-04-04 · unverdicted · none · ref 40
LightThinker++ adds explicit adaptive memory management and a trajectory synthesis pipeline to LLM reasoning, cutting peak token use by ~70% while gaining accuracy in standard and long-horizon agent tasks.
Einstein World Models cs.AI · 2026-06-25 · unverdicted · none · ref 10
Einstein World Models integrate visual rollouts from a callable world-module into LLM reasoning traces to support complex thought beyond language.
Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs cs.CL · 2026-06-09 · unverdicted · none · ref 26
Continual training recipe upcycles dense Qwen2.5-8B LLM to 4x channel-sparse model via predictor-gated bank-wise sparsity in SwiGLU FFN with a single-layer repair for long-context failure on RULER-CWE.
"I understand your perspective": LLM Persuasion and Sycophancy through the Lens of Communicative Action Theory cs.CL · 2026-06-06 · unverdicted · none · ref 83
LLMs outperform humans in expressing illocutionary intents and sycophancy in successful persuasive counter-arguments from ChangeMyView, with crowd workers preferring LLM versions.
GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression cs.CL · 2026-05-09 · unverdicted · none · ref 40 · 2 links
GRC unifies generation, retrieval, and compression in LLMs via meta latent tokens for single-pass execution with modular flexibility.
Language models fail at extended rule following cs.CL · 2026-05-03 · unverdicted · none · ref 39
LLMs fail at extended counting of repeated characters due to finite internal states, with abrupt errors persisting across model scales and inference methods.
MiMo-V2-Flash Technical Report cs.CL · 2026-01-06 · unverdicted · none · ref 45
MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurposed MTP layers.
InternLM2 Technical Report cs.CL · 2024-03-26 · unverdicted · none · ref 202
InternLM2 is a new open-source LLM that outperforms prior versions on 30 benchmarks and long-context tasks through scaled pre-training to 32k tokens and a conditional online RLHF alignment strategy.
Influcoder: Distilling Decoders' Gradient Influence Rankings into an Encoder for Data Attribution cs.CL · 2026-06-11 · unverdicted · none · ref 18
Influcoder distills decoders' gradient influence rankings into an encoder for scalable influence-based data attribution.
MetaEvo: A Meta-Optimization Framework for Experience-Driven Agent Evolution cs.LG · 2026-05-29 · unverdicted · none · ref 20
MetaEvo is a two-stage framework using preference optimization for principle abstraction followed by modular reuse to enable continual improvement of LLM agents on reasoning tasks.
Efficient Benchmarking Is Just Feature Selection and Multiple Regression stat.ML · 2026-05-25 · unverdicted · none · ref 73
Kernel ridge regression combined with mRMR feature selection improves prediction of full benchmark scores from question subsets over existing efficient benchmarking techniques.

Challenging

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer