super hub Mixed citations

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Ashish Sabharwal, Carissa Schoenick, Isaac Cowhey, Oren Etzioni, Peter Clark, Tushar Khot · 2018 · cs.AI · arXiv 1803.05457

Mixed citation behavior. Most common role is dataset (45%).

507 Pith papers citing it

Dataset 45% of classified citations

open full Pith review browse 507 citing papers more from Ashish Sabharwal arXiv PDF

abstract

We present a new question set, text corpus, and baselines assembled to encourage AI research in advanced question answering. Together, these constitute the AI2 Reasoning Challenge (ARC), which requires far more powerful knowledge and reasoning than previous challenges such as SQuAD or SNLI. The ARC question set is partitioned into a Challenge Set and an Easy Set, where the Challenge Set contains only questions answered incorrectly by both a retrieval-based algorithm and a word co-occurence algorithm. The dataset contains only natural, grade-school science questions (authored for human tests), and is the largest public-domain set of this kind (7,787 questions). We test several baselines on the Challenge Set, including leading neural models from the SQuAD and SNLI tasks, and find that none are able to significantly outperform a random baseline, reflecting the difficult nature of this task. We are also releasing the ARC Corpus, a corpus of 14M science sentences relevant to the task, and implementations of the three neural baseline models tested. Can your model perform better? We pose ARC as a challenge to the community.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 49 dataset 47 method 4 other 3 baseline 1

citation-polarity summary

use dataset 47 background 40 unclear 11 use method 4 baseline 1 support 1

claims ledger

abstract We present a new question set, text corpus, and baselines assembled to encourage AI research in advanced question answering. Together, these constitute the AI2 Reasoning Challenge (ARC), which requires far more powerful knowledge and reasoning than previous challenges such as SQuAD or SNLI. The ARC question set is partitioned into a Challenge Set and an Easy Set, where the Challenge Set contains only questions answered incorrectly by both a retrieval-based algorithm and a word co-occurence algorithm. The dataset contains only natural, grade-school science questions (authored for human tests),

authors

Ashish Sabharwal Carissa Schoenick Isaac Cowhey Oren Etzioni Peter Clark Tushar Khot

co-cited works

representative citing papers

Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth

cs.CL · 2026-05-24 · unverdicted · novelty 8.0

Introduces BonaFide benchmark of 3,066 ground-truth labeled CoTs showing most faithfulness metrics perform near chance with biases and poor scaling to longer chains.

Conformal Selective Acting: Anytime-Valid Risk Control for RLVR-Trained LLMs

cs.LG · 2026-05-18 · conditional · novelty 8.0

Conformal Selective Acting (CSA) fills a gap in conformal methods by providing per-round, pathwise-valid selective risk bounds for adaptive RLVR LLM streams under predictable updates and isotonic calibration.

Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

cs.AI · 2026-05-15 · unverdicted · novelty 8.0 · 2 refs

Presents the first fully open pipeline for clinical LLMs by unifying eight public QA datasets with three clinician-vetted synthetic extensions and applying it to five base models to achieve benchmark gains while maintaining auditability.

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.

Grid Games: The Power of Multiple Grids for Quantizing Large Language Models

cs.LG · 2026-05-12 · accept · novelty 8.0

Allowing each quantization group to select among multiple 4-bit grids improves accuracy over single-grid FP4 for both post-training and pre-training of LLMs.

ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

CacheTrap: Unveiling a Stealthier Gray-Box Trojan against LLMs

cs.CR · 2025-11-27 · conditional · novelty 8.0

CacheTrap achieves 100% targeted attack success on five open-source LLMs by using an efficient search to locate and flip a single bit in the KV cache as a transient trigger, while preserving normal accuracy without the trigger.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

cs.LG · 2023-12-01 · unverdicted · novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

Measuring Massive Multitask Language Understanding

cs.CY · 2020-09-07 · accept · novelty 8.0

Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.

Language Models are Few-Shot Learners

cs.CL · 2020-05-28 · accept · novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

cs.CL · 2018-09-08 · accept · novelty 8.0

OpenBookQA tests AI by requiring it to apply provided science facts plus common knowledge to new questions, where advanced models perform worse than simple baselines while humans score near 92%.

Morphing into Hybrid Attention Models

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

FlashMorph formulates hybrid layer selection as budget-constrained optimization, trains per-layer gates on synthetic retrieval data with linearization regularization, then discretizes and distills to produce efficient hybrid architectures.

CORTEX: High-Quality Cross-Domain Organization of Web-Scale Corpora through Ontological Corpus Graph

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

Cortex uses an Ontological Corpus Graph to structure web-scale corpora, creating a refined 24.14B-token corpus and a new benchmark validated on eight LLMs.

Minority Sentinel: When to Overturn Majority Voting in Multi-Agent LLM Debates

cs.MA · 2026-06-28 · unverdicted · novelty 7.0

Minority Sentinel uses a LightGBM model on debate fingerprints to overturn majority votes in LLM debates with 81.2% flip precision and positive net gain on six benchmarks.

CARVE: Content-Aware Recurrent with Value Efficiency for Chunk-Parallel Linear Attention

cs.CL · 2026-06-25 · conditional · novelty 7.0

CARVE introduces key-axis content-aware gating and value-efficient scalar writes in recurrent linear attention, outperforming GDN-2 on perplexity and retrieval tasks while cutting parameters and memory.

Explaining Attention with Program Synthesis

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

Language-model-guided program synthesis can approximate transformer attention heads with over 75% IoU fidelity on held-out data and allow replacing 25% of heads with only 16% average perplexity increase.

APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing

cs.DC · 2026-06-07 · conditional · novelty 7.0

APEX4 co-designs pure INT4 GEMM kernels with ρ-aware granularity adaptation to deliver up to 2.09× end-to-end speedup on GPUs with low ρ while keeping LLaMA-2-70B perplexity within 0.63 of FP16.

Toward Calibrated, Fair, and accurate Deepfake Detection

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.

Parameter-Efficient Fine-Tuning with Learnable Rank

cs.CL · 2026-06-03 · unverdicted · novelty 7.0

LR-LoRA learns per-layer adapter ranks during training and reports outperforming fixed-rank LoRA and other PEFT baselines on language understanding and commonsense reasoning tasks.

Compress then Merge: From Multiple LoRAs into One Low-Rank Adapter

cs.LG · 2026-06-02 · unverdicted · novelty 7.0

CtM merges T LoRAs into one rank-r LoRA by computing shared r-dimensional subspaces from the LoRA weights, projecting adapters into r x r coordinates, and merging in that reduced space, outperforming merge-then-compress baselines in experiments.

Calibration Data Trade-offs Across Capability Dimensions: Why Multi-Source Mixing Matters for High-Sparsity LLM Pruning

cs.LG · 2026-06-02 · unverdicted · novelty 7.0

Analysis of 15 calibration sources shows opposite-sign Spearman correlations between perplexity and retention across General vs. Math/Code dimensions in LLM pruning, and multi-source mixing via IGSP raises total retention from 40-50% to 58.8%.

citing papers explorer

Showing 50 of 507 citing papers.

Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth cs.CL · 2026-05-24 · unverdicted · none · ref 74 · internal anchor
Introduces BonaFide benchmark of 3,066 ground-truth labeled CoTs showing most faithfulness metrics perform near chance with biases and poor scaling to longer chains.
Conformal Selective Acting: Anytime-Valid Risk Control for RLVR-Trained LLMs cs.LG · 2026-05-18 · conditional · none · ref 41 · internal anchor
Conformal Selective Acting (CSA) fills a gap in conformal methods by providing per-round, pathwise-valid selective risk bounds for adaptive RLVR LLM streams under predictable updates and isotonic calibration.
Fully Open Meditron: An Auditable Pipeline for Clinical LLMs cs.AI · 2026-05-15 · unverdicted · none · ref 42 · 2 links · internal anchor
Presents the first fully open pipeline for clinical LLMs by unifying eight public QA datasets with three clinician-vetted synthetic extensions and applying it to five base models to achieve benchmark gains while maintaining auditability.
HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts cs.LG · 2026-05-13 · unverdicted · none · ref 9 · internal anchor
HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.
Grid Games: The Power of Multiple Grids for Quantizing Large Language Models cs.LG · 2026-05-12 · accept · none · ref 7 · internal anchor
Allowing each quantization group to select among multiple 4-bit grids improves accuracy over single-grid FP4 for both post-training and pre-training of LLMs.
ArgBench: Benchmarking LLMs on Computational Argumentation Tasks cs.CL · 2026-04-19 · unverdicted · none · ref 12 · internal anchor
ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding cs.CV · 2026-01-15 · unverdicted · none · ref 23 · internal anchor
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
CacheTrap: Unveiling a Stealthier Gray-Box Trojan against LLMs cs.CR · 2025-11-27 · conditional · none · ref 46 · internal anchor
CacheTrap achieves 100% targeted attack success on five open-source LLMs by using an efficient search to locate and flip a single bit in the KV cache as a transient trigger, while preserving normal accuracy without the trigger.
Large Language Diffusion Models cs.CL · 2025-02-14 · unverdicted · none · ref 113 · internal anchor
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models cs.CV · 2024-09-25 · accept · none · ref 23 · internal anchor
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
Mamba: Linear-Time Sequence Modeling with Selective State Spaces cs.LG · 2023-12-01 · unverdicted · none · ref 18 · internal anchor
Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
Measuring Massive Multitask Language Understanding cs.CY · 2020-09-07 · accept · none · ref 5 · internal anchor
Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.
Language Models are Few-Shot Learners cs.CL · 2020-05-28 · accept · none · ref 6 · internal anchor
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering cs.CL · 2018-09-08 · accept · none · ref 5 · internal anchor
OpenBookQA tests AI by requiring it to apply provided science facts plus common knowledge to new questions, where advanced models perform worse than simple baselines while humans score near 92%.
Morphing into Hybrid Attention Models cs.CL · 2026-06-29 · unverdicted · none · ref 13 · internal anchor
FlashMorph formulates hybrid layer selection as budget-constrained optimization, trains per-layer gates on synthetic retrieval data with linearization regularization, then discretizes and distills to produce efficient hybrid architectures.
CORTEX: High-Quality Cross-Domain Organization of Web-Scale Corpora through Ontological Corpus Graph cs.CL · 2026-06-29 · unverdicted · none · ref 45 · internal anchor
Cortex uses an Ontological Corpus Graph to structure web-scale corpora, creating a refined 24.14B-token corpus and a new benchmark validated on eight LLMs.
Minority Sentinel: When to Overturn Majority Voting in Multi-Agent LLM Debates cs.MA · 2026-06-28 · unverdicted · none · ref 5 · internal anchor
Minority Sentinel uses a LightGBM model on debate fingerprints to overturn majority votes in LLM debates with 81.2% flip precision and positive net gain on six benchmarks.
CARVE: Content-Aware Recurrent with Value Efficiency for Chunk-Parallel Linear Attention cs.CL · 2026-06-25 · conditional · none · ref 4 · internal anchor
CARVE introduces key-axis content-aware gating and value-efficient scalar writes in recurrent linear attention, outperforming GDN-2 on perplexity and retrieval tasks while cutting parameters and memory.
Explaining Attention with Program Synthesis cs.LG · 2026-06-17 · unverdicted · none · ref 7 · internal anchor
Language-model-guided program synthesis can approximate transformer attention heads with over 75% IoU fidelity on held-out data and allow replacing 25% of heads with only 16% average perplexity increase.
APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing cs.DC · 2026-06-07 · conditional · none · ref 30 · internal anchor
APEX4 co-designs pure INT4 GEMM kernels with ρ-aware granularity adaptation to deliver up to 2.09× end-to-end speedup on GPUs with low ρ while keeping LLaMA-2-70B perplexity within 0.63 of FP16.
Toward Calibrated, Fair, and accurate Deepfake Detection cs.LG · 2026-06-03 · unverdicted · none · ref 59 · internal anchor
Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.
Parameter-Efficient Fine-Tuning with Learnable Rank cs.CL · 2026-06-03 · unverdicted · none · ref 45 · internal anchor
LR-LoRA learns per-layer adapter ranks during training and reports outperforming fixed-rank LoRA and other PEFT baselines on language understanding and commonsense reasoning tasks.
Compress then Merge: From Multiple LoRAs into One Low-Rank Adapter cs.LG · 2026-06-02 · unverdicted · none · ref 3 · internal anchor
CtM merges T LoRAs into one rank-r LoRA by computing shared r-dimensional subspaces from the LoRA weights, projecting adapters into r x r coordinates, and merging in that reduced space, outperforming merge-then-compress baselines in experiments.
Calibration Data Trade-offs Across Capability Dimensions: Why Multi-Source Mixing Matters for High-Sparsity LLM Pruning cs.LG · 2026-06-02 · unverdicted · none · ref 36 · internal anchor
Analysis of 15 calibration sources shows opposite-sign Spearman correlations between perplexity and retention across General vs. Math/Code dimensions in LLM pruning, and multi-source mixing via IGSP raises total retention from 40-50% to 58.8%.
From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression cs.CL · 2026-06-01 · unverdicted · none · ref 78 · internal anchor
SubFit enables better LLM compression by fitting residual bypasses to non-contiguously selected submodules, outperforming layer-granularity baselines in accuracy-perplexity trade-offs at 12.5-37.5% sparsity.
Not All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model Inference cs.DC · 2026-06-01 · unverdicted · none · ref 17 · internal anchor
A new fault-injection framework enables a systematic empirical study that produces 17 takeaways on error propagation in LLM inference and four software-only mitigation directions.
Forget Attention: Importance-Aware Attention Is All You Need cs.AI · 2026-06-01 · unverdicted · none · ref 19 · internal anchor
SISA adds an SSM importance term inside the attention score and runs the full operation as one SDPA call on augmented Q/K vectors, reporting better LAMBADA and perfect NIAH at small scale.
Extreme Low-Bit Inference in Reasoning Models: Failure Modes and Targeted Recovery cs.AI · 2026-06-01 · conditional · none · ref 4 · internal anchor
2-bit quantized reasoning models exhibit process failures like loops and delayed commitment that degrade end-to-end performance, but FP16 planning and loop rescue recover accuracy on MATH-500 from 17.2% to 74.2% for Qwen3-8B while retaining speed gains.
TwinQuant: Learnable Subspace Decomposition for 4-Bit LLM Quantization cs.DC · 2026-06-01 · unverdicted · none · ref 92 · internal anchor
TwinQuant learns quantization-friendly subspaces for 4-bit LLM weights via manifold optimization and a fused kernel, preserving near-FP16 accuracy with up to 1.8x speedup on LLaMA3 and Qwen3 models.
PithTrain: A Compact and Agent-Native MoE Training System cs.LG · 2026-05-29 · unverdicted · none · ref 9 · internal anchor
PithTrain is a compact agent-native MoE training system that matches production throughput and improves agent-task efficiency by up to 62% fewer turns and 64% less GPU time on the new ATE-Bench.
On the Construction and Implications of Low-Loss Valleys in LoRA-based Bayesian Inference cs.LG · 2026-05-28 · unverdicted · none · ref 4 · internal anchor
Introduces LoRA-Curve parameterization to link independent LoRA optima via low-loss valleys, yielding higher predictive mutual information on reasoning and classification tasks with Qwen2.5 7B.
Parallax: Parameterized Local Linear Attention for Language Modeling cs.LG · 2026-05-27 · unverdicted · none · ref 2 · internal anchor
Parallax is a scalable parameterized local linear attention variant that improves LLM pretraining perplexity at 0.6B/1.7B scales with a hardware-aware kernel and shows gains under parameter- and compute-matched controls.
Can LLMs Use Linguistic Uncertainty Markers to Reliably Reflect Intrinsic Confidence? cs.CL · 2026-05-27 · unverdicted · none · ref 12 · internal anchor
LLMs struggle to associate epistemic markers with stable internal confidence levels across distributions, even under model-centric interpretations, while maintaining somewhat consistent marker rankings.
Training-Free Looped Transformers cs.LG · 2026-05-22 · unverdicted · none · ref 21 · internal anchor
Training-free looped transformers retrofit recurrence to frozen models via damped ODE sub-steps on mid-stack blocks, yielding gains such as +2.64 pp on MMLU-Pro for Qwen3-4B.
Convergence Without Understanding: When Language Models Agree on Representations but Disagree on Reasoning cs.CL · 2026-05-22 · unverdicted · none · ref 4 · internal anchor
Representational convergence across 16 LLMs on 800 reasoning problems is stronger for failed tasks and pre-decision stages but shows minimal causal influence on predictions, pointing to shared processing constraints over shared reasoning.
Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks cs.CL · 2026-05-22 · conditional · none · ref 4 · internal anchor
Audits reveal no reasoning benchmark controls position/filler/length jointly; CRE shows LLMs drop up to 88pp on middle-position tasks at 64K context, with diagnostic probe supporting positional cause.
LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models cs.LG · 2026-05-17 · unverdicted · none · ref 8 · internal anchor
LLMForge is a NAS framework with Infinite-Head Attention, a Forge-Former surrogate, and Forge-DSE engine that discovers hardware-specific architectures for edge language models, yielding variants with improved accuracy, energy, or latency on different substrates.
LEAP: Learnable End-to-End Adaptive Pruning of Large Language Models cs.LG · 2026-05-17 · unverdicted · none · ref 2 · 2 links · internal anchor
LEAP learns unstructured pruning masks end-to-end for LLMs via Gumbel-sigmoid Bernoulli relaxation and reports +2.59 average zero-shot accuracy gain over ADMM at 50-60% sparsity across five model families.
Dynamic Chunking for Diffusion Language Models cs.CL · 2026-05-15 · unverdicted · none · ref 10 · internal anchor
DCDM replaces positional blocks with learnable semantic chunks via differentiable Chunking Attention, yielding consistent gains over block and unstructured diffusion baselines up to 1.5B parameters.
Widening the Gap: Exploiting LLM Quantization via Outlier Injection cs.LG · 2026-05-14 · conditional · none · ref 37 · internal anchor
The paper introduces an outlier-injection attack that induces targeted weight collapse in LLMs under advanced quantization schemes including AWQ, GPTQ, and GGUF I-quants.
Inducing Artificial Uncertainty in Language Models cs.CL · 2026-05-13 · unverdicted · none · ref 7 · internal anchor
Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.
LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs cs.AI · 2026-05-12 · unverdicted · none · ref 7 · internal anchor
LGMT applies metamorphic testing derived from first-order logic equivalences to detect reasoning inconsistencies in LLMs that static benchmarks miss.
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching cs.CL · 2026-05-12 · unverdicted · none · ref 161 · internal anchor
Introduces TBPO, which derives a Bregman-divergence density-ratio matching objective for token-level preference optimization that generalizes DPO while preserving the induced optimal policy.
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models cs.LG · 2026-05-12 · unverdicted · none · ref 13 · 2 links · internal anchor
Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.
Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning cs.AI · 2026-05-12 · unverdicted · none · ref 5 · 2 links · internal anchor
GCPO uses team-level credit assignment via determinant volume over reward-weighted semantic embeddings to promote non-redundant correct reasoning paths, improving both accuracy and diversity in LLM training.
HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model cs.CL · 2026-05-11 · unverdicted · none · ref 3 · internal anchor
Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.
BCJR-QAT: A Differentiable Relaxation of Trellis-Coded Weight Quantization cs.LG · 2026-05-11 · unverdicted · none · ref 5 · internal anchor
BCJR-QAT makes trellis quantization differentiable via BCJR soft decoding at finite temperature, allowing QAT to improve 2-bit LLM perplexity over PTQ with a fused GPU kernel and a drift-budget escape condition.
Simply Stabilizing the Loop via Fully Looped Transformer cs.LG · 2026-05-11 · unverdicted · none · ref 62 · internal anchor
Fully Looped Transformer stabilizes looped training up to 12 iterations via distributed inter-loop signals and attention injection, improving downstream performance by up to 13.2%.
Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models cs.CL · 2026-05-10 · conditional · none · ref 19 · internal anchor
Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.
LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models cs.LG · 2026-05-10 · unverdicted · none · ref 57 · internal anchor
LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer