hub

MathQA: Towards interpretable math word problem solving with operation-based formalisms

doi: 10 · 2019 · DOI 10.18653/v1/n19-1245

17 Pith papers cite this work. Polarity classification is still indexing.

17 Pith papers citing it

open at publisher browse 17 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 1 dataset 1 other 1

citation-polarity summary

background 1 unclear 1 use dataset 1

representative citing papers

CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions

cs.AI · 2026-06-02 · unverdicted · novelty 7.0

CrowdMath is a new dataset of annotated collaborative math proof discussions where frontier LLMs achieve 83-88% on next-post prediction but only 0.42 macro-F1 on identifying contribution roles.

SimDiff: Depth Pruning via Similarity and Difference

cs.AI · 2026-04-21 · unverdicted · novelty 7.0

SimDiff uses similarity and difference metrics to prune LLM layers more effectively than cosine similarity alone, retaining over 91% performance at 25% pruning on LLaMA2-7B.

Profile-Then-Reason: Bounded Semantic Complexity for Tool-Augmented Language Agents

cs.AI · 2026-04-05 · unverdicted · novelty 7.0

PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

cs.CL · 2022-06-09 · accept · novelty 7.0

BIG-bench is a 204-task benchmark that measures scaling trends, calibration, and absolute limitations of language models across knowledge, reasoning, and social domains.

FlexMoE: One-for-All Nested Intra-Expert Pruning for MoE Language Models

cs.LG · 2026-06-26 · unverdicted · novelty 6.0

FlexMoE produces nested pruned subnetworks for MoE LLMs across budgets via channel importance ranking and discrete action learning, plus one mid-budget recovery fine-tune, retaining 99.8% performance at 50% expert parameter pruning.

Generalization in LLM Problem Solving: The Case of the Shortest Path

cs.AI · 2026-04-16 · unverdicted · novelty 6.0

LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.

Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models

cs.AI · 2026-03-26 · unverdicted · novelty 6.0

An external zero-shot monitor detects nine unsafe reasoning behaviors in LLMs at 87% step-level accuracy with low false positives and low latency.

PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark

cs.CL · 2025-11-26 · unverdicted · novelty 6.0

PEFT-Bench is a standardized end-to-end benchmark for 7 PEFT methods across 27 NLP datasets on autoregressive LLMs, accompanied by the PSCP metric that penalizes based on trainable parameters, inference speed, and training memory.

DataComp-LM: In search of the next generation of training sets for language models

cs.LG · 2024-06-17 · unverdicted · novelty 6.0

DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.

The Falcon Series of Open Language Models

cs.CL · 2023-11-28 · conditional · novelty 6.0

Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.

MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

cs.CL · 2023-09-11 · conditional · novelty 6.0

MAmmoTH models trained via hybrid CoT-PoT instruction tuning on MathInstruct outperform prior open-source LLMs by 16-32% average accuracy on nine math datasets, reaching 33% and 44% on MATH for 7B and 34B scales.

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

cs.CL · 2023-05-23 · conditional · novelty 6.0

UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.

IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM Compression

cs.LG · 2026-05-15 · unverdicted · novelty 5.0

IO-SVD performs SVD-based LLM compression by constructing a KL-aware double-sided whitening space and using first-order loss estimates for heterogeneous rank allocation.

Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs

cs.CL · 2026-05-09 · unverdicted · novelty 5.0 · 2 refs

Extremely quantized LLMs exhibit systematic smoothness degradation that reduces effective token candidates and degrades generation; a smoothness-preserving principle in PTQ and QAT delivers gains beyond numerical accuracy.

Efficient Quantization of Mixture-of-Experts with Theoretical Generalization Guarantees

cs.LG · 2026-04-07 · unverdicted · novelty 5.0

A router-norm and variance-based bit allocation strategy for quantizing MoE models that claims higher accuracy and lower cost than prior mixed-precision methods.

PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models

cs.CL · 2025-12-02 · unverdicted · novelty 5.0

PEFT-Factory supplies a ready-to-use, extensible codebase that unifies 19 PEFT methods and evaluation pipelines for fine-tuning large autoregressive language models.

PaLM 2 Technical Report

cs.CL · 2023-05-17 · unverdicted · novelty 5.0

PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.

citing papers explorer

Showing 5 of 5 citing papers after filters.

CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions cs.AI · 2026-06-02 · unverdicted · none · ref 15
CrowdMath is a new dataset of annotated collaborative math proof discussions where frontier LLMs achieve 83-88% on next-post prediction but only 0.42 macro-F1 on identifying contribution roles.
SimDiff: Depth Pruning via Similarity and Difference cs.AI · 2026-04-21 · unverdicted · none · ref 31
SimDiff uses similarity and difference metrics to prune LLM layers more effectively than cosine similarity alone, retaining over 91% performance at 25% pruning on LLaMA2-7B.
Profile-Then-Reason: Bounded Semantic Complexity for Tool-Augmented Language Agents cs.AI · 2026-04-05 · unverdicted · none · ref 18
PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.
Generalization in LLM Problem Solving: The Case of the Shortest Path cs.AI · 2026-04-16 · unverdicted · none · ref 4
LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.
Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models cs.AI · 2026-03-26 · unverdicted · none · ref 2
An external zero-shot monitor detects nine unsafe reasoning behaviors in LLMs at 87% step-level accuracy with low false positives and low latency.

MathQA: Towards interpretable math word problem solving with operation-based formalisms

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer