ProofGrid is a new benchmark for LLM reasoning that uses machine-checkable proofs in minimal formal notation, revealing progress on basic tasks but major gaps in complex combinatorial and synthesis reasoning.
AGIE val: A human-centric benchmark for evaluating foundation models
6 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
MEASER embeds malware into open-source LLMs via parameter targeting and MAR-QIM modulation, achieving 0 BER and high stealth even after quantization and PEFT.
Adapting autoregressive models via continual pre-training yields diffusion language models from 127M to 7B parameters that outperform prior diffusion models and compete with their autoregressive counterparts on language, reasoning, and commonsense benchmarks.
CausalMix fits a causal model on 512 runs of a 0.5B model to estimate CATE, then extrapolates optimal mixtures for an 800K data pool applied to 7B and 4B models, outperforming RegMix.
Technical report announcing Ling-2.6 and Ring-2.6 models with hybrid linear attention, evolutionary CoT, and KPop RL for efficient agentic intelligence at scale.
Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.
citing papers explorer
-
Stress-Testing the Reasoning Competence of LLMs With Proofs Under Minimal Formalism
ProofGrid is a new benchmark for LLM reasoning that uses machine-checkable proofs in minimal formal notation, revealing progress on basic tasks but major gaps in complex combinatorial and synthesis reasoning.
-
MEASER: Malware embedding attacks on open-source LLMs
MEASER embeds malware into open-source LLMs via parameter targeting and MAR-QIM modulation, achieving 0 BER and high stealth even after quantization and PEFT.
-
CausalMix: Data Mixture as Causal Inference for Language Model Training
CausalMix fits a causal model on 512 runs of a 0.5B model to estimate CATE, then extrapolates optimal mixtures for an 800K data pool applied to 7B and 4B models, outperforming RegMix.
-
Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale
Technical report announcing Ling-2.6 and Ring-2.6 models with hybrid linear attention, evolutionary CoT, and KPop RL for efficient agentic intelligence at scale.
-
Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning
Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.