pith. sign in

hub Baseline reference

B ool Q : Exploring the surprising difficulty of natural yes/no questions

Baseline reference. 57% of citing Pith papers use this work as a benchmark or comparison.

31 Pith papers citing it
Baseline 57% of classified citations

hub tools

citation-role summary

dataset 4 background 3

citation-polarity summary

representative citing papers

Layer Collapse in Diffusion Language Models

cs.LG · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

Diffusion language models develop early-layer collapse around an indispensable super-outlier due to overtraining, resulting in higher compressibility and reversed optimal sparsity patterns versus autoregressive models.

SimDiff: Depth Pruning via Similarity and Difference

cs.AI · 2026-04-21 · unverdicted · novelty 7.0

SimDiff uses similarity and difference metrics to prune LLM layers more effectively than cosine similarity alone, retaining over 91% performance at 25% pruning on LLaMA2-7B.

GAIA: a benchmark for General AI Assistants

cs.CL · 2023-11-21 · unverdicted · novelty 7.0

GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.

Forecasting Downstream Performance of LLMs With Proxy Metrics

cs.CL · 2026-05-18 · unverdicted · novelty 6.0

Proxy metrics from next-token distributions over expert solutions outperform loss and compute baselines for ranking LLMs, selecting pretraining data, and extrapolating performance across compute scales.

PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark

cs.CL · 2025-11-26 · unverdicted · novelty 6.0

PEFT-Bench is a standardized end-to-end benchmark for 7 PEFT methods across 27 NLP datasets on autoregressive LLMs, accompanied by the PSCP metric that penalizes based on trainable parameters, inference speed, and training memory.

Titans: Learning to Memorize at Test Time

cs.LG · 2024-12-31 · unverdicted · novelty 6.0

Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.

BloombergGPT: A Large Language Model for Finance

cs.LG · 2023-03-30 · conditional · novelty 6.0

BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.

Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs

cs.CL · 2026-05-09 · unverdicted · novelty 5.0 · 2 refs

Extremely quantized LLMs exhibit systematic smoothness degradation that reduces effective token candidates and degrades generation; a smoothness-preserving principle in PTQ and QAT delivers gains beyond numerical accuracy.

citing papers explorer

Showing 31 of 31 citing papers.