pith. sign in

hub

Smith, Daniel Khashabi, and Hannaneh Hajishirzi

38 Pith papers cite this work. Polarity classification is still indexing.

38 Pith papers citing it

hub tools

citation-role summary

background 2 method 1

citation-polarity summary

clear filters

representative citing papers

Continual Model Routing in Evolving Model Hubs

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

Formalizes continual model routing (CMR), releases CMRBench with over 2000 models, and presents CARvE which outperforms retrieval, fine-tuning and adapter-merging baselines on model/family/domain accuracy.

A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

TASTE automates generation of high-coverage difficult agent benchmarks via adaptive contrastive n-gram sampling of tool sequences, yielding τ^c-Bench where models saturating τ²-Bench drop sharply and unique tool combinations more than double.

Self-Rewarding Language Models

cs.CL · 2024-01-18 · conditional · novelty 7.0

Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.

Context-Aware Distillation and Ablation for Text2DSL

cs.CL · 2026-06-21 · unverdicted · novelty 6.0

Context-aware distillation with BNF+API+vocabulary scales PolkitBench to 10,073 pairs at 99.7% runtime pass rate; ablation on GigaChat-10B shows vocabulary adds +0.198 combined score while API/BNF add 22-25pp structural validity.

Self-Supervised On-Policy Distillation for Reasoning Language Models

cs.LG · 2026-05-17 · unverdicted · novelty 6.0

SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.

citing papers explorer

Showing 1 of 1 citing paper after filters.

  • A Survey on LLM-as-a-Judge cs.CL · 2024-11-23 · unverdicted · none · ref 165

    A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.