citation dossier

Enhancing chat language models by scaling high-quality instructional conversations

Ding, N · 2023 · arXiv 2305.14233

19Pith papers citing it

20reference links

cs.CLtop field · 8 papers

UNVERDICTEDtop verdict bucket · 14 papers

This arXiv-backed work is queued for full Pith review when it crosses the high-inbound sweep. That review runs reader · skeptic · desk-editor · referee · rebuttal · circularity · lean confirmation · RS check · pith extraction.

read on arXiv PDF

why this work matters in Pith

Pith has found this work in 19 reviewed papers. Its strongest current cluster is cs.CL (8 papers). The largest review-status bucket among citing papers is UNVERDICTED (14 papers). For highly cited works, this page shows a dossier first and a bounded explorer second; it never tries to render every citing paper at once.

representative citing papers

Beyond Parameter Aggregation: Semantic Consensus for Federated Fine-Tuning of LLMs

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Semantic consensus on model outputs for public prompts enables federated LLM fine-tuning that matches parameter-aggregation baselines with orders-of-magnitude lower communication.

IRIS: Interpolative R\'enyi Iterative Self-play for Large Language Model Fine-Tuning

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of the annotated data.

Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory

cs.LG · 2026-04-14 · unverdicted · novelty 7.0

Continuous adversarial training in the embedding space produces a robust generalization bound for linear transformers that decreases with perturbation radius, tied to singular values of the embedding matrix, and motivates a new regularizer that improves real LLM jailbreak robustness-utility tradeoff

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

cs.CL · 2024-10-14 · unverdicted · novelty 7.0

LongMemEval benchmarks long-term memory in chat assistants, revealing 30% accuracy drops across sustained interactions and proposing indexing-retrieval-reading optimizations that boost performance.

Self-Rewarding Language Models

cs.CL · 2024-01-18 · conditional · novelty 7.0

Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.

DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

cs.LG · 2026-05-11 · conditional · novelty 6.0 · 2 refs

DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.

TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning

cs.CR · 2026-04-30 · unverdicted · novelty 6.0

TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.

NVLLM: A 3D NAND-Centric Architecture Enabling Edge on-Device LLM Inference

cs.AR · 2026-04-28 · unverdicted · novelty 6.0

NVLLM offloads FFN computations to integrated 3D NAND flash with page-level access and keeps attention in DRAM, delivering 16.7x-37.9x speedups over GPU out-of-core baselines for models up to 30B parameters.

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

cs.CV · 2024-12-06 · unverdicted · novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

SnapKV: LLM Knows What You are Looking for Before Generation

cs.CL · 2024-04-22 · conditional · novelty 6.0

SnapKV selects clustered important KV positions per attention head from an observation window at the prompt end, yielding 3.6x faster generation and 8.2x better memory efficiency on 16K-token inputs with comparable performance across 16 datasets.

Do Linear Probes Generalize Better in Persona Coordinates?

cs.AI · 2026-05-10 · unverdicted · novelty 5.0

Probes on persona principal components from contrastive prompts generalize better than raw activation probes for harmful behaviors across 10 datasets.

Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods

cs.LG · 2026-04-19 · unverdicted · novelty 5.0

ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.

Acceptance Dynamics Across Cognitive Domains in Speculative Decoding

cs.AI · 2026-04-16 · unverdicted · novelty 5.0

Empirical measurements across four NLP domains show task type is a stronger predictor of speculative decoding acceptance than tree depth, with chat uniquely achieving expected accepted length over 1 token per step.

Understanding Performance Gap Between Parallel and Sequential Sampling in Large Reasoning Models

cs.CL · 2026-04-07 · unverdicted · novelty 5.0

Lack of exploration from conditioning on prior answers is the primary reason parallel sampling outperforms sequential sampling in large reasoning models.

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

cs.CL · 2025-02-04 · unverdicted · novelty 5.0

SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

cs.CV · 2024-08-03 · conditional · novelty 5.0

MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.

Reducing Redundancy in Retrieval-Augmented Generation through Chunk Filtering

cs.CL · 2026-04-27 · unverdicted · novelty 4.0

Entity-based chunk filtering reduces RAG vector index size by 25-36% with retrieval quality near baseline levels.

Yi: Open Foundation Models by 01.AI

cs.CL · 2024-03-07 · unverdicted · novelty 4.0

Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

cs.CL · 2024-12-07 · accept · novelty 3.0

A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

citing papers explorer

Showing 19 of 19 citing papers.

Beyond Parameter Aggregation: Semantic Consensus for Federated Fine-Tuning of LLMs cs.LG · 2026-05-12 · unverdicted · none · ref 20
Semantic consensus on model outputs for public prompts enables federated LLM fine-tuning that matches parameter-aggregation baselines with orders-of-magnitude lower communication.
IRIS: Interpolative R\'enyi Iterative Self-play for Large Language Model Fine-Tuning cs.LG · 2026-04-22 · unverdicted · none · ref 20
IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of the annotated data.
Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory cs.LG · 2026-04-14 · unverdicted · none · ref 5
Continuous adversarial training in the embedding space produces a robust generalization bound for linear transformers that decreases with perturbation radius, tied to singular values of the embedding matrix, and motivates a new regularizer that improves real LLM jailbreak robustness-utility tradeoff
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory cs.CL · 2024-10-14 · unverdicted · none · ref 66
LongMemEval benchmarks long-term memory in chat assistants, revealing 30% accuracy drops across sustained interactions and proposing indexing-retrieval-reading optimizations that boost performance.
Self-Rewarding Language Models cs.CL · 2024-01-18 · conditional · none · ref 43
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices cs.LG · 2026-05-11 · conditional · none · ref 100 · 2 links
DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.
TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning cs.CR · 2026-04-30 · unverdicted · none · ref 4
TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
NVLLM: A 3D NAND-Centric Architecture Enabling Edge on-Device LLM Inference cs.AR · 2026-04-28 · unverdicted · none · ref 3
NVLLM offloads FFN computations to integrated 3D NAND flash with page-level access and keeps attention in DRAM, delivering 16.7x-37.9x speedups over GPU out-of-core baselines for models up to 30B parameters.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling cs.CV · 2024-12-06 · unverdicted · none · ref 58
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
SnapKV: LLM Knows What You are Looking for Before Generation cs.CL · 2024-04-22 · conditional · none · ref 11
SnapKV selects clustered important KV positions per attention head from an observation window at the prompt end, yielding 3.6x faster generation and 8.2x better memory efficiency on 16K-token inputs with comparable performance across 16 datasets.
Do Linear Probes Generalize Better in Persona Coordinates? cs.AI · 2026-05-10 · unverdicted · none · ref 17
Probes on persona principal components from contrastive prompts generalize better than raw activation probes for harmful behaviors across 10 datasets.
Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods cs.LG · 2026-04-19 · unverdicted · none · ref 11
ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.
Acceptance Dynamics Across Cognitive Domains in Speculative Decoding cs.AI · 2026-04-16 · unverdicted · none · ref 15
Empirical measurements across four NLP domains show task type is a stronger predictor of speculative decoding acceptance than tree depth, with chat uniquely achieving expected accepted length over 1 token per step.
Understanding Performance Gap Between Parallel and Sequential Sampling in Large Reasoning Models cs.CL · 2026-04-07 · unverdicted · none · ref 5
Lack of exploration from conditioning on prior answers is the primary reason parallel sampling outperforms sequential sampling in large reasoning models.
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model cs.CL · 2025-02-04 · unverdicted · none · ref 167
SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
MiniCPM-V: A GPT-4V Level MLLM on Your Phone cs.CV · 2024-08-03 · conditional · none · ref 30
MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
Reducing Redundancy in Retrieval-Augmented Generation through Chunk Filtering cs.CL · 2026-04-27 · unverdicted · none · ref 10
Entity-based chunk filtering reduces RAG vector index size by 25-36% with retrieval quality near baseline levels.
Yi: Open Foundation Models by 01.AI cs.CL · 2024-03-07 · unverdicted · none · ref 20
Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods cs.CL · 2024-12-07 · accept · none · ref 49
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

Enhancing chat language models by scaling high-quality instructional conversations

why this work matters in Pith

fields

years

verdicts

representative citing papers

citing papers explorer