Athene-70b: Redefining the boundaries of post-training for open models, July 2024a

Evan Frick, Peter Jin, Tianle Li, Karthik Ganesan, Jian Zhang, Jiantao Jiao, Banghua Zhu · 2024 · arXiv 2410.14872

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems

cs.CL · 2026-04-14 · unverdicted · novelty 7.0

CompliBench uses simulation and adversarial flaw injection to create labeled dialogue data showing that top proprietary LLMs perform poorly at spotting guideline violations while fine-tuned smaller models outperform them and generalize to new domains.

RewardBench 2: Advancing Reward Model Evaluation

cs.CL · 2025-06-02 · unverdicted · novelty 6.0

RewardBench 2 is a new benchmark that supplies challenging fresh human prompts for reward model evaluation, yielding lower average scores but higher correlation with downstream best-of-N sampling and RLHF training performance.

Qwen2.5 Technical Report

cs.CL · 2024-12-19 · unverdicted · novelty 3.0

Qwen2.5 LLMs scale pre-training data to 18 trillion tokens and apply multistage reinforcement learning, achieving competitive performance on benchmarks with models up to 5 times larger.

citing papers explorer

Showing 3 of 3 citing papers.

CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems cs.CL · 2026-04-14 · unverdicted · none · ref 8
CompliBench uses simulation and adversarial flaw injection to create labeled dialogue data showing that top proprietary LLMs perform poorly at spotting guideline violations while fine-tuned smaller models outperform them and generalize to new domains.
RewardBench 2: Advancing Reward Model Evaluation cs.CL · 2025-06-02 · unverdicted · none · ref 14
RewardBench 2 is a new benchmark that supplies challenging fresh human prompts for reward model evaluation, yielding lower average scores but higher correlation with downstream best-of-N sampling and RLHF training performance.
Qwen2.5 Technical Report cs.CL · 2024-12-19 · unverdicted · none · ref 17
Qwen2.5 LLMs scale pre-training data to 18 trillion tokens and apply multistage reinforcement learning, achieving competitive performance on benchmarks with models up to 5 times larger.

Athene-70b: Redefining the boundaries of post-training for open models, July 2024a

fields

years

verdicts

representative citing papers

citing papers explorer