arxiv: 2604.14168 · v1 · submitted 2026-03-24 · 💻 cs.CL · cs.AI

Recognition: no theorem link

SAGE Celer 2.6 Technical Card

SAGEA Research Team , Basab Jha , Firoj Paudel , Ujjwal Puri , Adrian Liu , Ethan Henkel , Zhang Yuting , Mateusz Kowalczyk

show 3 more authors

Mei Huang Choi Donghyuk Wang Junhao

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:51 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords SAGE Celer 2.6South Asian languagesDevanagari tokenizerInverse Reasoningmultimodal modelsNepaliHindilanguage models

0 comments

The pith

SAGE Celer 2.6 delivers competitive math and coding results with a custom Devanagari tokenizer for strong Nepali and Hindi support.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SAGE Celer 2.6 in 5B, 10B, and 27B parameter sizes as a general-purpose model that incorporates architectural modifications and further pre-training. It uses an Inverse Reasoning pipeline to train the model to validate its own logic paths, which reduces cascading errors and hallucinations during complex tasks. A native end-to-end vision encoder provides multimodal capabilities, and the model includes a custom tokenizer for the Devanagari script to achieve strong performance in Nepali and Hindi while preserving English reasoning and keeping latency low.

Core claim

Celer 2.6 benefits from extensive architectural modifications and further pre-training on an undisclosed model. Using the Inverse Reasoning pipeline, it is trained to validate its own logic paths, minimizing cascading errors and hallucinations. It includes an end-to-end vision encoder for multimodal functionality and provides competitive results on ACUMEN benchmarks for mathematics, coding, and general intelligence with low latency. It is optimized for South Asian language support through a custom tokenizer for the Devanagari script, showing strong performance in Nepali and Hindi without sacrificing English reasoning ability.

What carries the argument

Inverse Reasoning (IR) pipeline that natively trains the model to validate its own logic paths, combined with a custom tokenizer for the Devanagari script.

Load-bearing premise

The undisclosed pre-training, Inverse Reasoning pipeline, and custom tokenizer deliver the stated competitive benchmark results and language performance without post-hoc selection or unstated trade-offs.

What would settle it

Independent benchmark evaluations on ACUMEN or standard tests such as GSM8K for mathematics, HumanEval for coding, and dedicated Nepali/Hindi language tasks that either match the claimed competitiveness or fall short.

read the original abstract

We introduce SAGE Celer 2.6, the latest in our line of general-purpose Celer models from SAGEA. Celer 2.6 is available in 5B, 10B, and 27B parameter sizes and benefits from extensive architectural modifications and further pre-training on an undisclosed model. Using our Inverse Reasoning (IR) pipeline, SAGEA natively trains Celer 2.6 to validate its own logic paths, minimizing cascading error and hallucination in complex reasoning tasks. Celer 2.6 also boasts natively integrated multimodal functionality with an end-to-end vision encoder to avoid common pitfalls in adapter-based approaches. Celer 2.6 provides highly competitive results on mathematics, coding, and general intelligence benchmarks (ACUMEN), along with low latency. Most importantly, Celer 2.6 is specifically optimized for South Asian language support, with a custom tokenizer for the Devanagari script and strong performance in both Nepali and Hindi without sacrificing English reasoning ability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a model release note with big claims on South Asian language performance and self-reasoning but zero benchmarks or details to back any of it up.

read the letter

The main point here is that Celer 2.6 is pitched as delivering competitive math, coding, and ACUMEN results plus strong Nepali and Hindi support through a custom Devanagari tokenizer and an Inverse Reasoning pipeline, all while keeping English intact. Those are the two things worth noting up front: the regional language focus and the self-correction training step. Everything else is standard scaling and fine-tuning language from the industry side. The card does flag a practical target—multimodal handling without adapters and native support for scripts that often get shortchanged—which could matter for users working in those languages. The IR pipeline is presented as a way to cut cascading errors, which at least names a concrete engineering choice rather than vague scaling. Beyond that, the text stays at the level of assertion. No scores appear, no baselines, no dataset sizes, no ablation on the tokenizer or the reasoning loop, and no protocol for how the benchmarks were run. That leaves the competitive claims as statements without any way to check them or spot trade-offs. The absence of even basic numbers makes it hard to treat this as more than a product update. Readers who need a ready model for Hindi or Nepali tasks might still download and test it themselves, but the card adds nothing verifiable to the literature. I would not bring this to a reading group, would not cite it, and would not send it for peer review because there is no method, result, or analysis to referee.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces SAGE Celer 2.6, a family of general-purpose models (5B/10B/27B parameters) from SAGEA. It describes architectural modifications and further pre-training on an undisclosed base model, an Inverse Reasoning pipeline for self-validation of logic paths to reduce cascading errors and hallucinations, native multimodal support via an end-to-end vision encoder, highly competitive results on mathematics, coding, and ACUMEN benchmarks with low latency, and targeted optimization for South Asian languages through a custom Devanagari tokenizer that delivers strong Nepali and Hindi performance without English trade-offs.

Significance. If the performance and architectural claims were substantiated with data, the work could contribute to multilingual LLM research by addressing tokenization challenges for Devanagari-script languages and introducing a reasoning-validation pipeline. The multimodal design without adapters would also be of interest. However, the complete absence of any empirical results, baselines, or methodological details prevents any assessment of actual significance or novelty.

major comments (2)

[Abstract] Abstract: The central claims that Celer 2.6 'provides highly competitive results on mathematics, coding, and general intelligence benchmarks (ACUMEN)' and achieves 'strong performance in both Nepali and Hindi without sacrificing English reasoning ability' are stated without any numerical scores, comparison tables, dataset descriptions, evaluation protocols, or error analyses. These assertions are load-bearing for the paper's contribution yet remain entirely unsupported.
Full text: No information is supplied on the undisclosed pre-training corpus, the implementation details of the Inverse Reasoning pipeline, the architecture or training of the end-to-end vision encoder, or any ablation studies that would demonstrate the claimed benefits over standard approaches.

minor comments (1)

The relationship between SAGEA and the prior Celer models is not clarified, and the manuscript provides no references to related work on custom tokenizers or reasoning pipelines.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their review of the SAGE Celer 2.6 Technical Card. This document is a concise model card summarizing architectural choices and high-level capabilities rather than a full research paper. We address the major comments below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims that Celer 2.6 'provides highly competitive results on mathematics, coding, and general intelligence benchmarks (ACUMEN)' and achieves 'strong performance in both Nepali and Hindi without sacrificing English reasoning ability' are stated without any numerical scores, comparison tables, dataset descriptions, evaluation protocols, or error analyses. These assertions are load-bearing for the paper's contribution yet remain entirely unsupported.

Authors: We agree that the abstract presents performance claims without supporting numbers or protocols. As this is a technical card rather than an empirical paper, detailed benchmarks were omitted for brevity. We will revise the abstract to qualify these statements (e.g., 'demonstrates strong internal performance on...') and add a pointer to evaluation resources where available. revision: partial
Referee: [—] Full text: No information is supplied on the undisclosed pre-training corpus, the implementation details of the Inverse Reasoning pipeline, the architecture or training of the end-to-end vision encoder, or any ablation studies that would demonstrate the claimed benefits over standard approaches.

Authors: The pre-training corpus is undisclosed for proprietary reasons. The Inverse Reasoning pipeline is described at a conceptual level as self-validation of logic paths; full implementation details and the precise vision encoder architecture/training are omitted to protect intellectual property. Ablation studies are not included because the document is a model overview, not a research contribution seeking to prove incremental gains. The primary novelty highlighted is the custom Devanagari tokenizer and native multimodal integration. revision: no

standing simulated objections not resolved

Details on the pre-training corpus, full implementation of the Inverse Reasoning pipeline, vision encoder architecture, and any ablation studies cannot be disclosed due to proprietary constraints.

Circularity Check

0 steps flagged

No circularity: technical card contains no derivations, equations, or fitted predictions

full rationale

The document is a model announcement card that states performance claims and architectural features without presenting any equations, derivation chains, fitted parameters, or quantitative predictions. No load-bearing steps exist that could reduce by construction to inputs, self-citations, or ansatzes. All patterns (self-definitional, fitted-input-called-prediction, etc.) require explicit mathematical or logical reductions that are absent here, leaving the text self-contained as descriptive prose.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, axioms, or invented entities are present; the document is a descriptive model card rather than a theoretical or empirical research paper.

pith-pipeline@v0.9.0 · 5509 in / 1210 out tokens · 46855 ms · 2026-05-15T00:51:08.175279+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 13 internal anchors

[1]

M. AI. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Y . Bai, S. Kadavath, S. Kundu, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

J. Chen et al. Internvl: Scaling up vision foundation models and aligning for generic visual- linguistic tasks.arXiv preprint arXiv:2312.14238, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

M. Chen, J. Tworek, H. Jun, Q. Yuan, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

D. Ganguli, L. Lovitt, J. Kernion, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[7]

Measuring Mathematical Problem Solving With the MATH Dataset

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, et al. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021. 21 SAGE Celer 2.6 Technical Card

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

B. Jha, F. Paudel, U. Puri, E. Henkel, J. Wang, et al. Sage-32b: Agentic reasoning via iterative distillation.arXiv preprint arXiv:2601.04237, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

Madaan, N

A. Madaan, N. Tandon, P. Gupta, et al. Self-refine: Iterative refinement with self-feedback. In Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[10]

Red Teaming Language Models with Language Models

E. Perez, S. Huang, H. F. Song, et al. Red teaming language models with language models.arXiv preprint arXiv:2202.03286, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

P. Röttger, B. Vidgen, D. Nguyen, et al. Xstest: A test suite for identifying exaggerated safety behaviours in large language models.arXiv preprint arXiv:2308.01263, 2023

work page internal anchor Pith review arXiv 2023
[12]

P. Rust, J. Pfeiffer, I. Vuli ´c, S. Ruder, and I. Gurevych. How good is your tokenizer? on the monolingual performance of multilingual language models.arXiv preprint arXiv:2012.15613, 2021

work page arXiv 2012
[13]

Acumen methodology

SAGEA. Acumen methodology. Technical report, SAGEA, 2026. Internal technical report, manuscript in preparation

work page 2026
[14]

Shinn, F

N. Shinn, F. Cassano, A. Berman, et al. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[15]

Q. Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Q. Team. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Y . Wang, Y . Zhang, et al. Mmlu-pro: A more robust and challenging multi-task language under- standing benchmark.arXiv preprint arXiv:2406.01574, 2024. 22 SAGE Celer 2.6 Technical Card A. Supplementary Material The following appendices contain infrastructural details regarding the invocation of the Inverse Reason- ing parameters, programmatic logic execu...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

A le S bhanda agadi prastut garnuparcha

work page
[19]

B le D bhanda agadi prastut garnuparcha

work page
[20]

I le A ra B dubai pachi prastut garnuparcha

work page
[21]

Duiwata valid prastuti kram deu ra kina valid chan bhanne chhoto karan Nepalimai lekh

F antim ma hunuparcha. Duiwata valid prastuti kram deu ra kina valid chan bhanne chhoto karan Nepalimai lekh. Celer 2.6 High Output: <think> Lakshya: sabai nibandhan pura hune dui kram khojne. Nibandhanharu: - A < S - B < D - A < I - B < I - F antim Ranniti:

work page
[22]

F lai antim sthan (6) ma sthir rakhne

work page
[23]

Baki {A, B, S, D, I} madhye nibandhan-aadharit aanshik kram banne

work page
[24]

I lai A ra B pachi rakhnuparcha, tesaile I suruka sthanma aaudaina

work page
[25]

S kebal A pachi aaunuparcha; D kebal B pachi aaunuparcha

work page
[26]

</think> Dui valid prastuti kram:

Valid topological-jasto kramharubata dui farak samadhan chhanot garne. </think> Dui valid prastuti kram:

work page
[27]

A, B, S, D, I, F Karan: A S bhanda agadi cha; B D bhanda agadi cha; I A ra B dubai pachi cha; F antimma cha

work page
[28]

lazy refusal

B, A, D, S, I, F Karan: B D bhanda agadi cha; A S bhanda agadi cha; I A ra B dubai pachi cha; F antimma cha. Tyesaile mathika dubai kram sabai niyam santushta garne valid kram hun. 24 SAGE Celer 2.6 Technical Card E. Safety and Alignment Metrics SAGEA strictly evaluates its models against adversarial prompting to ensure safety without deteriorating helpfu...

work page 1996