arxiv: 2512.24880 · v2 · submitted 2025-12-31 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

mHC: Manifold-Constrained Hyper-Connections

Zhenda Xie , Yixuan Wei , Huanqi Cao , Chenggang Zhao , Chengqi Deng , Jiashi Li , Damai Dai , Huazuo Gao

show 12 more authors

Jiang Chang Kuai Yu Liang Zhao Shangyan Zhou Zhean Xu Zhengyan Zhang Wangding Zeng Shengding Hu Yuqing Wang Jingyang Yuan Lean Wang Wenfeng Liang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 12:27 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords Manifold-Constrained Hyper-ConnectionsHyper-ConnectionsResidual ConnectionsTraining StabilityScalabilityLarge Language ModelsNeural Architecture Design

0 comments

The pith

Projecting hyper-connection residuals onto a manifold restores identity mapping for stable large-scale training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Manifold-Constrained Hyper-Connections (mHC) to address limitations in Hyper-Connections (HC), which expand residual streams and diversify patterns for better performance. This diversification breaks the identity mapping that makes residual connections stable to train, causing instability and scalability limits at large sizes along with extra memory costs. mHC projects the residual connection space onto a manifold to recover the identity property while retaining the diversity gains, and adds infrastructure optimizations for efficiency. Experiments show tangible performance improvements and the ability to train effectively at scale. A sympathetic reader would care because this could support more reliable growth in model size and capability without the previous training breakdowns.

Core claim

mHC projects the residual connection space of HC onto a specific manifold to restore the identity mapping property, while incorporating rigorous infrastructure optimization to ensure efficiency. Empirical experiments demonstrate that mHC is effective for training at scale, offering tangible performance improvements and superior scalability.

What carries the argument

Manifold projection applied to the residual connection space of hyper-connections, restoring the identity mapping property while keeping diversified connectivity.

If this is right

Tangible performance improvements over standard residual connections during large-scale training.
Superior scalability that supports larger model sizes without instability.
Reduced memory access overhead through the added infrastructure optimizations.
More reliable training dynamics that preserve the benefits of diversified connectivity patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The manifold approach could be tested on other forms of expanded or diversified connections beyond the original HC design.
Different choices of manifold might produce task-specific gains in stability or efficiency for particular model families.
This framing suggests a route to explore topological constraints as a general tool for balancing expressivity and trainability in deep networks.
If effective, mHC-style projections might lower the cost of iterating on new connectivity schemes during architecture search.

Load-bearing premise

Projecting the residual connection space of hyper-connections onto a specific manifold restores the identity mapping property while preserving the performance benefits of diversified connectivity patterns.

What would settle it

A direct comparison of training curves and final performance for equivalent large-scale models using mHC versus unconstrained HC, checking whether mHC avoids divergence and reaches higher accuracy.

read the original abstract

Recently, studies exemplified by Hyper-Connections (HC) have extended the ubiquitous residual connection paradigm established over the past decade by expanding the residual stream width and diversifying connectivity patterns. While yielding substantial performance gains, this diversification fundamentally compromises the identity mapping property intrinsic to the residual connection, which causes severe training instability and restricted scalability, and additionally incurs notable memory access overhead. To address these challenges, we propose Manifold-Constrained Hyper-Connections (mHC), a general framework that projects the residual connection space of HC onto a specific manifold to restore the identity mapping property, while incorporating rigorous infrastructure optimization to ensure efficiency. Empirical experiments demonstrate that mHC is effective for training at scale, offering tangible performance improvements and superior scalability. We anticipate that mHC, as a flexible and practical extension of HC, will contribute to a deeper understanding of topological architecture design and suggest promising directions for the evolution of foundational models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

mHC adds a manifold projection to hyper-connections to restore identity mapping and stability, but the abstract supplies almost no numbers or controls to show whether the gains survive the projection.

read the letter

The paper's main contribution is a way to constrain hyper-connections to a manifold so that the identity mapping property is restored. This addresses the training instability that comes with the diversified connectivity in plain HC, while trying to keep the performance advantages. It does a decent job laying out the problem with existing HC approaches and proposing the projection plus some efficiency optimizations as a solution. The abstract indicates they ran scaling experiments that show improvements, which is the kind of practical test that matters for these architecture tweaks. The soft spots are mostly around the lack of detail. The abstract asserts effectiveness without giving quantitative results, controls, or specifics on how the manifold is chosen or how the projection is implemented. The assumption that the projection keeps the extra pathways effective needs checking against the actual math and data. If the manifold pulls things too close to standard residuals, the claimed gains could shrink. The stress-test concern about the projection altering connectivity patterns is one to watch for in the full paper. This is aimed at people building large models who want alternatives to vanilla residuals. Someone already familiar with the HC papers would find it a natural next step to examine. The citation pattern looks appropriate, building on the prior work without over-relying on self-cites. It deserves peer review so the experiments can be scrutinized properly. The idea is concrete and the motivation is clear, even if more evidence is needed to confirm it scales as hoped.

Referee Report

2 major / 1 minor

Summary. The paper proposes Manifold-Constrained Hyper-Connections (mHC) as a general framework extending Hyper-Connections (HC). It projects the residual connection space of HC onto a specific manifold to restore the identity mapping property (lost in standard HC, causing instability and scalability limits), while adding infrastructure optimizations for efficiency. The central claim is that empirical experiments show mHC enables effective large-scale training with tangible performance gains and superior scalability over prior approaches.

Significance. If the empirical claims and the preservation of HC benefits under projection hold, mHC could provide a practical route to diversified residual connectivity without instability penalties, advancing topological design for foundational models. The emphasis on infrastructure optimization is a concrete strength for real-world deployment.

major comments (2)

[Abstract] Abstract: the claim that 'empirical experiments demonstrate that mHC is effective for training at scale, offering tangible performance improvements and superior scalability' is unsupported by any quantitative metrics, baselines, error bars, ablation controls, or dataset/model sizes. This absence makes the central empirical assertion impossible to evaluate.
[mHC Framework] mHC framework description: no derivation or analysis is supplied showing that the manifold projection operator restores exact identity mapping while commuting with the HC mixing operations so that diversified connectivity patterns are preserved at scale. If the projection forces effective connectivity back toward standard residuals, the reported gains would be lost; this is load-bearing for the central claim.

minor comments (1)

[Abstract] Abstract: the specific manifold and projection operator are not named, which reduces immediate clarity for readers familiar with HC.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. We address each major comment below, indicating the specific revisions we will undertake to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'empirical experiments demonstrate that mHC is effective for training at scale, offering tangible performance improvements and superior scalability' is unsupported by any quantitative metrics, baselines, error bars, ablation controls, or dataset/model sizes. This absence makes the central empirical assertion impossible to evaluate.

Authors: We agree that the abstract would be strengthened by including concrete quantitative details. In the revised manuscript we will update the abstract to reference specific performance metrics from our experiments (including gains over baselines), the model sizes and datasets used, and explicit mentions of error bars and ablation controls. This will make the empirical claims directly evaluable. revision: yes
Referee: [mHC Framework] mHC framework description: no derivation or analysis is supplied showing that the manifold projection operator restores exact identity mapping while commuting with the HC mixing operations so that diversified connectivity patterns are preserved at scale. If the projection forces effective connectivity back toward standard residuals, the reported gains would be lost; this is load-bearing for the central claim.

Authors: We thank the referee for identifying this gap. The current description introduces the projection operator but does not supply the requested derivation. We will add a dedicated theoretical analysis subsection that derives how the manifold projection restores exact identity mapping and proves that the operator commutes with the HC mixing operations, thereby preserving diversified connectivity at scale. This will directly demonstrate that the projection does not collapse connectivity patterns back to standard residuals. revision: yes

Circularity Check

0 steps flagged

No significant circularity; mHC defined via external manifold projection and validated empirically

full rationale

The paper defines mHC as a projection of HC residual space onto a chosen manifold to restore identity mapping, then reports empirical gains in scalability and performance. No load-bearing step reduces a prediction or uniqueness claim to a fitted parameter, self-citation chain, or definitional tautology. The manifold choice and projection operator are introduced as design decisions supported by experiments rather than derived from the target result itself. This is a standard non-circular proposal of a new architectural constraint.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that a manifold projection can simultaneously restore identity mapping and retain HC performance gains; no free parameters or new entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Projecting residual connection space onto a specific manifold restores the identity mapping property
Invoked to solve the instability problem caused by diversified connectivity.

pith-pipeline@v0.9.0 · 5520 in / 1003 out tokens · 27499 ms · 2026-05-15T12:27:06.891339+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.LogicAsFunctionalEquation Identity C echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

projects the residual connection space of HC onto a specific manifold to restore the identity mapping property

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Efficient and provably convergent end-to-end training of deep neural networks with linear constraints
math.OC 2026-05 unverdicted novelty 7.0

An efficiently computable HS-Jacobian acts as a conservative mapping for projections onto polyhedral sets, supporting provably convergent Adam-based end-to-end training of linearly constrained deep neural networks.
FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning
cs.LG 2026-05 unverdicted novelty 7.0

FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connect...
Transformers with Selective Access to Early Representations
cs.LG 2026-05 unverdicted novelty 7.0

SATFormer uses a learned context-dependent gate for selective access to early-layer value representations in Transformers, improving loss and accuracy over static residual baselines.
Transformers with Selective Access to Early Representations
cs.LG 2026-05 unverdicted novelty 7.0

SATFormer uses a context-dependent gate for selective reuse of early Transformer representations, improving validation loss and zero-shot accuracy especially on retrieval benchmarks.
Can an MLP Absorb Its Own Skip Connection?
cs.LG 2026-04 accept novelty 7.0

Skip-connected MLPs and residual-free MLPs of equal width represent generically disjoint function classes for common activations, with explicit impossibility proofs and a non-generic absorption condition for ReLU and GELU.
Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning
cs.LG 2026-04 unverdicted novelty 7.0

A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and common...
LoopCTR: Unlocking the Loop Scaling Power for Click-Through Rate Prediction
cs.IR 2026-04 unverdicted novelty 7.0

LoopCTR trains CTR models with recursive layer reuse and process supervision so that zero-loop inference outperforms baselines on public and industrial datasets.
Optimistic Dual Averaging Unifies Modern Optimizers
cs.LG 2026-05 unverdicted novelty 6.0

SODA unifies several modern optimizers under optimistic dual averaging and supplies a 1/k decay wrapper that improves performance without weight decay tuning.
Cubit: Token Mixer with Kernel Ridge Regression
cs.LG 2026-05 unverdicted novelty 6.0

Cubit replaces Transformer attention with Kernel Ridge Regression token mixing and shows potential gains on longer sequences.
The E$\Delta$-MHC-Geo Transformer: Adaptive Geodesic Operations with Guaranteed Orthogonality
cs.LG 2026-05 unverdicted novelty 6.0

The EΔ-MHC-Geo Transformer achieves input-adaptive unconditionally orthogonal residual connections via a Cayley-based rotation that works for all parameters, combined with a learned hybrid gate for reflections.
Graph Normalization: Fast Binarizing Dynamics for Differentiable MWIS
cs.LG 2026-05 unverdicted novelty 6.0

Graph Normalization is a convergent dynamical system that approximates MWIS by always reaching a binary maximum independent set via majorization-minimization and evolutionary game equivalence.
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
cs.AI 2026-04 unverdicted novelty 6.0

HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
Beyond the Laplacian: Doubly Stochastic Matrices for Graph Neural Networks
cs.LG 2026-04 unverdicted novelty 6.0

DsmNet substitutes Laplacian matrices with approximated doubly stochastic matrices in GNNs, using Neumann truncation and residual mass compensation to achieve O(K|E|) efficiency and bound Dirichlet energy decay for re...
ResBM: Residual Bottleneck Models for Low-Bandwidth Pipeline Parallelism
cs.LG 2026-04 unverdicted novelty 6.0

ResBM achieves 128x activation compression in pipeline-parallel transformer training by adding a residual bottleneck module that preserves a low-rank identity path, with no major loss in convergence or added overhead.
LPC-SM: Local Predictive Coding and Sparse Memory for Long-Context Language Modeling
cs.CL 2026-03 unverdicted novelty 6.0

LPC-SM is a hybrid architecture separating local attention, persistent memory, predictive correction, and control with ONT for memory writes, showing loss reductions on 158M-parameter models up to 4096-token contexts.
Beyond Linearity in Attention Projections: The Case for Nonlinear Queries
cs.LG 2026-03 conditional novelty 6.0

Nonlinear query projections of the form X + MLP(X) improve transformer performance on small models with only d² + O(d) added parameters.
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
cs.CV 2026-05 unverdicted novelty 5.0

GAP aligns visual latent reasoning in MLLMs at feature, context, and capacity levels, yielding the best aggregate perception and reasoning scores on Qwen2.5-VL 7B among supervised variants while providing task-relevan...
mHC-SSM: Manifold-Constrained Hyper-Connections for State Space Language Models with Stream-Specialized Adapters
cs.LG 2026-05 unverdicted novelty 5.0

Manifold-constrained multi-stream mixing plus per-stream adapters improves SSM language model validation loss from 6.3507 to 6.1353 and perplexity from 572.91 to 461.88 on WikiText-2.
Hyperloop Transformers
cs.LG 2026-04 unverdicted novelty 5.0

Hyperloop Transformers outperform standard and mHC Transformers with roughly 50% fewer parameters by looping a middle block of layers and applying hyper-connections only after each loop.
Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling
cs.LG 2026-04 unverdicted novelty 5.0

Nexusformer uses a three-stage nonlinear mapping in attention to enable stable, inheritable scaling of transformers, matching baseline perplexity with up to 41.5% less compute when growing from 240M to 440M parameters.
YOCO++: Enhancing YOCO with KV Residual Connections for Efficient LLM Inference
cs.CL 2026-04 unverdicted novelty 4.0

YOCO++ enhances YOCO by adding weighted residual KV connections from bottom layers, delivering state-of-the-art results among cross-layer compression methods at 50% KV cache reduction and outperforming the standard Tr...

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 20 Pith papers · 14 internal anchors

[1]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[2]

European conference on computer vision , pages=

Identity mappings in deep residual networks , author=. European conference on computer vision , pages=. 2016 , organization=

work page 2016
[3]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Densely connected convolutional networks , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[4]

FractalNet: Ultra-Deep Neural Networks without Residuals

Fractalnet: Ultra-deep neural networks without residuals , author=. arXiv preprint arXiv:1605.07648 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

work page
[6]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Aggregated residual transformations for deep neural networks , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[7]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Xception: Deep learning with depthwise separable convolutions , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[8]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Gqa: Training generalized multi-query transformer models from multi-head checkpoints , author=. arXiv preprint arXiv:2305.13245 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Fast Transformer Decoding: One Write-Head is All You Need

Fast transformer decoding: One write-head is all you need , author=. arXiv preprint arXiv:1911.02150 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1911
[10]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model , author=. arXiv preprint arXiv:2405.04434 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer , author=. arXiv preprint arXiv:1701.06538 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Gshard: Scaling giant models with conditional computation and automatic sharding , author=. arXiv preprint arXiv:2006.16668 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2006
[13]

Journal of Machine Learning Research , volume=

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. Journal of Machine Learning Research , volume=

work page
[14]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep layer aggregation , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[15]

arXiv preprint arXiv:2409.19606 , year=

Hyper-connections , author=. arXiv preprint arXiv:2409.19606 , year=

work page arXiv
[16]

arXiv preprint arXiv:2506.22696 , year=

Residual Matrix Transformers: Scaling the Size of the Residual Stream , author=. arXiv preprint arXiv:2506.22696 , year=

work page arXiv
[17]

arXiv preprint arXiv:2502.12170 , year=

Muddformer: Breaking residual bottlenecks in transformers via multiway dynamic dense connections , author=. arXiv preprint arXiv:2502.12170 , year=

work page arXiv
[18]

Neurocomputing , volume=

Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

work page 2024
[19]

Advances in neural information processing systems , volume=

Root mean square layer normalization , author=. Advances in neural information processing systems , volume=

work page
[20]

Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

Auxiliary-loss-free load balancing strategy for mixture-of-experts , author=. arXiv preprint arXiv:2408.15664 , year=

work page internal anchor Pith review arXiv
[21]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[23]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

arXiv preprint arXiv:2504.17577 , year=

TileLang: A Composable Tiled Programming Model for AI Systems , author=. arXiv preprint arXiv:2504.17577 , year=

work page arXiv
[26]

An empirical analysis of compute-optimal large language model training , url =

Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and de Las Casas, Diego and Hendricks, Lisa Anne and Welbl, Johannes and Clark, Aidan and Hennigan, Thomas and Noland, Eric and Millican, Katherine and van den Driessche, George and Damoc, Bogdan and Guy, Aurelia and Osindero, Simon and...

work page
[27]

and Ermon, Stefano and Rudra, Atri and R

Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R. Flash. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[28]

The Twelfth International Conference on Learning Representations , year=

Zero Bubble (Almost) Pipeline Parallelism , author=. The Twelfth International Conference on Learning Representations , year=

work page
[29]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page
[30]

2025 , url=

Gaurav Menghani and Ravi Kumar and Sanjiv Kumar , booktitle=. 2025 , url=

work page 2025
[31]

Forty-second International Conference on Machine Learning , year=

DeepCrossAttention: Supercharging Transformer Residual Connections , author=. Forty-second International Conference on Machine Learning , year=

work page
[32]

Training Very Deep Networks , url =

Srivastava, Rupesh K and Greff, Klaus and Schmidhuber, J\". Training Very Deep Networks , url =. Advances in Neural Information Processing Systems , editor =

work page
[33]

2023 , eprint=

ResiDual: Transformer with Dual Residual Connections , author=. 2023 , eprint=

work page 2023
[34]

The Eleventh International Conference on Learning Representations , year=

Cross-Layer Retrospective Retrieving via Layer Attention , author=. The Eleventh International Conference on Learning Representations , year=

work page
[35]

Highway Transformer: Self-Gating Enhanced Self-Attentive Networks

Chai, Yekun and Jin, Shuo and Hou, Xinwen. Highway Transformer: Self-Gating Enhanced Self-Attentive Networks. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.616

work page doi:10.18653/v1/2020.acl-main.616 2020
[36]

Pacific Journal of Mathematics , volume=

Concerning nonnegative matrices and doubly stochastic matrices , author=. Pacific Journal of Mathematics , volume=

work page
[37]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2009
[38]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Challenging big-bench tasks and whether chain-of-thought can solve them , author=. arXiv preprint arXiv:2210.09261 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,

Dheeru Dua and Yizhong Wang and Pradeep Dasigi and Gabriel Stanovsky and Sameer Singh and Matt Gardner , editor =. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,. 2019 , url =. doi:10.18653/V1/N19-1246 , timestamp =

work page doi:10.18653/v1/n19-1246 2019
[40]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

doi:10.18653/v1/P19-1472 , pages =

Rowan Zellers and Ari Holtzman and Yonatan Bisk and Ali Farhadi and Yejin Choi , editor =. Proceedings of the 57th Conference of the Association for Computational Linguistics,. 2019 , url =. doi:10.18653/v1/p19-1472 , timestamp =

work page doi:10.18653/v1/p19-1472 2019
[42]

The Thirty-Fourth

Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. The Thirty-Fourth. 2020 , url =. doi:10.1609/aaai.v34i05.6239 , timestamp =

work page doi:10.1609/aaai.v34i05.6239 2020
[43]

Measuring Mathematical Problem Solving With the MATH Dataset

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[44]

T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke. T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1147

work page doi:10.18653/v1/p17-1147 2017