SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning

SingGuard Team

arxiv: 2606.22873 · v2 · pith:KXXB6DMLnew · submitted 2026-06-22 · 💻 cs.CV · cs.CL

SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning

SingGuard Team This is my paper

Pith reviewed 2026-06-26 09:17 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords multimodal guardrailspolicy adaptationsafety assessmentvision-language modelsdynamic reasoningreinforcement learningbenchmark evaluation

0 comments

The pith

SingGuard lets multimodal guardrails accept natural-language policies as runtime input and check content rule by rule.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SingGuard as a family of models for safety assessment in vision-language model conversations. It accepts active policies in natural language at runtime, evaluates the input content against each rule, and returns both a safety label and the specific rule that triggered the decision. The model supports a spectrum of inference modes from fast direct classification to slower policy-grounded reasoning, tuned with fast-slow decoupled reinforcement learning. Evaluation uses a new benchmark of 56,340 examples covering more than 80 risk types, including cross-modal cases where separate modalities are harmless but their combination is not. Results show state-of-the-art average F1 scores across six benchmark families and higher accuracy when policies shift during operation.

Core claim

SingGuard treats the active policy as a runtime input: given natural-language rules, it checks the target content against the active policy rule by rule and predicts both the safety label and the triggered rule. To balance efficiency and interpretability, SingGuard supports fast, hybrid, and slow inference regimes along a fast-to-slow reasoning spectrum, ranging from direct safety judgments to policy-grounded deliberation, optimized with fast-slow decoupled reinforcement learning.

What carries the argument

Policy as runtime input with rule-by-rule evaluation and fast-to-slow reasoning spectrum optimized by decoupled reinforcement learning.

If this is right

SingGuard reaches state-of-the-art average F1 on every one of the six benchmark families spanning 35 datasets.
Dynamic-rule evaluation raises policy-following accuracy from 0.6465 to 0.7415 when policies change at runtime.
The model handles cross-modal joint-risk cases where each modality alone is harmless but the combination implies unsafe intent.
Fast, hybrid, and slow inference regimes let users trade speed for deeper policy-grounded deliberation without retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Guardrails could be deployed across regions or products and updated simply by swapping the natural-language policy text.
The same rule-by-rule structure might apply to safety evaluation in non-multimodal LLM settings.
The benchmark's emphasis on dynamic rules could become a standard test for adaptability in future safety models.

Load-bearing premise

The new benchmark of 56,340 examples and 80+ risk types, including cross-modal cases, measures real-world multimodal safety performance in a faithful and generalizable way.

What would settle it

A live deployment test that applies policy changes at runtime outside the benchmark distributions and measures whether policy-following accuracy falls below the reported 0.7415.

Figures

Figures reproduced from arXiv: 2606.22873 by SingGuard Team.

**Figure 2.** Figure 2: Overview of SingGuard. Previous guardrails often cover only partial modalities, rely on fixed taxonomies or static policies, and use a fixed direct-answer or always-reasoning output path. SingGuard unifies multimodal inputs, open active policies, and adaptive reasoning paths in one policy-adaptive guardrail model. modal guardrail benchmark with more than 80 fine-grained risk types organized under a three-l… view at source ↗

**Figure 3.** Figure 3: Policy-conditioned data and training pipeline. SingGuard aligns heterogeneous safety data with [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Data-statistics overview of SingGuard-Bench: taxonomy distribution, sample counts, and keyword [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Radar comparison over benchmark families. SingGuard shows more balanced coverage across text, [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

**Figure 6.** Figure 6: Rule Isolation Mask (RI-Mask) for parallel multi-rule inference. RI-Mask packs the shared image– [PITH_FULL_IMAGE:figures/full_fig_p030_6.png] view at source ↗

read the original abstract

Vision-language models (VLMs) are increasingly deployed in consumer, medical, financial, and enterprise applications. This broad deployment expands the safety surface: risks can arise from multimodal question answering, assistant responses, and cross-modal composition, while moderation policies may vary across products, regions, and deployment stages. Most existing guardrails either rely on fixed taxonomies or target only a narrow set of interaction settings, which limits their adaptability when safety rules change at deployment time. We present \textbf{SingGuard}, a policy-adaptive multimodal guardrail model family for safety assessment in multimodal conversations. SingGuard treats the active policy as a runtime input: given natural-language rules, it checks the target content against the active policy rule by rule and predicts both the safety label and the triggered rule. To balance efficiency and interpretability, SingGuard supports fast, hybrid, and slow inference regimes along a fast-to-slow reasoning spectrum, ranging from direct safety judgments to policy-grounded deliberation. We further optimize this behavior with fast--slow decoupled reinforcement learning. We also introduce \textbf{SingGuard-Bench}, a multimodal guardrail benchmark with 56{,}340 examples spanning 80+ fine-grained risk types across multimodal QA, adversarial attack, and dynamic-rule evaluation settings, including cross-modal joint-risk cases where each modality is harmless in isolation but their composition implies unsafe intent. Across six benchmark families (35 datasets), SingGuard achieves state-of-the-art average F1 in every family. Dynamic-rule evaluation further shows improved policy-following accuracy from 0.6465 to 0.7415 under runtime policy shifts. Our code is available at https://github.com/inclusionAI/Sing-Guard.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SingGuard's runtime natural-language policy input plus cross-modal joint-risk cases in the new benchmark are the actual additions worth noting, but the SOTA F1 claims across families cannot be assessed from the abstract alone.

read the letter

SingGuard's main move is feeding the active policy as natural-language text at runtime instead of locking in a fixed taxonomy, plus their benchmark that explicitly includes cross-modal joint-risk cases where each modality looks fine alone. The fast-slow decoupled RL setup for switching between direct judgment and policy-grounded deliberation is a practical way to handle the speed versus depth tradeoff.

The paper frames the deployment problem clearly: policies differ by product, region, or stage, and existing guardrails do not adapt easily. The dynamic-rule numbers showing policy-following accuracy rising from 0.6465 to 0.7415 give a concrete signal that the runtime input helps. Releasing the code is also useful.

The soft spot is that the abstract supplies no methods, so the state-of-the-art average F1 across six families and 35 datasets cannot be checked for baseline fairness, statistical testing, or whether the 56,340-example SingGuard-Bench construction introduces its own biases. The 80+ risk types and cross-modal cases sound relevant, but their fidelity to real unsafe intent stays unverified.

This is for people working on multimodal safety tooling who need policy flexibility. A reader in that niche would get value from the runtime adaptation idea and the benchmark structure, provided the full paper shows how the examples were built and validated.

It deserves a serious referee. The core mechanism addresses a genuine deployment constraint even if the experimental claims will need close scrutiny.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces SingGuard, a policy-adaptive multimodal LLM guardrail that accepts natural-language rules as runtime input to perform rule-by-rule safety assessment, supporting fast, hybrid, and slow inference modes optimized via fast-slow decoupled reinforcement learning. It also presents SingGuard-Bench, a new benchmark with 56,340 examples across 80+ risk types including cross-modal joint risks, and reports state-of-the-art average F1 scores on 35 datasets from six benchmark families, plus an improvement in policy-following accuracy from 0.6465 to 0.7415 under dynamic rules.

Significance. If the reported results are robust, this work could advance the field by enabling flexible, deployment-time policy adaptation for multimodal safety without model retraining, addressing a key limitation of existing fixed-taxonomy guardrails. The availability of the code at the provided GitHub repository is a positive step toward reproducibility. The benchmark's inclusion of cross-modal cases could help evaluate compositional risks more effectively.

major comments (1)

The central SOTA claims and dynamic-rule accuracy improvement depend on the construction of SingGuard-Bench (56,340 examples, 80+ risk types) and the 35 datasets; the abstract provides no details on benchmark creation, baseline implementations, or statistical significance testing for the F1 and accuracy numbers, preventing verification of whether the results are sound or affected by post-hoc choices.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for identifying the need for greater transparency around benchmark construction and evaluation details to support verification of the reported results. We address the major comment below.

read point-by-point responses

Referee: The central SOTA claims and dynamic-rule accuracy improvement depend on the construction of SingGuard-Bench (56,340 examples, 80+ risk types) and the 35 datasets; the abstract provides no details on benchmark creation, baseline implementations, or statistical significance testing for the F1 and accuracy numbers, preventing verification of whether the results are sound or affected by post-hoc choices.

Authors: We agree the abstract is concise by design and omits these details. The full manuscript describes SingGuard-Bench construction (data sources, synthetic cross-modal generation, annotation protocol for 80+ risk types, and the 56,340-example split) in Section 4, with explicit discussion of how dynamic-rule examples were created to avoid post-hoc leakage. The 35 datasets across six families are enumerated in Table 1 and Section 5.1; baseline implementations follow the original papers' protocols (with links to official code where available) and are reproduced via the released repository. We did not report statistical significance testing (e.g., bootstrap or paired t-tests) on the F1 and accuracy figures. We will add these tests in the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The supplied paper text is limited to the abstract and high-level description. No equations, derivations, parameter-fitting procedures, or self-citation chains appear. Claims of SOTA performance and benchmark construction are presented as empirical results rather than reductions to prior inputs by construction. No load-bearing step reduces to self-definition, fitted prediction, or imported uniqueness from the authors' prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only input supplies no information on free parameters, axioms, or invented entities; ledger is empty by necessity.

pith-pipeline@v0.9.1-grok · 5833 in / 1083 out tokens · 23405 ms · 2026-06-26T09:17:49.311729+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

296 extracted references · 72 linked inside Pith

[1]

arXiv preprint arXiv:2409.18839 , year=

Mineru: An open-source solution for precise document content extraction , author=. arXiv preprint arXiv:2409.18839 , year=

Pith/arXiv arXiv
[2]

arXiv preprint arXiv:2408.16737 , year=

Smaller, weaker, yet better: Training llm reasoners via compute-optimal sampling , author=. arXiv preprint arXiv:2408.16737 , year=

arXiv
[3]

Advances in Neural Information Processing Systems , volume=

Dart-math: Difficulty-aware rejection tuning for mathematical problem-solving , author=. Advances in Neural Information Processing Systems , volume=
[4]

arXiv preprint arXiv:2501.03262 , year=

REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models , author=. arXiv preprint arXiv:2501.03262 , year=

Pith/arXiv arXiv
[5]

arXiv preprint arXiv:2304.03208 , year=

Cerebras-GPT: Open compute-optimal language models trained on the Cerebras wafer-scale cluster , author=. arXiv preprint arXiv:2304.03208 , year=

arXiv
[6]

International Conference on Machine Learning , pages=

Pythia: A suite for analyzing large language models across training and scaling , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[7]

Journal of Machine Learning Research , volume=

Palm: Scaling language modeling with pathways , author=. Journal of Machine Learning Research , volume=
[8]

2023 , howpublished=

Mike Conover and Matt Hayes and Ankit Mathur and Jianwei Xie and Jun Wan and Sam Shah and Ali Ghodsi and Patrick Wendell and Matei Zaharia and Reynold Xin , title =. 2023 , howpublished=

2023
[9]

Wikimedia Foundation , title =
[10]

Advances in Neural Information Processing Systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in Neural Information Processing Systems , volume=
[11]

arXiv preprint arXiv:2401.12954 , year=

Meta-Prompting: Enhancing Language Models with Task-Agnostic Scaffolding , author=. arXiv preprint arXiv:2401.12954 , year=

arXiv
[12]

OpenWebText Corpus , author=
[13]

arXiv preprint arXiv:2204.05862 , year=

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

Pith/arXiv arXiv
[14]

EMNLP , year=

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. EMNLP , year=
[15]

Communications of the ACM , volume=

Winogrande: An adversarial winograd schema challenge at scale , author=. Communications of the ACM , volume=. 2021 , publisher=

2021
[16]

Thirty-Fourth AAAI Conference on Artificial Intelligence , year =

Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. Thirty-Fourth AAAI Conference on Artificial Intelligence , year =
[17]

HuggingFace repository , howpublished =

OpenOrca: An Open Dataset of GPT Augmented FLAN Reasoning Traces , author =. HuggingFace repository , howpublished =. 2023 , publisher =

2023
[18]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

2023
[19]

arXiv preprint arXiv:2304.12244 , year=

Wizardlm: Empowering large language models to follow complex instructions , author=. arXiv preprint arXiv:2304.12244 , year=

Pith/arXiv arXiv
[21]

arXiv preprint arXiv:2304.10592 , year=

Minigpt-4: Enhancing vision-language understanding with advanced large language models , author=. arXiv preprint arXiv:2304.10592 , year=

Pith/arXiv arXiv
[22]

arXiv preprint arXiv:2308.12067 , year=

Instructiongpt-4: A 200-instruction paradigm for fine-tuning minigpt-4 , author=. arXiv preprint arXiv:2308.12067 , year=

Pith/arXiv arXiv
[23]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[24]

arXiv preprint arXiv:2308.12966 , year=

Qwen-vl: A frontier large vision-language model with versatile abilities , author=. arXiv preprint arXiv:2308.12966 , year=

Pith/arXiv arXiv
[25]

2018 , publisher=

Improving language understanding by generative pre-training , author=. 2018 , publisher=

2018
[26]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
[27]

arXiv preprint arXiv:2106.09685 , year=

Lora: Low-rank adaptation of large language models , author=. arXiv preprint arXiv:2106.09685 , year=

Pith/arXiv arXiv
[28]

OpenCompass: A Universal Evaluation Platform for Foundation Models , author=
[29]

2024 , journal=

Can I understand what I create? Self-Knowledge Evaluation of Large Language Models , author=. 2024 , journal=

2024
[30]

arXiv preprint arXiv:2101.00027 , year=

The pile: An 800gb dataset of diverse text for language modeling , author=. arXiv preprint arXiv:2101.00027 , year=

Pith/arXiv arXiv
[31]

arXiv preprint arXiv:2001.08361 , year=

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

Pith/arXiv arXiv 2001
[32]

arXiv preprint arXiv:2010.11929 , year=

An image is worth 16x16 words: Transformers for image recognition at scale , author=. arXiv preprint arXiv:2010.11929 , year=

Pith/arXiv arXiv 2010
[33]

arXiv preprint arXiv:2310.03744 , year=

Improved baselines with visual instruction tuning , author=. arXiv preprint arXiv:2310.03744 , year=

Pith/arXiv arXiv
[34]

arXiv preprint arXiv:2306.05685 , year=

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , author=. arXiv preprint arXiv:2306.05685 , year=

Pith/arXiv arXiv
[35]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Eva: Exploring the limits of masked visual representation learning at scale , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[36]

Advances in neural information processing systems , volume=

Xlnet: Generalized autoregressive pretraining for language understanding , author=. Advances in neural information processing systems , volume=
[37]

arXiv preprint arXiv:2305.17326 , year=

Kernel-SSL: Kernel KL Divergence for Self-supervised Learning , author=. arXiv preprint arXiv:2305.17326 , year=

arXiv
[38]

2007 15th European signal processing conference , pages=

The effective rank: A measure of effective dimensionality , author=. 2007 15th European signal processing conference , pages=. 2007 , organization=

2007
[39]

2023 , journal=

Matrix Information Theory for Self-Supervised Learning , author=. 2023 , journal=

2023
[40]

Physical Review A , volume=

Quantum coding , author=. Physical Review A , volume=. 1995 , publisher=

1995
[41]

2013 , publisher=

Quantum information theory , author=. 2013 , publisher=

2013
[42]

arXiv preprint arXiv:1712.00409 , year=

Deep learning scaling is predictable, empirically , author=. arXiv preprint arXiv:1712.00409 , year=

Pith/arXiv arXiv
[43]

arXiv preprint arXiv:2203.15556 , year=

Training compute-optimal large language models , author=. arXiv preprint arXiv:2203.15556 , year=

Pith/arXiv arXiv
[44]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Reproducible scaling laws for contrastive language-image learning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[46]

Wiley interdisciplinary reviews: computational statistics , volume=

Principal component analysis , author=. Wiley interdisciplinary reviews: computational statistics , volume=. 2010 , publisher=

2010
[47]

arXiv preprint arXiv:2307.09288 , year=

LLaMA: Open and Efficient Foundation Language Models , author=. arXiv preprint arXiv:2307.09288 , year=

Pith/arXiv arXiv
[48]

arXiv preprint arXiv:2306.04050 , year=

LLMZip: Lossless Text Compression using Large Language Models , author=. arXiv preprint arXiv:2306.04050 , year=

arXiv
[49]

Journal of Multivariate Analysis , volume=

Spectrum estimation: A unified framework for covariance matrix estimation and PCA in large dimensions , author=. Journal of Multivariate Analysis , volume=. 2015 , publisher=

2015
[50]

arXiv preprint arXiv:2403.15796 , year=

Understanding emergent abilities of language models from the loss perspective , author=. arXiv preprint arXiv:2403.15796 , year=

arXiv
[51]

arXiv.org , author =
[52]

arXiv preprint arXiv:2309.10668 , year=

Language Modeling Is Compression , author=. arXiv preprint arXiv:2309.10668 , year=

Pith/arXiv arXiv
[53]

international conference on machine learning , pages=

On the expressive power of deep neural networks , author=. international conference on machine learning , pages=. 2017 , organization=

2017
[54]

Statistics and computing , volume=

A tutorial on spectral clustering , author=. Statistics and computing , volume=. 2007 , publisher=

2007
[55]

The Bell system technical journal , volume=

A mathematical theory of communication , author=. The Bell system technical journal , volume=. 1948 , publisher=

1948
[56]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=
[57]

arXiv preprint arXiv:2205.01068 , year=

Opt: Open pre-trained transformer language models , author=. arXiv preprint arXiv:2205.01068 , year=

Pith/arXiv arXiv
[59]

arXiv preprint arXiv:2305.10403 , year=

Palm 2 technical report , author=. arXiv preprint arXiv:2305.10403 , year=

Pith/arXiv arXiv
[60]

Neural networks , volume=

Sigmoid-weighted linear units for neural network function approximation in reinforcement learning , author=. Neural networks , volume=. 2018 , publisher=

2018
[61]

Physical Review E , volume=

Symbolic pregression: Discovering physical laws from distorted video , author=. Physical Review E , volume=. 2021 , publisher=

2021
[62]

2013 , publisher=

Mathematische grundlagen der quantenmechanik , author=. 2013 , publisher=

2013
[63]

International Conference on Learning Representations , year=

On the Information Bottleneck Theory of Deep Learning , author=. International Conference on Learning Representations , year=
[64]

arXiv preprint physics/0004057 , year=

The information bottleneck method , author=. arXiv preprint physics/0004057 , year=

Pith/arXiv arXiv
[65]

arXiv preprint arXiv:1912.02178 , year=

Fantastic generalization measures and where to find them , author=. arXiv preprint arXiv:1912.02178 , year=

arXiv 1912
[66]

International Conference on Machine Learning , pages=

Complexity of linear regions in deep networks , author=. International Conference on Machine Learning , pages=. 2019 , organization=

2019
[67]

Advances in neural information processing systems , volume=

On the number of linear regions of deep neural networks , author=. Advances in neural information processing systems , volume=
[68]

arXiv preprint arXiv:2309.02390 , year=

Explaining grokking through circuit efficiency , author=. arXiv preprint arXiv:2309.02390 , year=

arXiv
[69]

arXiv preprint arXiv:2306.13253 , year=

Predicting Grokking Long Before it Happens: A look into the loss landscape of models which grok , author=. arXiv preprint arXiv:2306.13253 , year=

arXiv
[70]

arXiv preprint arXiv:2301.02679 , year=

Grokking modular arithmetic , author=. arXiv preprint arXiv:2301.02679 , year=

arXiv
[71]

arXiv preprint arXiv:2206.04817 , year=

The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon , author=. arXiv preprint arXiv:2206.04817 , year=

arXiv
[72]

arXiv preprint arXiv:2303.06173 , year=

Unifying Grokking and Double Descent , author=. arXiv preprint arXiv:2303.06173 , year=

arXiv
[73]

Advances in Neural Information Processing Systems , volume=

Hidden progress in deep learning: Sgd learns parities near the computational limit , author=. Advances in Neural Information Processing Systems , volume=
[74]

arXiv preprint arXiv:2303.11873 , year=

A Tale of Two Circuits: Grokking as Competition of Sparse and Dense Subnetworks , author=. arXiv preprint arXiv:2303.11873 , year=

arXiv
[75]

arXiv preprint arXiv:2210.01117 , year=

Omnigrok: Grokking beyond algorithmic data , author=. arXiv preprint arXiv:2210.01117 , year=

arXiv
[76]

arXiv preprint arXiv:2301.05217 , year=

Progress measures for grokking via mechanistic interpretability , author=. arXiv preprint arXiv:2301.05217 , year=

Pith/arXiv arXiv
[77]

Advances in Neural Information Processing Systems , volume=

Towards understanding grokking: An effective theory of representation learning , author=. Advances in Neural Information Processing Systems , volume=
[78]

arXiv preprint arXiv:2201.02177 , year=

Grokking: Generalization beyond overfitting on small algorithmic datasets , author=. arXiv preprint arXiv:2201.02177 , year=

Pith/arXiv arXiv
[79]

Advances in Neural Information Processing Systems , volume=

On linear stability of sgd and input-smoothness of neural networks , author=. Advances in Neural Information Processing Systems , volume=
[80]

arXiv preprint arXiv:2301.08164 , year=

DiME: Maximizing Mutual Information by a Difference of Matrix-Based Entropies , author=. arXiv preprint arXiv:2301.08164 , year=

arXiv
[81]

arXiv preprint arXiv:2305.16446 , year=

The Representation Jensen-Shannon Divergence , author=. arXiv preprint arXiv:2305.16446 , year=

arXiv
[82]

arXiv preprint arXiv:2210.15435 , year=

Grokking phase transitions in learning local rules with gradient descent , author=. arXiv preprint arXiv:2210.15435 , year=

arXiv
[83]

2015 ieee information theory workshop (itw) , pages=

Deep learning and the information bottleneck principle , author=. 2015 ieee information theory workshop (itw) , pages=. 2015 , organization=

2015

Showing first 80 references.

[1] [1]

arXiv preprint arXiv:2409.18839 , year=

Mineru: An open-source solution for precise document content extraction , author=. arXiv preprint arXiv:2409.18839 , year=

Pith/arXiv arXiv

[2] [2]

arXiv preprint arXiv:2408.16737 , year=

Smaller, weaker, yet better: Training llm reasoners via compute-optimal sampling , author=. arXiv preprint arXiv:2408.16737 , year=

arXiv

[3] [3]

Advances in Neural Information Processing Systems , volume=

Dart-math: Difficulty-aware rejection tuning for mathematical problem-solving , author=. Advances in Neural Information Processing Systems , volume=

[4] [4]

arXiv preprint arXiv:2501.03262 , year=

REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models , author=. arXiv preprint arXiv:2501.03262 , year=

Pith/arXiv arXiv

[5] [5]

arXiv preprint arXiv:2304.03208 , year=

Cerebras-GPT: Open compute-optimal language models trained on the Cerebras wafer-scale cluster , author=. arXiv preprint arXiv:2304.03208 , year=

arXiv

[6] [6]

International Conference on Machine Learning , pages=

Pythia: A suite for analyzing large language models across training and scaling , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023

[7] [7]

Journal of Machine Learning Research , volume=

Palm: Scaling language modeling with pathways , author=. Journal of Machine Learning Research , volume=

[8] [8]

2023 , howpublished=

Mike Conover and Matt Hayes and Ankit Mathur and Jianwei Xie and Jun Wan and Sam Shah and Ali Ghodsi and Patrick Wendell and Matei Zaharia and Reynold Xin , title =. 2023 , howpublished=

2023

[9] [9]

Wikimedia Foundation , title =

[10] [10]

Advances in Neural Information Processing Systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in Neural Information Processing Systems , volume=

[11] [11]

arXiv preprint arXiv:2401.12954 , year=

Meta-Prompting: Enhancing Language Models with Task-Agnostic Scaffolding , author=. arXiv preprint arXiv:2401.12954 , year=

arXiv

[12] [12]

OpenWebText Corpus , author=

[13] [13]

arXiv preprint arXiv:2204.05862 , year=

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

Pith/arXiv arXiv

[14] [14]

EMNLP , year=

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. EMNLP , year=

[15] [15]

Communications of the ACM , volume=

Winogrande: An adversarial winograd schema challenge at scale , author=. Communications of the ACM , volume=. 2021 , publisher=

2021

[16] [16]

Thirty-Fourth AAAI Conference on Artificial Intelligence , year =

Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. Thirty-Fourth AAAI Conference on Artificial Intelligence , year =

[17] [17]

HuggingFace repository , howpublished =

OpenOrca: An Open Dataset of GPT Augmented FLAN Reasoning Traces , author =. HuggingFace repository , howpublished =. 2023 , publisher =

2023

[18] [18]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

2023

[19] [19]

arXiv preprint arXiv:2304.12244 , year=

Wizardlm: Empowering large language models to follow complex instructions , author=. arXiv preprint arXiv:2304.12244 , year=

Pith/arXiv arXiv

[20] [21]

arXiv preprint arXiv:2304.10592 , year=

Minigpt-4: Enhancing vision-language understanding with advanced large language models , author=. arXiv preprint arXiv:2304.10592 , year=

Pith/arXiv arXiv

[21] [22]

arXiv preprint arXiv:2308.12067 , year=

Instructiongpt-4: A 200-instruction paradigm for fine-tuning minigpt-4 , author=. arXiv preprint arXiv:2308.12067 , year=

Pith/arXiv arXiv

[22] [23]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

2021

[23] [24]

arXiv preprint arXiv:2308.12966 , year=

Qwen-vl: A frontier large vision-language model with versatile abilities , author=. arXiv preprint arXiv:2308.12966 , year=

Pith/arXiv arXiv

[24] [25]

2018 , publisher=

Improving language understanding by generative pre-training , author=. 2018 , publisher=

2018

[25] [26]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

[26] [27]

arXiv preprint arXiv:2106.09685 , year=

Lora: Low-rank adaptation of large language models , author=. arXiv preprint arXiv:2106.09685 , year=

Pith/arXiv arXiv

[27] [28]

OpenCompass: A Universal Evaluation Platform for Foundation Models , author=

[28] [29]

2024 , journal=

Can I understand what I create? Self-Knowledge Evaluation of Large Language Models , author=. 2024 , journal=

2024

[29] [30]

arXiv preprint arXiv:2101.00027 , year=

The pile: An 800gb dataset of diverse text for language modeling , author=. arXiv preprint arXiv:2101.00027 , year=

Pith/arXiv arXiv

[30] [31]

arXiv preprint arXiv:2001.08361 , year=

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

Pith/arXiv arXiv 2001

[31] [32]

arXiv preprint arXiv:2010.11929 , year=

An image is worth 16x16 words: Transformers for image recognition at scale , author=. arXiv preprint arXiv:2010.11929 , year=

Pith/arXiv arXiv 2010

[32] [33]

arXiv preprint arXiv:2310.03744 , year=

Improved baselines with visual instruction tuning , author=. arXiv preprint arXiv:2310.03744 , year=

Pith/arXiv arXiv

[33] [34]

arXiv preprint arXiv:2306.05685 , year=

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , author=. arXiv preprint arXiv:2306.05685 , year=

Pith/arXiv arXiv

[34] [35]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Eva: Exploring the limits of masked visual representation learning at scale , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[35] [36]

Advances in neural information processing systems , volume=

Xlnet: Generalized autoregressive pretraining for language understanding , author=. Advances in neural information processing systems , volume=

[36] [37]

arXiv preprint arXiv:2305.17326 , year=

Kernel-SSL: Kernel KL Divergence for Self-supervised Learning , author=. arXiv preprint arXiv:2305.17326 , year=

arXiv

[37] [38]

2007 15th European signal processing conference , pages=

The effective rank: A measure of effective dimensionality , author=. 2007 15th European signal processing conference , pages=. 2007 , organization=

2007

[38] [39]

2023 , journal=

Matrix Information Theory for Self-Supervised Learning , author=. 2023 , journal=

2023

[39] [40]

Physical Review A , volume=

Quantum coding , author=. Physical Review A , volume=. 1995 , publisher=

1995

[40] [41]

2013 , publisher=

Quantum information theory , author=. 2013 , publisher=

2013

[41] [42]

arXiv preprint arXiv:1712.00409 , year=

Deep learning scaling is predictable, empirically , author=. arXiv preprint arXiv:1712.00409 , year=

Pith/arXiv arXiv

[42] [43]

arXiv preprint arXiv:2203.15556 , year=

Training compute-optimal large language models , author=. arXiv preprint arXiv:2203.15556 , year=

Pith/arXiv arXiv

[43] [44]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Reproducible scaling laws for contrastive language-image learning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[44] [46]

Wiley interdisciplinary reviews: computational statistics , volume=

Principal component analysis , author=. Wiley interdisciplinary reviews: computational statistics , volume=. 2010 , publisher=

2010

[45] [47]

arXiv preprint arXiv:2307.09288 , year=

LLaMA: Open and Efficient Foundation Language Models , author=. arXiv preprint arXiv:2307.09288 , year=

Pith/arXiv arXiv

[46] [48]

arXiv preprint arXiv:2306.04050 , year=

LLMZip: Lossless Text Compression using Large Language Models , author=. arXiv preprint arXiv:2306.04050 , year=

arXiv

[47] [49]

Journal of Multivariate Analysis , volume=

Spectrum estimation: A unified framework for covariance matrix estimation and PCA in large dimensions , author=. Journal of Multivariate Analysis , volume=. 2015 , publisher=

2015

[48] [50]

arXiv preprint arXiv:2403.15796 , year=

Understanding emergent abilities of language models from the loss perspective , author=. arXiv preprint arXiv:2403.15796 , year=

arXiv

[49] [51]

arXiv.org , author =

[50] [52]

arXiv preprint arXiv:2309.10668 , year=

Language Modeling Is Compression , author=. arXiv preprint arXiv:2309.10668 , year=

Pith/arXiv arXiv

[51] [53]

international conference on machine learning , pages=

On the expressive power of deep neural networks , author=. international conference on machine learning , pages=. 2017 , organization=

2017

[52] [54]

Statistics and computing , volume=

A tutorial on spectral clustering , author=. Statistics and computing , volume=. 2007 , publisher=

2007

[53] [55]

The Bell system technical journal , volume=

A mathematical theory of communication , author=. The Bell system technical journal , volume=. 1948 , publisher=

1948

[54] [56]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

[55] [57]

arXiv preprint arXiv:2205.01068 , year=

Opt: Open pre-trained transformer language models , author=. arXiv preprint arXiv:2205.01068 , year=

Pith/arXiv arXiv

[56] [59]

arXiv preprint arXiv:2305.10403 , year=

Palm 2 technical report , author=. arXiv preprint arXiv:2305.10403 , year=

Pith/arXiv arXiv

[57] [60]

Neural networks , volume=

Sigmoid-weighted linear units for neural network function approximation in reinforcement learning , author=. Neural networks , volume=. 2018 , publisher=

2018

[58] [61]

Physical Review E , volume=

Symbolic pregression: Discovering physical laws from distorted video , author=. Physical Review E , volume=. 2021 , publisher=

2021

[59] [62]

2013 , publisher=

Mathematische grundlagen der quantenmechanik , author=. 2013 , publisher=

2013

[60] [63]

International Conference on Learning Representations , year=

On the Information Bottleneck Theory of Deep Learning , author=. International Conference on Learning Representations , year=

[61] [64]

arXiv preprint physics/0004057 , year=

The information bottleneck method , author=. arXiv preprint physics/0004057 , year=

Pith/arXiv arXiv

[62] [65]

arXiv preprint arXiv:1912.02178 , year=

Fantastic generalization measures and where to find them , author=. arXiv preprint arXiv:1912.02178 , year=

arXiv 1912

[63] [66]

International Conference on Machine Learning , pages=

Complexity of linear regions in deep networks , author=. International Conference on Machine Learning , pages=. 2019 , organization=

2019

[64] [67]

Advances in neural information processing systems , volume=

On the number of linear regions of deep neural networks , author=. Advances in neural information processing systems , volume=

[65] [68]

arXiv preprint arXiv:2309.02390 , year=

Explaining grokking through circuit efficiency , author=. arXiv preprint arXiv:2309.02390 , year=

arXiv

[66] [69]

arXiv preprint arXiv:2306.13253 , year=

Predicting Grokking Long Before it Happens: A look into the loss landscape of models which grok , author=. arXiv preprint arXiv:2306.13253 , year=

arXiv

[67] [70]

arXiv preprint arXiv:2301.02679 , year=

Grokking modular arithmetic , author=. arXiv preprint arXiv:2301.02679 , year=

arXiv

[68] [71]

arXiv preprint arXiv:2206.04817 , year=

The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon , author=. arXiv preprint arXiv:2206.04817 , year=

arXiv

[69] [72]

arXiv preprint arXiv:2303.06173 , year=

Unifying Grokking and Double Descent , author=. arXiv preprint arXiv:2303.06173 , year=

arXiv

[70] [73]

Advances in Neural Information Processing Systems , volume=

Hidden progress in deep learning: Sgd learns parities near the computational limit , author=. Advances in Neural Information Processing Systems , volume=

[71] [74]

arXiv preprint arXiv:2303.11873 , year=

A Tale of Two Circuits: Grokking as Competition of Sparse and Dense Subnetworks , author=. arXiv preprint arXiv:2303.11873 , year=

arXiv

[72] [75]

arXiv preprint arXiv:2210.01117 , year=

Omnigrok: Grokking beyond algorithmic data , author=. arXiv preprint arXiv:2210.01117 , year=

arXiv

[73] [76]

arXiv preprint arXiv:2301.05217 , year=

Progress measures for grokking via mechanistic interpretability , author=. arXiv preprint arXiv:2301.05217 , year=

Pith/arXiv arXiv

[74] [77]

Advances in Neural Information Processing Systems , volume=

Towards understanding grokking: An effective theory of representation learning , author=. Advances in Neural Information Processing Systems , volume=

[75] [78]

arXiv preprint arXiv:2201.02177 , year=

Grokking: Generalization beyond overfitting on small algorithmic datasets , author=. arXiv preprint arXiv:2201.02177 , year=

Pith/arXiv arXiv

[76] [79]

Advances in Neural Information Processing Systems , volume=

On linear stability of sgd and input-smoothness of neural networks , author=. Advances in Neural Information Processing Systems , volume=

[77] [80]

arXiv preprint arXiv:2301.08164 , year=

DiME: Maximizing Mutual Information by a Difference of Matrix-Based Entropies , author=. arXiv preprint arXiv:2301.08164 , year=

arXiv

[78] [81]

arXiv preprint arXiv:2305.16446 , year=

The Representation Jensen-Shannon Divergence , author=. arXiv preprint arXiv:2305.16446 , year=

arXiv

[79] [82]

arXiv preprint arXiv:2210.15435 , year=

Grokking phase transitions in learning local rules with gradient descent , author=. arXiv preprint arXiv:2210.15435 , year=

arXiv

[80] [83]

2015 ieee information theory workshop (itw) , pages=

Deep learning and the information bottleneck principle , author=. 2015 ieee information theory workshop (itw) , pages=. 2015 , organization=

2015