Recognition: 2 theorem links
· Lean TheoremLlama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Pith reviewed 2026-05-10 18:54 UTC · model grok-4.3
The pith
Llama Guard is a Llama2-7b model instruction-tuned on a safety-risk dataset that classifies risks in both user prompts and generated responses at levels matching or exceeding existing moderation tools.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Llama Guard, a Llama2-7b model that is instruction-tuned on our collected dataset, albeit low in volume, demonstrates strong performance on existing benchmarks such as the OpenAI Moderation Evaluation dataset and ToxicChat, where its performance matches or exceeds that of currently available content moderation tools. Llama Guard functions as a language model, carrying out multi-class classification and generating binary decision scores. Furthermore, the instruction fine-tuning of Llama Guard allows for the customization of tasks and the adaptation of output formats, enabling the adjustment of taxonomy categories to align with specific use cases and facilitating zero-shot or few-shot prompt
What carries the argument
The safety risk taxonomy that categorizes risks in LLM prompts for prompt classification and in generated responses for response classification, paired with instruction-tuning of Llama2-7b on the collected dataset to produce multi-class labels and binary safety decisions.
Load-bearing premise
The high-quality dataset collected around the safety risk taxonomy is representative of real-world risks in human-AI conversations and benchmark performance will translate to effective safeguarding in deployed systems.
What would settle it
Running Llama Guard on a large set of live human-AI conversations that contain known harmful outcomes and measuring whether its classifications align with independent human judgments of those same interactions.
read the original abstract
We introduce Llama Guard, an LLM-based input-output safeguard model geared towards Human-AI conversation use cases. Our model incorporates a safety risk taxonomy, a valuable tool for categorizing a specific set of safety risks found in LLM prompts (i.e., prompt classification). This taxonomy is also instrumental in classifying the responses generated by LLMs to these prompts, a process we refer to as response classification. For the purpose of both prompt and response classification, we have meticulously gathered a dataset of high quality. Llama Guard, a Llama2-7b model that is instruction-tuned on our collected dataset, albeit low in volume, demonstrates strong performance on existing benchmarks such as the OpenAI Moderation Evaluation dataset and ToxicChat, where its performance matches or exceeds that of currently available content moderation tools. Llama Guard functions as a language model, carrying out multi-class classification and generating binary decision scores. Furthermore, the instruction fine-tuning of Llama Guard allows for the customization of tasks and the adaptation of output formats. This feature enhances the model's capabilities, such as enabling the adjustment of taxonomy categories to align with specific use cases, and facilitating zero-shot or few-shot prompting with diverse taxonomies at the input. We are making Llama Guard model weights available and we encourage researchers to further develop and adapt them to meet the evolving needs of the community for AI safety.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Llama Guard, a Llama2-7B model instruction-tuned on a collected high-quality (but low-volume) dataset for prompt and response classification in human-AI conversations. It defines a safety risk taxonomy to categorize risks and uses this for both input classification and output moderation. The central claim is that the resulting model matches or exceeds existing content moderation tools on the OpenAI Moderation Evaluation dataset and ToxicChat benchmark; the model supports multi-class classification with binary decision scores, allows taxonomy customization via instruction tuning, and the weights are released publicly.
Significance. If the benchmark results hold under detailed scrutiny, the work supplies a practical, open-weight, instruction-tunable safeguard that can be adapted to new taxonomies or use cases. The public release of weights is a concrete contribution to reproducibility and community experimentation in conversational AI safety.
major comments (2)
- [Abstract] Abstract: the assertion that Llama Guard 'demonstrates strong performance' and 'matches or exceeds' existing tools on the OpenAI Moderation Evaluation dataset and ToxicChat is stated without any quantitative metrics (accuracy, F1, precision/recall, or comparison tables), dataset statistics, or error analysis. This information is load-bearing for the central empirical claim.
- [Dataset and Experiments sections] Dataset construction and experimental sections: no details are supplied on the size of the collected dataset, label distribution, how the safety risk taxonomy was operationalized into labels, training hyperparameters, or exact benchmark numbers. These omissions prevent verification that the low-volume dataset supports the reported generalization.
minor comments (1)
- [Abstract] The description of 'multi-class classification and generating binary decision scores' would benefit from an explicit example of the output format (e.g., a sample prompt and expected response).
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address each major comment below and have revised the manuscript accordingly to improve transparency and verifiability of our claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the assertion that Llama Guard 'demonstrates strong performance' and 'matches or exceeds' existing tools on the OpenAI Moderation Evaluation dataset and ToxicChat is stated without any quantitative metrics (accuracy, F1, precision/recall, or comparison tables), dataset statistics, or error analysis. This information is load-bearing for the central empirical claim.
Authors: We agree that the abstract would benefit from explicit quantitative support for the performance claims. In the revised manuscript, we will add key metrics (e.g., accuracy and F1 scores) along with direct comparisons to existing moderation tools on both the OpenAI Moderation Evaluation dataset and ToxicChat. A brief reference to the error analysis and dataset statistics already detailed in the Experiments section will also be included in the abstract. revision: yes
-
Referee: [Dataset and Experiments sections] Dataset construction and experimental sections: no details are supplied on the size of the collected dataset, label distribution, how the safety risk taxonomy was operationalized into labels, training hyperparameters, or exact benchmark numbers. These omissions prevent verification that the low-volume dataset supports the reported generalization.
Authors: We acknowledge the need for greater specificity in these sections to enable verification. The revised manuscript will expand the Dataset section to report the exact dataset size, label distributions, and the mapping from the safety risk taxonomy to classification labels. The Experiments section will include training hyperparameters and precise benchmark numbers with comparison tables. These additions will clarify how the high-quality, low-volume dataset supports the observed generalization. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper introduces an empirically trained safeguard model (Llama Guard) via instruction-tuning on a custom safety dataset, then reports performance on independent external benchmarks (OpenAI Moderation Evaluation dataset and ToxicChat). No mathematical derivations, equations, or first-principles predictions exist in the text; the central claims rest on standard train-then-evaluate results against public test sets rather than any self-definitional loop, fitted-input-as-prediction, or load-bearing self-citation chain. The derivation chain is therefore self-contained and non-circular.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Instruction tuning of LLMs on a modest dataset can produce effective multi-class safety classifiers.
invented entities (1)
-
Safety risk taxonomy
no independent evidence
Forward citations
Cited by 60 Pith papers
-
Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models
Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.
-
The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?
No continuous utility-preserving input wrapper can eliminate all prompt injection risks in connected prompt spaces for language models.
-
Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems
Proteus demonstrates that adaptive red-teaming achieves 40-90% attack success after five rounds and bypasses even strong auditors at up to 41% joint success, revealing that static skill vetting underestimates residual risk.
-
FlowSteer: Prompt-Only Workflow Steering Exposes Planning-Time Vulnerabilities in Multi-Agent LLM Systems
FlowSteer is a prompt-only attack that biases multi-agent LLM workflow planning to propagate malicious signals, raising success rates by up to 55%, with FlowGuard as an input-side defense reducing it by up to 34%.
-
Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations
Behavioral directions from one LLM family transfer to others via projection into a shared anchor coordinate space, yielding 0.83 ten-way detection accuracy and steering effects up to 0.46% on held-out models.
-
Mitigating Many-shot Jailbreak Attacks with One Single Demonstration
A single safety demonstration appended at inference time mitigates many-shot jailbreak attacks by counteracting implicit malicious fine-tuning on harmful examples.
-
How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation
DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conform...
-
Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs
Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.
-
Self-Mined Hardness for Safety Fine-Tuning
Self-mined hardness from model rollouts reduces WildJailbreak attack success rates to 1-3% on Llama models but increases over-refusal on benign prompts, which mixing with adversarially-framed benign prompts partially ...
-
When Alignment Isn't Enough: Response-Path Attacks on LLM Agents
A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
-
Social Bias in LLM-Generated Code: Benchmark and Mitigation
LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.
-
Jailbroken Frontier Models Retain Their Capabilities
Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.
-
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
-
Latent Space Probing for Adult Content Detection in Video Generative Models
Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.
-
SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models
SafeAnchor preserves 93.2% of original safety alignment across sequential domain adaptations by anchoring low-rank safety subspaces and constraining orthogonal updates, while matching unconstrained fine-tuning perform...
-
GuardPhish: Securing Open-Source LLMs from Phishing Abuse
Open-source LLMs detect phishing intent at high rates but still generate actionable phishing content, and GuardPhish supplies a dataset plus modular classifiers to close the gap.
-
HarmChip: Evaluating Hardware Security Centric LLM Safety via Jailbreak Benchmarking
HarmChip is a new benchmark exposing an alignment paradox where LLMs refuse legitimate hardware security queries but comply with semantically disguised malicious requests.
-
Governed MCP: Kernel-Level Tool Governance for AI Agents via Logit-Based Safety Primitives
Governed MCP implements kernel-level governance for MCP tool calls in AI agents through a 6-layer pipeline including ProbeLogits semantic verification, with an ablation showing F1 drop from 0.773 to 0.327 without it a...
-
Conjunctive Prompt Attacks in Multi-Agent LLM Systems
Conjunctive prompt attacks split adversarial elements across agents and routing paths in multi-agent LLM systems, evading isolated defenses and succeeding through topology-aware optimization.
-
Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization
Mosaic combines text perturbation, multi-view image optimization, and surrogate model ensembles to reduce reliance on any single open-source model and achieve higher attack success rates on commercial closed-source VLMs.
-
LogAct: Enabling Agentic Reliability via Shared Logs
LogAct is a shared-log abstraction for LLM agents that makes actions visible before execution, allows decoupled stopping, enables consistent recovery, and supports LLM-driven introspection for reliability.
-
SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits
SelfGrader detects LLM jailbreaks by interpreting logit distributions on numerical tokens with a dual maliciousness-benignness score, cutting attack success rates up to 22.66% while using up to 173x less memory and 26...
-
The Great Pretender: A Stochasticity Problem in LLM Jailbreak
ASR metrics for LLM jailbreaks are inflated by stochasticity; CAS-eval reveals up to 30pp drops under multi-attempt criteria while CAS-gen recovers the performance loss.
-
On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment
FATE lets LLM agents self-evolve safer behaviors by generating and filtering repairs from their own failure trajectories using verifiers and Pareto optimization.
-
Context-Aware Spear Phishing: Generative AI-Enabled Attacks Against Individuals via Public Social Media Data
Generative AI enables scalable, context-aware spear phishing by extracting profiles from public social media, producing emails that outperform real-world phishing samples in personalization and lower recipient suspicion.
-
Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization
UJEM-KL improves cross-model transferability of untargeted jailbreaks on vision-language models by maximizing entropy at decision tokens instead of forcing specific outputs.
-
Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing
DR-Smoothing introduces a disrupt-then-rectify prompt processing scheme into smoothing defenses, delivering tight theoretical bounds on success probability against both token- and prompt-level jailbreaks.
-
MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks
MT-JailBench is a modular benchmark that standardizes evaluation of multi-turn jailbreaks to identify key success drivers and enable stronger combined attacks.
-
Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs
A truly benign DPO attack using 10 harmless preference pairs jailbreaks frontier LLMs by suppressing refusal behavior, achieving up to 81.73% attack success rate on GPT-4.1-nano at low cost.
-
Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories
Self-ReSET is a reinforcement learning approach that lets large reasoning models learn to recover from their own unsafe reasoning trajectories, improving robustness to adversarial jailbreaks while preserving utility.
-
Internalizing Safety Understanding in Large Reasoning Models via Verification
Training large reasoning models only on safety verification tasks internalizes safety understanding and boosts robustness to out-of-domain jailbreaks, providing a stronger base for reinforcement learning alignment tha...
-
GLiGuard: Schema-Conditioned Classification for LLM Safeguard
GLiGuard is a compact schema-conditioned bidirectional encoder that matches 7B-27B guard models on safety benchmarks while delivering up to 16x higher throughput and 17x lower latency.
-
One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
TurnGate uses a new multi-turn intent dataset to detect the harm-enabling closure point in dialogues, outperforming baselines with low over-refusal and generalizing across domains.
-
One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
TurnGate identifies the critical turn in multi-turn dialogues where a response would complete hidden malicious intent, outperforming baselines on the new MTID dataset while keeping over-refusal low.
-
Understanding Annotator Safety Policy with Interpretability
Annotator Policy Models learn safety policies from labeling behavior alone, accurately predicting responses and revealing sources of disagreement like policy ambiguity and value pluralism.
-
You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation
NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while r...
-
Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis
Harmful prompts reformulated as coherent mathematical problems bypass LLM safety mechanisms at 46-56% rates, with success depending on deep reformulation rather than mere notation.
-
ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection
ARGUS defends LLM agents from context-aware prompt injections by tracking information provenance and verifying decisions against trustworthy evidence, reducing attack success to 3.8% while retaining 87.5% task utility.
-
Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses
JBShield is vulnerable to adaptive JB-GCG attacks (up to 53% ASR) because jailbreak representations occupy a distinct region in refusal-direction space; the new RTV defense using Mahalanobis detection on multi-layer f...
-
ZeRO-Prefill: Zero Redundancy Overheads in MoE Prefill Serving
ZeRO-Prefill achieves 1.35-1.59x higher throughput for MoE prefill serving by replacing per-layer activation AllToAll with overlapped asynchronous weight AllGather and prefix-aware routing.
-
Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection
Refusal in LLMs leaves a detectable upstream trajectory that SALO exploits to raise jailbreak detection from near zero to over 90 percent even under forced-decoding attacks.
-
LocalAlign: Enabling Generalizable Prompt Injection Defense via Generation of Near-Target Adversarial Examples for Alignment Training
LocalAlign generates near-target adversarial examples via prompting and applies margin-aware alignment training to enforce tighter boundaries against prompt injection attacks.
-
How Language Models Process Out-of-Distribution Inputs: A Two-Pathway Framework
LLM OOD detectors are length-confounded; a two-pathway embedding-plus-trajectory framework detects covert OOD inputs at 0.721 average AUROC and 0.850 on jailbreaks.
-
TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning
TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
-
Test-Time Safety Alignment
Optimizing input embeddings sub-lexically via black-box zeroth-order gradients neutralizes all safety-flagged responses from aligned models on standard benchmarks.
-
From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model
Paired analysis of 1250 LLM interactions shows 61% of responses de-escalate harm, 36% maintain severity, and 3% escalate, with sexual content persisting far more than other categories.
-
How Sensitive Are Safety Benchmarks to Judge Configuration Choices?
LLM judge prompt variations alone shift HarmBench harmful-response rates by up to 24.2 percentage points and produce moderate instability in model safety rankings.
-
An AI Agent Execution Environment to Safeguard User Data
GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack...
-
Reasoning Structure Matters for Safety Alignment of Reasoning Models
Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.
-
Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams
Harmful intent is geometrically recoverable as a linear direction or angular deviation in LLM residual streams, with high AUROC across 12 models, stable under alignment variants including abliterated ones, and transfe...
-
Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams
Harmful intent is linearly separable in LLM residual streams across 12 models and multiple architectures, reaching mean AUROC 0.982 while showing protocol-dependent directions and strong generalization to held-out har...
-
LLM Safety From Within: Detecting Harmful Content with Internal Representations
SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
-
The Autocorrelation Blind Spot: Why 42% of Turn-Level Findings in LLM Conversation Analysis May Be Spurious
42% of significant turn-level associations in LLM conversation analysis are spurious due to unaccounted autocorrelation, with a validated two-stage correction framework improving replication.
-
ProbeLogits: Kernel-Level LLM Inference Primitives for AI-Native Operating Systems
ProbeLogits performs single-pass logit reading inside the kernel to classify LLM agent actions as safe or dangerous, reaching 97-99% block rates on HarmBench and F1 parity or better than Llama Guard 3 at 2.5x lower latency.
-
The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems
Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.
-
Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism
Harmful generation in LLMs relies on a compact, unified set of weights that alignment compresses and that are distinct from benign capabilities, explaining emergent misalignment.
-
GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking
GRM ranks Mel bands by attack contribution versus utility sensitivity, perturbs a subset, and learns a universal perturbation to reach 88.46% average jailbreak success rate with improved attack-utility trade-off on fo...
-
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and...
-
Spectral Geometry of LoRA Adapters Encodes Training Objective and Predicts Harmful Compliance
Spectral geometry of LoRA adapters encodes training objective and predicts harmful compliance in language models.
-
TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense
TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.
Reference graph
Works this paper leans on
-
[1]
Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403 ,
work page internal anchor Pith review arXiv
-
[2]
SemEval-2019 task 5: Multilingual detection of hate speech against immigrants and women in Twitter
Valerio Basile, Cristina Bosco, Elisabetta Fersini, Debora Nozza, Viviana Patti, Francisco Manuel Rangel Pardo, Paolo Rosso, and Manuela Sanguinetti. SemEval-2019 task 5: Multilingual detection of hate speech against immigrants and women in Twitter. In Proceedings of the 13th International Workshop on Semantic Evaluation , pages 54– 63, Minneapolis, Minne...
work page 2019
-
[3]
Association for Computational Linguistics. doi: 10.18653/v1/S19-2007. https://www.aclweb.org/anthology/S19-2007. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing sy...
-
[4]
Association for Computational Linguistics. doi: 10.18653/v1/W18-0802. https://aclanthology.org/W18-0802. Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. A survey on evaluation of large language models,
-
[5]
Association for Computational Linguistics. doi: 10.18653/v1/W18-5102. https: //www.aclweb.org/anthology/W18-5102. Emily Dinan, Samuel Humeau, Bharath Chintagunta, and Jason Weston. Build it break it fix it for dialogue safety: Robustness from adversarial human attack,
-
[6]
doi: 10.18653/v1/2021.acl-long.210
Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.210. https://aclanthology.org/2021.acl-long.210. Alon Halevy, Cristian Canton-Ferrer, Hao Ma, Umut Ozertem, Patrick Pantel, Marzieh Saeidi, Fabrizio Silvestri, and Ves Stoyanov. Preserving integrity in online social networks. Communications of the ACM , 65(2):92–98,
-
[7]
Exploring social bias in chatbots using stereotype knowledge
Nayeon Lee, Andrea Madotto, and Pascale Fung. Exploring social bias in chatbots using stereotype knowledge. In Amittai Axelrod, Diyi Yang, Rossana Cunha, Samira Shaikh, and Zeerak Waseem, editors, Proceedings of the 2019 Workshop on Widening NLP , pages 177–180, Florence, Italy, August
work page 2019
-
[8]
Toolformer: Language Models Can Teach Themselves to Use Tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dess` ı, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Can- cedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761,
work page internal anchor Pith review arXiv
-
[9]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners, 2022a. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Adva...
work page internal anchor Pith review arXiv
-
[11]
SemEval-2019 task 6: Identifying and categorizing offensive language in social media (OffensEval)
Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, and Ritesh Kumar. SemEval-2019 task 6: Identifying and categorizing offensive language in social media (OffensEval). In Jonathan May, Ekaterina Shutova, Aurelie Herbelot, Xiaodan Zhu, Marianna Apidianaki, and Saif M. Mohammad, editors, Proceedings of the 13th International Works...
work page 2019
-
[12]
Association for Computational Linguistics. doi: 10.18653/v1/S19-2010. https://aclanthology.org/S19-2010. Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, and Omer Levy. Lima: Less is more for alignment,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.