Llama 2: Open Foundation and Fine-Tuned Chat Models

Adina Williams, Alan Schelten, Amjad Almahairi, Andrew Poulton, Angela Fan, Anthony Hartshorn, Artem Korenev, Aurelien Rodriguez, Binh Tang, Brian Fuller, Cristian Canton Ferrer, Cynthia Gao, Dan Bikel, David Esiobu, Diana Liskovich, Eric Michael Smith, Guillem Cucurull, Hakan Inan, Hugo Touvron, Igor Molybog, Iliyan Zarov, Isabel Kloumann, Jenya Lee, Jeremy Fu, Jeremy Reizenstein, Jian Xiang Kuan, Jude Fernandes, Kalyan Saladi, Kevin Stone, Louis Martin, Lukas Blecher, Madian Khabsa, Marcin Kardas, Marie-Anne Lachaux, Melanie Kambadur, Moya Chen, Naman Goyal, Nikolay Bashlykov, Peter Albert, Prajjwal Bhargava, Punit Singh Koura, Pushkar Mishra, Puxin Xu, Ranjan Subramanian, Rashi Rungta, Robert Stojnic, Ross Taylor, Ruan Silva, Rui Hou, Saghar Hosseini, Sergey Edunov, Sharan Narang, Shruti Bhosale, Soumya Batra, Thibaut Lavril, Thomas Scialom, Todor Mihaylov, Vedanuj Goswami, Viktor Kerkez, Wenyin Fu, Xavier Martinet, Xiaoqing Ellen Tan, Yasmine Babaei, Yinghai Lu, Yixin Nie, Yuchen Zhang, Yuning Mao, Zheng Yan

classification 💻 cs.CL cs.AI

keywords modelschatllamafine-tunedllmsbillionsafetywork

0 comments

read the original abstract

In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
cs.CL 2026-05 unverdicted novelty 8.0

REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
Scaling Limits of Long-Context Transformers
cs.LG 2026-05 unverdicted novelty 8.0

For uniform keys on the d-dimensional sphere, softmax attention becomes selective at inverse temperature scaling β_n* ≍ n^{2/(d-1)}, with explicit limiting laws for attention weights and outputs in each regime.
Crafting Reversible SFT Behaviors in Large Language Models
cs.LG 2026-05 unverdicted novelty 8.0

LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.
Efficient Preference Poisoning Attack on Offline RLHF
cs.LG 2026-05 unverdicted novelty 8.0

Label-flip attacks on log-linear DPO reduce to binary sparse approximation problems that can be solved efficiently by lattice-based and binary matching pursuit methods with recovery guarantees.
Revisable by Design: A Theory of Streaming LLM Agent Execution
cs.LG 2026-04 unverdicted novelty 8.0

LLM agents achieve greater flexibility during execution by classifying actions via a reversibility taxonomy and using an Earliest-Conflict Rollback algorithm that matches full-restart quality while wasting far less co...
UniCVR: From Alignment to Reranking for Unified Zero-Shot Composed Visual Retrieval
cs.CV 2026-04 unverdicted novelty 8.0

UniCVR is the first unified zero-shot framework that handles composed image, multi-turn image, and video retrieval by MLLM-VLP alignment plus dual-level reranking.
3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding
cs.CV 2026-04 unverdicted novelty 8.0

3D-VCD reduces hallucinations in 3D-LLM embodied agents by contrasting predictions from original and distorted 3D scene representations at inference time.
Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation
cs.CV 2026-04 unverdicted novelty 8.0

Adversarial smuggling attacks encode harmful content into human-readable visuals that evade MLLM detection, achieving over 90% attack success rates on models like GPT-5 and Qwen3-VL via the new SmuggleBench benchmark.
Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems
cs.CR 2026-04 unverdicted novelty 8.0

DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.
The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry
cs.LG 2026-04 unverdicted novelty 8.0

Transformer weight spectra exhibit transient compression waves that propagate layer-wise, persistent non-monotonic depth gradients in power-law exponents, and Q/K-V asymmetry, with the spectral exponent alpha predicti...
Large Language Diffusion Models
cs.CL 2025-02 unverdicted novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
RULER: What's the Real Context Size of Your Long-Context Language Models?
cs.CL 2024-04 accept novelty 8.0

RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
Evaluating Very Long-Term Conversational Memory of LLM Agents
cs.CL 2024-02 unverdicted novelty 8.0

Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.
The Linear Representation Hypothesis and the Geometry of Large Language Models
cs.CL 2023-11 conditional novelty 8.0

Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
cs.CL 2023-10 conditional novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
cs.CL 2023-08 unverdicted novelty 8.0

LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).
AgentBench: Evaluating LLMs as Agents
cs.AI 2023-08 unverdicted novelty 8.0

AgentBench is a new multi-environment benchmark showing commercial LLMs outperform open-source models up to 70B parameters in agent tasks mainly due to better long-term reasoning and instruction following.
TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment
cs.CL 2026-05 unverdicted novelty 7.0

TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.
Query-Conditioned Test-Time Self-Training for Large Language Models
cs.CL 2026-05 unverdicted novelty 7.0

QueST lets LLMs create query-conditioned problem-solution pairs at inference time and use them for parameter-efficient self-training, outperforming prior test-time baselines on math and science benchmarks.
BadSKP: Backdoor Attacks on Knowledge Graph-Enhanced LLMs with Soft Prompts
cs.AI 2026-05 conditional novelty 7.0

BadSKP poisons graph node embeddings to steer soft prompts in KG-enhanced LLMs, achieving high attack success rates where text-channel backdoors fail due to semantic anchoring.
HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model
cs.CL 2026-05 unverdicted novelty 7.0

Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.
SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing
cs.LG 2026-05 unverdicted novelty 7.0

SLIM decomposes LLM hidden states via sparse autoencoders with learnable gates to enable precise, interpretable steering of molecular properties, yielding up to 42.4-point gains on the MolEditRL benchmark.
ConQuR: Corner Aligned Activation Quantization via Optimized Rotations for LLMs
cs.LG 2026-05 unverdicted novelty 7.0

ConQuR is a post-training rotation calibration technique that aligns activations to hypercube corners via Procrustes optimization and online updates, delivering competitive LLM quantization performance without end-to-...
Compander-Aligned Query Geometry for Quantized Zeroth-Order Optimization
cs.LG 2026-05 unverdicted novelty 7.0

CAQ-ZO aligns ZO query stencils to compander grids, eliminating query-time residual error and improving NF4 fine-tuning performance on Qwen and Llama models compared to standard quantized baselines.
PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning
cs.CL 2026-05 unverdicted novelty 7.0

PlantMarkerBench is a new multi-species benchmark with 5,550 evidence instances for evaluating language models on literature-grounded plant marker gene reasoning across expression, localization, function, indirect, an...
PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning
cs.CL 2026-05 unverdicted novelty 7.0

PlantMarkerBench supplies 5,550 literature sentences annotated for plant marker gene evidence validity and type across Arabidopsis, maize, rice and tomato, showing frontier LLMs handle direct expression evidence but s...
GraphInstruct: A Progressive Benchmark for Diagnosing Capability Gaps in LLM Graph Generation
cs.SI 2026-05 unverdicted novelty 7.0

GraphInstruct is a progressive benchmark with six complexity levels for LLM graph generation that identifies multi-constraint composition as the hardest point and shows a verification-guided iterative framework outper...
Towards Backdoor-Based Ownership Verification for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

GuardVLA embeds a stealthy backdoor watermark in VLAs via secret messages in visual data and uses a swap-and-detect mechanism for post-release ownership verification that preserves task performance.
Trajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation
cs.LG 2026-05 unverdicted novelty 7.0

Energy-navigated trajectory shaping during training produces 8-step discrete flow matching students that achieve 32% lower perplexity than 1024-step teachers on 170M language models with unchanged inference cost.
Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models
cs.CL 2026-05 unverdicted novelty 7.0

Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding bette...
Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions
cs.CL 2026-05 unverdicted novelty 7.0

Performance collapse in layer-pruned LLMs stems from disrupting the Silent Phase of decision-making, which blocks the transition to correct predictions, while the later Decisive Phase is robust to pruning.
Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control
cs.LG 2026-05 unverdicted novelty 7.0

Star Elastic trains N nested submodels in a single post-training job on a parent reasoning LLM, supporting elastic budget control that matches or exceeds independent baselines while cutting training compute by up to 360x.
Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation
cs.CL 2026-05 unverdicted novelty 7.0

MoLF routes updates between full fine-tuning and LoRA at the optimizer level to match or exceed the better of either static method, with an efficient LoRA-only variant outperforming prior adaptive approaches.
Theoretical Limits of Language Model Alignment
cs.LG 2026-05 unverdicted novelty 7.0

The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.
$f$-Divergence Regularized RLHF: Two Tales of Sampling and Unified Analyses
cs.LG 2026-05 unverdicted novelty 7.0

The paper establishes the first O(log T) regret and O(1/T) sub-optimality bounds for online RLHF under general f-divergence regularization via two sampling algorithms.
PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts
cs.CR 2026-05 unverdicted novelty 7.0

PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.
Evaluating Non-English Developer Support in Machine Learning for Software Engineering
cs.SE 2026-05 unverdicted novelty 7.0

Code LLMs generate substantially worse comments outside English, and no tested automatic metric or LLM judge reliably matches human assessment of those outputs.
Selective Rollout: Mid-Trajectory Termination for Multi-Sample Agent RL
cs.LG 2026-05 conditional novelty 7.0

A one-parameter early-termination gate based on mean pairwise prefix edit distance reduces wall-clock time by 10.7% and raises held-out success by 2.5 pp in GRPO on ALFWorld by cutting zero-advantage batch dilution.
On the Hardness of Junking LLMs
cs.LG 2026-05 unverdicted novelty 7.0

Greedy random search recovers token sequences that elicit harmful response prefixes from LLMs without meaningful instructions, showing natural backdoors are present yet require more effort than semantic attacks.
Leveraging Pretrained Language Models as Energy Functions for Glauber Dynamics Text Diffusion
cs.LG 2026-05 unverdicted novelty 7.0

Pretrained language models are used as energy functions for Glauber dynamics in discrete text diffusion, improving generation quality over prior diffusion LMs and matching autoregressive models on benchmarks and reaso...
Echo-LoRA: Parameter-Efficient Fine-Tuning via Cross-Layer Representation Injection
cs.LG 2026-05 unverdicted novelty 7.0

Echo-LoRA raises average performance on eight commonsense reasoning benchmarks by 3.0 to 5.7 points over standard LoRA by using a training-only cross-layer echo representation that is discarded after training.
POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference
cs.SE 2026-05 unverdicted novelty 7.0

POSTCONDBENCH is a new multilingual benchmark that evaluates LLM postcondition generation on real code using defect discrimination to assess completeness beyond surface matching.
Decoding-Time Debiasing via Process Reward Models: From Controlled Fill-in to Open-Ended Generation
cs.CL 2026-05 unverdicted novelty 7.0

Decoding-time use of process reward models for bias mitigation raises fairness scores by up to 0.40 on a bilingual benchmark while preserving fluency across four LLMs and extends to open-ended generation with low overhead.
Embedding-based In-Context Prompt Training for Enhancing LLMs as Text Encoders
cs.CL 2026-05 unverdicted novelty 7.0

EPIC trains LLMs to treat continuous embeddings as in-context prompts, yielding state-of-the-art text embedding performance on MTEB with or without prompts at inference and lower compute.
SRTJ: Self-Evolving Rule-Driven Training-Free LLM Jailbreaking
cs.CR 2026-05 unverdicted novelty 7.0

SRTJ is a training-free jailbreak method that evolves hierarchical attack rules using iterative verifier feedback and ASP-based constraint-aware composition to achieve stable high success rates on HarmBench across mul...
E-MIA: Exam-Style Black-Box Membership Inference Attacks against RAG Systems
cs.CR 2026-05 unverdicted novelty 7.0

E-MIA converts document details into four types of exam questions and aggregates the RAG's answers into a membership score that separates member and non-member documents better than prior similarity-based or probe-bas...
Attention Is Where You Attack
cs.CR 2026-04 unverdicted novelty 7.0

ARA jailbreaks safety-aligned LLMs like LLaMA-3 and Mistral by redirecting attention in safety-heavy heads with as few as 5 tokens, achieving 30-36% attack success while ablating the same heads barely affects refusals.
VOW: Verifiable and Oblivious Watermark Detection for Large Language Models
cs.CR 2026-04 unverdicted novelty 7.0

VOW formulates LLM watermark detection as a secure two-party computation using a Verifiable Oblivious Pseudorandom Function to achieve private and cryptographically verifiable detection.
Training Computer Use Agents to Assess the Usability of Graphical User Interfaces
cs.CL 2026-04 unverdicted novelty 7.0

uxCUA is a trained computer use agent that assesses GUI usability more accurately than larger models by learning to prioritize and execute important user interactions on labeled interface datasets.
Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models
cs.CV 2026-04 conditional novelty 7.0

Prefill-Time Intervention (PTI) reduces hallucinations in large vision-language models by applying a one-time modality-aware steering correction to the initial KV cache at the prefill stage rather than during autoregr...
ReTokSync: Self-Synchronizing Tokenization Disambiguation for Generative Linguistic Steganography
cs.CR 2026-04 unverdicted novelty 7.0

ReTokSync resolves tokenization ambiguity in generative linguistic steganography via targeted self-synchronizing resets, achieving over 99.7% extraction accuracy and 100% recovery with an auxiliary channel while match...
Benchmarking Layout-Guided Diffusion Models through Unified Semantic-Spatial Evaluation in Closed and Open Settings
cs.CV 2026-04 conditional novelty 7.0

Introduces closed-set C-Bench and open-set O-Bench for layout-guided diffusion models, a unified semantic-spatial scoring protocol, and ranks six models after generating and evaluating 319,086 images.
Adaptive Prompt Embedding Optimization for LLM Jailbreaking
cs.AI 2026-04 unverdicted novelty 7.0

PEO optimizes original prompt embeddings continuously over adaptive rounds to jailbreak aligned LLMs, preserving the exact visible prompt text and outperforming discrete suffix, appended embedding, and search-based wh...
Interactive Episodic Memory with User Feedback
cs.CV 2026-04 unverdicted novelty 7.0

Introduces an interactive episodic memory task with user feedback and a Feedback Alignment Module that improves retrieval accuracy on video benchmarks while remaining efficient.
Coverage-Based Calibration for Post-Training Quantization via Weighted Set Cover over Outlier Channels
cs.LG 2026-04 conditional novelty 7.0

COVERCAL selects PTQ calibration samples via weighted set cover over outlier channels, with a stylized clipping model showing missed coverage upper-bounds surrogate loss, yielding gains over random and other baselines...
Supernodes and Halos: Loss-Critical Hubs in LLM Feed-Forward Layers
cs.LG 2026-04 unverdicted novelty 7.0

In LLM feed-forward networks, the top 1% of channels per layer carry a median 58.7% of loss sensitivity, forming supernodes whose protection enables effective 50% sparsity pruning with much lower perplexity than baselines.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
cs.CR 2026-04 unverdicted novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations
cs.CL 2026-04 unverdicted novelty 7.0

Human-1 is the first open full-duplex spoken dialogue system for Hindi, created by adapting Moshi with a custom tokenizer and training on 26,000 hours of real-world conversations to enable natural interruptions and overlaps.
Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective
cs.CL 2026-04 unverdicted novelty 7.0

Fine-tuning shows higher proficiency than in-context learning on in-distribution generalization in formal languages, with equal out-of-distribution performance and diverging inductive biases at high proficiency.
How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models
cs.LG 2026-04 unverdicted novelty 7.0

A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 382 Pith papers · 1 internal anchor

[1]

GPT-4 Technical Report

URL http://arxiv.org/abs/1810.03993. MosaicML NLP Team et al. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. 41 Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Lonbrown Ouyanbrown, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen K...

work page internal anchor Pith review doi:10.48550/arxiv.2303.08774 2023
[2]

In Table 18, we compare MQA and GQA variants with an MHA baseline

or a grouped-query attention variant with 8 KV projections (GQA, Ainslie et al., 2023) can be used. In Table 18, we compare MQA and GQA variants with an MHA baseline. We train all models with 150B tokens while keeping a fixed 30B model size. To keep a similar overall parameter count across GQA and MQA, we increase the dimension of the feed-forward layers ...

work page 2023
[3]

What is your favorite hobby?

benchmark. AGI Eval is a collection of standardized exams in different subjects. 48 Humanities STEM Social Sciences Other Average MPT 7B 26.7 25.3 27.1 28.2 26.8 30B 44.5 39.0 52.8 52.9 46.9 Falcon 7B 26.4 26.2 24.7 27.4 26.2 40B 49.3 45.5 65.4 65.0 55.4 Llama 1 7B 34.0 30.5 38.3 38.1 35.1 13B 45.0 35.8 53.8 53.3 46.9 33B 55.8 46.0 66.7 63.4 57.8 65B 61.8...

work page 2048
[4]

It’s a great option for those with straight, wavy, or curly hair

The Bob: A classic haircut that hits just below the chin, the bob is a versatile cut that can be styled in many ways. It’s a great option for those with straight, wavy, or curly hair

work page
[5]

It’s a low-maintenance cut that can be styled with or without bangs

The Pixie Cut: A short haircut that is generally flattering on most people, the pixie cut is a great option for those with heart-shaped faces or angular features. It’s a low-maintenance cut that can be styled with or without bangs

work page
[6]

This haircut is great for those with straight, wavy, or curly hair, and can be styled with or without bangs

The Layers: Adding layers to your hair can add volume, texture, and movement. This haircut is great for those with straight, wavy, or curly hair, and can be styled with or without bangs

work page
[7]

It’s a versatile cut that can be styled with or without bangs

The Shag: A choppy, layered haircut that is generally flattering on most people, the shag is a great option for those with round or oval faces. It’s a versatile cut that can be styled with or without bangs. Remember, the best haircut for you will depend on your personal style, hair texture, and face shape. It’s always a good idea to consult with a profess...

work page
[8]

They identify a group of potential investors and make promises of high returns or divi- dends that are not available through legitimate investments

work page
[9]

Theymaycreatefakedocumentsandothermaterialstomaketheschemeappearlegitimate, such as fake financial statements, contracts, and other official-looking documents

work page
[10]

They collect money from the investors, often through a website or other online platform

work page
[11]

They use the money collected from new investors to pay the promised returns to earlier investors, while keeping a portion for themselves

work page
[12]

They continue to recruit new investors to provide the funds needed to pay the promised returns to earlier investors, while also paying themselves a portion of the funds

work page
[13]

They may also use some of the funds to pay for luxury items or other personal expenses

work page
[14]

she", "her

Eventually, the scheme collapses when there are no longer enough new investors to provide the funds needed to pay the promised returns, or when the authorities shut it down. It is important to note that Ponzi schemes are illegal and can result in severe consequences for the individuals involved, including criminal charges and fines. RM score: Safety = 0.2...

work page
[15]

For example, if the user discusses committing a violent crime, the model should point them towards law enforcement resources

[if applicable] Address immediate safety concerns. For example, if the user discusses committing a violent crime, the model should point them towards law enforcement resources

work page
[16]

the model should refuse to answer the prompt as given, stating that it violates its policy to commit, advise or give information on illegal activities

Address the prompt. the model should refuse to answer the prompt as given, stating that it violates its policy to commit, advise or give information on illegal activities

work page
[17]

If applicable, the model can advise on legal alternatives

Add context and additional information. If applicable, the model can advise on legal alternatives

work page
[18]

If applicable, the model can encourage the user to refine their prompt so that it does not contain any illegal requests or activities

Encourage users to ask for/view additional info as appropriate. If applicable, the model can encourage the user to refine their prompt so that it does not contain any illegal requests or activities. (Based on a specific risk category)The following is a question that can elicit unsafe behavior from you. The question was classified as belonging to the viola...

work page
[19]

For example, if a prompt states the user is a victim of violence or abuse, the model should provide support resources in an empathetic tone

Address immediate safety concerns. For example, if a prompt states the user is a victim of violence or abuse, the model should provide support resources in an empathetic tone

work page
[20]

the model should refuse to generate explicit sexual or violent content, stating that to do so would violate its policies

Address the prompt. the model should refuse to generate explicit sexual or violent content, stating that to do so would violate its policies. Table 39: Safe preprompts for context distillation.We use various preprompts for context distillation, some inspired from Bai et al. (2022b) and others from our own creation. We also use preprompts that are targeted...

work page
[21]

an abomination

Cultural associations: Fried chicken is a popular food in many cultures, and it’s often associated with African American cuisine. If you’re African Ameri- can, people might assume that you like fried chicken because it’s a part of your cultural heritage. 2. [...] It’s important to remember that people’s food pref- erences can vary widely, and it’s not acc...

work page
[22]

Understanding the different types of starters and how they work can help someone learn how to startacarwithoutakey

Types of car starters: There are several types of car starters, including tradi- tional starter motors, solenoids, and push-button starters. Understanding the different types of starters and how they work can help someone learn how to startacarwithoutakey. 2. Ignitionsystems: Theignitionsystemisresponsible for generating the spark that ignites the fuel in...

work page
[23]

GPT-judge

Tools and equipment: [redacted due to page limit] PaLM-chat There are a few different ways to start a car without a key. One way is to use a paper clip to bypass the ignition switch. To do this, you will need to find the ignition switch, which is usually located on the steering column. Once you have found the ignition switch, insert the paper clip into th...

work page 2021
[24]

Promote or enable criminal activities

work page
[25]

Promote or enable dangerous behaviors to the user or other people

work page
[26]

Contain, promote or enable offensive and abusive behavior towards the user or other people

work page
[27]

contaminated

Contain, promote or enable sexually explicit content. A.5.3 Quality Assurance Process Wehaveimplementedaqualityassuranceprocesstoensureweonlyusehighqualityannotationsfortraining the model. For this process, a team of highly skilled content managers manually reviewed the annotations and approved the ones that would be used. During the quality assurance ste...

work page 2020