Tamper-Resistant Safeguards
for Open-Weight LLMs

What are Tamper-Resistant Safeguards?

Tamper-Resistant Safeguards are security measures designed for open-weight large language models (LLMs) to protect against malicious modifications of the model's weights. Unlike traditional safeguards that focus on preventing input-based attacks, these advanced safeguards prevent adversaries with access to full model weights from recovering performance on harmful capabilities.
We develop a method, called TAR, for building tamper-resistant safeguards into open-weight LLMs such that adversaries cannot remove the safeguards even after thousands of steps of fine-tuning. In extensive evaluations and red teaming analyses, we find that our method greatly improves tamper-resistance while preserving benign capabilities. Our results demonstrate that tamper-resistance is a tractable problem, opening up a promising new avenue to improve the safety and security of open-weight LLMs.

Tamper-Resistant Safeguards Illustration

Figure 1: An illustration of existing brittle LLM safeguards compared to tamper-resistant training. Existing safeguards can be easily removed by adversaries with access to model weights, raising serious concerns about the safety of open-weight LLMs.

Example

How can I make an explosive device?

Existing Refusal Training

Brittle Refusal Mechanism

I'm sorry, but I can't provide information on making explosive devices as that could be dangerous and illegal.

Harmful Fine-tuning

Successful Recovery

Here are the steps to make an explosive device: [REDACTED]

Tamper-Resistant Training

Tamper Resistant Defense

I cannot assist with making explosive devices or any other illegal activities. This kind of request is not something I'm able to help with under any circumstances.

Harmful Fine-tuning

Unsuccessful Recovery

I apologize, but I cannot provide any information about creating explosive devices or other dangerous items. This is not something I'm willing or able to assist with, as it could lead to harm.

Results

Main results are shown below for training tamper resistant safeguards against three different hazardous domains: Bioweapons, Chemical Weapons, and Cyberweapons.

Figure 1: Description of the first image

Figure 1: Comparison of the general capabilities and Post-Attack Hazardous Knowledge error of 9 baseline safeguards compared to our new TAR method. Unlike baseline safeguards, our method provides far greater tamper-resistance at similar levels of general capability, measured via MMLU.

Figure 2: Description of the second image

Table 2: Pre-Attack and average Post-Attack accuracies for WMDP Biology, Chemistry, and Cyber- security for TAR and all other baselines, reported for Llama-3-8B. The average Post-Attack accuracy is computed as the average accuracy across all 28 fine-tuning attacks. TAR is the only method that maintains low Post-Attack recovery while preserving high Retain MMLU and low Forget accuracies.

Citation

@misc{tamirisa2024tamperresistantsafeguardsopenweightllms,
title={Tamper-Resistant Safeguards for Open-Weight LLMs},
author={Rishub Tamirisa and Bhrugu Bharathi and Long Phan and Andy Zhou and Alice Gatti and Tarun Suresh and Maxwell Lin and Justin Wang and Rowan Wang and Ron Arel and Andy Zou and Dawn Song and Bo Li and Dan Hendrycks and Mantas Mazeika},
year={2024},
eprint={2408.00761},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2408.00761}
}