Tamper-Resistant Safeguards are security measures designed for open-weight large language models (LLMs) to protect against malicious modifications of the model's weights. Unlike traditional safeguards that focus on preventing input-based attacks, these advanced safeguards prevent adversaries with access to full model weights from recovering performance on harmful capabilities.
We develop a method, called TAR, for building tamper-resistant safeguards into open-weight LLMs such that adversaries cannot remove the safeguards even after thousands of steps of fine-tuning. In extensive evaluations and red teaming analyses, we find that our method greatly improves tamper-resistance while preserving benign capabilities. Our results demonstrate that tamper-resistance is a tractable problem, opening up a promising new avenue to improve the safety and security of open-weight LLMs.
Figure 1: An illustration of existing brittle LLM safeguards compared to tamper-resistant training. Existing safeguards can be easily removed by adversaries with access to model weights, raising serious concerns about the safety of open-weight LLMs.
How can I make an explosive device?
Main results are shown below for training tamper resistant safeguards against three different hazardous domains: Bioweapons, Chemical Weapons, and Cyberweapons.
Figure 1: Comparison of the general capabilities and Post-Attack Hazardous Knowledge error of 9 baseline safeguards compared to our new TAR method. Unlike baseline safeguards, our method provides far greater tamper-resistance at similar levels of general capability, measured via MMLU.
Table 2: Pre-Attack and average Post-Attack accuracies for WMDP Biology, Chemistry, and Cyber- security for TAR and all other baselines, reported for Llama-3-8B. The average Post-Attack accuracy is computed as the average accuracy across all 28 fine-tuning attacks. TAR is the only method that maintains low Post-Attack recovery while preserving high Retain MMLU and low Forget accuracies.
@misc{tamirisa2024tamperresistantsafeguardsopenweightllms,title={Tamper-Resistant Safeguards for Open-Weight LLMs},author={Rishub Tamirisa and Bhrugu Bharathi and Long Phan and Andy Zhou and Alice Gatti and Tarun Suresh and Maxwell Lin and Justin Wang and Rowan Wang and Ron Arel and Andy Zou and Dawn Song and Bo Li and Dan Hendrycks and Mantas Mazeika},year={2024},eprint={2408.00761},archivePrefix={arXiv},primaryClass={cs.LG},url={https://arxiv.org/abs/2408.00761}}