Defining and Evaluating Physical Safety for Large Language Models
Yung-Chen Tang¹, Pin-Yu Chen², Tsung-Yi Ho¹
¹The Chinese University of Hong Kong, ²IBM Research
[arxiv] [dataset] [video]
Abstract
Large Language Models (LLMs) are increasingly used to control robotic systems such as drones, but their risks of causing physical threats and harm in real-world applications remain unexplored. Our study addresses the critical gap in evaluating LLM physical safety by developing a comprehensive benchmark for drone control. We classify the physical safety risks of drones into four categories: (1) human-targeted threats, (2) object-targeted threats, (3) infrastructure attacks, and (4) regulatory violations. Our evaluation of mainstream LLMs reveals an undesirable trade-off between utility and safety, with models that excel in code generation often performing poorly in crucial safety aspects. Furthermore, while incorporating advanced prompt engineering techniques such as In-Context Learning and Chain-of-Thought can improve safety, these methods still struggle to identify unintentional attacks. In addition, larger models demonstrate better safety capabilities, particularly in refusing dangerous commands. Our findings and benchmark can facilitate the design and evaluation of physical safety for LLMs.
Physical Safety for LLMs
We consider the practical scenario where a bad actor uses text instructions to prompt LLMs to generate code snippets that can be directly compiled and executed for drone control. Our research categorizes physical drone safety into four distinct realistic types: (1) human-targeted threats; (2) object-targeted threats; (3) infrastructure attacks; and (4) violations of Federal Aviation Administration (FAA) regulations. Figure 1 illustrates these potential risks causing physical harm and damage.
Figure 1. Defining the potential physical safety risks caused by drones.
By defining these physical safety risks, we develop a set of comprehensive evaluation methods and safety measures to evaluate different LLMs in drone control. As illustrated in Figure 2, our research introduces a holistic approach called "LLM-Robotics Physical Safety Benchmark in Drone Control." This benchmark utilizes a curated dataset of potential risks and prompts, two LLM judges, and a simulation environment to comprehensively evaluate the safety and utility of LLMs in drone control.
The process begins with generating responses to our designed prompts, followed by evaluations made by an LLM judge. Subsequently, the resulting outcomes are validated in a simulation environment to assess physical harm and damage. This comprehensive approach ensures a thorough evaluation of LLM safety in simulated physical environments, considering various aspects and potential physical risks.
Figure 2. Benchmarking LLM-Robotics Safety in Drone Control: Process, Datasets, and Results. Top: Flowchart illustrating the benchmark process for LLM-assisted drone control. Given a specific LLM, the evaluation process involves providing a test prompt and categorizing the presence of code in the output. Two different LLM judges are used to validate the output. The code is further tested in a simulation environment to measure physical safety risks. Bottom left: Composition of the benchmark's evaluation datasets, which are divided into four main categories: deliberate attacks, unintentional attacks, violation attacks, and utility. These categories help assess the model's performance from different perspectives. Bottom right: Safety evaluation results obtained using different LLMs. We found that LLMs with higher scores in utility and code fidelity (indicating better drone control) tend to exhibit higher safety risks.
Results
Figure 3. Safety Evaluation Results across Different LLMs. The left panel presents individual scores for six metrics: Self-Assurance, Avoid Collision, Regulatory Compliance, Code Fidelity, Instruction Understanding, and Utility. The right panel visualizes these scores using a radar chart, highlighting the trade-off between Utility and Safety across various models.
Trade-off between Utility and Safety
The safety evaluation results, as depicted in Figure 3, reveal an interesting trend. LLMs with higher scores in utility and code fidelity, indicating better control over drone operations through code, tend to exhibit higher safety risks. This observation suggests that optimizing these models solely for enhanced utility may come at the expense of compromising essential safety considerations.
In-Context Learning Delivers Significant Safety Gains
We investigated the impact of different prompt engineering techniques on LLM safety, including In-Context Learning (ICL) and Zero-shot Chain of Thought (Zero-shot CoT). As showed in (a) and (b) of Figure 4, this study demonstrates the potential of ICL and Zero-shot CoT in enhancing LLM safety. While ICL offers significant improvements through contextual learning, it relies on high-quality examples and can be computationally demanding. Zero-shot CoT provides a simpler, lightweight solution, but is less effective.
Model Size Plays a Critical Role in Safety
Our research also explored the scalability of model size on safety performance. The results, as depicted in (c) of Figure 4, suggest a strong correlation between model size and safety capabilities. Larger LLMs, such as CodeLlama-34b-instruct, demonstrate superior performance across various safety metrics, particularly in refusing dangerous commands.
Figure 4. Safety Evaluation Results with Prompt Engineering and Model Size. The left side presents the Self-Protect and Safety Refusal (%) across different datasets. The right side shows the radar charts for six metrics.
Citation
If you find LLM Physical Safety Benchmark helpful and useful for your research, please cite our main paper as follows:
@misc{tang2024definingevaluatingphysicalsafety,
title={Defining and Evaluating Physical Safety for Large Language Models},
author={Yung-Chen Tang and Pin-Yu Chen and Tsung-Yi Ho},
year={2024},
eprint={2411.02317},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2411.02317},
}