A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents
All materials required for review are available below
Computer-use agents extend language models from text generation to persistent action over tools, files, and execution environments. Unlike chat systems, they maintain state across interactions and translate intermediate outputs into concrete actions. This creates a distinct safety challenge in that harmful behavior may emerge through sequences of individually plausible steps, including intermediate actions that appear locally acceptable but collectively lead to unauthorized actions. We present AgentHazard, a benchmark for evaluating harmful behavior in computer-use agents. AgentHazard contains 2,653 instances spanning diverse risk categories and attack strategies. Each instance pairs a harmful objective with a sequence of operational steps that are locally legitimate but jointly induce unsafe behavior. The benchmark evaluates whether agents can recognize and interrupt harm arising from accumulated context, repeated tool use, intermediate actions, and dependencies across steps. We evaluate AgentHazard on Claude Code, OpenClaw, and IFlow using mostly open or openly deployable models from the Qwen, Kimi, GLM, and DeepSeek families. Our experimental results indicate that current systems remain highly vulnerable. In particular, when powered by Qwen3-Coder, Claude Code exhibits an attack success rate of 73.63%, suggesting that model alignment alone does not reliably guarantee the safety of autonomous agents.
Overview of the AgentHazard construction pipeline: taxonomy design β task generation β execution-based filtering β LLM judging β human curation
Illustration of harmful task execution in a computer-use agent. Unsafe behavior emerges only after multiple user turns, intermediate agent actions, and tool-mediated execution are composed across the trajectory.
Interactive demonstration of the sandboxed evaluation environment used for executing and testing agent behavior in isolated containers
β οΈ Real security alert triggered during AgentHazard evaluation.
This demonstrates that our attacks are not just theoreticalβthey trigger actual security monitoring systems in production environments, confirming the real-world effectiveness and severity of the identified vulnerabilities.
Distribution of AgentHazard across risk categories and attack strategies. The heatmap shows the number of instances in each category-strategy pair, while the marginal bar charts summarize totals.
Attack success rate (%) by framework, backbone model, and risk category.
Highlighted values indicate the highest ASR within each framework.
| Framework | Model | RCE | Exfil | Persist | Supply | Destruct | Recon | PrivEsc | ResExh | PrmTheft | Comply | Overall |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| OpenClaw | Qwen2.5-72B-Inst. | 29.00 | 17.39 | 37.37 | 20.41 | 25.26 | 16.67 | 28.87 | 36.00 | 12.37 | 18.56 | 24.10 |
| Kimi-K2.5 | 44.00 | 35.65 | 59.60 | 44.90 | 49.47 | 62.75 | 43.30 | 52.00 | 64.95 | 45.36 | 50.00 | |
| Qwen3-32B | 65.00 | 77.39 | 75.51 | 65.31 | 62.77 | 56.86 | 76.29 | 41.41 | 18.56 | 49.48 | 59.18 | |
| Qwen3-VL-235B-Inst. | 60.00 | 65.22 | 73.74 | 82.65 | 61.05 | 58.82 | 47.42 | 58.00 | 61.86 | 60.82 | 63.00 | |
| Qwen2.5-Coder-32B-Inst. | 40.40 | 54.78 | 66.67 | 70.41 | 62.11 | 53.92 | 65.98 | 71.00 | 72.16 | 85.57 | 64.06 | |
| GLM-4.6 | 69.00 | 68.70 | 81.82 | 75.51 | 65.26 | 73.53 | 53.61 | 81.00 | 68.04 | 71.13 | 70.80 | |
| Kimi-K2 | 68.00 | 80.00 | 68.69 | 66.33 | 68.42 | 78.43 | 67.01 | 65.00 | 75.26 | 72.16 | 71.10 | |
| Claude Code | Qwen2.5-72B-Inst. | 18.00 | 26.96 | 22.22 | 16.33 | 14.74 | 24.51 | 19.59 | 17.00 | 16.49 | 26.80 | 20.40 |
| Kimi-K2 | 19.00 | 38.26 | 30.30 | 18.37 | 23.16 | 24.51 | 28.87 | 21.00 | 19.59 | 20.62 | 24.60 | |
| Qwen2.5-Coder-32B-Inst. | 59.00 | 62.61 | 59.60 | 46.94 | 52.63 | 68.63 | 62.89 | 62.00 | 50.52 | 51.55 | 57.80 | |
| Qwen3-VL-235B-Inst. | 65.00 | 73.91 | 68.69 | 76.53 | 56.84 | 72.55 | 55.67 | 66.00 | 59.79 | 58.76 | 65.60 | |
| Qwen3-Coder | 78.26 | 77.42 | 74.90 | 72.18 | 68.00 | 75.06 | 65.27 | 84.04 | 61.15 | 64.94 | 73.63 | |
| Kimi-K2.5 | 71.00 | 71.30 | 80.81 | 66.33 | 82.11 | 84.31 | 79.38 | 88.00 | 77.32 | 83.51 | 78.30 | |
| GLM-4.6 | 84.00 | 85.22 | 89.90 | 76.53 | 81.05 | 93.14 | 77.32 | 87.00 | 70.10 | 83.51 | 82.90 | |
| IFlow | Qwen2.5-Coder-32B-Inst. | 76.00 | 75.65 | 77.78 | 73.47 | 75.79 | 76.47 | 73.20 | 78.00 | 64.95 | 75.26 | 74.70 |
@inproceedings{feng2026agenthazard,
title = {AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents},
author = {Feng, Yunhao and Ding, Yifan and Tan, Yingshui and Ma, Xingjun and
Li, Yige and Wu, Yutao and Gao, Yifeng and Zhai, Kun and Guo, Yanming},
booktitle = {arXiv:2604.02947},
year = {2026}
}