AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents

Abstract

Computer-use agents extend language models from text generation to persistent action over tools, files, and execution environments. Unlike chat systems, they maintain state across interactions and translate intermediate outputs into concrete actions. This creates a distinct safety challenge in that harmful behavior may emerge through sequences of individually plausible steps, including intermediate actions that appear locally acceptable but collectively lead to unauthorized actions. We present AgentHazard, a benchmark for evaluating harmful behavior in computer-use agents. AgentHazard contains 2,653 instances spanning diverse risk categories and attack strategies. Each instance pairs a harmful objective with a sequence of operational steps that are locally legitimate but jointly induce unsafe behavior. The benchmark evaluates whether agents can recognize and interrupt harm arising from accumulated context, repeated tool use, intermediate actions, and dependencies across steps. We evaluate AgentHazard on Claude Code, OpenClaw, and IFlow using mostly open or openly deployable models from the Qwen, Kimi, GLM, and DeepSeek families. Our experimental results indicate that current systems remain highly vulnerable. In particular, when powered by Qwen3-Coder, Claude Code exhibits an attack success rate of 73.63%, suggesting that model alignment alone does not reliably guarantee the safety of autonomous agents.

Dataset Statistics

2,653

Total Instances

10

Risk Categories

10

Attack Strategies

3

Agent Frameworks

10K+

Trajectories

11.55

Avg. Steps/Instance

Construction Pipeline

Overview of the AgentHazard construction pipeline: taxonomy design → task generation → execution-based filtering → LLM judging → human curation

Attack Scenario Illustration

Illustration of harmful task execution in a computer-use agent. Unsafe behavior emerges only after multiple user turns, intermediate agent actions, and tool-mediated execution are composed across the trajectory.

Evaluation Environment Demo

Interactive demonstration of the sandboxed evaluation environment used for executing and testing agent behavior in isolated containers

Real-World Attack Effectiveness

Security Alert Triggered by AgentHazard Attack

⚠️ Real security alert triggered during AgentHazard evaluation.
This demonstrates that our attacks are not just theoretical—they trigger actual security monitoring systems in production environments, confirming the real-world effectiveness and severity of the identified vulnerabilities.

Dataset Distribution

Distribution of AgentHazard across risk categories and attack strategies. The heatmap shows the number of instances in each category-strategy pair, while the marginal bar charts summarize totals.

Experimental Results

Attack success rate (%) by framework, backbone model, and risk category.
Highlighted values indicate the highest ASR within each framework.

Framework	Model	RCE	Exfil	Persist	Supply	Destruct	Recon	PrivEsc	ResExh	PrmTheft	Comply	Overall
OpenClaw	Qwen2.5-72B-Inst.	29.00	17.39	37.37	20.41	25.26	16.67	28.87	36.00	12.37	18.56	24.10
	Kimi-K2.5	44.00	35.65	59.60	44.90	49.47	62.75	43.30	52.00	64.95	45.36	50.00
	Qwen3-32B	65.00	77.39	75.51	65.31	62.77	56.86	76.29	41.41	18.56	49.48	59.18
	Qwen3-VL-235B-Inst.	60.00	65.22	73.74	82.65	61.05	58.82	47.42	58.00	61.86	60.82	63.00
	Qwen2.5-Coder-32B-Inst.	40.40	54.78	66.67	70.41	62.11	53.92	65.98	71.00	72.16	85.57	64.06
	GLM-4.6	69.00	68.70	81.82	75.51	65.26	73.53	53.61	81.00	68.04	71.13	70.80
	Kimi-K2	68.00	80.00	68.69	66.33	68.42	78.43	67.01	65.00	75.26	72.16	71.10
Claude Code	Qwen2.5-72B-Inst.	18.00	26.96	22.22	16.33	14.74	24.51	19.59	17.00	16.49	26.80	20.40
	Kimi-K2	19.00	38.26	30.30	18.37	23.16	24.51	28.87	21.00	19.59	20.62	24.60
	Qwen2.5-Coder-32B-Inst.	59.00	62.61	59.60	46.94	52.63	68.63	62.89	62.00	50.52	51.55	57.80
	Qwen3-VL-235B-Inst.	65.00	73.91	68.69	76.53	56.84	72.55	55.67	66.00	59.79	58.76	65.60
	Qwen3-Coder	78.26	77.42	74.90	72.18	68.00	75.06	65.27	84.04	61.15	64.94	73.63
	Kimi-K2.5	71.00	71.30	80.81	66.33	82.11	84.31	79.38	88.00	77.32	83.51	78.30
	GLM-4.6	84.00	85.22	89.90	76.53	81.05	93.14	77.32	87.00	70.10	83.51	82.90
IFlow	Qwen2.5-Coder-32B-Inst.	76.00	75.65	77.78	73.47	75.79	76.47	73.20	78.00	64.95	75.26	74.70

Key Findings

82.9%

Highest Attack Success Rate (GLM-4.6 on Claude Code)

3×

ASR Increase from Round 1 to Round 3 (Multi-step escalation)

27%

Best Guard Model Detection Rate (Still Insufficient)

16pp

Framework Variance for Same Model (Safety depends on framework)

Authors

Yunhao Feng

Alibaba Group

* Equal contribution

Yifan Ding

Alibaba Group

* Equal contribution

Yingshui Tan

Alibaba Group

Xingjun Ma

Fudan University

† Corresponding author

Yige Li

Fudan University

Yutao Wu

Fudan University

Yifeng Gao

Fudan University

* Equal contribution

Kun Zhai

Fudan University

Yanming Guo

Hunan Institute of Advanced Technology

Citation

@inproceedings{feng2026agenthazard,
  title     = {AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents},
  author    = {Feng, Yunhao and Ding, Yifan and Tan, Yingshui and Ma, Xingjun and
               Li, Yige and Wu, Yutao and Gao, Yifeng and Zhai, Kun and Guo, Yanming},
  booktitle = {arXiv:2604.02947},
  year      = {2026}
}

Additional Resources

📊 Download Dataset 💻 GitHub Repository 📄 Download Appendix 🔍 Browse Trajectories

AgentHazard

📦 Essential Resources for Reviewers

Dataset (JSON)

GitHub Repository

Supplementary Material

Abstract

Dataset Statistics

Construction Pipeline

Attack Scenario Illustration

Evaluation Environment Demo

Real-World Attack Effectiveness

Dataset Distribution

Experimental Results

Key Findings

Authors

Citation

Additional Resources