AgentHazard Logo

AgentHazard

A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents

🎬 ACMMM 2026 Dataset Track

πŸ“¦ Essential Resources for Reviewers

All materials required for review are available below

πŸ“Š

Dataset (JSON)

2,653 curated instances spanning 10 risk categories Γ— 10 attack strategies

πŸ’»

GitHub Repository

Full source code, evaluation framework, construction pipeline & 10K+ trajectories

πŸ“„

Supplementary Material

Detailed taxonomy, methodology, additional results & extended analysis

Abstract

Computer-use agents extend language models from text generation to persistent action over tools, files, and execution environments. Unlike chat systems, they maintain state across interactions and translate intermediate outputs into concrete actions. This creates a distinct safety challenge in that harmful behavior may emerge through sequences of individually plausible steps, including intermediate actions that appear locally acceptable but collectively lead to unauthorized actions. We present AgentHazard, a benchmark for evaluating harmful behavior in computer-use agents. AgentHazard contains 2,653 instances spanning diverse risk categories and attack strategies. Each instance pairs a harmful objective with a sequence of operational steps that are locally legitimate but jointly induce unsafe behavior. The benchmark evaluates whether agents can recognize and interrupt harm arising from accumulated context, repeated tool use, intermediate actions, and dependencies across steps. We evaluate AgentHazard on Claude Code, OpenClaw, and IFlow using mostly open or openly deployable models from the Qwen, Kimi, GLM, and DeepSeek families. Our experimental results indicate that current systems remain highly vulnerable. In particular, when powered by Qwen3-Coder, Claude Code exhibits an attack success rate of 73.63%, suggesting that model alignment alone does not reliably guarantee the safety of autonomous agents.

Dataset Statistics

2,653
Total Instances
10
Risk Categories
10
Attack Strategies
3
Agent Frameworks
10K+
Trajectories
11.55
Avg. Steps/Instance

Construction Pipeline

AgentHazard Construction Pipeline

Overview of the AgentHazard construction pipeline: taxonomy design β†’ task generation β†’ execution-based filtering β†’ LLM judging β†’ human curation

Attack Scenario Illustration

Harmful Task Execution Illustration

Illustration of harmful task execution in a computer-use agent. Unsafe behavior emerges only after multiple user turns, intermediate agent actions, and tool-mediated execution are composed across the trajectory.

Evaluation Environment Demo

AgentHazard Evaluation Environment

Interactive demonstration of the sandboxed evaluation environment used for executing and testing agent behavior in isolated containers

Real-World Attack Effectiveness

Security Alert Triggered by AgentHazard Attack

⚠️ Real security alert triggered during AgentHazard evaluation.
This demonstrates that our attacks are not just theoreticalβ€”they trigger actual security monitoring systems in production environments, confirming the real-world effectiveness and severity of the identified vulnerabilities.

Dataset Distribution

Dataset Distribution Heatmap

Distribution of AgentHazard across risk categories and attack strategies. The heatmap shows the number of instances in each category-strategy pair, while the marginal bar charts summarize totals.

Experimental Results

Attack success rate (%) by framework, backbone model, and risk category.
Highlighted values indicate the highest ASR within each framework.

Framework Model RCE Exfil Persist Supply Destruct Recon PrivEsc ResExh PrmTheft Comply Overall
OpenClaw Qwen2.5-72B-Inst. 29.00 17.39 37.37 20.41 25.26 16.67 28.87 36.00 12.37 18.56 24.10
Kimi-K2.5 44.00 35.65 59.60 44.90 49.47 62.75 43.30 52.00 64.95 45.36 50.00
Qwen3-32B 65.00 77.39 75.51 65.31 62.77 56.86 76.29 41.41 18.56 49.48 59.18
Qwen3-VL-235B-Inst. 60.00 65.22 73.74 82.65 61.05 58.82 47.42 58.00 61.86 60.82 63.00
Qwen2.5-Coder-32B-Inst. 40.40 54.78 66.67 70.41 62.11 53.92 65.98 71.00 72.16 85.57 64.06
GLM-4.6 69.00 68.70 81.82 75.51 65.26 73.53 53.61 81.00 68.04 71.13 70.80
Kimi-K2 68.00 80.00 68.69 66.33 68.42 78.43 67.01 65.00 75.26 72.16 71.10
Claude Code Qwen2.5-72B-Inst. 18.00 26.96 22.22 16.33 14.74 24.51 19.59 17.00 16.49 26.80 20.40
Kimi-K2 19.00 38.26 30.30 18.37 23.16 24.51 28.87 21.00 19.59 20.62 24.60
Qwen2.5-Coder-32B-Inst. 59.00 62.61 59.60 46.94 52.63 68.63 62.89 62.00 50.52 51.55 57.80
Qwen3-VL-235B-Inst. 65.00 73.91 68.69 76.53 56.84 72.55 55.67 66.00 59.79 58.76 65.60
Qwen3-Coder 78.26 77.42 74.90 72.18 68.00 75.06 65.27 84.04 61.15 64.94 73.63
Kimi-K2.5 71.00 71.30 80.81 66.33 82.11 84.31 79.38 88.00 77.32 83.51 78.30
GLM-4.6 84.00 85.22 89.90 76.53 81.05 93.14 77.32 87.00 70.10 83.51 82.90
IFlow Qwen2.5-Coder-32B-Inst. 76.00 75.65 77.78 73.47 75.79 76.47 73.20 78.00 64.95 75.26 74.70

Key Findings

82.9%
Highest Attack Success Rate (GLM-4.6 on Claude Code)
3Γ—
ASR Increase from Round 1 to Round 3 (Multi-step escalation)
27%
Best Guard Model Detection Rate (Still Insufficient)
16pp
Framework Variance for Same Model (Safety depends on framework)

Authors

Yunhao Feng
Alibaba Group
* Equal contribution
Yifan Ding
Alibaba Group
* Equal contribution
Yingshui Tan
Alibaba Group
Xingjun Ma
Fudan University
† Corresponding author
Yige Li
Fudan University
Yutao Wu
Fudan University
Yifeng Gao
Fudan University
* Equal contribution
Kun Zhai
Fudan University
Yanming Guo
Hunan Institute of Advanced Technology

Citation

@inproceedings{feng2026agenthazard,
  title     = {AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents},
  author    = {Feng, Yunhao and Ding, Yifan and Tan, Yingshui and Ma, Xingjun and
               Li, Yige and Wu, Yutao and Gao, Yifeng and Zhai, Kun and Guo, Yanming},
  booktitle = {arXiv:2604.02947},
  year      = {2026}
}

Additional Resources