AI Red Teaming & MCP Security

Research summary/Authorized security testing/ProofLayer Research

Overview

AI agents combine models with tools, retrieval, memory, and other agents. That creates security failures a model-only test cannot observe. A prompt may look harmless while the resulting workflow invokes the wrong tool, carries poisoned context into a later session, or exposes protected data.

We built an adaptive red-teaming system to test those end-to-end outcomes continuously. This article shares the evaluation design, headline results, and defensive lessons. It intentionally omits exploit strings, model and training details, campaign-selection logic, scoring formulas, and implementation parameters.

Results

Headline results

40%+

Verified attack success

On a held-out multi-agent target with layered defenses enabled.

70%+

Cross-target success

When high-level attack strategies were evaluated on a separate agent framework.

3.5×

Lift over static templates

Adaptive testing found materially more verified breaches than a fixed prompt library.

Results are rounded to emphasize the directional finding, not a single benchmark point. A breach was counted only when the target produced observable, replayable evidence of impact.

Approach

How the system works

Static scanners and fixed prompt libraries are useful regression tools. They are less effective when an agent changes plans, chains tools, stores memory, or retrieves attacker-controlled content. Our approach uses multiple specialized attack capabilities inside one coordinated campaign.

Specialize by attack surface

Different capabilities focus on memory, retrieval, tool use, prompt boundaries, and multi-agent handoffs.

Coordinate the campaign

The platform allocates effort to promising attack paths as evidence accumulates during an authorized assessment.

Verify observable outcomes

A finding counts only when the system records a concrete security impact in the response, tool trace, memory state, or audit trail.

Improve from prior runs

Successful attacks and informative failures strengthen future campaigns without relying on a static prompt catalog.

At a product level, the loop is simple: attack the complete workflow, detect a real security outcome, preserve the replay trace, and use the result to improve the next campaign.

Evaluation

How we evaluated the approach

The evaluation covered authorized multi-agent targets with tool use, retrieval, and persistent memory. We separated development scenarios from held-out tests and evaluated whether successful strategies transferred to a different agent implementation.

We measured three things: verified attack success, cross-target transfer, and legitimate task utility. This matters because an attacker that only works on its training target is brittle, while a defense that blocks every useful action is not operationally acceptable. AgentDojo and AgentHarm make the same broader point: agent security needs both security and capability measurements.^[4]^[5]

Findings required direct evidence such as an unauthorized tool action, protected-data exposure, poisoned state influencing a later action, or another policy-breaking outcome recorded in the system trace. Model-generated judgments alone did not count as proof.

Findings

What we learned

Adaptive testing materially outperformed fixed templates. Agent defenses change the shape of the attack surface. A system that learns from outcomes can redirect effort when an obvious path is blocked.

Specialization improved coverage. Memory, retrieval, tool use, and multi-agent control flow fail in different ways. Treating every surface as generic prompt injection left meaningful gaps.

Transferability is a critical test. High-level attack strategies carried across agent implementations even when tool names, prompts, and application code changed. This suggests teams should test behaviors and outcomes, not only known strings.

Input filtering was not enough. The most important signals often appeared after retrieval or tool execution. Output inspection, authorization boundaries, memory isolation, and complete audit traces were necessary to confirm and contain impact.

Replayable evidence changed remediation. Engineers could reproduce the exact sequence, identify the affected component, and verify a fix without debating whether a model response merely looked suspicious.

For defenders

What this means for security teams

Test the complete agent workflow, including memory, retrieval, tools, and inter-agent handoffs.
Run adaptive campaigns after changes to models, prompts, tools, permissions, or MCP servers.
Require observable proof and replay traces before a suspected issue enters the vulnerability queue.
Measure legitimate task utility beside attack success to catch over-blocking defenses.
Map verified findings to OWASP LLM Top 10, MITRE ATLAS, and the controls your auditors already request.

Safety

Responsible disclosure

All testing described here was performed on systems we owned or were authorized to assess. Validated issues were disclosed to the responsible engineering teams with replay evidence and remediation context.

We are publishing the security conclusions and evaluation principles, not operational attack material. Exploit payloads, learned attack artifacts, target-specific traces, system prompts, datasets, model configurations, and internal scoring logic remain restricted.

Coordinated-disclosure contact: security@sinewave.ai.

References

[1]Anthropic. Challenges in Red Teaming AI Systems. 2024.
[2]L. Ahmad et al. OpenAI's Approach to External Red Teaming for AI Models and Systems. OpenAI, 2024.
[3]M. Mazeika et al. HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. 2024.
[4]E. Debenedetti et al. AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents. 2024.
[5]M. Andriushchenko et al. AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents. 2024.

ProofLayer Research← Back to all posts