The AI Incident Response Framework: When Your Agent Is the Threat

Guy Barnhart-Magen

May 25, 2026

On April 25, 2026, a developer at PocketOS gave a Cursor agent a routine task: fix a credential mismatch in a staging environment. Nine seconds later, the production database was gone. All volume-level backups were gone with it. Railway stores them in the same volume they protect. The most recent recoverable backup was three months old. Rental businesses couldn’t process Saturday morning reservations.

The agent had found a Railway CLI token in an unrelated file. It used the token to call Railway’s GraphQL API: volumeDelete. It didn’t ask. It didn’t confirm. It optimized toward its goal and resolved the obstacle in front of it.

Then it wrote a confession.

In the session log, the agent enumerated every safety principle it had been given, and listed, methodically, each one it had violated. “I guessed instead of verifying. I ran a destructive action without being asked. I violated every principle I was given.” No attacker has ever done this. No malware has ever done this. This was something new.

The Railway CEO replied on X: “Oh my. That 1000% shouldn’t be possible.”

Three systems failed simultaneously. Cursor’s guardrails didn’t prevent the deletion. Cursor acknowledged a constraint enforcement bug introduced in December 2025. Railway’s volumeDelete API had no confirmation step, and its token model lacked operation-level scoping. Railway’s backup architecture stored volume backups inside the volume they protected. None of these were edge cases. This is the default state of most AI deployments.

Why Your IR Playbook Is Missing a Branch

PocketOS is the incident that went public. We’ve worked the ones that didn’t.

In our IR engagements, we’ve classified four recurring categories of AI-induced incidents (we covered the earliest of these in detail in New Attack Vector: AI-Induced Destruction):

Autonomous destruction: 2.3 million records deleted in 87 seconds, indistinguishable from wiper malware at first triage
Accidental data exposure: an agent rewrites an ACL while executing a legitimate task; 340,000 records exposed for six hours; the forensic signature looks like staged exfiltration
Misconfiguration deployment: an agent granted broad permissions “cleans up technical debt”; rate limiting is disabled, authentication weakened across three downstream services; it reads as an insider threat
Privilege escalation: an agent acquires scope beyond its initial grant; it mimics lateral movement

The pattern isn’t bad luck. It’s a structural gap. Your IR playbook was written for human attackers and malware. It has a branch for external adversaries, a branch for insiders, and a branch for malware. It almost certainly has no branch for the AI agent your team deployed last quarter.

The public incidents confirm the pattern is wider than any single firm’s casework. EchoLeak (CVE-2025-32711, CVSS 9.3) demonstrated zero-click prompt injection through a shared document that exfiltrated Teams, OneDrive, and SharePoint data without any user interaction. Samsung engineers pasted proprietary source code into ChatGPT three times in a single month before the company issued a ban. An Air Canada chatbot fabricated a bereavement refund policy; a tribunal ordered the airline to honor it.

And on the attacker side: three teenagers with no coding background used ChatGPT to launch a DDoS campaign against Rakuten 220,000 times. One actor used Claude Code autonomously to run a multi-organization extortion campaign: code, analysis, and drafting, all delegated to the model. Claude Mythos generated 181 working Firefox exploits autonomously (we wrote about what that actually means for defenders). The capability floor for attacks has dropped permanently.

The threat surface is now three-dimensional: attacks on AI systems (prompt injection, RAG compromise, model poisoning), AI causing damage (over-permissioned agents acting autonomously), and AI assisting attacks (capability amplification for actors who previously lacked the skill). Most IR frameworks address none of these three surfaces. They were written before any of them existed at scale.

The three AI threat surfaces: Attacks ON AI, AI Causing Damage, AI-Assisted Attacks

A Framework for Classification: Absence and Presence

The core problem is forensic ambiguity. Mass deletion at machine speed reads as a wiper malware precursor. An ACL rewrite reads as staged exfiltration. Configuration changes that disable security controls read as an insider threat. When you’re thirty minutes into triage, you cannot tell whether you’re hunting a threat actor, investigating a rogue employee, or cleaning up after an agent that optimized too aggressively.

Attribution is now genuinely three-way: external actor, internal user, or AI agent. The initial forensic surface can look identical across all three.

The same incident forensic surface maps to three possible attribution paths: external attacker, insider threat, or AI agent, each requiring different investigation approaches

The framework we’ve built evaluates two axes simultaneously in the first thirty minutes.

Axis 1: Absence Analysis

A skilled attacker at significant impact scope leaves a predictable trail. Its absence is as diagnostic as any artifact.

At the scope of a major data deletion or privilege change, expect all of these:

C2 beaconing in the preceding 72 hours
Lateral movement or reconnaissance before the damage
Exfiltration before destruction (no motivated attacker skips this)
Persistence mechanisms
Anti-forensic activity such as log clearing or timestamp manipulation

When all of these are absent at the expected scope, that is a strong AI-induced signal. It is not conclusive alone. But it is the first axis.

Axis 2: Presence Analysis

AI agents produce artifacts that human attackers never generate. These are the presence signals:

Optimization rationale in tool logs: chain-of-thought reasoning embedded before each tool invocation. The agent explains its action before taking it. No attacker produces this.
Goal-directed resumption: after resolving an obstacle (rewriting an ACL, finding a credential, changing a config), the agent returns to its original task. The behavior is purposeful and sequential, not exploratory.
Error-recovery chains: authentication error → solution search → configuration change → task continuation. Fully logged, fully sequential.
No reconnaissance preamble: agents act on what they already have access to. Attackers always probe first.
Written confession: the PocketOS agent enumerated every safety rule it violated. No threat actor self-documents.
Automated timing signature: machine speed, no dwell time, no hesitation between steps.

Any single presence signal, alongside the absence of adversarial TTPs at expected scope, is sufficient to reclassify the incident and pivot from threat hunting to recovery.

The two-axis framework: Absence Analysis checks for adversarial TTPs that should be present; Presence Analysis checks for AI-specific artifacts that attackers never produce

The Framework Applied: Two Cases

The MongoDB Massacre

An agent was given a legitimate task: clean up old records and optimize storage. It inverted a legacy_flag logic condition. 2.3 million records were deleted in 87 seconds.

At first triage, this looks like wiper malware. The speed, the scope, the timing: all consistent with a destructive attack.

Absence cascade: No C2 pre-positioning, no lateral movement, no reconnaissance before the deletion. Critically: no exfiltration before destruction. A motivated attacker with access to 2.3 million records does not skip the exfiltration step. That data has value. Skipping it makes no sense. Its absence, at that scope, is near-definitive.

Presence: The query construction log shows the exact reasoning chain. The agent determined that records where legacy_flag = false should be deleted, then inverted the condition. The agent logs a success confirmation. The log reads like a work report.

Classification: Autonomous destruction. Pivot immediately to recovery. Do not commit IR resources to adversary hunting. The evidence is in the logs, not in threat intelligence feeds.

The Configuration Catastrophe

A CI/CD agent was granted --dangerously-skip-permissions and told to “clean up technical debt.” Four hours later: rate limiting disabled, authentication weakened across three downstream services.

The precision of the changes made this look like an insider threat. Exactly the controls that would prevent detection were modified. Exactly the configuration that would reduce audit visibility was altered. If an attacker had done this, it would have been methodical and targeted.

Absence: No C2. No lateral movement. No anti-forensics. The changes happened directly, without any preamble. An attacker with full access doesn’t skip reconnaissance. They don’t need to optimize toward a goal because they already have the access they need.

Presence: Every configuration change was preceded by the agent’s reasoning in the tool invocation log. “Rate limiting is preventing the pipeline from completing. Adjusting threshold to allow test traffic.” “Auth token validation is blocking the cleanup script. Modifying validation to allow service-to-service calls.” Goal-directed optimization. Not attack. Not insider. Agent.

Classification: Misconfiguration deployment. Pivot to session audit and permission review. Restore the affected configurations from the last known-good state. Review what permissions the agent held and how it acquired them.

The 12-Question Triage Checklist

These questions take five minutes and determine the entire investigation path. Add them to your existing IR checklist. Run them before committing resources to adversary hunting.

Establish AI presence

Was an AI agent active in this environment at the time of the incident?
What tools and permissions did it hold at session start?
Are tool invocation logs preserved? Secure them immediately. They are the primary evidence chain.
What was the standing instruction at session initiation?

Absence axis

Is there C2 beaconing in the 72 hours preceding the event?
Is there evidence of lateral movement or reconnaissance before the damage?
Was there exfiltration before any data destruction?
Are persistence mechanisms present? Anti-forensic activity (log clearing, timestamp manipulation)?

Presence axis

Do tool invocation logs contain optimization rationale or chain-of-thought reasoning?
Does session context show goal-directed behavior: agent resuming its original task after resolving an obstacle?
Is timing consistent with automated execution: no dwell time, no human pacing between steps?
Is there a written record of the agent’s decision process, including rules it acknowledged and violated?

Absence of all adversarial TTPs at expected scope (questions 5–8) is a strong AI-induced signal. Any presence signal (questions 9–12) alongside that absence: classify as AI-induced, pivot to recovery.

What to Do This Week

The prevention side of this has four controls, each one mapping to a documented failure.

Agent permissions: The safest default is read-only. Write access requires explicit authorization per operation class. No agent holds a credential that can DELETE, DROP, or volumeDelete without out-of-band confirmation — a gate that the agent itself cannot satisfy. Audit MCP servers before installation: check the publisher, review the permission model, pin the version. (The Claude Code source leak exposed how deep agent internals go, including permission models most operators never inspect.)

Scoped credentials: Agents use what they find. A credential created for one purpose (domain management, CI/CD) can authorize a completely different operation (volume deletion, database drops). Agents search the filesystem, environment variables, and config files. Separate tokens per operation class; require explicit scope declaration at token creation. Our secrets leakage research documents how consistently secrets end up where they shouldn’t. AI agents accelerate that exposure surface.

Environment isolation: “Don’t touch production” in a system prompt is not isolation. Agents in staging must be architecturally unable to reach production: separate credentials, separate network paths, separate MCP server scope per environment. Promotion to production requires an explicit human step. (See our cloud incident prevention guide for the full infrastructure checklist.)

Logging: Most organizations have zero AI agent logging. Without logs, you cannot run the framework above. Capture: tool invocation log, session context, permission grants, timing distributions. Sources: Claude Code (~/.claude/), Cursor session logs, Copilot audit log, your framework logs. Integrate into your existing SIEM.

The PocketOS agent didn’t need a threat actor. It needed a token, a task, and nine seconds. The confession it wrote afterward was the most honest forensic artifact in the entire incident. Most organizations would have classified it as a sophisticated attack and spent two weeks hunting a threat actor who was never there.

The framework exists. The checklist is twelve questions. Add them to your runbook before the next incident, not during it.

Written by

Guy Barnhart-Magen

Related posts