The ways we contain Claude across products

AI
Security
Developer Tools
Infrastructure

Anthropic’s post lays out a practical security story for AI agents: as Claude gets more capable, you should assume it will sometimes take bad actions, so the job is to contain the environment around it. The post describes a few patterns. Put the agent in a VM instead of trusting prompt-level behavior. Keep sensitive credentials on the host and hand the VM narrow, revocable tokens. Restrict outbound network access with proxies and allowlists. Add model-side checks like approval classifiers, but only as another layer. In plain terms, Anthropic is saying agent safety is not about making the model perfectly obedient. It is about limiting what it can touch when it is not.

That basic framing landed as obvious and mostly correct. Security is a cost and blast-radius problem, not a purity test, and several people said this is simply the right way to think about any system that can run commands. The sharper reaction was that Anthropic’s writeup smooths over how brittle this gets in real use. Once an agent can read files, browse the web, inspect tool output, or use third-party APIs, prompt injection spreads across everything the user would normally think of as trusted. A calendar entry, bug report, README, dependency metadata, or even the artifact produced by a lower-privilege VM can become an attack path into a higher-privilege step. At that point, “just sandbox it” is not a complete answer, because the dangerous channel is often the data flow, not the process boundary. The most concrete criticism came from people pointing at known implementation gaps. One commenter linked prior research showing bugs in claude.ai and Claude Code session isolation that allegedly let one session access other sessions, connected repos, and environment variables, with regressions after fixes. Others called out that Anthropic’s own auto mode docs claim it blocks irreversible or destructive actions, while the post admits roughly 17 percent of risky actions still slip through. That turned what could have been read as a clean defense-in-depth post into something narrower: useful design guidance, but not evidence that Anthropic has containment solved. The practical consensus was to focus less on filesystem isolation and more on secrets and egress. A disposable machine, VM, or microVM helps, and many people said they already run Claude inside qemu, bubblewrap, or custom VM setups. But the harder problem is not the agent trashing its workspace. It is the agent seeing credentials, abusing overbroad API tokens, or exfiltrating data through channels you intentionally left open. Several people argued the safest pattern is to keep agents away from crown-jewel secrets entirely, force human review before any push or external action, and assume that anything with outbound communication can leak information in ways that an allowlist will not fully stop.

If you are putting an agent near real code, data, or accounts, treat the model as untrusted code and design around blast radius, egress, and credential scope first. Also do not take product claims like “auto mode blocks destructive actions” at face value without testing the false negatives in your own workflow.

June 4, 2026
anthropic.com
Discuss on HN

Discussion mood

Guardedly approving of the least-privilege approach, but skeptical of Anthropic’s execution and marketing. People liked the shift from “trust the model” to “contain the environment,” then immediately pointed out prompt injection, token scope, egress leaks, and misleading product claims that make the real security picture much worse than the post suggests.

Key insights

Session isolation has already failed in practice

Concrete past bugs make the post read less like a solved pattern and more like an aspirational one. Prior research alleged that claude.ai and Claude Code session boundaries could be broken so one session could access other sessions, connected repositories, and environment variables, then even spawn fresh sessions with broader instructions. The important point is not one historical bug. It is that token-scope and cross-session isolation are exactly the layers this architecture depends on, and they appear to have regressed more than once.

Audit cross-session token scope and repo isolation in any agent product you use, even if the vendor says sessions are sandboxed. Treat session boundaries as a high-risk area for regression testing, not a box you check once.

Attribution:

lebovic #1

Prompt injection follows data, not machines

Putting an agent in a VM does not break the attack chain when the thing crossing the boundary is poisoned content. If a low-privilege agent can emit an artifact, summary, or plan that a higher-privilege agent later consumes, the prompt injection can survive the sandbox hop and regain power on the other side. The deeper point is that taint spreads through user files, calendars, bug reports, and any tool output the agent later treats as input. Once the system edits or manages those sources for you, they stop being trustworthy too.

Model your agent stack like an untrusted dataflow problem, not just a host isolation problem. Add explicit trust boundaries between inputs and outputs, and avoid chaining lower-privilege agent output directly into higher-privilege contexts.

Attribution:

saagarjha #1

Secrets and egress are the real blast radius

Filesystem destruction is noisy and recoverable if you have backups. The nastier failures are stolen credentials, silent corruption, and data leaving through an approved channel. Several comments converged on the same operational lesson. A separate box or VM helps, but it does little if the agent still has broad API tokens, SSH access, browser cookies, or outbound paths to places you trust. That is why the post’s VM-heavy framing only gets you halfway. The crown jewels are credentials and communication, not the workdir.

Spend your effort on credential minimization, outbound controls, and human review of external actions before polishing local sandboxing. If an agent can see a secret and talk to the network, assume it can eventually leak that secret.

Attribution:

barrkel #1
e12e #1
koolba #1
jon-wood #1
kstenerud #1
ElenaDaibunny #1

Auto mode safety claims do not match the numbers

Anthropic’s own writeup says the classifier behind Claude Code auto mode lets about 17 percent of risky actions through. That makes the public docs language about blocking anything irreversible or destructive read as a product promise the system does not meet. This changes how to interpret the whole stack. The approval model is a convenience feature with a measurable miss rate, not a safety guarantee you can build policy around.

Do not grant wider permissions because an approval classifier sounds strong in marketing copy. Keep irreversible operations behind separate controls that you own, and verify false negative rates under your own task mix.

Attribution:

SpicyLemonZest #1

Allowlisted egress still leaks through covert channels

Network allowlists and in-VM proxies reduce casual exfiltration, but they do not close the problem. Commenters pointed to domain fronting, steganography in commits or generated text, and encoding data into timing or ordering across allowed requests. Even manual airlock designs can leak if the handoff artifact itself carries an encoded payload. That pushes the problem toward the same place as banking fraud. You can reduce and detect it, but not cleanly prove it away in a general-purpose agent workflow.

Treat egress filtering as damage reduction, not containment proof. Pair it with anomaly detection, narrow data exposure, and workflow design that minimizes what the model can ever access in the first place.

Attribution:

benlivengood #1 #2
Retr0id #1
jazzyjackson #1

Against the grain

Anthropic may be doing the safety work sincerely

The cynical read is that public danger talk is just capability marketing ahead of an IPO. A more useful read is that Anthropic was founded around AI safety concerns and is one of the few labs willing to publish operational details about containment, even when that means highlighting ugly failure modes. The blackmail example that got mocked was explicitly presented as a fictional test scenario, not a live incident. That does not make the company beyond spin. It does mean not every alarming example is evidence of deception.

Read vendor safety writeups with skepticism, but do not discard them wholesale as marketing. Separate fabricated evaluation scenarios from real-world incidents and ask whether the published architecture gives you reusable controls anyway.

Attribution:

bananamogul #1
forest32 #1
ngruhn #1
radu_floricica #1

Many agent disasters are just standard bad ops

A lot of the scary examples lose their mystique once you strip away the AI framing. If production secrets are accessible from a workstation, if browser cookies can be scraped, or if destructive resources are exposed to a tool that executes commands, you already had a weak environment. In that view, the agent is not introducing a new class of failure so much as rapidly finding the same paths a sloppy human operator or attacker would find.

Fix the obvious operational mistakes even if you ban agents tomorrow. Remove ambient credentials, split dirty work from daily-driver machines, and keep production access out of general-purpose execution environments.

Attribution:

protocolture #1
lambda #1

Practical local sandboxes are good enough for many developers

Some people are already getting substantial value from qemu VMs, bubblewrap, and review-before-push workflows. The pattern is simple. Let the agent install packages, build containers, and make local commits in a disposable environment, but keep git push, credentials, and final review outside it. That setup does not solve prompt injection in the abstract. It does cut enough risk to make the productivity gain worth it for ordinary coding tasks.

If your use case is routine development work, you do not need a perfect security proof to get value. Start with a VM or strong local sandbox, remove credentials, and keep every external side effect behind manual review.

Attribution:

emilburzo #1 #2
dist-epoch #1
vbezhenar #1
saghm #1

In plain english

allowlist ↩

A list of specific packages or actions that are explicitly permitted while everything else stays blocked by default.

Bubblewrap ↩

A Linux sandboxing tool that restricts what a process can access on the host system.

domain fronting ↩

A technique that hides the true destination of network traffic behind an allowed domain or content delivery service.

egress ↩

Outbound network traffic leaving a system or sandbox.

MicroVM ↩

A very small virtual machine designed to start quickly and use fewer resources than a traditional virtual machine while still providing hardware-level isolation.

prompt injection ↩

A technique where untrusted input is written to manipulate an AI system into ignoring its original instructions or policies.

QEMU ↩

A widely used open source machine emulator and virtualizer for running virtual machines.

steganography ↩

A way of hiding information inside another message so the hidden data is not obvious to the recipient or observer.

VM ↩

Virtual machine, software that emulates a separate computer system inside another machine.

Reference links

Anthropic and Claude references

Anthropic engineering post on containing Claude
The main article being discussed, describing Anthropic’s containment patterns across products.
Claude 4 system card
Cited to show Anthropic’s blackmail example was explicitly a fictional test scenario.
Claude Code auto mode configuration docs
Referenced to contrast product docs with the blog post’s admitted miss rate for risky actions.

Security research and containment writeups

Hacking Claude Code on the web: breaking YOLO mode sandboxing
Used as evidence that Claude session isolation and token scoping have failed in practice.
The lethal trifecta for AI agents
A framing for agent risk around private data, untrusted content, and external communication.
Running Claude Code dangerously safely
A personal writeup describing a local VM-based containment workflow for Claude Code.
Related Hacker News discussion of that containment setup
Linked as prior discussion of the same VM-based setup.
yoloai
Shared as a tool built around environment-layer containment for AI agents.

Projects exploring exfiltration and isolation

OrcaBot blog: breaking the lethal trifecta
Mentioned by a builder working on exfiltration defenses and fraud-like detection for agents.
OrcaBot GitHub repository
Code repository for the same agent security project.
Tin Foil Chat
An example of strong one-way isolation and offline key handling offered as inspiration for an airlock design.

Other references

YC video on database-centered agent architecture
Referenced to support the claim that heavily constrained SQL-based execution can be a safer sandbox than arbitrary code execution.
Personified software notes
Linked in a side discussion about personifying AI software and brand positioning.