Anthropic’s post lays out a practical security story for AI agents: as Claude gets more capable, you should assume it will sometimes take bad actions, so the job is to contain the environment around it. The post describes a few patterns. Put the agent in a VM instead of trusting prompt-level behavior. Keep sensitive credentials on the host and hand the VM narrow, revocable tokens. Restrict outbound network access with proxies and allowlists. Add model-side checks like approval classifiers, but only as another layer. In plain terms, Anthropic is saying agent safety is not about making the model perfectly obedient. It is about limiting what it can touch when it is not.
That basic framing landed as obvious and mostly correct. Security is a cost and blast-radius problem, not a purity test, and several people said this is simply the right way to think about any system that can run commands. The sharper reaction was that Anthropic’s writeup smooths over how brittle this gets in real use. Once an agent can read files, browse the web, inspect tool output, or use third-party APIs,
prompt injection spreads across everything the user would normally think of as trusted. A calendar entry, bug report, README, dependency metadata, or even the artifact produced by a lower-privilege VM can become an attack path into a higher-privilege step. At that point, “just sandbox it” is not a complete answer, because the dangerous channel is often the data flow, not the process boundary.
The most concrete criticism came from people pointing at known implementation gaps. One commenter linked prior research showing bugs in claude.ai and Claude Code session isolation that allegedly let one session access other sessions, connected repos, and environment variables, with regressions after fixes. Others called out that Anthropic’s own auto mode docs claim it blocks irreversible or destructive actions, while the post admits roughly 17 percent of risky actions still slip through. That turned what could have been read as a clean defense-in-depth post into something narrower: useful design guidance, but not evidence that Anthropic has containment solved.
The practical consensus was to focus less on filesystem isolation and more on secrets and
egress. A disposable machine, VM, or
microVM helps, and many people said they already run Claude inside
qemu,
bubblewrap, or custom VM setups. But the harder problem is not the agent trashing its workspace. It is the agent seeing credentials, abusing overbroad API tokens, or exfiltrating data through channels you intentionally left open. Several people argued the safest pattern is to keep agents away from crown-jewel secrets entirely, force human review before any push or external action, and assume that anything with outbound communication can leak information in ways that an
allowlist will not fully stop.