What happened after 2k people tried to hack my AI assistant

AI
Security
Open Source
Developer Tools

The post describes a public challenge called HackMyClaw. An OpenClaw email agent named Fiu, running on Claude Opus 4.6 with access to secrets, was exposed to thousands of attacker emails. The writeup says no secrets leaked. It also says the setup had to change midstream because processing messages in shared context made the model increasingly suspicious, so later emails were handled in fresh contexts. Fiu had permission to reply by email, but the author says it was instructed not to, largely because replying to every message was too expensive.

That framing shaped almost all of the reaction. People did not read this as evidence that prompt injection is solved. They read it as evidence that a frontier model with strong instructions can survive a pile of mostly single-shot email jailbreaks when the useful behavior is heavily constrained. The sharpest criticism was that the test ducks the hard part. A secure assistant is not one that ignores an input channel. It is one that can tell good requests from malicious ones while still doing real work. Several commenters pushed on the missing metric: false positives versus false negatives. Without legitimate email tasks in the mix, no one can tell whether Fiu was robust or just overly defensive. The second big theme was threat model realism. Resetting context per email removes the multi-turn “frog boiling” style attacks many people see as the practical failure mode. Not letting the agent freely reply also removes attacker feedback loops that would normally help refine an exploit. Others noted that direct prompt injection by email is the easier case anyway. Real exfiltration risk often comes from indirect channels like fetched web pages, tool outputs, file handling, calendar invites, or shell actions that move data out of band without ever printing the secret in a reply. The comments still gave the experiment some credit. A lot of casual jailbreak lore says frontier models are trivial to break. This result cuts against that. People accepted that modern models like Opus are better at spotting obvious social engineering and refusing blatant secret extraction attempts than they were two years ago. But the dominant conclusion was narrower and more practical: stronger instruction following raises the cost of attack, it does not remove it, and permissions matter more than confidence. Even the author echoed that line by saying the experience made them less worried, but not comfortable enough to grant agents permission to send email. Cost became part of the security story too. The system burned several hundred dollars, Gmail rate-limited the account, and multiple commenters pointed out that “denial of wallet” is itself a viable attack surface for public-facing agents.

Treat this as a narrow test of one-shot prompt injection against a guarded, expensive frontier model, not as proof your own agent is safe. If your product must read untrusted content and still act, you need to test false positives, multi-turn attacks, indirect exfiltration paths, and cheaper model variants before relaxing permissions.

June 26, 2026
fernandoi.cl
Discuss on HN

Key insights

No secret leak is the wrong bar

The more revealing failure condition here was unauthorized action, not whether the model printed the secret. If an email can make the agent reply, call a tool, or otherwise act against instructions, that is already a successful compromise. Focusing on whether the exact secret string appeared in output understates how agents actually get exploited, because exfiltration usually happens through side effects rather than confession.

Define attack success around forbidden actions in your own system, not around one dramatic artifact like a leaked password. Your evals should record every unauthorized send, fetch, write, or tool call, even when no secret appears in the transcript.

Attribution:

keynha #1
dmurray #1
cuchoi #1

Reproducibility and model comparisons are missing

What would make this far more useful is a replayable corpus of attack emails and the exact OpenClaw setup. One commenter pointed to a separate OpenClaw writeup where Opus 4.8 was pushed into downloading and executing a malicious script during ordinary email summarization. That turns this from a blog experiment into an obvious benchmark opportunity, because the interesting question is not whether one guarded run survived but which prompts, harnesses, and models fail under the same workload.

Save attack corpora and harness configs as first-class test assets. Once you have them, rerun them across model upgrades, prompt changes, and fallback models instead of relying on one-off anecdotes.

Attribution:

veganmosfet #1
cuchoi #1

Single-shot tests miss the common exploit path

Many successful jailbreaks do not happen in one dramatic prompt. They happen through gradual, multi-turn steering where each step looks harmless on its own. Fresh context per email strips out exactly that attack pattern, so a clean result here says little about agents that maintain memory, work over long horizons, or let attackers adapt based on prior responses.

If your agent has memory or ongoing conversations, add longitudinal attack evals. Test sequences of benign-looking interactions that build toward an unauthorized action over hours or days, not just one-message payloads.

Attribution:

idiotsecant #1
ant-kinesthetic #1
smusamashah #1

Indirect exfiltration is the real problem

The dangerous version of prompt injection is often not “tell me the secret.” It is “use your tools in a way that carries the secret somewhere else.” Commenters sketched plausible routes through web requests, fake CVE triage pages, diff workflows, calendar invites, and shell or file operations. That framing shifts attention from output filtering to capability control, because a model that never writes the secret in chat can still leak it perfectly well through a POST body or a file it was tricked into creating.

Audit every outbound channel your agent can touch. Lock down network egress, file reads, shell execution, and structured actions with the same seriousness as reply text, because those are often the cleaner exfiltration paths.

Attribution:

dmagog #1
jetti #1
quuxplusone #1

Public contests mostly attract casual attacks

A weak result in an open challenge does not imply the system would survive skilled operators. Several commenters argued that good jailbreaks are scarce, expensive, and lose value quickly once disclosed. That means the people most likely to have effective prompts have the least incentive to burn them for a small public bounty. The experiment still tells you something about resistance to common attacks, but not much about exposure to motivated adversaries.

Use open contests to harvest broad prompt diversity, not to certify security against serious attackers. For higher assurance, budget for private red teaming with incentives that match the value of the exploit.

Attribution:

mpeg #1
cuchoi #1
saberience #1

OS sandboxing beats prompt discipline

One commenter cut to the core systems point: if the secrets file is readable by the agent process, you are betting policy on model behavior. Mandatory access controls like SELinux, AppArmor, or Tomoyo can make that bet unnecessary by preventing the process from reading the file in the first place. The author's own response underscored the same instinct at a higher layer by withholding email-send permissions despite the encouraging result.

Move critical protections out of the prompt and into the operating system or permission layer. Give agents fewer secrets and fewer outbound rights, then treat model guardrails as a backup, not the primary control.

Attribution:

seethishat #1
cuchoi #1

Against the grain

The post never claimed a universal victory

Some pushback said the criticism overshoots what was actually written. The author described a personal update in confidence, not a scientific proof that prompt injection is solved. They also noted two important mitigations in the writeup and comments: Gmail filtering was not the main reason for the result, and the contaminated shared-context issue was corrected by reprocessing each email in a fresh context.

Read this as one operator updating their prior based on a bounded exercise. That still leaves plenty of room to disagree with the threat model without pretending the claim was broader than it was.

Attribution:

ChrisRR #1
Ysx #1
cuchoi #1

Raising attacker skill requirements still counts

A few commenters argued that forcing attacks out of the realm of trivial jailbreaks is meaningful progress, even if it falls far short of robust security. If exploitation now looks more like specialized social engineering research than a cute prompt trick, the cost and rarity of attacks change. That matters for many practical deployments, especially compared with the common online claim that frontier models will spill secrets on demand.

Do not dismiss incremental hardening just because it is incomplete. If your controls push attacks from casual misuse into specialist tradecraft, that changes your risk model, staffing needs, and monitoring thresholds.

Attribution:

insanitybit #1
saberience #1
efromvt #1

In plain english

AppArmor ↩

A Linux security module that confines programs with per-application access rules.

Claude Opus 4.6 ↩

A high-end language model from Anthropic used here as the agent’s core model.

CVE ↩

Common Vulnerabilities and Exposures, a public catalog entry for a known software security flaw.

denial of wallet ↩

An attack that drives up usage costs, such as API or model-token spending, until the service becomes too expensive to operate.

frontier model ↩

A state-of-the-art artificial intelligence model at the leading edge of capability and cost.

jailbreak ↩

A prompt or interaction pattern that gets an AI model to bypass its safety or behavior restrictions.

mandatory access controls ↩

Operating system rules that enforce resource access limits regardless of what an application tries to do.

OpenClaw ↩

An open source AI agent framework that can connect a language model to tools and external tasks.

prompt injection ↩

A technique where untrusted input is written to manipulate an AI system into ignoring its original instructions or policies.

SELinux ↩

Security-Enhanced Linux, a Linux security system that restricts what processes are allowed to access.

Tomoyo ↩

A Linux security module that controls what files, capabilities, and actions a process may use.

Reference links

Related prompt injection research

Role Confusion
Cited as a strong recent example showing how models still fail under role-confusion prompt injection patterns.
OpenClaw Opus 4.8 malicious script writeup
Referenced as a contrasting experiment where an OpenClaw setup was steered into downloading and executing a malicious script.

Challenge and implementation references

Lakera Gandalf baseline
Brought up as a prior prompt-injection challenge that some commenters felt became impossible by making the model too unhelpful.
OpenClaw Ansible setup
Shared by the author as the deployment basis used for the experiment.
HackMyClaw attack log
Mentioned by commenters who wanted to inspect the actual attack attempts and outcomes.

Broader security analogies and commentary

AI-box experiment
Used as an analogy for evaluating trust under much higher stakes than this email-agent challenge.
Syntax hacking article on Ars Technica
Cited as an example of newer jailbreak research exploring sentence-structure based bypasses.

What happened after 2k people tried to hack my AI assistant

Discussion mood

Key insights

Against the grain

In plain english

Reference links

Related prompt injection research

Challenge and implementation references

Broader security analogies and commentary