HN Debrief

What happened after 2k people tried to hack my AI assistant

  • AI
  • Security
  • Open Source
  • Developer Tools

The post describes a public challenge called HackMyClaw. An OpenClaw email agent named Fiu, running on Claude Opus 4.6 with access to secrets, was exposed to thousands of attacker emails. The writeup says no secrets leaked. It also says the setup had to change midstream because processing messages in shared context made the model increasingly suspicious, so later emails were handled in fresh contexts. Fiu had permission to reply by email, but the author says it was instructed not to, largely because replying to every message was too expensive.

Treat this as a narrow test of one-shot prompt injection against a guarded, expensive frontier model, not as proof your own agent is safe. If your product must read untrusted content and still act, you need to test false positives, multi-turn attacks, indirect exfiltration paths, and cheaper model variants before relaxing permissions.

Discussion mood

Skeptical with some respect for the effort. People liked the experiment as a public stress test, but most thought the conclusions were too optimistic because the agent was constrained enough that it avoided the hardest real-world failure modes.

Key insights

  1. 01

    No secret leak is the wrong bar

    The more revealing failure condition here was unauthorized action, not whether the model printed the secret. If an email can make the agent reply, call a tool, or otherwise act against instructions, that is already a successful compromise. Focusing on whether the exact secret string appeared in output understates how agents actually get exploited, because exfiltration usually happens through side effects rather than confession.

    Define attack success around forbidden actions in your own system, not around one dramatic artifact like a leaked password. Your evals should record every unauthorized send, fetch, write, or tool call, even when no secret appears in the transcript.

      Attribution:
    • keynha #1
    • dmurray #1
    • cuchoi #1
  2. 02

    Reproducibility and model comparisons are missing

    What would make this far more useful is a replayable corpus of attack emails and the exact OpenClaw setup. One commenter pointed to a separate OpenClaw writeup where Opus 4.8 was pushed into downloading and executing a malicious script during ordinary email summarization. That turns this from a blog experiment into an obvious benchmark opportunity, because the interesting question is not whether one guarded run survived but which prompts, harnesses, and models fail under the same workload.

    Save attack corpora and harness configs as first-class test assets. Once you have them, rerun them across model upgrades, prompt changes, and fallback models instead of relying on one-off anecdotes.

      Attribution:
    • veganmosfet #1
    • cuchoi #1
  3. 03

    Single-shot tests miss the common exploit path

    Many successful jailbreaks do not happen in one dramatic prompt. They happen through gradual, multi-turn steering where each step looks harmless on its own. Fresh context per email strips out exactly that attack pattern, so a clean result here says little about agents that maintain memory, work over long horizons, or let attackers adapt based on prior responses.

    If your agent has memory or ongoing conversations, add longitudinal attack evals. Test sequences of benign-looking interactions that build toward an unauthorized action over hours or days, not just one-message payloads.

      Attribution:
    • idiotsecant #1
    • ant-kinesthetic #1
    • smusamashah #1
  4. 04

    Indirect exfiltration is the real problem

    The dangerous version of prompt injection is often not “tell me the secret.” It is “use your tools in a way that carries the secret somewhere else.” Commenters sketched plausible routes through web requests, fake CVE triage pages, diff workflows, calendar invites, and shell or file operations. That framing shifts attention from output filtering to capability control, because a model that never writes the secret in chat can still leak it perfectly well through a POST body or a file it was tricked into creating.

    Audit every outbound channel your agent can touch. Lock down network egress, file reads, shell execution, and structured actions with the same seriousness as reply text, because those are often the cleaner exfiltration paths.

      Attribution:
    • dmagog #1
    • jetti #1
    • quuxplusone #1
  5. 05

    Public contests mostly attract casual attacks

    A weak result in an open challenge does not imply the system would survive skilled operators. Several commenters argued that good jailbreaks are scarce, expensive, and lose value quickly once disclosed. That means the people most likely to have effective prompts have the least incentive to burn them for a small public bounty. The experiment still tells you something about resistance to common attacks, but not much about exposure to motivated adversaries.

    Use open contests to harvest broad prompt diversity, not to certify security against serious attackers. For higher assurance, budget for private red teaming with incentives that match the value of the exploit.

      Attribution:
    • mpeg #1
    • cuchoi #1
    • saberience #1
  6. 06

    OS sandboxing beats prompt discipline

    One commenter cut to the core systems point: if the secrets file is readable by the agent process, you are betting policy on model behavior. Mandatory access controls like SELinux, AppArmor, or Tomoyo can make that bet unnecessary by preventing the process from reading the file in the first place. The author's own response underscored the same instinct at a higher layer by withholding email-send permissions despite the encouraging result.

    Move critical protections out of the prompt and into the operating system or permission layer. Give agents fewer secrets and fewer outbound rights, then treat model guardrails as a backup, not the primary control.

      Attribution:
    • seethishat #1
    • cuchoi #1

Against the grain

  1. 01

    The post never claimed a universal victory

    Some pushback said the criticism overshoots what was actually written. The author described a personal update in confidence, not a scientific proof that prompt injection is solved. They also noted two important mitigations in the writeup and comments: Gmail filtering was not the main reason for the result, and the contaminated shared-context issue was corrected by reprocessing each email in a fresh context.

    Read this as one operator updating their prior based on a bounded exercise. That still leaves plenty of room to disagree with the threat model without pretending the claim was broader than it was.

      Attribution:
    • ChrisRR #1
    • Ysx #1
    • cuchoi #1
  2. 02

    Raising attacker skill requirements still counts

    A few commenters argued that forcing attacks out of the realm of trivial jailbreaks is meaningful progress, even if it falls far short of robust security. If exploitation now looks more like specialized social engineering research than a cute prompt trick, the cost and rarity of attacks change. That matters for many practical deployments, especially compared with the common online claim that frontier models will spill secrets on demand.

    Do not dismiss incremental hardening just because it is incomplete. If your controls push attacks from casual misuse into specialist tradecraft, that changes your risk model, staffing needs, and monitoring thresholds.

      Attribution:
    • insanitybit #1
    • saberience #1
    • efromvt #1

In plain english

AppArmor
A Linux security module that confines programs with per-application access rules.
Claude Opus 4.6
A high-end language model from Anthropic used here as the agent’s core model.
CVE
Common Vulnerabilities and Exposures, a public catalog entry for a known software security flaw.
denial of wallet
An attack that drives up usage costs, such as API or model-token spending, until the service becomes too expensive to operate.
frontier model
A state-of-the-art artificial intelligence model at the leading edge of capability and cost.
jailbreak
A prompt or interaction pattern that gets an AI model to bypass its safety or behavior restrictions.
mandatory access controls
Operating system rules that enforce resource access limits regardless of what an application tries to do.
OpenClaw
An open source AI agent framework that can connect a language model to tools and external tasks.
prompt injection
A technique where untrusted input is written to manipulate an AI system into ignoring its original instructions or policies.
SELinux
Security-Enhanced Linux, a Linux security system that restricts what processes are allowed to access.
Tomoyo
A Linux security module that controls what files, capabilities, and actions a process may use.

Reference links

Related prompt injection research

  • Role Confusion
    Cited as a strong recent example showing how models still fail under role-confusion prompt injection patterns.
  • OpenClaw Opus 4.8 malicious script writeup
    Referenced as a contrasting experiment where an OpenClaw setup was steered into downloading and executing a malicious script.

Challenge and implementation references

  • Lakera Gandalf baseline
    Brought up as a prior prompt-injection challenge that some commenters felt became impossible by making the model too unhelpful.
  • OpenClaw Ansible setup
    Shared by the author as the deployment basis used for the experiment.
  • HackMyClaw attack log
    Mentioned by commenters who wanted to inspect the actual attack attempts and outcomes.

Broader security analogies and commentary