HN Debrief

I built a vulnerable app and spent $1,500 seeing if LLMs could hack it

  • AI
  • Security
  • Developer Tools
  • Open Source

The post describes a homebrew benchmark: a deliberately vulnerable app, an eval harness, and repeated attempts to see whether current LLMs could discover and exploit the flaws. The headline result was that some models could break in, some mostly flailed, and Anthropic models scored poorly. What people zeroed in on is that this was not a clean test of raw capability. OpenAI appears to have had security-research allowances on the account, while Claude often hit refusal behavior or session-killing safeguards. That means part of the benchmark was really comparing vendor policy stacks, not just model intelligence.

If you want to use LLMs for security testing, treat provider policy and agent setup as part of the product, not noise around the benchmark. Also assume current results are highly sensitive to prompting, orchestration, and account-level permissions, so raw model comparisons can mislead.

Discussion mood

Skeptical and frustrated. People largely think the experiment showed real capability, but they were more struck by how much the results were distorted by guardrails, account whitelisting, and harness quality, especially for Anthropic.

Key insights

  1. 01

    Security evals need orchestration, not one-shot prompts

    Finding exploitable bugs is a search problem, not a single prompt. People doing real reverse engineering said useful setups break the work into exploration, candidate generation, and validation, often with a coordinator agent or human steering the next branch. That changes the interpretation of the benchmark. A weak result can mean the harness was underpowered, not that the base model was incapable.

    If you are evaluating models for security work, benchmark the full agent stack you plan to ship. Add explicit validation passes and branching search before concluding one model is worse than another.

      Attribution:
    • jc4p #1
    • mariopt #1
    • gcatalfamo #1
    • eskibars #1
  2. 02

    Guardrails add hidden cost and failure modes

    Several users claimed the pain is not just refusals. Anthropic's stack appears to inject long safety instructions server side, re-evaluate them on tool calls, and sometimes terminate sessions after substantial token burn. Others said model behavior changes if a target looks live instead of local, and that simply proxying a target through localhost can flip a refusal into compliance. That makes safety behavior part of latency, cost, and reliability, not just a policy overlay.

    Track refusal rate, wasted tokens, and tool-call churn as first-class metrics when choosing a model for agentic workflows. If the job is sensitive to cost or completion reliability, test with realistic targets and network conditions, not toy prompts.

      Attribution:
    • jerrythegerbil #1
    • kay_o #1 #2
    • SOLAR_FIELDS #1
    • acters #1
    • gck1 #1
  3. 03

    Legitimate work is getting caught in the blast radius

    The strongest operational signal was how often normal tasks now trigger security or safety filters. Examples included looking up public vulnerability PoCs, analyzing logs from a Docker app, explaining malware on a compromised machine, decompiling code, passing API tokens, and harmless biology questions. A few workaround patterns emerged, like putting secrets in files instead of the prompt or enrolling in provider security programs, but people reported those fixes as inconsistent. The net effect is a product that feels unreliable exactly when you need it for non-routine work.

    Have a fallback model path for security-adjacent and specialized workflows. If your team depends on one provider, expect sporadic refusals on legitimate tasks and design around them.

      Attribution:
    • strictnein #1
    • shepherdjerred #1
    • mft_ #1
    • mwigdahl #1
    • stavros #1
    • fc417fc802 #1
    • ang_cire #1
    • not_a9 #1
  4. 04

    Capability is colliding with access control

    What worried people was not that models can help find flaws. It was that the best offensive capability may sit behind opaque approval programs, NDAs, or premium tiers while weaker public versions are increasingly constrained. That creates an uneven playing field where insiders, approved researchers, or unrestricted competitors move faster than ordinary defenders. Several comments framed this as a principal-agent problem. Labs optimize for legal and reputational risk, not for the user's need to secure their own systems.

    Do not build a security workflow that depends on one vendor's goodwill or exception process. Keep alternative providers and local options in reserve if access policy becomes the real constraint.

      Attribution:
    • fergie #1
    • hgomersall #1
    • lesuorac #1
    • gmerc #1
    • josephg #1
    • jerf #1
  5. 05

    LLMs already help experts, but they still stall alone

    People with hands-on experience in crackmes and pentesting said current models can patch binaries, do runtime analysis, or work through CTF subtasks when guided carefully. The recurring limitation is autonomy. Left alone, they either stop after exhausting familiar patterns or drown you in false positives. That points to a near-term role that looks more like expert copilot than autonomous red team.

    Use LLMs to accelerate specialist workflows, not to replace specialist judgment. Budget time for triage and confirmation, because verification is where most of the real work still lives.

      Attribution:
    • mariopt #1
    • bitexploder #1 #2
    • nikanj #1
    • dwa3592 #1

Against the grain

  1. 01

    Guardrails are correct for the average user

    A smaller but credible view was that refusing to handle logins, credentials, and live attack workflows is simply good product design for most people. The argument is that handing an agent broad access to secrets is reckless, and that safer tool-mediated patterns are the right default even if they annoy security researchers. From that angle, reduced usefulness is an acceptable cost because the baseline user is far more likely to over-trust the agent than to need offensive security help.

    If you are deploying agents beyond expert users, keep secrets and live-system actions behind narrow tools with explicit permissions. The convenience of direct agent access is not worth normalizing unsafe operating habits.

      Attribution:
    • hgoel #1
    • zaphar #1
  2. 02

    Some apparent refusals may just mask incapability

    One commenter pushed back on the easy narrative that guardrails explain every bad benchmark result. Models often emit a refusal-shaped answer when they cannot actually complete the task, because that is a plausible next token sequence. Others replied that Anthropic also uses classifiers and prompt injections, and gave anecdotes where a cleanly reframed prompt worked. The useful correction is that both effects can be true at once. Policy and capability are entangled, and informal tests often over-attribute failure to policy alone.

    When a model refuses, rerun the task in a fresh session with different framing before deciding whether policy or competence was the limiting factor. Do not treat every refusal as proof of hidden capability.

      Attribution:
    • Bratmon #1
    • gck1 #1
    • SOLAR_FIELDS #1
  3. 03

    The benchmark compared different permission levels

    Some readers argued the headline comparison overstates differences because the OpenAI account appears to have had security-research approval while Anthropic did not. If one model is effectively whitelisted and another is not, scores mix capability with account policy in a way that breaks apples-to-apples ranking. That does not make the post useless. It just changes what it measures from pure model performance to real-world usability under actual vendor controls.

    When you read or run vendor benchmarks, document account status, safety programs, and model access tier alongside the results. Otherwise leaderboard-style comparisons can hide the variable that mattered most.

      Attribution:
    • mynameisvlad #1
    • jc4p #1
    • brooswajne #1

In plain english

API
Application programming interface, a way for software to call another service programmatically.
CTF
Capture the Flag, a security competition where participants solve hacking or reverse engineering challenges.
Docker
A platform for packaging and running software in isolated containers so it behaves consistently across environments.
eval harness
A testing framework that runs tasks against a model in a repeatable way and records whether it succeeded.
pentesting
Penetration testing, an authorized attempt to find and exploit security weaknesses in a system so they can be fixed.
RE
Reverse engineering, the process of analyzing software or hardware to understand how it works, often without the original source code.

Reference links

Security access and guardrails

Benchmarks and vulnerable apps

Model rankings and comparison tools

Reverse engineering and model releases

Background reading

  • Principal-agent problem
    Linked to frame the conflict between user interests and vendor incentives in model behavior.
  • OLMo
    Mentioned as an example of a genuinely open model in a side debate about whether FOSS LLMs exist.
  • Confirmation bias
    Shared in a side argument about selectively trusting benchmarks.
  • The Guardian on GPT-2 release concerns
    Used to recall the earlier 'too dangerous to release' narrative around language models.