I built a vulnerable app and spent $1,500 seeing if LLMs could hack it

AI
Security
Developer Tools
Open Source

The post describes a homebrew benchmark: a deliberately vulnerable app, an eval harness, and repeated attempts to see whether current LLMs could discover and exploit the flaws. The headline result was that some models could break in, some mostly flailed, and Anthropic models scored poorly. What people zeroed in on is that this was not a clean test of raw capability. OpenAI appears to have had security-research allowances on the account, while Claude often hit refusal behavior or session-killing safeguards. That means part of the benchmark was really comparing vendor policy stacks, not just model intelligence.

The more useful conclusion is that offensive security with LLMs is now very real, but still heavily shaped by harness design and human guidance. Several practitioners said one-shot "find the bug" prompting is the wrong frame. Better results come from multi-step workflows that decompose search, validation, and exploitation, often with humans steering exploration or a coordinator agent managing retries and confirmation. Without that structure, models hallucinate vulnerabilities, exhaust obvious attack paths from training, or get stuck. People who have used models for crackmes, reverse engineering, and pentesting said current systems are most effective as force multipliers for experts, not autonomous auditors. A second major theme was frustration with guardrails, especially from Claude. Multiple people reported refusals on benign biology, log analysis, malware explanation, decompilation, game overlays, forking MIT-licensed code, and even retrieving their own local documents. The complaint was not simply that safety exists. It was that the current implementation is broad, inconsistent, and costly. Users described hidden server-side prompt injections, extra tool-call churn, session terminations without refunds, and brittle behavior that changes depending on wording, account history, or whether the target looks local versus live. The practical effect is that legitimate defensive work gets harder, while determined users route around the blocks with prompt reframing, local proxies, clean sessions, other vendors, or open-weight models. The dominant mood was that the bottleneck is shifting from model capability to who is allowed to access it and under what conditions. Some saw that as a safety tradeoff worth making for average users who should not hand agents secrets or let them attack live systems. But most of the high-signal commentary landed on a sharper point: if leading labs over-constrain their best models, security professionals and serious builders will migrate to providers that are cheaper, less restrictive, or easier to run locally. In that world, guardrails become less a meaningful barrier to abuse and more a tax on legitimate use.

If you want to use LLMs for security testing, treat provider policy and agent setup as part of the product, not noise around the benchmark. Also assume current results are highly sensitive to prompting, orchestration, and account-level permissions, so raw model comparisons can mislead.

June 4, 2026
kasra.blog
Discuss on HN

Key insights

Security evals need orchestration, not one-shot prompts

Finding exploitable bugs is a search problem, not a single prompt. People doing real reverse engineering said useful setups break the work into exploration, candidate generation, and validation, often with a coordinator agent or human steering the next branch. That changes the interpretation of the benchmark. A weak result can mean the harness was underpowered, not that the base model was incapable.

If you are evaluating models for security work, benchmark the full agent stack you plan to ship. Add explicit validation passes and branching search before concluding one model is worse than another.

Attribution:

jc4p #1
mariopt #1
gcatalfamo #1
eskibars #1

Guardrails add hidden cost and failure modes

Several users claimed the pain is not just refusals. Anthropic's stack appears to inject long safety instructions server side, re-evaluate them on tool calls, and sometimes terminate sessions after substantial token burn. Others said model behavior changes if a target looks live instead of local, and that simply proxying a target through localhost can flip a refusal into compliance. That makes safety behavior part of latency, cost, and reliability, not just a policy overlay.

Track refusal rate, wasted tokens, and tool-call churn as first-class metrics when choosing a model for agentic workflows. If the job is sensitive to cost or completion reliability, test with realistic targets and network conditions, not toy prompts.

Attribution:

jerrythegerbil #1
kay_o #1 #2
SOLAR_FIELDS #1
acters #1
gck1 #1

Legitimate work is getting caught in the blast radius

The strongest operational signal was how often normal tasks now trigger security or safety filters. Examples included looking up public vulnerability PoCs, analyzing logs from a Docker app, explaining malware on a compromised machine, decompiling code, passing API tokens, and harmless biology questions. A few workaround patterns emerged, like putting secrets in files instead of the prompt or enrolling in provider security programs, but people reported those fixes as inconsistent. The net effect is a product that feels unreliable exactly when you need it for non-routine work.

Have a fallback model path for security-adjacent and specialized workflows. If your team depends on one provider, expect sporadic refusals on legitimate tasks and design around them.

Attribution:

strictnein #1
shepherdjerred #1
mft_ #1
mwigdahl #1
stavros #1
fc417fc802 #1
ang_cire #1
not_a9 #1

Capability is colliding with access control

What worried people was not that models can help find flaws. It was that the best offensive capability may sit behind opaque approval programs, NDAs, or premium tiers while weaker public versions are increasingly constrained. That creates an uneven playing field where insiders, approved researchers, or unrestricted competitors move faster than ordinary defenders. Several comments framed this as a principal-agent problem. Labs optimize for legal and reputational risk, not for the user's need to secure their own systems.

Do not build a security workflow that depends on one vendor's goodwill or exception process. Keep alternative providers and local options in reserve if access policy becomes the real constraint.

Attribution:

fergie #1
hgomersall #1
lesuorac #1
gmerc #1
josephg #1
jerf #1

LLMs already help experts, but they still stall alone

People with hands-on experience in crackmes and pentesting said current models can patch binaries, do runtime analysis, or work through CTF subtasks when guided carefully. The recurring limitation is autonomy. Left alone, they either stop after exhausting familiar patterns or drown you in false positives. That points to a near-term role that looks more like expert copilot than autonomous red team.

Use LLMs to accelerate specialist workflows, not to replace specialist judgment. Budget time for triage and confirmation, because verification is where most of the real work still lives.

Attribution:

mariopt #1
bitexploder #1 #2
nikanj #1
dwa3592 #1

Against the grain

Guardrails are correct for the average user

A smaller but credible view was that refusing to handle logins, credentials, and live attack workflows is simply good product design for most people. The argument is that handing an agent broad access to secrets is reckless, and that safer tool-mediated patterns are the right default even if they annoy security researchers. From that angle, reduced usefulness is an acceptable cost because the baseline user is far more likely to over-trust the agent than to need offensive security help.

If you are deploying agents beyond expert users, keep secrets and live-system actions behind narrow tools with explicit permissions. The convenience of direct agent access is not worth normalizing unsafe operating habits.

Attribution:

hgoel #1
zaphar #1

Some apparent refusals may just mask incapability

One commenter pushed back on the easy narrative that guardrails explain every bad benchmark result. Models often emit a refusal-shaped answer when they cannot actually complete the task, because that is a plausible next token sequence. Others replied that Anthropic also uses classifiers and prompt injections, and gave anecdotes where a cleanly reframed prompt worked. The useful correction is that both effects can be true at once. Policy and capability are entangled, and informal tests often over-attribute failure to policy alone.

When a model refuses, rerun the task in a fresh session with different framing before deciding whether policy or competence was the limiting factor. Do not treat every refusal as proof of hidden capability.

Attribution:

Bratmon #1
gck1 #1
SOLAR_FIELDS #1

The benchmark compared different permission levels

Some readers argued the headline comparison overstates differences because the OpenAI account appears to have had security-research approval while Anthropic did not. If one model is effectively whitelisted and another is not, scores mix capability with account policy in a way that breaks apples-to-apples ranking. That does not make the post useless. It just changes what it measures from pure model performance to real-world usability under actual vendor controls.

When you read or run vendor benchmarks, document account status, safety programs, and model access tier alongside the results. Otherwise leaderboard-style comparisons can hide the variable that mattered most.

Attribution:

mynameisvlad #1
jc4p #1
brooswajne #1

In plain english

API ↩

Application Programming Interface, a way for software to access a service or model over the network under the provider's control.

CTF ↩

Capture the Flag, a type of competitive cybersecurity exercise used to test hacking and defense skills.

Docker ↩

A platform for packaging software into portable containers so it runs consistently across different machines.

eval harness ↩

A testing framework that runs tasks against a model in a repeatable way and records whether it succeeded.

pentesting ↩

Penetration testing, an authorized attempt to find and exploit security weaknesses in a system so they can be fixed.

RE ↩

Reverse engineering, analyzing software or hardware to understand how it works internally.

Reference links

Security access and guardrails

Claude real-time cyber safeguards
Referenced as Anthropic's security verification program that can reduce some cyber-related blocks.
ChatGPT cyber verification
Linked as OpenAI's application flow for security-research access.
Normalization of deviance
Used to argue that guardrails are like factory safety controls and help reduce incidents from unreliable agents.

Benchmarks and vulnerable apps

OWASP Vulnerable Web Applications Directory
Suggested as a related corpus of deliberately vulnerable applications for benchmarking.
Awesome Vulnerable Applications
Offered as a collection of intentionally vulnerable apps useful for testing scanners or agents.
OWASP VulnerableApp
Named as a modular vulnerable application designed for validating security scanners and experiments.

Model rankings and comparison tools

OpenRouter rankings
Cited as one of several leaderboards showing competitive benchmark scores for lesser-known models.
Arena AI coding leaderboard
Shared as another benchmark source for comparing flagship coding models.
Artificial Analysis
Included as a third-party benchmark and comparison site for model performance.

Reverse engineering and model releases

Crackmes challenge example
Used as an example reverse engineering task that one commenter said GLM 5.1 could solve with guidance.
MoonshotAI K2 Vendor Verifier
Referenced in a side discussion about whether direct providers are the right way to evaluate model APIs.
0avx anticheat article
Linked as an example of web search results that may have triggered Anthropic guardrails.

Background reading

Principal-agent problem
Linked to frame the conflict between user interests and vendor incentives in model behavior.
OLMo
Mentioned as an example of a genuinely open model in a side debate about whether FOSS LLMs exist.
Confirmation bias
Shared in a side argument about selectively trusting benchmarks.
The Guardian on GPT-2 release concerns
Used to recall the earlier 'too dangerous to release' narrative around language models.

I built a vulnerable app and spent $1,500 seeing if LLMs could hack it

Discussion mood

Key insights

Against the grain

In plain english

Reference links

Security access and guardrails

Benchmarks and vulnerable apps

Model rankings and comparison tools

Reverse engineering and model releases

Background reading