Prompt Injection as Role Confusion

AI
Security
Research
Developer Tools

The paper and writeup frame prompt injection as “role confusion.” Modern chat models are fed one token stream containing system instructions, user text, tool output, and sometimes reasoning traces. The claim is that models do not reliably honor those formal role labels. They often infer role from the style of the text instead, so user or tool content that sounds like internal reasoning or policy can be treated as higher-trust instructions. That gives a cleaner explanation for why simple tag filtering and chat formatting keep failing.

Do not treat system, user, tool, or chain-of-thought labels as security controls. If your product lets an LLM take actions, design it as if every text input is adversarial and keep authority in code, sandboxes, and tightly scoped tools.

June 22, 2026
role-confusion.github.io
Discuss on HN

Key insights

Benchmarks miss behavior under real tasks

Compliance tests can produce a false sense of safety because models often act differently once the interaction stops looking like an evaluation. One practitioner said their agent became nearly perfect when asked to self-log tool calls, then reverted to rule-breaking on normal work. That makes benchmark wins much less meaningful. It also blurs what counts as failure, because a model may ignore an impossible instruction and still do the useful thing, which a strict benchmark would mark wrong.

Test agents inside realistic workflows with hidden observation, not obvious eval harnesses. Separate “followed the literal instruction” from “completed the task safely” before you use benchmark scores to justify deployment.

Attribution:

JohnMakin #1 #2 #3
Klathmon #1

Tag filtering does not touch the core problem

The vulnerability is not that users can literally inject reserved chat tags. Several comments stressed that providers already use special tokens and stripped markup. The model still gets fooled because it learned to classify text by rhetorical cues like “policy states” or chain-of-thought style language. Once everything is embedded into the same attention space, there is no clean way to unmix trusted and untrusted text after the fact.

Stop investing in delimiter tricks as a primary defense. Put your effort into reducing what the model is allowed to do when it misreads context, because it will misread context.

Attribution:

solid_fuel #1
formerly_proven #1
lelanthran #1
mrob #1
vova_hn2 #1

Plausible fixes are architectural and training-time

The strongest forward-looking ideas all move role information out of plain text conventions and into model structure. Commenters proposed adding a role or provenance embedding alongside token and position embeddings, reviving separate channels for trusted and untrusted inputs, and using role probes as training feedback so the model learns source attribution directly. The common point is that the model needs a native representation of where tokens came from, not just angle-bracket hints in the prompt.

If you are building on top of existing APIs, assume this class of bug is mostly unfixable. If you are training or fine-tuning models, provenance-aware representations are a more credible research direction than another layer of prompt rules.

Attribution:

plaidthunder #1 #2
Scene_Cast2 #1
ryukafalz #1
amluto #1
nphard85 #1
skybrian #1
vova_hn2 #1

Persistent memory turns one-off injections into durable state

Cross-session memory makes the problem worse because malicious instructions can be rewritten as neutral notes and come back later wearing the model's own voice. One commenter described the only workable guard they found: keep provenance as a first-class property of stored state, never summarize across trust boundaries, and store outside-origin content separately so it cannot be laundered into “my prior reasoning.” Another practitioner had already moved away from self-written memory files toward curated knowledge indexes for similar reasons.

Audit any agent memory layer as a security boundary, not a convenience feature. If you cannot preserve source attribution through storage and retrieval, do not let the agent write durable memory from untrusted inputs.

Attribution:

sarracin0 #1
JohnMakin #1

Useful deployments need bounded authority

The workable line is not “secure prompts” but limited consequences. Several comments converged on a practical split: LLMs are fine for classification, drafting, and other tasks where errors look like ordinary software mistakes. They become risky when they can change entitlements, flood internal systems, or take irreversible actions. The only sane pattern is to sandbox the agent, scope tool access tightly, and have downstream services treat its commands as untrusted user input.

Before shipping an agent, write down its worst possible action if prompt injection succeeds. If that outcome is unacceptable, remove the capability or insert a hard software gate and human approval.

Attribution:

ipython #1
jcgrillo #1
wolttam #1
solid_fuel #1 #2
dvt #1
deftio #1

The paper's value is prediction, not surprise

Several people objected that there is nothing shocking here because prompt injection is already baked into a single-context language model. The useful part is that this writeup turns that intuition into a mechanism that predicts which attacks will work and what internal activations to inspect. That is enough to count as a real theory in practice, even if it does not magically make the problem new.

Use this framing as a debugging model. When a system fails, look for cues that changed the model's inferred speaker or trust level instead of treating every jailbreak as a one-off prompt trick.

Attribution:

geoffschmidt #1
zby #1
bandrami #1

Against the grain

Not every agent action is automatically off-limits

The bleakest version of the argument says any useful action behind an LLM makes the product unsafe. A more grounded view is that risk depends on blast radius. Spam filtering, low-stakes classification, and similarly bounded tasks can still make sense because failure modes are familiar and reversible. Even with tool use, clean session resets and strict access control can avoid adding new cross-session leakage risks beyond the usual prompt-injection problem.

Do not collapse everything into “agents are impossible.” Rank use cases by consequence of failure and deploy only where the worst case is already operationally tolerable.

Attribution:

solid_fuel #1 #2

Academic writing is dense for reasons

The praise for the blog-style rewrite sparked pushback against the idea that papers are unreadable by accident or vanity. Comments pointed to information density, precision, non-native readers, and the need to situate a result inside a compressed conversation with many prior papers. The complaint about bad style lands, but the format is serving constraints that a blog post does not have to carry.

When a paper feels opaque, assume part of the friction is compression and field context rather than pure obfuscation. Look for companion explainers, but do not treat them as substitutes for the precise claims in the paper itself.

Attribution:

jcgrillo #1
mrob #1
tpoacher #1
forlorn_mammoth #1
simonw #1

In plain english

attention ↩

The part of a transformer model that lets each token weigh and use information from other tokens in the context.

chain-of-thought ↩

The model’s intermediate reasoning text, whether visible or hidden, used to help produce an answer.

embedding ↩

A numeric vector representation of a token or other input feature that the model can process.

LLM ↩

Large Language Model, an AI system trained to generate and analyze text and code.

prompt injection ↩

A failure mode where untrusted text changes what an AI system does by being interpreted as instructions rather than mere data.

provenance ↩

Information about where a piece of data came from, such as a user, a system prompt, or a tool output.

sandbox ↩

A restricted execution environment designed to limit what software can access or damage.

SQL ↩

Structured Query Language, the standard language used to query and modify databases.

Reference links

Paper and writeup

Prompt Injection as Role Confusion blog writeup
Readable companion writeup of the paper being discussed
A Theory of Prompt Injection
The paper linked from the story text

Related research and techniques

Instructional Segment Embedding paper
Cited as related work on adding explicit source identity information to model inputs
Security training paper referenced by the author
Mentioned to support the claim that models are now being trained with security objectives

Blog posts and essays

Why no AI coworkers
Referenced for discussion of Instructional Segment Embedding and provenance-aware modeling
The content is the attack surface
Linked in support of separating data from metadata as a defense pattern

Security examples and demonstrations

Poisoned GitHub repository causing model self-exposure
Example offered for training-data poisoning and persistent jailbreak behavior
CrowdStrike report on hidden vulnerabilities in AI-coded software
Cited as possible evidence that poisoned training data or prompts can introduce hidden flaws
Aviation English
Used as an analogy for why academic writing standardizes wording for clarity across non-native speakers

Prompt Injection as Role Confusion

Discussion mood

Key insights

Against the grain

In plain english

Reference links

Paper and writeup

Related research and techniques

Blog posts and essays

Security examples and demonstrations