How LLMs work

AI
Machine Learning
Developer Tools
Open Source

The post is an end-to-end explainer of how an LLM turns text into tokens, embeddings, attention, feed-forward layers, logits, and sampled output. It aims at readers who want an intuitive tour of transformers without reading papers. The strongest reaction was that the basic decoder-only transformer really is surprisingly simple relative to its output. Several people said the shock of learning GPT-style models is realizing how much of the breakthrough came from scaling a compact architecture with huge compute, huge datasets, and a lot of empirical tuning. That framed the article as a decent on-ramp to the old insight that methods that scale with computation keep beating hand-built cleverness.

Where people tightened the story was around what actually moved the frontier. The architecture has changed less than outsiders assume. Open-weight models and papers from DeepSeek and others suggest the big gains still come mostly from better training, post-training, reinforcement learning, data curation, context-handling tricks, and system engineering, not from some secret wholly different model class. Mixture-of-experts, attention variants, and long-context tweaks matter, but mostly as efficiency levers that let labs train or serve larger systems under the same budget. Several commenters also pushed beyond the article’s focus on the model internals and said the commercially important jump came from tool use, ReAct-style loops, and agent harnesses that let models fetch fresh data and act in external systems. A separate thread drilled into the common line that LLMs “just predict the next token.” The consensus landed on a more precise version: that description is mechanically true but explanatorily weak. It tells you the training objective and generation loop, not why transformers generalize so much better than simpler statistical models, nor why prompt structure, chain-of-thought, and reinforcement learning can unlock much better behavior. One useful addition was the path-dependence of autoregressive generation. Because the model writes left to right and cannot revise earlier tokens inside a single pass, it tends to preserve local coherence and can double down on early mistakes. That is one reason reasoning models, hidden scratchpads, and extra test-time compute help so much. The article itself took some hits. Multiple readers thought the prose looked AI-polished or poorly edited, and a more substantive complaint said its explanation of RoPE positional encoding was wrong or at least badly ordered. Others said that explaining transformers is not the same thing as explaining modern LLM behavior, because the hard parts now include training pipelines, distributed systems, inference optimization, and post-training. The mood was still broadly engaged and positive about understanding the field, but with impatience for glib explainers that stop at architecture diagrams or flatten everything into hype.

If you need your team to understand LLMs, use this as a starting sketch, not a source of truth. The practical edge is no longer in memorizing transformer blocks, but in understanding training, data quality, inference economics, and agent tooling around the base model.

June 6, 2026
0xkato.xyz
Discuss on HN

Discussion mood

Interested but skeptical. People were glad to see an accessible explainer, yet many were irritated by AI-sounding prose, loose structure, and at least one technical error, and they kept steering the conversation toward the harder reality that data, training, systems engineering, and agent scaffolding matter more than a neat transformer diagram.

Key insights

Most frontier gains are training and efficiency

Modern labs appear to be getting more mileage from training recipes, data quality, reinforcement learning, and compute efficiency than from swapping out the core transformer. Open-weight models like DeepSeek are close enough to the frontier that they act as a reality check on claims of secret architectural revolutions. Efficiency tricks such as mixture of experts matter because they buy more model or more context for the same budget, which translates into capability in practice.

Do not anchor your strategy on discovering a totally new base architecture before incumbents move again. Put more attention on data pipelines, post-training, serving cost, and what efficiency improvements let you afford at fixed spend.

Attribution:

HarHarVeryFunny #1
gobdovan #1
jmalicki #1
fizx #1

Agent harnesses create the product jump

Tool calling and ReAct-style loops are what turn a text model into a system that can inspect current state, gather missing information, and trigger actions. That is the step from clever chatbot to something that can handle open-ended tasks in production. The model alone supplies language and priors. The surrounding harness is what makes it operational.

When evaluating LLM products, inspect the tool layer and execution loop, not just the base model name. Most defensible product value sits in orchestration, permissions, and workflow integration.

Attribution:

locknitpicker #1 #2
galaxyLogic #1

Autoregressive generation is path dependent

Left-to-right generation forces the model to commit as it goes. Once it emits a bad premise, it often keeps writing in a way that preserves coherence rather than cleanly retracting the mistake. Hidden reasoning traces and extra test-time compute help because they give the model room to explore before committing. This also explains why prompt format used to swing performance so hard.

Design prompts and product flows to reduce early bad commitments. Ask for decomposition, allow retries or self-checks, and prefer systems that can deliberate before producing user-visible output.

Attribution:

miki123211 #1
swyx #1

The bitter lesson is about scalable methods

The useful reading of the bitter lesson is not "scale anything and win." It is that methods which improve with more computation, like search and learning, keep overtaking systems that bake in a lot of human priors. That framing fits LLMs well. The key choice is the axis that scales, not blind bigness.

Favor approaches whose performance keeps improving as you add compute, data, and feedback. Be wary of bespoke tricks that look smart in demos but do not compound with scale.

Attribution:

swyx #1
ekunazanu #1

Token relationships come from distributional structure

The model is never handed an explicit dictionary of meaning. It extracts relationships from how tokens co-occur across vast amounts of text. This is the old distributional idea in a much larger and more expressive system. Semantics show up because the easiest way to predict real text well is to internalize the patterns that generated it.

If you want a model to learn domain structure, give it rich, consistent corpora where the important relationships recur. Better examples beat more hand-written rules.

Attribution:

HarHarVeryFunny #1
inkysigma #1

Hands-on implementation beats passive reading

Several people said transformer papers only clicked after they built small models themselves or literally drew the block diagram and worked through the math step by step. Books that implement GPT-style models from scratch were repeatedly cited as the point where the architecture stopped feeling mystical. The practical learning path is concrete replication, not endless video consumption.

If you are training engineers on LLMs, assign a toy implementation or whiteboard walkthrough. A week spent building a tiny model will teach more than a month of high-level explainers.

Attribution:

2muchcoffeeman #1
malwrar #1
LatencyKills #1 #2

Against the grain

Transformer diagrams understate the real complexity

Knowing the whiteboard architecture is not the same as understanding a frontier model. The hard parts now include performant implementation on accelerators, distributed training, fault tolerance across long runs, and post-training that is still partly black art. That pushes back on the romantic idea that the secret is all in a simple block diagram.

Do not confuse conceptual understanding with operational competence. Hiring, timelines, and budgets for serious LLM work should account for systems engineering depth, not just ML literacy.

Attribution:

faurroar #1

The RoPE explanation appears wrong

A technically minded reader called out the article’s positional encoding section for implying that the whole token vector is rotated, instead of explaining that RoPE is applied to query and key representations so relative position changes attention scores. That is not a nit. It changes the intuition the reader walks away with.

Treat beginner-friendly explainers as fallible, especially on details that drive later intuition. Cross-check core mechanisms against a paper or trusted implementation before teaching from them.

Attribution:

giardini #1
rhubarbtree #1

Plausible text generation still explains most failures

Some commenters rejected the stronger claims that coherent output implies genuine understanding. Their point was that next-token prediction remains the most useful frame for explaining brittleness, sycophancy, and factual drift. If the training signal rewards plausible continuation over grounded truth, failure modes follow directly from that objective.

Use LLMs as probabilistic language systems unless your product adds grounding and verification. Plan for confident errors even when outputs look thoughtful.

Attribution:

Borealid #1
otabdeveloper4 #1

Capability gains do not settle the social cost

A skeptical line argued that even if LLMs are genuinely useful, society may still get a bad trade if dependence on a few large vendors grows the way social media did. A reply strengthened that point by separating model capability from concentration risk. The danger is not that the tools do nothing. It is that they do a lot while centralizing leverage over work, information, and behavior.

For adoption decisions, evaluate vendor dependence and power concentration alongside raw productivity. Open models and local deployment may matter strategically even when closed systems perform better today.

Attribution:

spaceisballer #1 #2
layla5alive #1

In plain english

autoregressive ↩

A generation process that predicts each next token from the tokens that came before it.

chain-of-thought ↩

A model's intermediate reasoning steps, often represented as internal text tokens before the final answer.

decoder-only transformer ↩

A transformer model architecture that generates text by looking at earlier tokens in a sequence and predicting the next one, without a separate encoder stage.

LLM ↩

Large language model, a machine learning system trained on large amounts of text that can generate and analyze language and code.

logits ↩

The raw scores a model produces for each possible next token before they are converted into probabilities.

Mixture of Experts ↩

A model architecture that activates only parts of the network for each request, improving efficiency.

open-weight ↩

Describes an AI model whose trained parameters are released so others can run or adapt it themselves.

query and key ↩

Two learned vector representations used in attention to measure which tokens should pay attention to which other tokens.

React ↩

A popular JavaScript library for building user interfaces out of reusable components.

RoPE ↩

Rotary positional embeddings, a way of encoding token position into attention computations so the model can track order and relative distance.

test-time compute ↩

Extra computation spent while generating an answer, often by allowing the model to do more internal reasoning or multiple passes before replying.

Reference links

Core concepts and theory

The Bitter Lesson
Referenced to frame LLM progress as a victory for methods that scale with compute rather than handcrafted human priors.
Hardware Lottery
Cited as a reminder that progress can depend heavily on what current hardware makes practical.
Predictive coding
Mentioned in a side argument comparing LLM token prediction to theories of human cognition.

Papers and technical references

DeepSeek-V3 paper
Used as evidence that open-weight models can replicate frontier capabilities without a new core architecture.
DeepSeek-V2 paper
Cited alongside other DeepSeek work as public evidence on modern training and architecture tweaks.
DeepSeek-R1 paper
Linked as an open example of large-scale reinforcement learning applied to language models.
Formal Algorithms for Transformers
Recommended as a pedagogically strong technical reference for understanding transformers.
Speech and Language Processing
Suggested textbook with chapters covering transformers and LLMs.
Language Models are Few-Shot Learners
Recommended as a good starting point for the GPT-3 era of LLMs.

Learning resources

What Is ChatGPT Doing, and Why Does It Work?
Praised as an accessible explainer that helped readers build intuition after ChatGPT launched.
Build a Large Language Model (From Scratch)
Repeatedly recommended as a practical way to internalize transformer mechanics by implementing them.
Build a DeepSeek Model (From Scratch)
Suggested as a hands-on sequel for understanding modern open-weight model designs.

Reasoning and alternate generation methods

OpenAI chain-of-thought monitoring
Quoted in a discussion about why labs may want models to reason in human-readable language instead of inventing opaque internal codes.
Text diffusion models overview video
Shared as an example of a non-autoregressive alternative that could reduce left-to-right path dependence.
Mercury video
Mentioned as one of the more visible attempts to push text diffusion forward.

Interpretability and visualization

Activation Atlas
Linked as an example of model visualization work that hints at how labs may inspect internal representations.
Beating Nyquist with Compressed Sensing
Referenced in a side discussion about superposition and why simple explanations may still leave a lot unexplained.