Where people tightened the story was around what actually moved the frontier. The architecture has changed less than outsiders assume.
Open-weight models and papers from DeepSeek and others suggest the big gains still come mostly from better training, post-training, reinforcement learning, data curation, context-handling tricks, and system engineering, not from some secret wholly different model class. Mixture-of-experts, attention variants, and long-context tweaks matter, but mostly as efficiency levers that let labs train or serve larger systems under the same budget. Several commenters also pushed beyond the article’s focus on the model internals and said the commercially important jump came from tool use,
ReAct-style loops, and agent harnesses that let models fetch fresh data and act in external systems.
A separate thread drilled into the common line that LLMs “just predict the next token.” The consensus landed on a more precise version: that description is mechanically true but explanatorily weak. It tells you the training objective and generation loop, not why transformers generalize so much better than simpler statistical models, nor why prompt structure,
chain-of-thought, and reinforcement learning can unlock much better behavior. One useful addition was the path-dependence of
autoregressive generation. Because the model writes left to right and cannot revise earlier tokens inside a single pass, it tends to preserve local coherence and can double down on early mistakes. That is one reason reasoning models, hidden scratchpads, and extra
test-time compute help so much.
The article itself took some hits. Multiple readers thought the prose looked AI-polished or poorly edited, and a more substantive complaint said its explanation of
RoPE positional encoding was wrong or at least badly ordered. Others said that explaining transformers is not the same thing as explaining modern LLM behavior, because the hard parts now include training pipelines, distributed systems, inference optimization, and post-training. The mood was still broadly engaged and positive about understanding the field, but with impatience for glib explainers that stop at architecture diagrams or flatten everything into hype.