The paper asks a narrow but important question: in transformer attention, do we really need three separate learned projections for queries, keys, and values, or can some of them be shared without wrecking model quality? The authors try variants like tying keys and values together and report that some simplified forms hold up surprisingly well on their experiments, which reached about 1.2B parameters and 10B training tokens. That makes the result interesting as an ablation of transformer internals, not as a replacement for the current stack.
The strongest reaction was not about the core idea but about how far you can trust it. Several people pointed out that a 1.2B model trained on 10B tokens is badly undertrained by current
LLM standards, so this setup may hide the long-run
value of standard attention. The practical claim is that simplified attention often looks competitive early, before enough data and compute expose where expressiveness matters. Others still found the result worth taking seriously because architecture simplifications can pay off in hardware-constrained settings, and because even a negative result at scale would teach something useful about which parts of attention are doing real work.
A second theme was mechanism. People zeroed in on what gets lost when Q and K collapse toward the same representation. Separate
query and
key projections let token relationships be asymmetric, which is a big part of why attention is more expressive than a plain similarity lookup. That makes the strong performance of shared variants more surprising, but it also suggests where they may fail first, especially on longer contexts or tasks that depend on directional structure. The paper itself reports only limited context-length evidence, so the takeaway is still provisional.
There was also broad annoyance at the notation. The paper uses labels like "Q-K=V" where the hyphen is acting like a separator, not subtraction, and that tripped up multiple readers. Beneath the snark was a fair point: this kind of ablation is exactly the sort of work where clean notation matters, because the whole contribution is about subtle architectural constraints. Overall the mood was curious and mildly skeptical. People like the question, think the reported result is nontrivial, and want a serious follow-up that maps the scaling behavior instead of stopping at a single medium-size regime.