HN Debrief

Do transformers need three projections? Systematic study of QKV variants

  • AI
  • Machine Learning
  • Infrastructure
  • Hardware

The paper asks a narrow but important question: in transformer attention, do we really need three separate learned projections for queries, keys, and values, or can some of them be shared without wrecking model quality? The authors try variants like tying keys and values together and report that some simplified forms hold up surprisingly well on their experiments, which reached about 1.2B parameters and 10B training tokens. That makes the result interesting as an ablation of transformer internals, not as a replacement for the current stack.

If you are exploring cheaper attention variants, this paper is a decent starting point for architecture search on constrained hardware. Do not treat it as evidence that standard QKV is overengineered for production LLMs until someone shows scaling curves across model size and token count.

Discussion mood

Curious but skeptical. People liked the ablation and were genuinely surprised that sharing K and V worked as well as it did, but most of the confidence was capped by the paper's small scale, undertrained setup, and some frustration with the paper's confusing notation.

Key insights

  1. 01

    Undertraining can hide attention's advantages

    A 1.2B model trained on 10B tokens is so far below modern compute-optimal training that it may never force the architecture to use the extra expressiveness of separate Q, K, and V projections. That changes the read on the headline result. A simplified attention block looking fine here may only mean the experiment stopped before the standard design pulled away.

    Treat these numbers like early-training behavior, not a final verdict on architecture quality. If you run similar ablations, vary both parameter count and token count so you can see whether the gap widens with longer training.

      Attribution:
    • in-silico #1
    • janalsncm #1
    • Philpax #1
  2. 02

    Separate Q and K enable directional attention

    Using distinct query and key projections lets one token attend strongly to another without forcing the reverse relation to look the same. That asymmetry is a real piece of transformer expressiveness, so merging representations is not just parameter sharing. It changes what kinds of token-to-token structure the model can encode.

    When evaluating tied-projection attention, probe tasks with directional structure and multi-step reasoning, not just average language modeling loss. That is where losses from symmetry are most likely to show up first.

      Attribution:
    • joshuamoyers #1 #2
  3. 03

    Scaling ladders would be more convincing

    The missing experiment is not a giant frontier run. It is a small scaling ladder. If you can afford 300M and 1.2B models, you can usually fit intermediate sizes and show whether the simplified variant gains or loses ground as scale increases. That would tell you much more than one medium run about whether the effect is structural or just a local artifact.

    Ask for parameter and data sweeps before updating your beliefs about attention design. Even modest multi-scale curves are enough to separate promising ideas from one-off wins.

      Attribution:
    • jephs #1
    • spindump8930 #1 #2
  4. 04

    Why you cannot just learn QK directly

    The attention matrix is token-count dependent, so you cannot replace Q and K with one fixed learned matrix unless your input shape is fixed. That is why attention buys flexible sequence handling while alternatives like MLP-Mixer work in settings where token layout is fixed and routing can be learned in advance.

    If your workload has fixed-size inputs, compare against fixed-routing architectures rather than assuming attention is mandatory. If your sequence length varies, factorized Q and K are doing more than bookkeeping.

      Attribution:
    • mattalex #1
  5. 05

    Ablations need curves, not anecdotes

    The core complaint was methodological, not hostile. Lots of architectural tweaks look fine at one small point on the map. Without results across sizes, token budgets, or both, you cannot tell whether a variant is robust or just lucky at that operating point. That standard is especially important for transformer internals where many ideas fail only after scaling.

    Use single-run ablations to generate hypotheses, not to settle design choices. Put architecture papers through the same scaling discipline you would expect before changing a production training stack.

      Attribution:
    • jephs #1
    • ketchup32613 #1
    • zxexz #1

Against the grain

  1. 01

    Demanding curves can suppress useful bets

    The push for scaling evidence can turn into a gatekeeping norm that only well-funded labs can satisfy. That matters because some architecture ideas need a leap before the curve exists, and waiting for perfect validation can slow adoption of genuinely good alternatives such as state space models.

    Do not dismiss low-budget architecture work just because it lacks frontier-scale proof. Use it to shortlist ideas worth internal replication if they line up with your hardware or product constraints.

      Attribution:
    • Der_Einzige #1
  2. 02

    Exact attention form may matter less

    One reading is that attention succeeds mainly because it provides global cross-token comparison that parallel hardware can brute force efficiently. If that is right, many specific QKV details may be second-order, with most of the practical gain coming from speed, stability, and implementation fit on GPUs rather than from the textbook form itself.

    When comparing attention variants, benchmark throughput, memory, and training stability alongside quality. A slightly less elegant mechanism can still win if it fits your hardware much better.

      Attribution:
    • hollosi #1
    • nbardy #1

In plain english

key
In attention, the vector that represents how a token can be matched or selected by other tokens' queries.
LLM
Large language model, a machine learning system trained on large amounts of text that can generate and analyze language and code.
MLP-Mixer
A neural network architecture that mixes information across tokens and channels using multilayer perceptrons instead of attention.
QKV
The three learned projections in transformer attention: queries, keys, and values.
query
In attention, the vector that represents what a token is looking for from other tokens.
state space models
A family of sequence models that process long contexts using recurrent-style internal state instead of standard full attention.
value
In attention, the vector that carries the information returned once attention selects a token.

Reference links

Background explainers

Alternative architectures

  • MLP-Mixer paper
    Used to explain fixed learned routing as an alternative when input token layout is fixed.