HN Debrief

AI outperforms law professors in Stanford Law study

Stanford Law posted a press release about a paper in which 16 law professors compared anonymized answers to common first-year contracts-law questions and preferred AI-generated responses over professor-written ones about 75% of the time. The setup was framed around tutoring and pedagogy, not autonomous legal practice. That distinction mattered. Most of the useful reaction was that the press release title wildly overstates what was actually tested. This was a preference study on short written answers to introductory law-school questions, not evidence that models outperform practicing lawyers, give reliable legal advice, or reason better than experts under real-world stakes.

For executives, the signal is not "AI replaced legal expertise" but that polished model output is already good enough to win subjective evaluations in text-heavy domains, which makes workflow redesign urgent while raising real governance and liability risks.

Discussion mood

Mostly skeptical and combative. Readers thought the headline and press release overstated a narrow tutoring study, and many distrusted the methodology, possible funding bias, and reliance on subjective preference. At the same time, there was broad agreement that LLMs are already useful for legal research and drafting when tightly supervised by experts.

Key insights

  1. 01 Legal drafting is harder to trust than AI coding because law lacks software's safety rails.
    One lawyer-engineer explained that code gets tests, static analysis, logs, sandboxes, and fast debugging loops, while legal mistakes can take months or years to surface and may be impossible to fix once discovered. That changes the operational math. Even if model error rates are similar across code and law, the legal domain is less forgiving and demands much stronger review discipline. The same commenter also described a planning-heavy workflow for both coding and legal memos, which suggests the leverage comes from structured process and expert review, not one-shot prompting.

    The limiting factor for legal AI is not prose quality. It is the absence of reliable verification and fast error correction.
  2. 02 The obvious legal AI failure is fake citations, but the more dangerous one is real citations used incorrectly.
    Lawyers said models can cite an actual case yet misread its holding, use the wrong jurisdiction, or rely on outdated law unless they are forced onto current legal databases like Westlaw, Lexis, vLex, or CourtListener. That means citation checking alone is not enough. A system can look grounded and still be legally wrong in ways that are hard for a nonexpert to detect.

    Grounding a model in sources does not solve legal reliability by itself. Provenance has to include relevance, jurisdiction, and recency.
      Attribution:
    • qingcharles #1
    • lawtalkinghuman #1
    • BartjeD #1
    • timpera #1
  3. 03 The clean reading of the paper is that LLMs may already be strong law tutors, not strong lawyers.
    This framing cuts through most of the hype. If the task is answering student-originated questions, explaining tradeoffs, and pointing learners toward the right concepts or materials, current models are plausibly very good. That is a meaningful educational result because tutoring is expensive and scarce. It does not justify jumping from pedagogy to autonomous counsel.

    Treat this as evidence for AI-assisted education and first-pass explanation. Do not mistake it for proof of production-grade legal judgment.
      Attribution:
    • finnborge #1
    • RataNova #1
    • scotty79 #1
  4. 04 The business moat is shifting away from raw model access and toward accountable domain packaging.
    Commenters working in or around law said clients are already pressuring firms to use AI for drafting and research. The durable value then moves to local compliance, verified workflows, and a human willing to sign their name to the result. That is where niche products can win, especially in specific jurisdictions or narrow practice areas where generic models are too risky.

    The opportunity is not just "AI for law". It is workflow products that bundle AI with verification, jurisdiction specificity, and liability-bearing review.
      Attribution:
    • songting591 #1
    • tiahura #1
    • the_real_cher #1

Against the grain

  1. 01 Dismissing the study because LLMs are optimized for preferred text misses what was actually tested.
    One commenter argued that persuading law professors to choose the answer they would give students, while avoiding pedagogical harm, is a specialized and meaningful benchmark. If a model can consistently beat expert-written tutoring answers under that lens, that is not trivial style transfer. It suggests genuine usefulness in a constrained professional task.

    Even a subjective preference benchmark can be a real capability signal when the judges are domain experts making consequential teaching choices.
      Attribution:
    • enoch_r #1
    • dcre #1
  2. 02 Complaining that model generations move too fast for studies can become an excuse to avoid measurement entirely.
    A reply made the case that imperfect studies are still necessary, because otherwise every capability claim rests on anecdotes and product demos. The right response to a flawed paper is better methodology, not abandoning empirical work.

    Fast model cycles do not make evaluation pointless. They make rigorous evaluation more necessary.
      Attribution:
    • jstummbillig #1
    • greggoB #1
  3. 03 Some readers thought the results may undersell how much legal knowledge is already baked into general models.
    NotebookLM with added resources only slightly outperformed baseline Gemini, and one commenter argued that this implies the systems were not just retrieving canned material but drawing on substantial internalized legal knowledge. If true, that raises the floor for what generic models can already do in text-heavy professional domains.

    If retrieval adds only a little, general-purpose models may already carry more domain competence than many skeptics assume.
      Attribution:
    • dragonwriter #1
    • vessenes #1
    • scotty79 #1

Reference links

Primary study and related critique

Legal databases and infrastructure

  • Legifrance
    Example of a public legal database in France used to contrast with harder-to-access US case law
  • Westlaw plans and pricing
    Used to show the cost and closed nature of US legal research infrastructure

AI reasoning and explainability

  • Claim Dependency Graphs paper
    Shared as a resource on structuring model outputs so claims and supporting logic can be reconstructed
  • Tacit knowledge
    Referenced in a subthread about the kinds of professional knowledge that are hard to formalize for models

Examples of AI failure or risk in legal practice

Context on editing and human reasoning