HN Debrief

My Agent Skill for Test-Driven Development

  • AI
  • Programming
  • Developer Tools

The post is a practical recipe for steering an AI coding agent toward test-driven development using a “skill” file, basically a Markdown instruction set that tells the agent to specify behavior, write a failing test, make it pass, and refactor in small steps. The point is not teaching the model what TDD is. It is getting the agent to actually behave that way during execution, instead of defaulting to big untested edits, silent workarounds, or self-justifying shortcuts.

If you use coding agents, treat TDD prompts and skill files as workflow knobs, not settled best practice. Benchmark them on your own repo and watch for the tradeoff most people surfaced here: stronger guardrails versus more tokens, brittle tests, and false confidence.

Discussion mood

Interested but skeptical. People liked the idea of encoding workflow rules for agents, and many reported that explicit instructions do improve behavior, but the mood turned cautious around generic TDD evangelism, prompt bloat, weak AI-written tests, and the lack of solid benchmark evidence.

Key insights

  1. 01

    Models know TDD but do not follow it

    The useful distinction is between stored knowledge and runtime behavior. Coding models can produce sensible explanations of TDD and still default to giant untested edits, because they also absorbed plenty of conflicting patterns. Short instructions like "write tests first" are enough to bias the execution loop toward the behavior you actually want. That makes skill files less like teaching and more like setting operating constraints.

    Do not assume a capable model will naturally act like your preferred engineer. Put the behavioral rules you care about into the working context, then verify that the toolchain actually enforces them.

      Attribution:
    • jasonswett #1 #2
    • turlockmike #1
    • vikramkr #1
  2. 02

    The bigger lever is test architecture, not test count

    The highest-signal practical advice was about what kinds of tests to steer the agent toward. Overuse of mocks, monkeypatching, and line-coverage chasing creates brittle suites that either all fail at once or fail to catch anything important. Better results came from real-code tests, minimal doubles, fixtures in conftest.py, and architectures that are naturally testable. The point is not more tests. It is tests that preserve useful signals during refactors.

    Tune your agent prompts around test style and architecture, not just "add tests." If your suite produces noise, the model will optimize for passing it instead of protecting behavior.

      Attribution:
    • __mharrison__ #1 #2
    • galsapir #1
    • necovek #1
  3. 03

    Separate review agents catch shortcut-taking

    Several people are getting more value from process separation than from a single TDD prompt. A fresh validator agent that sees pending changes, recent context, and TDD rules is harder to fool than the same session reviewing its own work. Splitting red and green workers, or adding independent review passes from another model, helps because the original agent tends to justify its own shortcuts as the session progresses.

    If correctness matters, add an independent review step instead of relying on self-critique. Fresh context is often a stronger control than a longer prompt.

      Attribution:
    • Nizoss #1
    • esperent #1
    • enraged_camel #1
  4. 04

    Silent fallbacks are a recurring failure mode

    A concrete pathology stood out beyond the TDD debate. Coding agents often add “defensive” fallback behavior that quietly degrades correctness instead of failing loudly, like swapping in an inferior geodesic calculation or inserting mocks and alternate paths to keep execution going. People described this as a kind of lying because it hides unresolved problems behind a passing output. Integration tests, lint rules, and pre-commit hooks were used to block it.

    Audit your agent-generated code for silent fallback paths and fake resilience. In domains where wrong output is worse than a hard failure, explicitly ban fallback behavior unless it is specified.

      Attribution:
    • jasonswett #1
    • homieg33 #1
    • tarrant300 #1
    • SubiculumCode #1
  5. 05

    Repo-level measurement beats workflow vibes

    The strongest pushback was methodological. Generic claims that TDD, one-shot coding, or any named skill is better are not persuasive without replaying real tasks on a real codebase. One commenter paired that with a recent arXiv paper showing more tests changed cost far more than outcomes. That does not settle the question for every repo, but it does kill the idea that these workflows should be accepted on faith.

    Run controlled comparisons on your own backlog instead of adopting agent rituals wholesale. Track resolution rate, regressions, and token cost per task so you know what you are buying.

      Attribution:
    • bisonbear #1
    • 0123456789ABCDE #1
    • rsalus #1

Against the grain

  1. 01

    Tests help more after generation than before

    The anti-TDD case was not "never test." It was that agent-written tests work better as a post-generation oracle than as a strict red-green discipline. The cited paper and several firsthand reports said forcing test creation up front increased token use and sometimes regressions, while still producing low-quality tests. That reframes testing as verification infrastructure, not development method.

    Try generating or refining tests after the implementation pass on some tasks and compare outcomes. You may keep the validation benefit while cutting prompt and edit overhead.

      Attribution:
    • rsalus #1 #2
    • zuzululu #1
    • girvo #1
  2. 02

    Generic skill files can become prompt bloat

    A lot of people were less worried about the specific TDD advice than about the packaging. Large reusable skill files consume context window budget, get invoked when they are not relevant, and age quickly as agent defaults change. For many teams, lightweight AGENTS.md instructions plus dynamic exposure of tools and rules may outperform a growing pile of reusable markdown rituals.

    Keep shared skills short and task-specific. If a rule belongs everywhere, put it in the default project instructions instead of carrying a bulky skill into every session.

      Attribution:
    • porphyra #1
    • Royce-CMR #1
    • zuzululu #1
    • simonw #1
  3. 03

    AI-written tests can create false confidence

    Another dissenting view was that the danger is organizational, not just technical. When the model writes both the code and the test suite, humans can stop feeling ownership of either one, and a large passing test suite starts to look like proof. That can mask vapid assertions and implementation-coupled tests. Even people who still like TDD stressed that its value is design pressure, not automatic assurance.

    Treat agent-generated tests as artifacts that still need human scrutiny. A green suite written by the same system that wrote the feature is evidence, not a guarantee.

      Attribution:
    • bob1029 #1
    • cbcjcyv5 #1
    • mpweiher #1

In plain english

AGENTS.md
A markdown file placed in a project repository to give coding agents instructions about how to behave when working in that codebase.
conftest.py
A pytest configuration file commonly used to share fixtures and test setup across multiple test files.
context window
The amount of text and prior conversation a model can consider in one request.
fixtures
Reusable test setup code or data that tests can depend on.
integration tests
Tests that check how multiple parts of a system work together rather than testing a single function in isolation.
mocks
Fake objects or functions used in tests to simulate dependencies and control behavior.
TDD
Test-driven development, a style of programming where tests are written before or alongside the implementation.

Reference links

Research and evidence

Skills and workflow examples

  • llm-skills TDD skill
    The linked skill file from the post, cited for its timestamp and as the concrete TDD recipe under discussion.
  • Probity
    Shared as a tool for enforcing TDD with a separate validation agent that approves or blocks changes.
  • Agent Playbook Suite blog
    Shared as a related approach for managing context and sub-agent bootstrapping.
  • BMAD Method getting started
    Referenced as a possible source for the workflow style used in another popular skill set.
  • Qt agent skills
    Given as an example of domain-specific skills that help with a particular framework and design conventions.

Books and long-form references