HN Debrief

You can't unit test for taste

  • AI
  • Programming
  • Developer Tools
  • Product
  • Design

The post describes an AI-assisted side project and lands on a familiar but sharper point: plenty of work is not about whether code runs, but whether the result feels right. The author found that once the project moved from mechanics into ranking places, balancing tradeoffs, and deciding what makes a route actually good, the neat test-first mindset broke down. There was no obvious ground truth, and local improvements could easily make the overall product worse. That made Claude useful as a coding partner, but not as the thing that decides quality.

If you are using AI to build product or software, treat it as a fast implementer inside a process you control, not as the source of final judgment. Invest in conventions, review checkpoints, and examples of acceptable output, because that is where current systems still fail hardest.

Discussion mood

Mostly aligned with the post, with a pragmatic tone. People were positive about AI as a strong implementation aid, but skeptical that current LLMs can internalize team-specific judgment, UX sense, or product taste without heavy scaffolding and human review.

Key insights

  1. 01

    LLMs act like day-one hires

    The best framing for current coding models is not “junior engineer” in skill level but “excellent engineer on the first day at your company.” They know many patterns and can move fast, but they do not accumulate durable local judgment about your codebase, your norms, or your exceptions. That shifts review toward decision points where the model filled in missing context, because most outputs are fine and a few are catastrophically off in ways that look reasonable at first glance.

    Review generated work at the architectural and assumption level, not just for syntax or style. If your workflow depends on the model learning your organization over time, assume that capability is weak today unless you have built persistent memory or explicit artifacts around it.

      Attribution:
    • paytonjjones #1 #2
    • trjordan #1
    • pixl97 #1
    • plastic-enjoyer #1
  2. 02

    Taste shows up most in what you cut

    Several comments sharpened the practical meaning of taste by pointing to selection and restraint. The problem is not usually that the model cannot produce many acceptable options. It is that it keeps adding structure, abstraction, or variation where an experienced human would delete, simplify, or preserve an existing pattern. One commenter described fighting Claude to stop inventing a bespoke query builder during a straightforward BigQuery to Postgres port. Another said the real job is taking the 200 generated lines and keeping only 80.

    Watch for needless novelty in AI-generated code. Add review questions that force simplification, such as whether a new abstraction is actually earning its keep or whether the right move is to keep the boring existing pattern.

      Attribution:
    • eithed #1
    • Dumblydorr #1
    • pydry #1
    • jpadkins #1
    • fny #1
  3. 03

    Externalized judgment becomes company property

    Writing down your personal workflow as prompts, skills, or agent instructions is not just a technical exercise. It can turn tacit know-how into an asset your employer may claim owns under work-for-hire rules. That risk barely matters for private shell scripts because few companies care, but reusable agent skill files that reliably encode how you work are much easier to treat as valuable proprietary IP.

    If your team is pushing hard on agent workflows, clarify ownership terms before people pour their craft into reusable prompt libraries or skill files. Otherwise you create a quiet retention and portability problem for your strongest operators.

      Attribution:
    • ElevenLathe #1
    • hammock #1
  4. 04

    Tests help freeze decisions, not discover quality

    The most useful testing comments reframed tests as a way to lock in things you already believe should not change. That includes snapshots for visuals, interfaces with many dependents, or bug reproductions. It does not solve the upstream question of what the right user experience, architecture, or aesthetic should be. You can absolutely test shaders, screenshots, and UI states if you invest in the harness. What you are testing is a chosen artifact, not the human judgment that made it worth choosing.

    Use tests to preserve taste after humans have approved an example, not to replace that approval step. For product areas with subjective quality, spend more on golden cases, snapshots, and regression harnesses than on pretending low-level unit coverage answers the design question.

      Attribution:
    • bob1029 #1
    • bluGill #1
    • MoreQARespect #1 #2 #3
  5. 05

    AI usefulness varies a lot by stack

    People reporting strong results with “vibe coding” were usually running disciplined plan and revision cycles and often working in domains where generated code can be judged quickly from the end product. Others said the same approach breaks down in stacks like native Swift UI work, where the model behaves like a smart but unseasoned engineer and misses obvious UX problems. That makes broad claims about AI coding misleading. The tool quality is inseparable from the surface area you can inspect cheaply.

    Set AI expectations by domain, not by headline capability. Favor AI-heavy workflows first where outputs are easy to evaluate end to end, and be much more conservative in areas with subtle platform conventions or expensive UX mistakes.

      Attribution:
    • hombre_fatal #1
    • ChrisMarshallNY #1
    • paytonjjones #1

Against the grain

  1. 01

    Preference models can become tests

    A real counterargument is that subjective judgment can be operationalized if you stop looking for crisp rules and start collecting comparative labels. Pairwise choices, ratings, or examples of “tacky” versus “good” can train a scoring model that acts like a taste test. The catch is that this does not yield an explanation people can read, and commenters noted it often collapses toward average, generic preferences unless the labels stay tightly personal or team-specific.

    If your product has enough repeated judgment calls, experiment with preference datasets and ranking models instead of trying to write exhaustive rules. Keep the label source narrow, because mixing many reviewers will smooth away the distinctive taste you were trying to capture.

      Attribution:
    • delichon #1
    • trjordan #1
    • andy99 #1
    • punnerud #1
    • esafak #1
    • HoldOnAMinute #1
  2. 02

    Taste may just be compressible pattern knowledge

    Some pushback treated taste as accumulated patterns rather than something mystical. The example of Rick Rubin was used to argue that what looks like pure judgment is often the result of many concrete exposures, habits, and references. Pattern languages, named practices, and large sets of prior project specs may let models recover more of that than people think, especially if the organization has built up consistent artifacts over time.

    Before declaring a quality bar inexpressible, check whether your team has simply failed to name and reuse its patterns. A library of conventions, examples, and recurring design moves can give AI far more usable signal than ad hoc prompting.

      Attribution:
    • sesm #1
    • timroman #1
    • pixl97 #1
    • joshka #1
  3. 03

    Good enough may be the only sane target

    A harder-edged minority view said the whole exercise becomes self-defeating if you chase perfect alignment between model output and personal taste. There may be no magic solution beyond better context, better tools, and accepting a level of mediocrity. For many workflows, the economic win comes from shipping with bounded imperfection rather than trying to eliminate every mismatch between what the model produced and what you would have made by hand.

    Set explicit thresholds for where AI output is allowed to be merely adequate and where it must be hand-crafted. Without that line, teams burn time trying to squeeze artisanal quality out of a tool that is delivering value mainly through speed.

      Attribution:
    • ahmedehab_01 #1
    • lukan #1

In plain english

BigQuery
Google BigQuery, a cloud data warehouse for running SQL queries over large datasets.
IP
Intellectual property, meaning the rights to characters, stories, and brands that can be monetized across media and merchandise.
linters
Tools that automatically check code for style issues, suspicious patterns, or rule violations.
LLM
Large language model, an AI system that generates text or code from prompts.
Postgres
PostgreSQL, a widely used open source relational database.
QA
Quality assurance, the process of checking whether software works correctly and meets expectations.
Swift
Apple’s programming language used for building apps on iPhone, Mac, and related platforms.
UX
User experience, the overall feel and usability of a product from the user's point of view.
work-for-hire
A legal arrangement where work created as part of employment is owned by the employer rather than the individual creator.

Reference links

Testing and quality approaches

AI, taste, and workflow references

  • Taste Is the New Skill
    Linked as a related essay arguing that taste can be learned from accumulated context and examples.
  • infoPipeline governance.md
    Shared as an example of putting governance and selection into an AI-assisted workflow.
  • QRank
    Suggested as a better notoriety signal for ranking places or entities than language count alone.
  • wikidata-qrank design document
    Linked to explain how QRank is computed from Wikimedia usage data.

UI and design examples

Other examples mentioned