HN Debrief

Show HN: CLI tool for detecting non-exact code duplication with embedding models

  • AI
  • Developer Tools
  • Programming
  • Open Source

Slopo is a CLI tool for scanning a codebase and surfacing similar code using embeddings rather than exact-match clone detection. It works at the function or method level today, strips comments, embeds each extracted unit, and boosts matches that are far apart in the codebase so it can find the kind of duplication that is easy to miss during normal maintenance. The author framed it as a tool for catching semantically similar code written by humans or coding agents, where some hits will be false positives but the real matches can point to refactors or outright bugs.

If you want to use embeddings for code deduplication, treat them as a candidate generator, not the final judge. The practical work is in choosing the right unit of analysis and adding a second validation step, whether that is AST-based scoring or an LLM review in your workflow.

Discussion mood

Mostly positive and curious. People liked the tool as a practical, Unix-style use of embeddings, but the strongest comments immediately pressed on false positives, chunking granularity, and the need for structural or deterministic signals to make the results trustworthy.

Key insights

  1. 01

    Embeddings need a second scoring signal

    Cosine similarity is good at finding code about the same thing, which is not the same as finding code that should be refactored together. One useful framing here is to let embeddings cast a wide net, then use AST edit distance or another structural check to reject same-topic false positives. The author chose a different second stage and pushes classification to a coding LLM outside the tool, which makes the system more flexible but also less reproducible.

    If you pilot this in a real repo, measure precision after a second pass instead of judging the embedding scores alone. Pick early whether you want a deterministic validator inside the tool or an LLM-based review step in your development workflow.

      Attribution:
    • nttylock #1
    • rkochanowski #1
  2. 02

    Function-level chunking will miss local duplication

    Working on whole functions keeps parsing manageable, but it bakes in a blind spot. Repeated conditional branches, exception handlers, and other sub-function patterns can disappear inside otherwise different large functions, so the exact duplication developers often want to clean up never makes it into the candidate set. The current parser also ignores comments and enforces hard size cutoffs, which further narrows what the tool can see.

    Expect the first wins to come from medium-sized duplicated functions, not small repeated blocks. If your codebase has huge methods or repeated control-flow fragments, wait for finer chunking or plan to add it yourself.

      Attribution:
    • klibertp #1
    • rkochanowski #1 #2
    • mempko #1
  3. 03

    Production use favors hybrid repo analysis

    A team that built a similar system for a large monorepo said the practical deployment point was code review, where new code is checked against existing patterns across the repo. They also found that AST-based analysis goes surprisingly far before embeddings are needed, and that deterministic output is easier for developers to inspect and trust. That pushes the concept away from a one-shot cleanup report and toward ongoing analysis with explainable matches.

    The strongest operational use case is not a periodic duplicate audit. It is a review-time assistant that flags likely repetition while the code is still being written and easy to change.

      Attribution:
    • vander_elst #1 #2
  4. 04

    Baseline comparisons are still missing

    The tool is pitched as filling the gap for embedding-based detection, but there is no evidence yet that embeddings outperform simpler retrieval methods or mature clone detectors on the actual task. BM25 was suggested as a baseline, and AST approaches came up repeatedly for the same reason. Without that comparison, it is hard to tell whether the novelty is doing useful work or just moving complexity around.

    Do not assume embedding-based duplicate detection is better because it sounds more semantic. Benchmark it against BM25 and structural clone detection on your own repos before you build process around it.

      Attribution:
    • rkochanowski #1 #2
    • janalsncm #1
  5. 05

    Similar code can reveal inconsistent business logic

    The best argument for this kind of tool was not cleaner style, it was bug discovery. The author described finding two distant permission checks that looked similar but were not equivalent, and the weaker one let invalid behavior through. That is a different class of value than normal deduplication tools usually promise, because the problem is divergence in logic, not just repeated text.

    Use results to hunt for security checks, validation, and policy code that has forked over time. Those are high-value duplicates even when you decide not to refactor them into one abstraction.

      Attribution:
    • rkochanowski #1

Against the grain

  1. 01

    Not all duplication deserves immediate refactoring

    The pushback to the basic "why remove duplicate code" question still carved out an important limit. A second similar implementation is often tolerable, code generation can justify large repeated patterns, and duplicated implementations can even be tested against each other. The useful threshold offered here was the rule of three, not blanket deduplication.

    Do not turn findings into automatic cleanup tickets. Triage by risk and repetition count, and leave generated code or deliberate parallel implementations alone unless they are causing real maintenance pain.

      Attribution:
    • rufius #1
    • klibertp #1
    • Zopieux #1

In plain english

AST
Abstract syntax tree, a tree representation of code structure produced by parsing source code.
AST edit distance
A way to measure how structurally different two pieces of code are by counting changes between their abstract syntax trees.
BM25
Best Matching 25, a classic keyword-based ranking algorithm used in search and information retrieval.
CLI
Command-line interface, a program run from a terminal rather than through a graphical user interface.
cosine similarity
A mathematical score that measures how close two vectors point in the same direction, often used to compare embeddings.
embedding
A numeric vector representation of code or text that lets software measure semantic similarity between items.
LLM
Large language model, an AI system trained on large amounts of text to generate or analyze language.
monorepo
A single repository that contains many projects, services, or packages together.

Reference links

Project and related tools

Implementation details and analysis