HN Debrief

Removing 'um' from a recording is harder than it sounds

  • AI
  • Developer Tools
  • Media
  • Open Source

The post is a build log for erm, a local CLI that strips filler words from recordings without the ugly artifacts you get from naive cuts. The core problem is timing. Whisper can tell you roughly where a word is, but its timestamps drift enough that cutting directly on them clips consonants, leaves stutters, or creates clicks. The tool fixes that by finding the actual word boundaries in the waveform, then sliding cut points to nearby quiet spots and zero crossings before handing the splice off to ffmpeg.

If you produce courses, podcasts, or videos, this looks useful as a cleanup step, not a one-click truth machine. Use it where brevity matters, but keep human review for interviews, nuanced answers, non-native speakers, and anything where hesitation carries meaning.

Discussion mood

Mostly positive about the engineering and the usefulness for content workflows, with sustained skepticism about the broader anti-filler crusade. People liked the narrow, local, practical tool, but many objected to treating all disfluencies as junk because they can carry meaning, conversational control, or evidence that someone is actually thinking.

Key insights

  1. 01

    Course editing is the clearest use case

    For recorded course production, this saves real labor instead of just sounding clever. One creator said manual filler cleanup used to cost nearly a full day per hour of material and that erm recovers about 70 percent of that time, while still needing tuning for edge cases like non-native speakers where the pause itself carries meaning.

    If you make training or educational video, test this against one of your existing editing sessions and measure hours saved, not just audio quality. Keep a review pass for speakers whose pacing or hesitation is part of the content.

      Attribution:
    • alyssamazz #1
  2. 02

    The bigger win may be transcript-first editing

    Several comments pointed past waveform surgery toward transcript-aware editing. Whisper already collapses some false starts on its own, another commenter argued for a second LLM pass to normalize self-corrections into clean text, and a video creator described a workflow that marks retakes with a spoken keyword and lets AI drive cuts in DaVinci Resolve. The useful framing is that filler removal is one small part of a broader edit-from-transcript pipeline.

    If your team already transcribes recordings, do not isolate filler removal as a standalone feature. Build around transcript-guided cuts, retake markers, and editor integration so cleanup compounds across the whole workflow.

      Attribution:
    • iib #1
    • josefritzishere #1
    • __mharrison__ #1
  3. 03

    The hard part is preserving surrounding speech

    A sharp reframing cut through the implementation details. The challenge is not identifying an “um.” It is removing it without damaging adjacent phonemes, rhythm, or room tone. That matches the post's actual mechanics, which are mostly about correcting weak timestamps and finding safe cut points rather than detecting fillers in the first place.

    When evaluating similar tools, ask for before-and-after audio and failure cases around boundaries. Detection accuracy alone is the wrong metric if the audible artifacts are what make the output unusable.

      Attribution:
    • ralferoo #1
    • dougcalobrisi #1
  4. 04

    People want shorter audio because spoken media is bloated

    The appetite for deleting filler is tied to a broader complaint that much online audio and video wastes time. Multiple comments described watching informational content at 2x speed, preferring transcripts when possible, and seeing creator incentives push material into longer, slower formats than the information warrants. In that environment, trimming padding is less a style preference than a response to low information density.

    If you publish spoken content, filler cleanup is only a partial fix. The bigger opportunity is format discipline, tighter scripting, and offering transcripts so users can choose the fastest path.

      Attribution:
    • ralferoo #1
    • ordu #1
    • burkaman #1
    • red-iron-pine #1
    • landl0rd #1
  5. 05

    Default removal changes what people actually said

    The strongest objection was not technical. It was about authorship. People do not want platforms or dictation systems silently rewriting their speech patterns, because hesitation can carry mood, emphasis, and interpersonal meaning. A cleaned version may be easier to consume, but it is no longer a neutral copy of the original utterance.

    Make filler stripping explicit and reversible. If you ship this in a product, expose it as a user-controlled edit layer rather than a default transformation of source audio or transcripts.

      Attribution:
    • BugsJustFindMe #1
    • wzdd #1
    • sublinear #1

Against the grain

  1. 01

    Automatic cleanup is inherently an editorial choice

    This argument rejected the premise that the problem can be solved cleanly at all. Once a token like “um” can either be disposable noise or load-bearing language, deciding whether to cut it depends on surrounding context and intent, not just acoustic boundaries. Better splicing does not remove that judgment call. It only hides it.

    Use this tool where you are already comfortable making editorial decisions, like polished teaching material or solo narration. Avoid treating it as a neutral preprocessing step for interviews, research, or anything that needs faithful representation.

      Attribution:
    • chrismorgan #1
  2. 02

    Manual editing may still beat automation

    One commenter claimed Audacity can do this quickly and with better results, pushing back on the need for a dedicated tool. The reply from someone who has actually done repetitive filler cleanup was that pattern-based manual approaches do not generalize well enough and erm performs better overall. The useful contrarian point is that if your files are short and your standards are high, manual review still sets the bar.

    Do not assume automation wins on quality. Benchmark it against a human editor on a representative sample, especially for premium content where small timing mistakes are obvious.

      Attribution:
    • monster_truck #1
    • alyssamazz #1

In plain english

Audacity
A free open source audio editor commonly used for recording and manual audio cleanup.
CLI
Command-line interface, a text-based way to interact with software from a terminal.
DaVinci Resolve
A professional video editing application used for cutting, color grading, and audio post-production.
ffmpeg
A widely used open source command-line tool for converting, editing, and processing audio and video files.
LLM
Large language model, a type of AI system trained on large amounts of text to generate and analyze language.
waveform
A visual or numerical representation of how an audio signal changes over time.
Whisper
An automatic speech recognition model from OpenAI that transcribes audio into text and can estimate when words occur.

Reference links

Primary project

Speech and filler-word references

Transcription and subtitle tools

  • Faster-Whisper-XXL
    Mentioned as an open source Whisper-based subtitle generation workflow with extra tuning options

Audio and accessibility examples