Removing 'um' from a recording is harder than it sounds

AI
Developer Tools
Media
Open Source

The post is a build log for erm, a local CLI that strips filler words from recordings without the ugly artifacts you get from naive cuts. The core problem is timing. Whisper can tell you roughly where a word is, but its timestamps drift enough that cutting directly on them clips consonants, leaves stutters, or creates clicks. The tool fixes that by finding the actual word boundaries in the waveform, then sliding cut points to nearby quiet spots and zero crossings before handing the splice off to ffmpeg.

If you produce courses, podcasts, or videos, this looks useful as a cleanup step, not a one-click truth machine. Use it where brevity matters, but keep human review for interviews, nuanced answers, non-native speakers, and anything where hesitation carries meaning.

June 12, 2026
doug.sh
Discuss on HN

Discussion mood

Mostly positive about the engineering and the usefulness for content workflows, with sustained skepticism about the broader anti-filler crusade. People liked the narrow, local, practical tool, but many objected to treating all disfluencies as junk because they can carry meaning, conversational control, or evidence that someone is actually thinking.

Key insights

Course editing is the clearest use case

For recorded course production, this saves real labor instead of just sounding clever. One creator said manual filler cleanup used to cost nearly a full day per hour of material and that erm recovers about 70 percent of that time, while still needing tuning for edge cases like non-native speakers where the pause itself carries meaning.

If you make training or educational video, test this against one of your existing editing sessions and measure hours saved, not just audio quality. Keep a review pass for speakers whose pacing or hesitation is part of the content.

Attribution:

alyssamazz #1

The bigger win may be transcript-first editing

Several comments pointed past waveform surgery toward transcript-aware editing. Whisper already collapses some false starts on its own, another commenter argued for a second LLM pass to normalize self-corrections into clean text, and a video creator described a workflow that marks retakes with a spoken keyword and lets AI drive cuts in DaVinci Resolve. The useful framing is that filler removal is one small part of a broader edit-from-transcript pipeline.

If your team already transcribes recordings, do not isolate filler removal as a standalone feature. Build around transcript-guided cuts, retake markers, and editor integration so cleanup compounds across the whole workflow.

Attribution:

iib #1
josefritzishere #1
__mharrison__ #1

The hard part is preserving surrounding speech

A sharp reframing cut through the implementation details. The challenge is not identifying an “um.” It is removing it without damaging adjacent phonemes, rhythm, or room tone. That matches the post's actual mechanics, which are mostly about correcting weak timestamps and finding safe cut points rather than detecting fillers in the first place.

When evaluating similar tools, ask for before-and-after audio and failure cases around boundaries. Detection accuracy alone is the wrong metric if the audible artifacts are what make the output unusable.

Attribution:

ralferoo #1
dougcalobrisi #1

People want shorter audio because spoken media is bloated

The appetite for deleting filler is tied to a broader complaint that much online audio and video wastes time. Multiple comments described watching informational content at 2x speed, preferring transcripts when possible, and seeing creator incentives push material into longer, slower formats than the information warrants. In that environment, trimming padding is less a style preference than a response to low information density.

If you publish spoken content, filler cleanup is only a partial fix. The bigger opportunity is format discipline, tighter scripting, and offering transcripts so users can choose the fastest path.

Attribution:

ralferoo #1
ordu #1
burkaman #1
red-iron-pine #1
landl0rd #1

Default removal changes what people actually said

The strongest objection was not technical. It was about authorship. People do not want platforms or dictation systems silently rewriting their speech patterns, because hesitation can carry mood, emphasis, and interpersonal meaning. A cleaned version may be easier to consume, but it is no longer a neutral copy of the original utterance.

Make filler stripping explicit and reversible. If you ship this in a product, expose it as a user-controlled edit layer rather than a default transformation of source audio or transcripts.

Attribution:

BugsJustFindMe #1
wzdd #1
sublinear #1

Against the grain

Automatic cleanup is inherently an editorial choice

This argument rejected the premise that the problem can be solved cleanly at all. Once a token like “um” can either be disposable noise or load-bearing language, deciding whether to cut it depends on surrounding context and intent, not just acoustic boundaries. Better splicing does not remove that judgment call. It only hides it.

Use this tool where you are already comfortable making editorial decisions, like polished teaching material or solo narration. Avoid treating it as a neutral preprocessing step for interviews, research, or anything that needs faithful representation.

Attribution:

chrismorgan #1

Manual editing may still beat automation

One commenter claimed Audacity can do this quickly and with better results, pushing back on the need for a dedicated tool. The reply from someone who has actually done repetitive filler cleanup was that pattern-based manual approaches do not generalize well enough and erm performs better overall. The useful contrarian point is that if your files are short and your standards are high, manual review still sets the bar.

Do not assume automation wins on quality. Benchmark it against a human editor on a representative sample, especially for premium content where small timing mistakes are obvious.

Attribution:

monster_truck #1
alyssamazz #1

In plain english

Audacity ↩

A free open source audio editor commonly used for recording and manual audio cleanup.

CLI ↩

Command-line interface, meaning a program run from a shell or terminal rather than through a graphical interface.

DaVinci Resolve ↩

A professional video editing application used for cutting, color grading, and audio post-production.

FFmpeg ↩

A widely used open source multimedia framework for decoding, encoding, and processing audio and video.

LLM ↩

Large Language Model, a machine learning system trained to generate and analyze text.

waveform ↩

The raw digital representation of an audio signal, rather than a higher-level intermediate format.

Whisper ↩

An automatic speech recognition model from OpenAI that commonly resamples audio to lower rates such as 16 kHz mono for transcription.

Reference links

Primary project

erm GitHub repository
Source code for the local CLI tool discussed in the post

Speech and filler-word references

Don't Worry About Saying Um... Effective Public Speaking Includes Filler Words
Cited to argue that filler words can play a useful role in spoken communication
Ums Considered Harmful
Linked jokingly as part of a campaign against filler words
Related SIGBOVIK paper on ums
Related paper linked alongside the tongue-in-cheek anti-um reference

Transcription and subtitle tools

Faster-Whisper-XXL
Mentioned as an open source Whisper-based subtitle generation workflow with extra tuning options

Audio and accessibility examples

Blind Microsoft developer using a screen reader at high speed
Example used to illustrate how some users adapt to extremely fast spoken audio