The post is a build log for erm, a local CLI that strips filler words from recordings without the ugly artifacts you get from naive cuts. The core problem is timing. Whisper can tell you roughly where a word is, but its timestamps drift enough that cutting directly on them clips consonants, leaves stutters, or creates clicks. The tool fixes that by finding the actual word boundaries in the waveform, then sliding cut points to nearby quiet spots and zero crossings before handing the splice off to ffmpeg.
Most people bought the engineering premise and immediately turned it into a workflow question. The strongest practical signal came from creators who already spend hours hand-editing courses and videos. For them, the value is not perfect linguistic cleanup. It is getting back a big chunk of editing time while avoiding the over-aggressive cuts common in automated podcast tools. That said, the thread did not treat filler words as disposable noise. Repeatedly, people pointed out that “um” can hold the floor in conversation, mark real thinking time, or soften meaning. Remove the wrong one and you do not just tighten pacing. You rewrite the speaker.
That led to a useful line between media types. In produced instructional content, dead air and padding are expensive because listeners already complain about bloated runtimes and often consume audio at 1.5x to 2x just to get to the point. In interviews, live conversation, and speech from non-native English speakers, hesitation often carries signal and heavy cleanup can make someone sound choppy, overconfident, or subtly different from what they meant. The practical consensus was sharp: this kind of tool is valuable when you want faster edits and shorter runtime, but it should behave like a careful assistant, not an editor with opinions about what speech ought to sound like.
If you produce courses, podcasts, or videos, this looks useful as a cleanup step, not a one-click truth machine. Use it where brevity matters, but keep human review for interviews, nuanced answers, non-native speakers, and anything where hesitation carries meaning.
Mostly positive about the engineering and the usefulness for content workflows, with sustained skepticism about the broader anti-filler crusade. People liked the narrow, local, practical tool, but many objected to treating all disfluencies as junk because they can carry meaning, conversational control, or evidence that someone is actually thinking.
Key insights
01
Course editing is the clearest use case
For recorded course production, this saves real labor instead of just sounding clever. One creator said manual filler cleanup used to cost nearly a full day per hour of material and that erm recovers about 70 percent of that time, while still needing tuning for edge cases like non-native speakers where the pause itself carries meaning.
If you make training or educational video, test this against one of your existing editing sessions and measure hours saved, not just audio quality. Keep a review pass for speakers whose pacing or hesitation is part of the content.
Several comments pointed past waveform surgery toward transcript-aware editing. Whisper already collapses some false starts on its own, another commenter argued for a second LLM pass to normalize self-corrections into clean text, and a video creator described a workflow that marks retakes with a spoken keyword and lets AI drive cuts in DaVinci Resolve. The useful framing is that filler removal is one small part of a broader edit-from-transcript pipeline.
If your team already transcribes recordings, do not isolate filler removal as a standalone feature. Build around transcript-guided cuts, retake markers, and editor integration so cleanup compounds across the whole workflow.
A sharp reframing cut through the implementation details. The challenge is not identifying an “um.” It is removing it without damaging adjacent phonemes, rhythm, or room tone. That matches the post's actual mechanics, which are mostly about correcting weak timestamps and finding safe cut points rather than detecting fillers in the first place.
When evaluating similar tools, ask for before-and-after audio and failure cases around boundaries. Detection accuracy alone is the wrong metric if the audible artifacts are what make the output unusable.
People want shorter audio because spoken media is bloated
The appetite for deleting filler is tied to a broader complaint that much online audio and video wastes time. Multiple comments described watching informational content at 2x speed, preferring transcripts when possible, and seeing creator incentives push material into longer, slower formats than the information warrants. In that environment, trimming padding is less a style preference than a response to low information density.
If you publish spoken content, filler cleanup is only a partial fix. The bigger opportunity is format discipline, tighter scripting, and offering transcripts so users can choose the fastest path.
The strongest objection was not technical. It was about authorship. People do not want platforms or dictation systems silently rewriting their speech patterns, because hesitation can carry mood, emphasis, and interpersonal meaning. A cleaned version may be easier to consume, but it is no longer a neutral copy of the original utterance.
Make filler stripping explicit and reversible. If you ship this in a product, expose it as a user-controlled edit layer rather than a default transformation of source audio or transcripts.
Automatic cleanup is inherently an editorial choice
This argument rejected the premise that the problem can be solved cleanly at all. Once a token like “um” can either be disposable noise or load-bearing language, deciding whether to cut it depends on surrounding context and intent, not just acoustic boundaries. Better splicing does not remove that judgment call. It only hides it.
Use this tool where you are already comfortable making editorial decisions, like polished teaching material or solo narration. Avoid treating it as a neutral preprocessing step for interviews, research, or anything that needs faithful representation.
One commenter claimed Audacity can do this quickly and with better results, pushing back on the need for a dedicated tool. The reply from someone who has actually done repetitive filler cleanup was that pattern-based manual approaches do not generalize well enough and erm performs better overall. The useful contrarian point is that if your files are short and your standards are high, manual review still sets the bar.
Do not assume automation wins on quality. Benchmark it against a human editor on a representative sample, especially for premium content where small timing mistakes are obvious.