HN Debrief

Regular expressions that work “everywhere”

  • Programming
  • Developer Tools
  • Standards
  • Open Source

The post argues that despite the mess of regex dialects, there is still a practical core you can use “everywhere,” including basics like character classes, anchors, repetition, grouping, alternation, and some escapes. Readers mostly agreed with the goal and disagreed with the confidence. The big correction was that syntax support is only half the problem. Defaults matter just as much. GNU grep and sed use POSIX Basic Regular Expressions by default, which means operators like `+`, `?`, `|`, and grouping often need `-E` or escaping before they mean what many developers expect. BSD and macOS sed diverge further, especially around shorthand classes like `\w` and `\s` and word boundaries like `\b`. Several people pushed the discussion past syntax into semantics. POSIX and Perl-style engines can match the same pattern differently because they resolve alternatives and greediness differently, so “works everywhere” can still produce different matches. That turned the practical lesson into something stricter than the post itself: portability is not a list of tokens, it is a combination of dialect, defaults, and matching rules. The most useful additions were references to attempts to standardize or fence off the problem, from POSIX BRE and RFC 9485’s I-Regexp to JSON Schema’s recommended subset and Russ Cox’s writing on regex engines and behavior. Another recurring complaint was that software and docs routinely say “supports regex” without naming the dialect, which is fine for language-internal code and terrible for user-facing configuration. A few side threads branched into tooling pain like nested escaping in shells, Python raw strings, and Emacs, plus the idea that regexes would have been easier to live with if they had evolved as a structured composable language instead of a pile of mini-DSLs.

If your product, docs, or config accepts regexes, name the exact dialect and matching behavior instead of saying just “regex.” For anything that must run across tools, test against the actual engines you care about and stick to a deliberately tiny subset.

Discussion mood

Mostly skeptical but constructive. People liked the attempt to carve out a portable subset, but the dominant reaction was that the post overstates interoperability because regex portability fails on engine defaults, dialect differences, and differing matching semantics long before you hit exotic features.

Key insights

  1. 01

    Portable syntax fails on default tools

    Even the supposed safe subset breaks if you forget what engine a tool uses by default. GNU grep and sed start in POSIX Basic Regular Expressions mode, so operators many developers think of as baseline often need `-E` or different escaping. macOS and BSD sed are tighter still. They do not accept shorthand classes like `\w` and `\s`, and `\b` is replaced by POSIX word-boundary forms with no direct `\B` equivalent. That means a regex can look conservative and still be non-portable in ordinary shell workflows.

    If a pattern is headed for shell tools, write and test separate examples for GNU and BSD userlands. Prefer POSIX character classes over PCRE-style shorthands when you need broad command-line compatibility.

      Attribution:
    • LoganDark #1
    • semanticc #1
  2. 02

    Matching semantics differ even when syntax parses

    A regex that compiles cleanly across engines can still match different text. POSIX leftmost-longest behavior and Perl or PCRE greedy backtracking do not resolve the same ambiguities the same way. One commenter pointed to a paper on patterns that are equivalent under both greedy semantics and leftmost-maximal semantics, which underlines how real this portability gap is. The hard problem is not just what tokens an engine accepts. It is whether the engine chooses the same match.

    For validation, extraction, or rewriting logic, test expected match results across engines instead of treating successful compilation as proof of portability. Avoid ambiguous patterns when results must be identical across platforms.

      Attribution:
    • jonstewart #1
    • agnishom #1
  3. 03

    User-facing docs must name the dialect

    Calling something a “regex” without naming the dialect is sloppy when users have to write patterns themselves. The complaint was not about library internals. It was about configs, search boxes, spreadsheets, and tools where users should not have to infer whether the engine is RE2, PCRE, Python, or something else. The same gripe came up for Markdown, which has the same bad habit of pretending one fuzzy label is enough. Once regex leaves source code and becomes an interface, unspecified dialect is a documentation bug.

    In product docs and API references, spell out both the regex dialect and any unsupported features. Add one or two tested examples that show anchors, classes, and word boundaries in your actual engine.

      Attribution:
    • quotemstr #1 #2
    • bartread #1
    • xigoi #1
  4. 04

    Standards and subsets are the only sane escape hatch

    Several references pointed to the same conclusion. Portability gets better only when you commit to a published subset or standard instead of hand-waving about “common regex.” POSIX BRE remains the old baseline, though even that has version caveats. JSON Schema publishes a recommended subset. RFC 9485 defines I-Regexp as an interoperable format. Russ Cox’s regex essays explain why engines differ in both features and behavior. Put together, they frame the problem as one of contract design, not syntax trivia.

    If you maintain a spec or multi-language product, choose a named regex subset and enforce it at the boundary. Point users to the exact standard instead of describing support informally.

      Attribution:
    • tonyg #1
    • myroon5 #1
    • dekdrop #1
    • JdeBP #1
  5. 05

    Escaping and composition are still the daily pain

    The operational misery is not the regex language alone. It is the stack of surrounding string syntaxes. Shell quoting, Python raw strings, Emacs escaping, and generated scripts all add another failure mode before the pattern even reaches the engine. That led to a broader complaint that regexes grew as ad hoc mini-languages rather than structured, composable objects. Swift’s RegexBuilder was cited as a counterexample that treats pattern construction more like programming than string concatenation.

    When patterns are assembled dynamically or passed through multiple layers, stop treating them as opaque strings. Use builder APIs or at least centralize escaping and test the exact serialized pattern that reaches the engine.

      Attribution:
    • rtpg #1
    • brookst #1
    • afiori #1
    • woadwarrior01 #1

Against the grain

  1. 01

    Internal code often does not need dialect ceremony

    For code that never accepts user-supplied patterns, specifying the regex dialect can be overkill. If a Python project uses Python regexes internally, other contributors usually infer that from the language and standard library. The argument here is that dialect ambiguity is mainly a problem at configuration and product boundaries, not inside ordinary source code written for peers in the same ecosystem.

    Reserve the heavy documentation for user-facing regex entry points. Inside codebases, focus more on tests and readability unless patterns cross process, language, or product boundaries.

      Attribution:
    • zahlman #1 #2
  2. 02

    Regex composition is a tooling gap, not a theory limit

    The complaint that regexes cannot be composed was pushed back on as mostly an API design issue. Pattern strings can already be assembled before compilation, and languages can expose better abstractions if they choose. Swift’s RegexBuilder shows one path. Readability helpers or block syntax would solve much of the practical problem without changing what regular expressions are.

    Do not wait for a new regex standard to make patterns maintainable. Wrap common fragments, generate patterns before compile time, or adopt a host-language builder where one exists.

      Attribution:
    • wwind123 #1
    • galaxyLogic #1
    • ystlum #1
    • woadwarrior01 #1

In plain english

BRE
Basic Regular Expressions, the older POSIX regex dialect used by tools like grep and sed unless extended mode is enabled.
BSD
Berkeley Software Distribution, a Unix family whose command-line tools on systems like macOS often differ from GNU tools.
Emacs
A programmable text editor whose own regex and string escaping rules are a common source of confusion.
GNU
A widely used open source Unix toolchain and set of command-line utilities, often with behavior that differs from BSD versions.
I-Regexp
Interoperable Regular Expression Format, a standardized subset intended to make regexes more portable across systems.
JSON Schema
A standard for describing and validating the structure of JSON data, including guidance on a portable regex subset.
PCRE
Perl Compatible Regular Expressions, a popular regex flavor that follows many Perl-style features and semantics.
POSIX
A family of standards that define common operating system interfaces and shell behavior on Unix-like systems.
RE2
A regular expression engine from Google that avoids catastrophic backtracking by rejecting features like backreferences.
RFC 9485
An Internet Engineering Task Force standard document defining I-Regexp, an interoperable regular expression format.

Reference links

Standards and portability references

Regex engine behavior and theory

Alternative regex tooling and models

  • Emacs rx notation
    Mentioned as a structured alternative to hand-writing raw regexes in Emacs Lisp.
  • Swift RegexBuilder proposal
    Given as an example of composable regex construction in a host language.
  • Apple RegexBuilder documentation
    Practical documentation for the Swift builder approach mentioned in the comments.
  • SNOBOL
    Raised as an older pattern-matching system that is more expressive and structured than classic regex syntax.