HN Debrief

Branchless Quicksort faster than std:sort and pdqsort with C and C++ API

  • Programming
  • Developer Tools
  • Infrastructure

The post presents BLQSort, a C and C++ sorting implementation built around branchless quicksort partitioning. The pitch is straightforward: in quicksort, a good pivot produces roughly 50/50 splits, which also makes the comparison outcome hard for a CPU branch predictor to guess. BLQSort sidesteps that by turning the hot partition step into arithmetic and unconditional stores instead of conditional jumps, and the benchmark charts in the post show it beating `std::sort` and pdqsort on random inputs.

Treat this as a specialized systems optimization, not a universal replacement sort. If your workload is random, in-cache, and compares cheap values, branchless partitioning can pay off. If your data is partially ordered, comparator-heavy, or uses nontrivial C++ types, benchmark before you switch and inspect the API constraints closely.

Discussion mood

Positive about the algorithmic idea and the engineering quality. Skeptical about the breadth of the performance claim. People liked the branchless partition trick, but they kept narrowing the win conditions to random data, cheap comparisons, cache-friendly writes, and suitably simple element types.

Key insights

  1. 01

    Rust standard sorts are already ahead

    Rust’s standard library already ships newer sort implementations that beat the posted numbers on at least one real machine. `ipnsort` was reported faster than BLQSort on 50 million doubles and small structs, while stable `driftsort` pulled far ahead on 90% sorted input. That shifts the comparison from “branchless quicksort beats the standard library” to “this beats older library choices on random data, but newer adaptive sorts can do better once the input has structure.”

    If you own a language runtime or platform library, the target is not `std::sort` folklore but the best current adaptive sort for your data mix. If you are an application developer in Rust, you likely already have a faster default than this post implies.

      Attribution:
    • orlp #1
    • vlovich123 #1
    • modulared #1
    • chrka #1 #2
  2. 02

    The C++ API is not truly generic

    The implementation bakes in assumptions that only hold for a narrow class of types. It default-constructs temporary `T` values and buffers, and uses copy paths in places where move-aware or uninitialized storage handling would be expected in serious generic C++. The author confirmed one fallback path was incorrectly used for non-trivially-copyable types. That means the algorithm may be fine, while the library interface still falls short of what C++ users expect from a drop-in sort.

    Do not read “C++ API” as “works like a standard-library algorithm on arbitrary comparable types.” Audit trait bounds, construction costs, and fallback paths before using it on strings, wrappers, or user-defined objects.

      Attribution:
    • quuxplusone #1 #2
    • chrka #1
  3. 03

    The demo loop wins by changing semantics

    The branchless filtering example is persuasive only if you focus on the returned prefix and ignore extra writes outside the useful range. It unconditionally stores every examined value, then advances the output length only for matches. That is why it can be compiled into straight-line code. It is also why a compiler cannot safely transform the ordinary `if` version into the same machine code in general. The optimization depends on accepting different intermediate side effects and different final garbage beyond `smlen`.

    When you copy this style, check whether your program truly observes only the logical prefix and not the transient or out-of-range writes. If those writes are visible to concurrency, signals, memory-mapped I/O, or invariants on untouched storage, the trick stops being free.

      Attribution:
    • kleiba2 #1
    • edelind #1
    • bhaak #1
    • addaon #1
    • djray #1
    • flohofwoe #1
  4. 04

    Branchless helps when comparisons are coin flips

    The useful hardware framing was not “modern CPUs hate branches.” It was that branch prediction fails specifically when the branch outcome is close to random. Good quicksort partitioning creates exactly that pattern because values should split around the pivot about evenly. In that case, trading a misprediction-heavy branch for an unconditional cache-friendly store is a good bargain. Once the condition becomes predictable, ordinary branching often regains the edge.

    Use branchless rewrites surgically in hot loops whose predicate is genuinely unpredictable. Do not spread them through code that mostly sees skewed or patterned data, because then you are paying extra dependencies and stores for no benefit.

      Attribution:
    • bagxrvxpepzn #1
    • chrka #1
    • adrian_b #1
    • adrianmonk #1
    • teo_zero #1

Against the grain

  1. 01

    The benchmarks dodge harder sorting cases

    The posted results center on random numeric inputs, which are the friendliest possible case for this design. People asked for comparisons against vectorized sorting-network implementations and for tests on non-random data and expensive keys like long strings. That objection lands because once comparisons dominate, or the input has structure, the branchless partition trick is no longer the main cost center.

    Before adopting any new sort, build a benchmark set that matches your production keys and distributions. Random integers are useful for microarchitecture work, but they are a weak proxy for logs, strings, records, and partially ordered business data.

      Attribution:
    • mgaunard #1
    • jeffbee #1
    • dekdrop #1
    • Aardwolf #1

In plain english

BLQSort
A sorting algorithm in the post that uses branchless quicksort partitioning to reduce branch mispredictions.
branch prediction
A CPU feature that guesses which way a conditional jump will go so execution can continue without stalling.
branchless
Code written to avoid conditional jumps in favor of arithmetic, masks, or conditional-move style operations.
driftsort
A stable Rust sorting algorithm mentioned in the comments that performs especially well on partially sorted data.
heapsort
A comparison sorting algorithm often used as a fallback because it guarantees good worst-case performance.
ipnsort
A Rust standard-library unstable sorting algorithm mentioned in the comments as a newer high-performance replacement.
pdqsort
Pattern-defeating quicksort, a fast quicksort variant designed to handle certain bad input patterns better than classic quicksort.
pivot
The value a quicksort-style algorithm uses to split elements into lower and higher partitions.
std::sort
The common C++ standard library sorting function for ordering a range of elements.

Reference links

Rust sorting implementations

Branchless and CPU behavior references

Books and related reading

  • Hacker's Delight
    Recommended as a practical book full of branch-free and bit-manipulation techniques similar to the article’s examples.
  • When ‘if’ slows you down, avoid it
    Linked from the same site as background reading that explains the branchless loop example in more detail.

Miscellaneous project links

  • misfortunate crate documentation
    Referenced during a side discussion about pathological but valid trait implementations in Rust.
  • misfortunate crate index
    Linked to explain the crate’s notion of violating the “social contract” while still satisfying Rust trait requirements.