Branchless Quicksort faster than std:sort and pdqsort with C and C++ API

Programming
Developer Tools
Infrastructure

The post presents BLQSort, a C and C++ sorting implementation built around branchless quicksort partitioning. The pitch is straightforward: in quicksort, a good pivot produces roughly 50/50 splits, which also makes the comparison outcome hard for a CPU branch predictor to guess. BLQSort sidesteps that by turning the hot partition step into arithmetic and unconditional stores instead of conditional jumps, and the benchmark charts in the post show it beating `std::sort` and pdqsort on random inputs.

Treat this as a specialized systems optimization, not a universal replacement sort. If your workload is random, in-cache, and compares cheap values, branchless partitioning can pay off. If your data is partially ordered, comparator-heavy, or uses nontrivial C++ types, benchmark before you switch and inspect the API constraints closely.

June 5, 2026
tiki.li
Discuss on HN

Key insights

Rust standard sorts are already ahead

Rust’s standard library already ships newer sort implementations that beat the posted numbers on at least one real machine. `ipnsort` was reported faster than BLQSort on 50 million doubles and small structs, while stable `driftsort` pulled far ahead on 90% sorted input. That shifts the comparison from “branchless quicksort beats the standard library” to “this beats older library choices on random data, but newer adaptive sorts can do better once the input has structure.”

If you own a language runtime or platform library, the target is not `std::sort` folklore but the best current adaptive sort for your data mix. If you are an application developer in Rust, you likely already have a faster default than this post implies.

Attribution:

orlp #1
vlovich123 #1
modulared #1
chrka #1 #2

The C++ API is not truly generic

The implementation bakes in assumptions that only hold for a narrow class of types. It default-constructs temporary `T` values and buffers, and uses copy paths in places where move-aware or uninitialized storage handling would be expected in serious generic C++. The author confirmed one fallback path was incorrectly used for non-trivially-copyable types. That means the algorithm may be fine, while the library interface still falls short of what C++ users expect from a drop-in sort.

Do not read “C++ API” as “works like a standard-library algorithm on arbitrary comparable types.” Audit trait bounds, construction costs, and fallback paths before using it on strings, wrappers, or user-defined objects.

Attribution:

quuxplusone #1 #2
chrka #1

The demo loop wins by changing semantics

The branchless filtering example is persuasive only if you focus on the returned prefix and ignore extra writes outside the useful range. It unconditionally stores every examined value, then advances the output length only for matches. That is why it can be compiled into straight-line code. It is also why a compiler cannot safely transform the ordinary `if` version into the same machine code in general. The optimization depends on accepting different intermediate side effects and different final garbage beyond `smlen`.

When you copy this style, check whether your program truly observes only the logical prefix and not the transient or out-of-range writes. If those writes are visible to concurrency, signals, memory-mapped I/O, or invariants on untouched storage, the trick stops being free.

Attribution:

kleiba2 #1
edelind #1
bhaak #1
addaon #1
djray #1
flohofwoe #1

Branchless helps when comparisons are coin flips

The useful hardware framing was not “modern CPUs hate branches.” It was that branch prediction fails specifically when the branch outcome is close to random. Good quicksort partitioning creates exactly that pattern because values should split around the pivot about evenly. In that case, trading a misprediction-heavy branch for an unconditional cache-friendly store is a good bargain. Once the condition becomes predictable, ordinary branching often regains the edge.

Use branchless rewrites surgically in hot loops whose predicate is genuinely unpredictable. Do not spread them through code that mostly sees skewed or patterned data, because then you are paying extra dependencies and stores for no benefit.

Attribution:

bagxrvxpepzn #1
chrka #1
adrian_b #1
adrianmonk #1
teo_zero #1

Against the grain

The benchmarks dodge harder sorting cases

The posted results center on random numeric inputs, which are the friendliest possible case for this design. People asked for comparisons against vectorized sorting-network implementations and for tests on non-random data and expensive keys like long strings. That objection lands because once comparisons dominate, or the input has structure, the branchless partition trick is no longer the main cost center.

Before adopting any new sort, build a benchmark set that matches your production keys and distributions. Random integers are useful for microarchitecture work, but they are a weak proxy for logs, strings, records, and partially ordered business data.

Attribution:

mgaunard #1
jeffbee #1
dekdrop #1
Aardwolf #1

In plain english

BLQSort ↩

A sorting algorithm in the post that uses branchless quicksort partitioning to reduce branch mispredictions.

branch prediction ↩

A CPU technique that guesses which path a conditional will take so it can keep work moving efficiently.

branchless ↩

Code written to avoid conditional jumps in favor of arithmetic, masks, or conditional-move style operations.

driftsort ↩

A stable Rust sorting algorithm mentioned in the comments that performs especially well on partially sorted data.

heapsort ↩

A comparison sorting algorithm often used as a fallback because it guarantees good worst-case performance.

ipnsort ↩

A Rust standard-library unstable sorting algorithm mentioned in the comments as a newer high-performance replacement.

pdqsort ↩

Pattern-defeating quicksort, a fast quicksort variant designed to handle certain bad input patterns better than classic quicksort.

pivot ↩

The value a quicksort-style algorithm uses to split elements into lower and higher partitions.

std::sort ↩

The common C++ standard library sorting function for ordering a range of elements.

Reference links

Rust sorting implementations

Rust 1.81.0 new sort implementations
Referenced to identify when Rust shipped the newer standard-library sort implementations discussed in the comments.
glidesort repository
Linked as the predecessor algorithm from which driftsort was derived.

Branchless and CPU behavior references

Value speculation and branch behavior
Cited to argue that branchless code is often worse than branches on modern CPUs except in specific unpredictable cases.
Linus on cmov and branch prediction
Cited alongside the value-speculation article to support the claim that conditional moves and branchless rewrites are often overused.
Godbolt example of branchless vs branchful compilation
Used to show the generated assembly for the loop example and explain when compilers do or do not eliminate branches.
Godbolt comparison of equivalent loop forms
Used to demonstrate that once the unconditional store is hoisted, compilers can emit the same branchless code for both source forms.

Books and related reading

Hacker's Delight
Recommended as a practical book full of branch-free and bit-manipulation techniques similar to the article’s examples.
When ‘if’ slows you down, avoid it
Linked from the same site as background reading that explains the branchless loop example in more detail.

Miscellaneous project links

misfortunate crate documentation
Referenced during a side discussion about pathological but valid trait implementations in Rust.
misfortunate crate index
Linked to explain the crate’s notion of violating the “social contract” while still satisfying Rust trait requirements.

Branchless Quicksort faster than std:sort and pdqsort with C and C++ API

Discussion mood

Key insights

Against the grain

In plain english

Reference links

Rust sorting implementations

Branchless and CPU behavior references

Books and related reading

Miscellaneous project links