Perl bindings for djbsort โ Daniel J. Bernstein's
constant-time, SIMD-friendly sorting library.
Marian (HackMan) Marinov <mm@yuhu.biz>
n wires.[1, 4, 7, 9, 8, 5, 3, 2]n in logโ n stages.n/2 compare-and-swaps at a fixed stride (n/2, then n/4, โฆ, then 1).
Extra log n factor over quicksort on paper โ bought back by SIMD + branchless execution.
6 stages · logโ(8) · (logโ(8) + 1) / 2 = 6 compare-and-swap columns ·
same schedule for every input — that's what makes it constant-time and SIMD-friendly.
click the animation to pause / resume
sortverif/ โ binary-analysis tools that prove correctnessint32 / uint32 / int64 / uint64 / float32 / float64A bitonic sorting network: the sequence of compare-and-swap operations is fixed โ it does not depend on the input values.
Cost: O(n ยท logยฒ n) comparisons โ worse than quicksort on paper. Win: no branches, no key-dependent memory access, 8 int32s per AVX2 instruction.
djbsort_src/The XS distribution bundles the portable C sources verbatim from djbsort:
djbsort_src/
djbsort.h -- public API
crypto_int32.h -- constant-time primitives (inline asm for
crypto_int64.h x86_64, aarch64, arm, sparc; C fallback)
int32_sort.c -- the core: portable4 bitonic network
int64_sort.c -- ditto for 64-bit
uint32_sort.c -- thin wrapper: flip sign bit, call int32
uint64_sort.c
float32_sort.c -- thin wrapper: total-ordering trick on bits
float64_sort.c
*down_sort.c -- descending variants
djbsort_dispatch.c -- runtime symbol dispatch
Everything is C89-portable. On a system with libdjbsort installed you get AVX2 / SSE4.2 / NEON automatically; otherwise the bundled portable build is used.
Sort::DJB::PureA direct Perl translation of djbsort's portable4/sort.c โ same network, same passes:
sub _minmax {
# branch-free compare-and-swap: guarantees $_[0] <= $_[1]
@_[0, 1] = @_[1, 0] if $_[0] > $_[1];
}
sub _int_sort {
my ($x) = @_; my $n = scalar @$x; return if $n < 2;
my $top = 1; # outer while: pick top power-of-two
$top += $top while $top < $n - $top;
for (my $p = $top; $p >= 1; $p >>= 1) { # outer for: stride p halves each pass
my $i = 0;
while ($i + 2 * $p <= $n) { # inner while: walk blocks of size 2p
for (my $j = $i; $j < $i + $p; ++$j) {# inner for: p parallel compare-and-swaps
_minmax($x->[$j], $x->[$j + $p]);
}
$i += 2 * $p;
}
# ... merge passes with q > p (elided) ...
}
}
Signed/unsigned/float are handled by bit-fiddling wrappers around one _int_sort โ the same trick djbsort uses in C.
Sort::DJB (XS)Every sort function is one macro expansion โ marshal โ sort โ unmarshal:
#define SORT_BODY(name, av_ref, ctype, cfunc, av_to_fn, to_av_fn) { \
AV *av; SSize_t n; ctype *buf; AV *result; \
if (!SvROK(av_ref) || SvTYPE(SvRV(av_ref)) != SVt_PVAV) \
croak(name ": argument must be an array reference"); \
av = (AV *)SvRV(av_ref); \
buf = av_to_fn(aTHX_ av, &n); /* SV -> flat C buffer */ \
if (n > 0) { \
cfunc(buf, (long long)n); /* djbsort does the work */ \
result = to_av_fn(aTHX_ buf, n); /* C buffer -> new AV */ \
Safefree(buf); \
} else { result = newAV(); } \
mXPUSHs(newRV_noinc((SV *)result)); \
XSRETURN(1); \
}
void sort_int32(av_ref) SV *av_ref
PPCODE:
SORT_BODY("sort_int32", av_ref, int32_t, djbsort_int32,
av_to_int32, int32_to_av)
One macro; 12 sort functions โ one call site each. Input is an arrayref; a new arrayref comes back โ the input is never mutated.
One SORT_BODY macro, one djbsort C entry point per type. Signed/unsigned/float differ only in a bit-level pre-pass.
| Ascending | Descending | Data type | Perl SV |
|---|---|---|---|
sort_int32 | sort_int32down | signed 32-bit | IV |
sort_uint32 | sort_uint32down | unsigned 32-bit | UV |
sort_int64 | sort_int64down | signed 64-bit | IV |
sort_uint64 | sort_uint64down | unsigned 64-bit | UV |
sort_float32 | sort_float32down | 32-bit float | NV |
sort_float64 | sort_float64down | 64-bit double | NV |
Import tags: :int32 :uint32 :int64 :uint64 :float32 :float64 :all. All take an arrayref, return a new arrayref, never mutate.
use Sort::DJB qw(:all);
my $sorted = sort_int32( [5, 3, 1, 4, 2] );
# => [1, 2, 3, 4, 5]
my $desc = sort_int32down( [5, 3, 1, 4, 2] );
# => [5, 4, 3, 2, 1]
my $floats = sort_float64( [3.14, 1.41, 2.72, 0.58] );
# => [0.58, 1.41, 2.72, 3.14]
# what's actually running under the hood?
print Sort::DJB::version(); # "20260210"
print Sort::DJB::arch(); # "portable" or "amd64"
print Sort::DJB::int32_implementation(); # "portable4" or "avx2"
12 typed functions, arrayref-in / arrayref-out, input never mutated.
| # | Implementation | Rate | vs Perl sort |
|---|---|---|---|
| 1 | Sort::DJB XS (AVX2 bitonic) | 407 / s | 9.0ร |
| 2 | Sort::Packed (pack + sort + unpack) | 147 / s | 3.2ร |
| 3 | Sort::Key::Radix (O(n) radix) | 103 / s | 2.3ร |
| 4 | Perl builtin sort { $a <=> $b } | 45.5 / s | 1.0ร (baseline) |
| 5 | Sort::Key (key-cached mergesort) | 34.3 / s | 0.75ร |
| 6 | Sort::DJB::Pure (bitonic in Perl) | 0.81 / s | 0.018ร |
Sort::Packed = pack("l*", @data) โ C-level in-place sort โ unpack("l*", $buf).
It works on a flat binary buffer instead of Perl SVs, which is why it beats the builtin
sort โ but the pack/unpack round-trip is what caps it at ~3ร.
| Type | Perl sort | Best CPAN alt | Sort::DJB XS | Speedup |
|---|---|---|---|---|
| int32 | 44 / s | 150 / s (Packed) | 407 / s (AVX2) | 9.0ร |
| uint32 | 44 / s | 141 / s (Packed) | 97.9 / s | 2.2ร |
| int64 | 43 / s | 122 / s (Packed) | 88.3 / s | 2.1ร |
| float64 | 22 / s | 108 / s (Packed) | 99.7 / s | 4.6ร |
int32 gets the AVX2 boost; other types use the bundled portable build in this run (still competitive with the best CPAN option, and 2โ5ร over Perl's builtin).
SvIV/SvNV
on every element, then wrapping the sorted output back into fresh SVs.
Raw pre-packed buffers sort ~3ร faster than the same data going through the round-trip.log n more comparisons than
quicksort/mergesort. AVX2 buys it back with 8-way SIMD; the portable build breaks even.Fast for big arrays is easy โ fast for small arrays means low per-call cost.
| Implementation | Rate | vs Perl sort |
|---|---|---|
| Sort::DJB XS | 4 221 744 / s | 1.54ร |
| Perl builtin sort | 2 749 454 / s | 1.00ร |
| Sort::Key::Radix | 2 535 828 / s | 0.92ร |
| Sort::Key | 1 894 758 / s | 0.69ร |
| Sort::DJB::Pure | 122 736 / s | 0.04ร |
| Implementation | Rate | vs Perl sort |
|---|---|---|
| Sort::DJB XS | 2 203 017 / s | 1.21ร |
| Perl builtin sort | 1 815 337 / s | 1.00ร |
| Sort::Key::Radix | 1 302 285 / s | 0.72ร |
| Sort::Key | 1 031 797 / s | 0.57ร |
XS overhead is amortised even at n=5 โ no CPAN alternative is faster for tiny arrays.
Skip pack/unpack: pre-pack the buffer once, benchmark only sort_packed.
This isolates the sort itself from SV conversion.
| Buffer type | Rate (raw sort only) |
|---|---|
| PackedR_i32 (int32) | 293 / s |
| PackedR_i64 (int64) | 208 / s |
| PackedR_f64 (float64) | 164 / s |
| Path | Rate | Overhead cost |
|---|---|---|
| Raw packed sort (no pack/unpack) | 293 / s | โ |
| Full Sort::Packed pipeline | 150 / s | ~49% lost to pack/unpack |
| Sort::DJB XS (AVX2) | 407 / s | beats even the raw buffer |
Takeaway: SIMD sorting is fast enough that on the AVX2 path, Sort::DJB's SV round-trip still wins against a fully in-place raw-buffer sort.
{ $a->{name} cmp $b->{name} })
Rule of thumb: if you'd naturally pack("l*", @data) before touching it, Sort::DJB is a straight win.
github.com/hackman/Sort-DJB ยท djbsort by DJB: sorting.cr.yp.to
Marian (HackMan) Marinov <mm@yuhu.biz>
Questions?
cpanm Sort::DJB ยท
use Sort::DJB qw(:all);