Sort::DJB — Perl bindings for djbsort

Sort::DJB

Bitonic sorting networks for Perl

Perl bindings for djbsort — Daniel J. Bernstein's constant-time, SIMD-friendly sorting library.

Marian (HackMan) Marinov <mm@yuhu.biz>

github.com/hackman/Sort-DJB

What is a sorting network?

Sorting network

A fixed choreography of compare-and-swap operations on n wires.
The schedule of comparisons is decided at compile time — it does not look at the values.
Feed any permutation in → sorted output comes out.
Branch-free, data-independent timing, trivially parallel.

Bitonic sequence

A sequence that first ascends, then descends (or vice versa).
Example: [1, 4, 7, 9, 8, 5, 3, 2]
Purely ascending or descending counts too — one of the halves is empty.

Batcher's bitonic sort (1968)

The merger

Merges a bitonic sequence of length n in log₂ n stages.
Each stage: n/2 compare-and-swaps at a fixed stride (n/2, then n/4, …, then 1).
Full sort: recursively build bitonic halves, then merge. Total: O(n · log² n) compare-and-swaps.

Why it wins in practice

No branches → no mispredictions.
Independent compare-and-swaps at each stage → AVX2 does 8 at a time.
Fixed strides → prefetcher-friendly memory access.

Extra log n factor over quicksort on paper — bought back by SIMD + branchless execution.

Bitonic sort in action · n = 8

▐▐ PAUSED

6 stages · log₂(8) · (log₂(8) + 1) / 2 = 6 compare-and-swap columns · same schedule for every input — that's what makes it constant-time and SIMD-friendly.
click the animation to pause / resume

What is djbsort?

Sorting library by Daniel J. Bernstein, first released as (July 10, 2018)
FAST claims fastest for typical array sizes on 64-bit Intel/AMD/ARM
CONST data-independent execution — safe for cryptographic code
VERIFIED ships sortverif/ — binary-analysis tools that prove correctness
Only sorts numeric types: int32 / uint32 / int64 / uint64 / float32 / float64

The trick

A bitonic sorting network: the sequence of compare-and-swap operations is fixed — it does not depend on the input values.

Cost: O(n · log² n) comparisons — worse than quicksort on paper. Win: no branches, no key-dependent memory access, 8 int32s per AVX2 instruction.

Upstream: `djbsort_src/`

The XS distribution bundles the portable C sources verbatim from djbsort:

djbsort_src/
  djbsort.h                 -- public API
  crypto_int32.h            -- constant-time primitives (inline asm for
  crypto_int64.h               x86_64, aarch64, arm, sparc; C fallback)
  int32_sort.c              -- the core: portable4 bitonic network
  int64_sort.c              -- ditto for 64-bit
  uint32_sort.c             -- thin wrapper: flip sign bit, call int32
  uint64_sort.c
  float32_sort.c            -- thin wrapper: total-ordering trick on bits
  float64_sort.c
  *down_sort.c              -- descending variants
  djbsort_dispatch.c        -- runtime symbol dispatch

Everything is C89-portable. On a system with libdjbsort installed you get AVX2 / SSE4.2 / NEON automatically; otherwise the bundled portable build is used.

Implementation: `Sort::DJB::Pure`

A direct Perl translation of djbsort's portable4/sort.c — same network, same passes:

sub _minmax {
    # branch-free compare-and-swap: guarantees $_[0] <= $_[1]
    @_[0, 1] = @_[1, 0] if $_[0] > $_[1];
}

sub _int_sort {
    my ($x) = @_; my $n = scalar @$x; return if $n < 2;

    my $top = 1;                                  # outer while: pick top power-of-two
    $top += $top while $top < $n - $top;
    for (my $p = $top; $p >= 1; $p >>= 1) {       # outer for: stride p halves each pass
        my $i = 0;
        while ($i + 2 * $p <= $n) {               # inner while: walk blocks of size 2p
            for (my $j = $i; $j < $i + $p; ++$j) {# inner for: p parallel compare-and-swaps
                _minmax($x->[$j], $x->[$j + $p]);
            }
            $i += 2 * $p;
        }
        # ... merge passes with q > p (elided) ...
    }
}

Signed/unsigned/float are handled by bit-fiddling wrappers around one _int_sort — the same trick djbsort uses in C.

Implementation: `Sort::DJB` (XS)

Every sort function is one macro expansion — marshal → sort → unmarshal:

#define SORT_BODY(name, av_ref, ctype, cfunc, av_to_fn, to_av_fn) {    \
    AV *av; SSize_t n; ctype *buf; AV *result;                          \
    if (!SvROK(av_ref) || SvTYPE(SvRV(av_ref)) != SVt_PVAV)             \
        croak(name ": argument must be an array reference");            \
    av  = (AV *)SvRV(av_ref);                                           \
    buf = av_to_fn(aTHX_ av, &n);        /* SV -> flat C buffer     */ \
    if (n > 0) {                                                        \
        cfunc(buf, (long long)n);        /* djbsort does the work    */ \
        result = to_av_fn(aTHX_ buf, n); /* C buffer -> new AV       */ \
        Safefree(buf);                                                  \
    } else { result = newAV(); }                                        \
    mXPUSHs(newRV_noinc((SV *)result));                                 \
    XSRETURN(1);                                                        \
}

void sort_int32(av_ref) SV *av_ref
    PPCODE:
        SORT_BODY("sort_int32", av_ref, int32_t, djbsort_int32,
                  av_to_int32, int32_to_av)

One macro; 12 sort functions — one call site each. Input is an arrayref; a new arrayref comes back — the input is never mutated.

The 12 sort functions

One SORT_BODY macro, one djbsort C entry point per type. Signed/unsigned/float differ only in a bit-level pre-pass.

Ascending	Descending	Data type	Perl SV
`sort_int32`	`sort_int32down`	signed 32-bit	IV
`sort_uint32`	`sort_uint32down`	unsigned 32-bit	UV
`sort_int64`	`sort_int64down`	signed 64-bit	IV
`sort_uint64`	`sort_uint64down`	unsigned 64-bit	UV
`sort_float32`	`sort_float32down`	32-bit float	NV
`sort_float64`	`sort_float64down`	64-bit double	NV

Import tags: :int32 :uint32 :int64 :uint64 :float32 :float64 :all. All take an arrayref, return a new arrayref, never mutate.

The Perl API

use Sort::DJB qw(:all);

my $sorted = sort_int32( [5, 3, 1, 4, 2] );
# => [1, 2, 3, 4, 5]

my $desc = sort_int32down( [5, 3, 1, 4, 2] );
# => [5, 4, 3, 2, 1]

my $floats = sort_float64( [3.14, 1.41, 2.72, 0.58] );
# => [0.58, 1.41, 2.72, 3.14]

# what's actually running under the hood?
print Sort::DJB::version();               # "20260210"
print Sort::DJB::arch();                  # "portable" or "amd64"
print Sort::DJB::int32_implementation();  # "portable4" or "avx2"

12 typed functions, arrayref-in / arrayref-out, input never mutated.

Performance: int32, n = 100 000

With system libdjbsort (AVX2)

#	Implementation	Rate	vs Perl `sort`
1	Sort::DJB XS (AVX2 bitonic)	407 / s	9.0×
2	Sort::Packed (pack + sort + unpack)	147 / s	3.2×
3	Sort::Key::Radix (O(n) radix)	103 / s	2.3×
4	Perl builtin `sort { $a <=> $b }`	45.5 / s	1.0× (baseline)
5	Sort::Key (key-cached mergesort)	34.3 / s	0.75×
6	Sort::DJB::Pure (bitonic in Perl)	0.81 / s	0.018×

Sort::Packed = pack("l*", @data) → C-level in-place sort → unpack("l*", $buf). It works on a flat binary buffer instead of Perl SVs, which is why it beats the builtin sort — but the pack/unpack round-trip is what caps it at ~3×.

Across all types (n = 100 000)

Type	Perl `sort`	Best CPAN alt	Sort::DJB XS	Speedup
int32	44 / s	150 / s (Packed)	407 / s (AVX2)	9.0×
uint32	44 / s	141 / s (Packed)	97.9 / s	2.2×
int64	43 / s	122 / s (Packed)	88.3 / s	2.1×
float64	22 / s	108 / s (Packed)	99.7 / s	4.6×

int32 gets the AVX2 boost; other types use the bundled portable build in this run (still competitive with the best CPAN option, and 2–5× over Perl's builtin).

What limits the XS wins?

SV ↔ C marshalling — walking the AV, calling SvIV/SvNV on every element, then wrapping the sorted output back into fresh SVs. Raw pre-packed buffers sort ~3× faster than the same data going through the round-trip.
Radix sort is asymptotically O(n) — but the SV overhead eats the advantage; Sort::Key::Radix is only ~2.3× over builtin.
Perl 5.40's builtin sort is genuinely good — Sort::Key's key-cached mergesort can actually lose to it at large sizes.
Bitonic is O(n · log² n) — a factor of log n more comparisons than quicksort/mergesort. AVX2 buys it back with 8-way SIMD; the portable build breaks even.

Call overhead: tiny arrays

Fast for big arrays is easy — fast for small arrays means low per-call cost.

n = 5 elements

Implementation	Rate	vs Perl `sort`
Sort::DJB XS	4 221 744 / s	1.54×
Perl builtin sort	2 749 454 / s	1.00×
Sort::Key::Radix	2 535 828 / s	0.92×
Sort::Key	1 894 758 / s	0.69×
Sort::DJB::Pure	122 736 / s	0.04×

n = 10 elements

Implementation	Rate	vs Perl `sort`
Sort::DJB XS	2 203 017 / s	1.21×
Perl builtin sort	1 815 337 / s	1.00×
Sort::Key::Radix	1 302 285 / s	0.72×
Sort::Key	1 031 797 / s	0.57×

XS overhead is amortised even at n=5 — no CPAN alternative is faster for tiny arrays.

The real ceiling: raw packed buffers

Skip pack/unpack: pre-pack the buffer once, benchmark only sort_packed. This isolates the sort itself from SV conversion.

n = 100 000

Buffer type	Rate (raw sort only)
PackedR_i32 (int32)	293 / s
PackedR_i64 (int64)	208 / s
PackedR_f64 (float64)	164 / s

Compare to the full pipeline (int32, n = 100 000)

Path	Rate	Overhead cost
Raw packed sort (no pack/unpack)	293 / s	—
Full Sort::Packed pipeline	150 / s	~49% lost to pack/unpack
Sort::DJB XS (AVX2)	407 / s	beats even the raw buffer

Takeaway: SIMD sorting is fast enough that on the AVX2 path, Sort::DJB's SV round-trip still wins against a fully in-place raw-buffer sort.

Where Sort::DJB shines

Great fit

Homogeneous numeric arrays — telemetry, sensor readings, timestamps.
Fixed-schema record IDs — user IDs, order IDs, session IDs (uint32/uint64).
Time-series buckets — epoch-milli int64, latency samples.
PDL / packed number arrays — already flat, cheap to marshal.
Crypto / post-quantum code — constant-time is a feature, not overhead.
Percentile / median calculations — sort once, index everywhere.

Bad fit

Sort by arbitrary comparator ({ $a->{name} cmp $b->{name} })
Arrays of strings, mixed types, or blessed refs
Schwartzian-transform workloads (sort by derived key)
Very small arrays where marshalling dominates and comparator is trivial
On-disk / larger-than-RAM sorts

Rule of thumb: if you'd naturally pack("l*", @data) before touching it, Sort::DJB is a straight win.

Thanks!

github.com/hackman/Sort-DJB · djbsort by DJB: sorting.cr.yp.to

Marian (HackMan) Marinov <mm@yuhu.biz>

Questions?

cpanm Sort::DJB · use Sort::DJB qw(:all);

Note: You are viewing an auto-generated directory index.