Sort::DJB โ€” Perl bindings for djbsort

Sort::DJB

Bitonic sorting networks for Perl

Perl bindings for djbsort โ€” Daniel J. Bernstein's constant-time, SIMD-friendly sorting library.

Marian (HackMan) Marinov   <mm@yuhu.biz>

github.com/hackman/Sort-DJB

What is a sorting network?

Sorting network

  • A fixed choreography of compare-and-swap operations on n wires.
  • The schedule of comparisons is decided at compile time โ€” it does not look at the values.
  • Feed any permutation in โ†’ sorted output comes out.
  • Branch-free, data-independent timing, trivially parallel.

Bitonic sequence

  • A sequence that first ascends, then descends (or vice versa).
  • Example: [1, 4, 7, 9, 8, 5, 3, 2]
  • Purely ascending or descending counts too โ€” one of the halves is empty.

Batcher's bitonic sort (1968)

The merger

  • Merges a bitonic sequence of length n in logโ‚‚ n stages.
  • Each stage: n/2 compare-and-swaps at a fixed stride (n/2, then n/4, โ€ฆ, then 1).
  • Full sort: recursively build bitonic halves, then merge. Total: O(n ยท logยฒ n) compare-and-swaps.

Why it wins in practice

  • No branches โ†’ no mispredictions.
  • Independent compare-and-swaps at each stage โ†’ AVX2 does 8 at a time.
  • Fixed strides โ†’ prefetcher-friendly memory access.

Extra log n factor over quicksort on paper โ€” bought back by SIMD + branchless execution.

Bitonic sort in action · n = 8

▐▐ PAUSED phase 1: pairs phase 2: 4-blocks phase 3: 8-block merge 0: 1: 2: 3: 4: 5: 6: 7: 5 3 8 2 7 1 6 4 35 82 17 64 32 85 67 14 23 58 76 41 23 41 76 58 21 43 56 78 12 34 56 78 stage 1 stage 2 stage 3 stage 4 stage 5 stage 6 Input · unsorted Stage 1 · sort pairs asc / desc / asc / desc (builds bitonic 4-blocks) Stage 2 · stride 2 within each 4-block Stage 3 · stride 1 — the array is now bitonic (length 8) Stage 4 · stride 4 — full 8-way merge begins Stage 5 · stride 2 Sorted ✓ [1, 2, 3, 4, 5, 6, 7, 8]

6 stages · logโ‚‚(8) · (logโ‚‚(8) + 1) / 2 = 6 compare-and-swap columns · same schedule for every input — that's what makes it constant-time and SIMD-friendly.
click the animation to pause / resume

What is djbsort?

  • Sorting library by Daniel J. Bernstein, first released as (July 10, 2018)
  • FAST claims fastest for typical array sizes on 64-bit Intel/AMD/ARM
  • CONST data-independent execution โ€” safe for cryptographic code
  • VERIFIED ships sortverif/ โ€” binary-analysis tools that prove correctness
  • Only sorts numeric types: int32 / uint32 / int64 / uint64 / float32 / float64

The trick

A bitonic sorting network: the sequence of compare-and-swap operations is fixed โ€” it does not depend on the input values.

Cost: O(n ยท logยฒ n) comparisons โ€” worse than quicksort on paper. Win: no branches, no key-dependent memory access, 8 int32s per AVX2 instruction.

Upstream: djbsort_src/

The XS distribution bundles the portable C sources verbatim from djbsort:

djbsort_src/
  djbsort.h                 -- public API
  crypto_int32.h            -- constant-time primitives (inline asm for
  crypto_int64.h               x86_64, aarch64, arm, sparc; C fallback)
  int32_sort.c              -- the core: portable4 bitonic network
  int64_sort.c              -- ditto for 64-bit
  uint32_sort.c             -- thin wrapper: flip sign bit, call int32
  uint64_sort.c
  float32_sort.c            -- thin wrapper: total-ordering trick on bits
  float64_sort.c
  *down_sort.c              -- descending variants
  djbsort_dispatch.c        -- runtime symbol dispatch

Everything is C89-portable. On a system with libdjbsort installed you get AVX2 / SSE4.2 / NEON automatically; otherwise the bundled portable build is used.

Implementation: Sort::DJB::Pure

A direct Perl translation of djbsort's portable4/sort.c โ€” same network, same passes:

sub _minmax {
    # branch-free compare-and-swap: guarantees $_[0] <= $_[1]
    @_[0, 1] = @_[1, 0] if $_[0] > $_[1];
}

sub _int_sort {
    my ($x) = @_; my $n = scalar @$x; return if $n < 2;

    my $top = 1;                                  # outer while: pick top power-of-two
    $top += $top while $top < $n - $top;
    for (my $p = $top; $p >= 1; $p >>= 1) {       # outer for: stride p halves each pass
        my $i = 0;
        while ($i + 2 * $p <= $n) {               # inner while: walk blocks of size 2p
            for (my $j = $i; $j < $i + $p; ++$j) {# inner for: p parallel compare-and-swaps
                _minmax($x->[$j], $x->[$j + $p]);
            }
            $i += 2 * $p;
        }
        # ... merge passes with q > p (elided) ...
    }
}

Signed/unsigned/float are handled by bit-fiddling wrappers around one _int_sort โ€” the same trick djbsort uses in C.

Implementation: Sort::DJB (XS)

Every sort function is one macro expansion โ€” marshal โ†’ sort โ†’ unmarshal:

#define SORT_BODY(name, av_ref, ctype, cfunc, av_to_fn, to_av_fn) {    \
    AV *av; SSize_t n; ctype *buf; AV *result;                          \
    if (!SvROK(av_ref) || SvTYPE(SvRV(av_ref)) != SVt_PVAV)             \
        croak(name ": argument must be an array reference");            \
    av  = (AV *)SvRV(av_ref);                                           \
    buf = av_to_fn(aTHX_ av, &n);        /* SV -> flat C buffer     */ \
    if (n > 0) {                                                        \
        cfunc(buf, (long long)n);        /* djbsort does the work    */ \
        result = to_av_fn(aTHX_ buf, n); /* C buffer -> new AV       */ \
        Safefree(buf);                                                  \
    } else { result = newAV(); }                                        \
    mXPUSHs(newRV_noinc((SV *)result));                                 \
    XSRETURN(1);                                                        \
}

void sort_int32(av_ref) SV *av_ref
    PPCODE:
        SORT_BODY("sort_int32", av_ref, int32_t, djbsort_int32,
                  av_to_int32, int32_to_av)

One macro; 12 sort functions โ€” one call site each. Input is an arrayref; a new arrayref comes back โ€” the input is never mutated.

The 12 sort functions

One SORT_BODY macro, one djbsort C entry point per type. Signed/unsigned/float differ only in a bit-level pre-pass.

Ascending Descending Data type Perl SV
sort_int32 sort_int32down signed 32-bit IV
sort_uint32 sort_uint32down unsigned 32-bit UV
sort_int64 sort_int64down signed 64-bit IV
sort_uint64 sort_uint64down unsigned 64-bit UV
sort_float32 sort_float32down 32-bit float NV
sort_float64 sort_float64down 64-bit double NV

Import tags: :int32 :uint32 :int64 :uint64 :float32 :float64 :all. All take an arrayref, return a new arrayref, never mutate.

The Perl API

use Sort::DJB qw(:all);

my $sorted = sort_int32( [5, 3, 1, 4, 2] );
# => [1, 2, 3, 4, 5]

my $desc = sort_int32down( [5, 3, 1, 4, 2] );
# => [5, 4, 3, 2, 1]

my $floats = sort_float64( [3.14, 1.41, 2.72, 0.58] );
# => [0.58, 1.41, 2.72, 3.14]

# what's actually running under the hood?
print Sort::DJB::version();               # "20260210"
print Sort::DJB::arch();                  # "portable" or "amd64"
print Sort::DJB::int32_implementation();  # "portable4" or "avx2"

12 typed functions, arrayref-in / arrayref-out, input never mutated.

Performance: int32, n = 100 000

With system libdjbsort (AVX2)

#ImplementationRatevs Perl sort
1Sort::DJB XS (AVX2 bitonic)407 / s9.0ร—
2Sort::Packed (pack + sort + unpack)147 / s3.2ร—
3Sort::Key::Radix (O(n) radix)103 / s2.3ร—
4Perl builtin sort { $a <=> $b }45.5 / s1.0ร— (baseline)
5Sort::Key (key-cached mergesort)34.3 / s0.75ร—
6Sort::DJB::Pure (bitonic in Perl)0.81 / s0.018ร—

Sort::Packed = pack("l*", @data) โ†’ C-level in-place sort โ†’ unpack("l*", $buf). It works on a flat binary buffer instead of Perl SVs, which is why it beats the builtin sort โ€” but the pack/unpack round-trip is what caps it at ~3ร—.

Across all types (n = 100 000)

TypePerl sortBest CPAN altSort::DJB XSSpeedup
int3244 / s150 / s (Packed)407 / s (AVX2)9.0ร—
uint3244 / s141 / s (Packed)97.9 / s2.2ร—
int6443 / s122 / s (Packed)88.3 / s2.1ร—
float6422 / s108 / s (Packed)99.7 / s4.6ร—

int32 gets the AVX2 boost; other types use the bundled portable build in this run (still competitive with the best CPAN option, and 2โ€“5ร— over Perl's builtin).

What limits the XS wins?

  • SV โ†” C marshalling โ€” walking the AV, calling SvIV/SvNV on every element, then wrapping the sorted output back into fresh SVs. Raw pre-packed buffers sort ~3ร— faster than the same data going through the round-trip.
  • Radix sort is asymptotically O(n) โ€” but the SV overhead eats the advantage; Sort::Key::Radix is only ~2.3ร— over builtin.
  • Perl 5.40's builtin sort is genuinely good โ€” Sort::Key's key-cached mergesort can actually lose to it at large sizes.
  • Bitonic is O(n ยท logยฒ n) โ€” a factor of log n more comparisons than quicksort/mergesort. AVX2 buys it back with 8-way SIMD; the portable build breaks even.

Call overhead: tiny arrays

Fast for big arrays is easy โ€” fast for small arrays means low per-call cost.

n = 5 elements

ImplementationRatevs Perl sort
Sort::DJB XS4 221 744 / s1.54ร—
Perl builtin sort2 749 454 / s1.00ร—
Sort::Key::Radix2 535 828 / s0.92ร—
Sort::Key1 894 758 / s0.69ร—
Sort::DJB::Pure122 736 / s0.04ร—

n = 10 elements

ImplementationRatevs Perl sort
Sort::DJB XS2 203 017 / s1.21ร—
Perl builtin sort1 815 337 / s1.00ร—
Sort::Key::Radix1 302 285 / s0.72ร—
Sort::Key1 031 797 / s0.57ร—

XS overhead is amortised even at n=5 โ€” no CPAN alternative is faster for tiny arrays.

The real ceiling: raw packed buffers

Skip pack/unpack: pre-pack the buffer once, benchmark only sort_packed. This isolates the sort itself from SV conversion.

n = 100 000

Buffer typeRate (raw sort only)
PackedR_i32 (int32) 293 / s
PackedR_i64 (int64) 208 / s
PackedR_f64 (float64)164 / s

Compare to the full pipeline (int32, n = 100 000)

PathRateOverhead cost
Raw packed sort (no pack/unpack)293 / sโ€”
Full Sort::Packed pipeline150 / s~49% lost to pack/unpack
Sort::DJB XS (AVX2)407 / sbeats even the raw buffer

Takeaway: SIMD sorting is fast enough that on the AVX2 path, Sort::DJB's SV round-trip still wins against a fully in-place raw-buffer sort.

Where Sort::DJB shines

Great fit

  • Homogeneous numeric arrays โ€” telemetry, sensor readings, timestamps.
  • Fixed-schema record IDs โ€” user IDs, order IDs, session IDs (uint32/uint64).
  • Time-series buckets โ€” epoch-milli int64, latency samples.
  • PDL / packed number arrays โ€” already flat, cheap to marshal.
  • Crypto / post-quantum code โ€” constant-time is a feature, not overhead.
  • Percentile / median calculations โ€” sort once, index everywhere.

Bad fit

  • Sort by arbitrary comparator ({ $a->{name} cmp $b->{name} })
  • Arrays of strings, mixed types, or blessed refs
  • Schwartzian-transform workloads (sort by derived key)
  • Very small arrays where marshalling dominates and comparator is trivial
  • On-disk / larger-than-RAM sorts

Rule of thumb: if you'd naturally pack("l*", @data) before touching it, Sort::DJB is a straight win.

Thanks!

github.com/hackman/Sort-DJB  ยท  djbsort by DJB: sorting.cr.yp.to

Marian (HackMan) Marinov   <mm@yuhu.biz>

Questions?

cpanm Sort::DJB  ยท  use Sort::DJB qw(:all);

Note: You are viewing an auto-generated directory index.