032025· Concurrency · Go

SplitOrderedLists

A lock-free, extensible hash table in Go that grows without locks or rehashing — the Shalev–Shavit split-ordered design.

Private · guided study

Goatomic CASlock-freeHarris/Michael listgenerics

428k

Get ops/ms @128 goroutines

one EPYC 9534 socket, 64 physical cores, NUMA-pinned; vs sync.Map 244k

~90×

Put throughput over a global mutex

@64 goroutines on the 64-core socket

3µs

insert P99 during a 5M-key resize

vs a mutex map's 786µs P99

allocations on the read path

0 B/op — Go generics avoid boxing

The problem

Concurrent hash tables built on locks pay a growing penalty under contention: a global mutex serializes every operation, and even sharded designs stall when the table resizes, because a conventional rehash must visit and move every item. The goal was a hash table that scales across cores and absorbs growth without ever blocking readers or writers or relocating existing data.

The approach

I implemented the Shalev–Shavit Split-Ordered Lists algorithm (JACM 53(3), 2006) in Go: a single lock-free linked list sorted by bit-reversed (split-order) keys, with the hash buckets as pointers into that list. Data nodes use Reverse(key | (1<<63)) so their low bit is 1; dummy sentinel nodes use Reverse(bucketIndex) so their low bit is 0, which guarantees each dummy sorts ahead of its bucket's items. Growth doubles the bucket count via an atomic CAS on the size field with no items moved; new buckets are initialized lazily and recursively, where a bucket's parent (its index with the highest set bit cleared) is set up first, forming a tree rooted at bucket 0.

Key decisions

Deletion is the hard part of any lock-free list, so I used Harris's two-phase scheme (logical mark on the next-pointer, then physical unlink) with Michael's prev.next revalidation in find(). To pay zero heap allocations per mutation, I packed the Harris mark into the low bit of an unsafe.Pointer and mutated exclusively via atomic CAS on those tagged pointers, relying on Go's ≥8-byte struct alignment and the collector's tracing of unsafe.Pointer to stay safe under GC. I chose Go generics over interface{} to eliminate boxing on the value path, which is what keeps reads at zero allocations. The guarantee is lock-free, not wait-free, and the linearization point of each operation is documented (Insert at the CAS swinging prev.next, Delete at the CAS setting the mark bit, Find at the read on an unmarked node).

Outcome

Correctness is covered by 16 concurrent tests (9 uint64-key, 7 string-key) under Go's race detector, with methodology drawn from the JSR-166 TCK, Go's own sync.Map tests, and Herlihy–Wing linearizability. On one socket of an AMD EPYC 9534 (64 physical cores, NUMA-pinned via numactl) under SLURM, Get throughput reached 172,369 ops/ms at 32 goroutines and 428,706 at 128 goroutines (software threads oversubscribed onto the 64 cores), versus 101,630 and 244,281 for sync.Map and ~18–22k for a global-mutex map. Writes show the gap even more starkly: at 64 goroutines Put throughput held near 99,800 ops/ms against the global mutex's ~1,100 — roughly 90×. Reads are zero-allocation; Put costs 2 allocs/op at ~48 B/op. The resize behaviour is where the design pays off most: under a 5M-key resize at 32 goroutines, insert tail latency held at P99 3µs / P999 14µs against the mutex baseline's P99 786µs / P999 2.0ms. That 262× tighter tail is the point: a structure whose worst case stays bounded under contention — no lock convoy, no stop-the-world pause — is the property latency-sensitive systems are built on, from a render loop that can't drop a frame to an order book that can't miss a quote.

Structure — one sorted list

Dummies sort by bit-reversed bucket index (0, 2, 1, 3), so each sits just ahead of its bucket's items. Growth doubles the bucket count with a single CAS on the size field — existing nodes never move.

MeasuredWhere the lock-free design pays off

It scales with cores — and barely flinches mid-resize.

Real numbers from the committed benchmark suite, not a marketing curve. Flip between raw throughput and the tail latency during a live resize.

Get ops/ms

split-orderedsync.Map

sourceAMD EPYC 9534 · one socket · 64 physical cores · NUMA-pinned (numactl) · SLURM