Deterministic Shuffling And Bucketing With Cryptographic Hashing Functions

  • Date:

Sometimes it is useful to be able to shuffle a series of items in a reproducible, deterministic manner. For a video game, a developer might design various rooms, areas or objectives that can be arranged in an arbitrary order so the players have a unique experience whenever they start a new game. A common scenario that often comes up in my line of work is deciding the order in which updates are performed across a fleet of machines.

Shuffling

One way to shuffle a list of values is to generate a random value for each item in the list then sort the items using the random values as the keys. The items can be deterministically shuffled by simply replacing the random values with deterministically generated ones derived from each item. Cryptographic hashing functions have a few properties that make them good candidates for this use case: they are deterministic, so any given value always produces the same hash; minor changes to the input significantly change the output, so two similar messages will have drastically different hashes that have no apparent correlation with one another (also known as the "avalanche effect"); and the output is uniformly random, the importance of which will be explained later.

Avalanche effect example showing the SHA-256 hashes of www1.codevat.com, www2,codevat.com and www3.codevat.com formatted as a UML activity diagram with "SHA-256" between each hostname and the associatedhash

Here is an implementation of a deterministic shuffle in Python using that approach with SHA-256 as the hashing primitive:

import hashlib

def deterministic_shuffle(items):
    """
    Shuffle items in a deterministic manner; the same set of inputs will
    always be returned in the same arbitrary order.
    """
    return sorted(items, key=lambda x: hashlib.sha256(x).digest())

An example:

>>> deterministic_shuffle([
...     b"www1.codevat.com",
...     b"www2.codevat.com",
...     b"www3.codevat.com",
... ])
[b'www3.codevat.com', b'www1.codevat.com', b'www2.codevat.com']

A potentially beneficial property of this method of shuffling is that items will always be shuffled in the same relative order; "www3.codevat.com" will always come before "www1.codevat.com" even if "www2…" is removed from the list and/or other strings are added.

>>> deterministic_shuffle([
...     b"www1.codevat.com",
...     b"www3.codevat.com",
...     b"www6.codevat.com",
... ])
[b'www6.codevat.com', b'www3.codevat.com', b'www1.codevat.com']

By introducing a salt, the variance becomes configurable much like seeding a random number generator.

import hashlib

def deterministic_shuffle(items, salt=b""):
    """
    Shuffle items in a deterministic manner; the same set of inputs with
    a particular salt will always be returned in the same arbitrary order.
    """
    return sorted(items, key=lambda x: hashlib.sha256(x + salt).digest())

Changing the salt generally changes the order of the output.

>>> deterministic_shuffle([b"www1.codevat.com", b"www2.codevat.com",
... b"www3.codevat.com"])
[b'www3.codevat.com', b'www1.codevat.com', b'www2.codevat.com']

>>> deterministic_shuffle([b"www1.codevat.com", b"www2.codevat.com",
... b"www3.codevat.com"], b"xyz")
[b'www2.codevat.com', b'www3.codevat.com', b'www1.codevat.com']

>>> deterministic_shuffle([b"www1.codevat.com", b"www2.codevat.com",
... b"www3.codevat.com"], b"123")
[b'www1.codevat.com', b'www3.codevat.com', b'www2.codevat.com']

If the values returned by the cryptographic hash function are uniformly random and uncorrelated with the inputs, and the shuffle function does, in fact, shuffle its inputs, the mean and standard deviation of the final position of every item in the list should be roughly the same over a large number of trials. This script shuffles a list of 100 initially sorted strings in the form of "www%d.codevat.com" using a random salt and tracks the shuffled positions across 1,000 trials:

import os

trials = 1000
items = [b"www%d.codevat.com" % n for n in range(100)]
indexes = dict((item, index) for index, item in enumerate(items))
shuffled_indexes = [list() for item in items]

for _ in range(trials):
    for index, item in enumerate(deterministic_shuffle(items, os.urandom(10))):
        shuffled_indexes[indexes[item]].append(index)

Here are the aggregate statistics graphed as a scatter plot:

Shuffle statistics graphed as a scatter plot showing the mean and the mean plus or minus the standard deviation

In a perfectly uniform distribution, the mean of the shuffled positions would be 49.5 with a standard deviation of 100/sqrt(12) or ≈28.9 (derivation). This data lines up just about perfectly.

Bucketing

Another problem in the same vein as deterministic shuffling is deterministic bucketing. The most straightforward way of doing this is to deterministically shuffle the list then divide the it into equally sized chunks.

import hashlib

def deterministic_shuffle(items, salt=b""):
    """
    Shuffle items in a deterministic manner; the same set of inputs with
    a particular salt will always be returned in the same arbitrary order.
    """
    return sorted(items, key=lambda x: hashlib.sha256(x + salt).digest())

def deterministic_bucket(items, count, salt=b""):
    """
    Group elements of an iterable into "N" buckets in a deterministic manner
    that produces the same output regardless of the order of the input.
    """
    buckets = [list() for _ in range(count)]

    for index, item in enumerate(deterministic_shuffle(items, salt)):
        buckets[index % count].append(item)

    return buckets

An example:

>>> items = [b"AAA", b"BBB", b"CCC", b"DDD", b"EEE", b"FFF"]
>>> deterministic_bucket(items, 3)
[[b'FFF', b'AAA'], [b'DDD', b'EEE'], [b'CCC', b'BBB']]

Note that adding or removing items from the list may change which buckets the remaining items are in.

>>> items.pop()
b'FFF'
>>> deterministic_bucket(items, 3)
[[b'DDD', b'EEE'], [b'CCC', b'BBB'], [b'AAA']]

This form of naive bucketing is not acceptable for every use case. In the scenario where a fleet of servers is being updated, the buckets might represent days of the week each host should be updated. If adding or removing servers changed the buckets, the servers would not consistently be updated every 7 days. For A/B testing, the buckets could determine which experiment users see — something that typically should not change just because the number of subscribers changed. This can be addressed by selecting a bucket using an item's hash.

import hashlib

def stable_deterministic_bucket(items, count, salt=b""):
    """
    Group elements of an iterable into "N" buckets in a deterministic manner
    that produces the same output regardless of the order of the input. Adding
    or removing items from the input will not affect which buckets the
    remaining items are placed into.
    """
    buckets = [list() for _ in range(count)]
    entries = [(hashlib.sha256(item + salt).digest(), item) for item in items]

    for key, item in sorted(entries):
        key_as_int = int.from_bytes(key, byteorder="little")
        buckets[key_as_int % count].append(item)

    return buckets

A bar graph showing statistics associated with flipping a coin using a random number generator that outputs 3 values versus one that outputs 101 values

The item's hash is converted to an unsigned integer and that value modulo the number of buckets determines where the item is placed much like the original implementation. Use of the modulo operator in this manner introduces bias for some bucket sizes (specifically when 2256 mod n ≠ 0). Consider a random number generator used to simulate coin flips via random_integer() % 2 with 0 representing heads and 1 representing tails. If "random_integer" returns 0, 1 or 2, then heads will come up 66.7% of the time, but as the range of values increases, the bias decreases; if "random_integer" returns [0, ..., 100], heads will come up 50.5% of the time. Because SHA-256 can theoretically produce 2256 unique values, the impact of modulo bias is negligible for any reasonable number of buckets.

For a given salt and hashing algorithm, a particular item will always placed in the same bucket regardless of the presence or absence of other items.

>>> items = [b"AAA", b"BBB", b"CCC", b"DDD", b"EEE", b"FFF"]
>>> stable_deterministic_bucket(items, 3)
[[b'CCC', b'DDD'], [b'BBB', b'EEE'], [b'AAA', b'FFF']]
>>> items.pop()
b'FFF'
>>> items.append(b"XXX")
>>> stable_deterministic_bucket(items, 3)
[[b'DDD', b'CCC'], [b'XXX', b'EEE', b'BBB'], [b'AAA']]

The relative ordering of the items within the buckets is also preserved regardless of the order of the inputs.

>>> import random
>>> random.shuffle(items)
>>> items
[b'AAA', b'XXX', b'EEE', b'CCC', b'DDD', b'BBB']
>>> stable_deterministic_bucket(items, 3)
[[b'DDD', b'CCC'], [b'XXX', b'EEE', b'BBB'], [b'AAA']]

The stability comes at a cost: the number of items in each bucket is no longer guaranteed to be equal (±1) as was the case with the original function, but with a sufficiently large number of inputs, the buckets will have roughly the same number of occupants.

>>> import os
>>> items = [os.urandom(10) for _ in range(10000)]
>>> [len(bucket) for bucket in stable_deterministic_bucket(items, 3)]
[3339, 3302, 3359]

Wrap-up

The following code implements generalized (read: not restrcited to operating on bytes) forms of deterministic shuffling, naive deterministic bucketing and stable deterministic bucketing using the methods described above:

#!/usr/bin/env python3
__author__ = "Eric Pruitt (https://www.codevat.com)"
__license__ = "Public Domain or MIT"

import hashlib

DEFAULT_HASHING_ALGORITHM = "sha256"


def keyer(salt, *, cast=None, algorithm=None):
    """
    Return a function that can be used to generate ordering e.g. for use as the
    "key" argument to the "sorted" function.

    - salt: Salt used when computing the hash of objects.
    - cast: A function used to interpret objects that are neither strings nor
      bytes. The function must accept an object as its only argument and return
      either a string or bytes. This defaults to `str`.
    - algorithm: Cryptographic hashing algorithm used as the hashing primitive.
      This can be any string supported by hashlib i.e. any value listed in
      `hashlib.algorithms_available`. The default value is "sha256".

    Raises:
    - TypeError: The salt is neither a string nor bytes.

    Returns: A function that takes an object as its only argument and returns
    bytes.
    """
    if isinstance(salt, str):
        salt = salt.encode("utf-8")
    elif not isinstance(salt, bytes):
        raise TypeError("salt must be a string or bytes")

    if algorithm is None:
        algorithm = DEFAULT_HASHING_ALGORITHM

    if cast is None:
        cast = str

    def key(obj):
        if not isinstance(obj, (str, bytes)):
            obj = cast(obj)

        if not isinstance(obj, bytes):
            obj = obj.encode("utf-8")

        return hashlib.new(algorithm, obj + salt).digest()

    return key


def deterministic_shuffle(items, salt=b"", *, cast=None, algorithm=None,
  mutate=True):
    """
    Shuffle a series of items in a repeatable, deterministic manner; the order
    of the shuffled items will always be the same for a given salt and
    algorithm. If some members are added or removed between invocations of this
    function, the relative order of elements in both inputs is maintained. The
    order of the items in the input has no impact on the order of the output
    unless there is a hash collision.

    Arguments:
    - items
    - salt: Salt used when computing the hash of objects.
    - cast: A function used to interpret objects that are neither strings nor
      bytes. The function must accept an object as its only argument and return
      either a string or bytes. This defaults to `str`.
    - algorithm: Cryptographic hashing algorithm used as the hashing primitive.
      This can be any string supported by hashlib i.e. any value listed in
      `hashlib.algorithms_available`. The default value is "sha256".
    - mutate: When this is True, the items are shuffled in place. When this is
      false, a shuffled list is returned instead.

    Raises:
    - TypeError: The salt is neither a string nor bytes.

    Returns: If "mutate" is False, a list is a returned.
    """
    shuffled = sorted(items, key=keyer(salt, cast=cast, algorithm=algorithm))

    if not mutate:
        return shuffled

    items[:] = shuffled


def deterministic_bucket(items, count, salt=b"", *, cast=None, algorithm=None,
  stably=False):
    """
    Group elements of an iterable into "N" buckets in a deterministic manner
    that produces the same output regardless of the order of the input.

    Arguments:
    - items
    - count: Desired number of buckets.
    - salt: Salt used when computing the hash of objects.
    - cast: A function used to interpret objects that are neither strings nor
      bytes. The function must accept an object as its only argument and return
      either a string or bytes. This defaults to `str`.
    - algorithm: Cryptographic hashing algorithm used as the hashing primitive.
      This can be any string supported by hashlib i.e. any value listed in
      `hashlib.algorithms_available`. The default value is "sha256".
    - stably: A boolean value that determines whether bucketing is done in a
      stable manner; when this is True, a particular item will always placed in
      the same bucket regardless of the presence or absence of other items.

    Raises:
    - TypeError: The salt is neither a string nor bytes.

    Returns: A list of lists. Each inner list is a bucket of items.
    """
    sort_key = keyer(salt, cast=cast, algorithm=algorithm)
    buckets = [list() for _ in range(count)]

    if stably:
        for key, item in sorted([(sort_key(item), item) for item in items]):
            key_as_int = int.from_bytes(key, byteorder="little")
            buckets[key_as_int % count].append(item)

    else:
        for index, item in enumerate(sorted(items, key=sort_key)):
            buckets[index % count].append(item)

    return buckets