Quantiles In MySQL

Posted on

There are tons of different ways to calculate quantiles in MySQL, but unfortunately, the most prominent methods calculate quantiles using a series of discrete statements or a single statement that can only be used to compute one quantile at a time and cannot be simultaneously combined with other MySQL group functions. Taking advantage, of GROUP_CONCAT and context-sensitive type coercion, I created macro that allows the calculation of arbitrary quantiles that can be used in a query in conjunction with other group functions.

The process is fairly straightforward: calculate the index of desired quantile in the sorted set of observations then pad every single observation so the string length of each observation is the same size. To get the substring indices of the selected observation, we simply multiply the fractional quantile by the number of observations and the length of each string. When the index of the quantile falls between two observations, a simple linear approximation is used.

Here's an implementation of the quantile calculation in PHP that generates a clause that can be used in a MySQL statement in any place where the built-in group functions are also valid:

 * Macro to generate a MySQL clause that returns the quantile of a set of
 * observations.
 * This example calculates the median age of a list of students:
 *     $median = MySQL_Quantile_Macro(50, 'Ages');
 *     $query = "SELECT $median AS MedianAges FROM Students";
 * The `group_concat_max_len` setting needs to be high enough to accommodate
 * storing all of the right-tail observations in memory simultaneously, that is
 * to say `group_concat_max_len` must be greater than or equal to
 * CEIL(($maxlen) * (1 - $quantile / 100) * COUNT($observations)).
 * @param float $quantile Desired quantile
 * @param string $observations Column or calculated value representing the
 * observations. Must not contain any group functions.
 * @param int $maxlen Padded record size. Should ideally be set to
 * MAX(LENGTH($observations)) when known but any value greater than or equal to
 * that will work.
 * @return string Clause to be used in MySQL query.
function MySQL_Quantile_Macro($quantile, $observations, $maxlen = 22)
    // Fractional location of the desired observing assuming ASC sort.
    $F = 1 - $quantile / 100;

    // Index in the sorted list of observations where the percentile can be found.
    $I = "((COUNT(*) - 1) * $F)";

    // $maxlen by another name.
    $M = (int) $maxlen;

    // Concatenated string of all, padded observations
    $O = '(' . $observations . ')';

    // If the index is a whole number, then we just need a single, index-based
    // substring. Otherwise, we'll need to estimate the value using fractional
    // portions of two observations around the desired quantile.
    return <<<________SQL
            CEIL($I) = FLOOR($I),

            SUBSTRING_INDEX($G, 1 + $I * $M, $M) + 0,

            SUBSTRING($G, 1 + FLOOR($I) * $M, $M) * (CEIL($I) - $I) +
            SUBSTRING($G, 1 + (FLOOR($I) + 1) * $M, $M) * ($I - FLOOR($I))

As noted in the comments, one must take care to set MySQL's group_concat_max_len to accommodate storing the coerced observations during the query execution.Performance of this method isn't great but is likely acceptable for any datasets of 500,000 observations or fewer. On my dual core, 3.0GHz desktop with a consumer SSD, calculating the 25%, 50% and 75% quantiles on a set of 55,000 observations takes around 0.32 seconds. On 450,000 observations, the same calculations take around 3.4 seconds.

When I was originally looking for ways to calculate quantiles in MySQL, none of the top results in my searches had any mention of this method, so I figured it out by means of independent discovery but unsurprisingly, I am by no means the first person to use GROUP_CONCAT for this and searching for "mysql group_concat quantiles" will return a few different implementations. I saw that everyone else opted to use SUBSTRING_INDEX in lieu of padding which solves the issue of having to provide an accurate $maxlen. This was something I considered, but I assumed using SUBSTRING_INDEX would have negative impact performance; I opted to use padding instead of SUBSTRING_INDEX because I was concerned the concatenated results would be needlessly copied by each SUBSTRING_INDEX call. I used the method taken from Roland Bouman's blog to compare speed, and his code to calculate the median takes a little less than half the time my code takes to calculate the median, so I intend poke at this some more later to try to improve performance of the macro.