# Zipfian distribution in sysbench

Yesterday I added Zipfian distribution to the list of random number distributions provided by sysbench. Why is that important? First of all, many Internet data access patterns typically conform to the Zipfian distribution. That's probably why it is used by default in many popular benchmarks, like LinkBench or YCSB. Which also makes support for the Zipfian distribution a prerequisite for my goals of re-implementing those suites in sysbench Lua.

sysbench already provides the Pareto distribution, which is, in fact, the "self-similar" or "80-20 rule" distribution as described by Knuth in "The Art of Computer Programming", and is another approximation to real-life distributions. For practical purposes, it is similar to the Zipfian distribution, i.e. a small subset of data (a "hotspot") is much more popular than the rest ("cold data"), e.g. "80 percent of transactions deal with 20 percent of data".

I thought it would still make sense to provide the Zipfian distribution in sysbench, to make sure that other popular benchmarks generate the same workload with the same parameters when re-implemented in sysbench.

## Details, details...

The task looked simple and boring at first. Unfortunately, the most boring tasks usually tend to turn into yak shaving. This is what happened here as well: the task has eventually required me to do some research on existing algorithms and implementations and even dust off my math degree to understand the details (and has also reminded me how unscientific some legacy PRNG-related code is in sysbench đź™‚).

So there are two popular approaches to implementing the Zipfian distribution, with each having its own issues:

- the one described in
**"Quickly Generating Billion-Record Synthetic Databases"**(Jim Gray et al, SIGMOD 1994) - the one described in
**"Non-Uniform Random Variate Generation"**(Luc Devroye, p. 550-551, Springer 1986)

The former is a straightforward approach used in LinkBench and YCSB, and
is based on computing the *zetan* value which is a generalized harmonic
number. The time to compute it is linearly dependent on the total number
of items (i.e. the range size, or *cardinality*), but the value can be
cached, which results in *O(n)* benchmark initialization time and *O(1)*
run-time computational complexity for LinkBench and YCSB. Which is not
too bad (even though initializing the benchmark for huge data sets can
take minutes), but on top of that it has poor statistical properties
with the exponent (the distribution parameter) greater than 1, and is
also undefined for exponent = 1.

This is also the approach employed by Vadim in his old merge proposal against sysbench (see also his blog post on the topic). .

The second approach is also frequently discussed and referenced. It is based on the rejection sampling method. It has constant initialization and memory space costs. The only downside is that uses iterations to generate random numbers and the number of iterations grows with higher rejection rates. Which becomes computationally more expensive with higher exponent values.

There have also been some recent efforts to add the Zipfian distribution to pgbench, where PostgreSQL developers ended up with a combined approach, i.e. using the algorithm based on harmonic numbers for exponents in the range (0; 1), and rejection sampling for exponents in the range (1; +inf). Which still suffers from the computational costs issue with large exponent values, and also excludes the case of exponent = 1

While looking for a better alternative, I came across a more recent
article describing the method of implementing a number of distributions,
including the Zipfian distribution: **"Rejection-Inversion to Generate
Variates from Monotone Discrete Distributions"** (Wolfgang HĂ¶rmann and
Gerhard Derflinger, ACM Transactions on Modeling and Computer
Simulation (TOMACS) 6.3, 1996)

A verbatim implementation of that method is available in the Go standard library and limits the exponent to values greater than 1.0. Which is unfortunate, because both LinkBench and YCSB use exponents less than 1.0 by default.

There's also a slightly improved version of that method implemented in
the Apache Commons project. The algorithm is defined for all
non-negative exponent values (with exponent = 0 providing a uniform
distribution as it should), and has bounded (and thus, *O(1)*)
initialization and run-time complexity and constant memory requirements.

Which is why it was the algorithm I ported to sysbench from its original implementation in Java in Apache Commons.

## Conclusions

The Zipfian distribution is now available in sysbench `master`

branch
with the `--rand-type=zipfian`

option, and the exponent parameter can be
specified with `--rand-zipfian-exp`

. The default exponent value is 0.8,
which is identical to LinkBench, and is quite close to YCSB which uses
0.99 by default (which I guess was supposed to be 1.0, but 0.99 was chosen
due to implementation limitations).

As with other distributions, Lua scripts can use the Zipfian
distribution explicitly by using the `sysbench.rand.zipfian()`

function
instead of relying on the default distribution specified with
`--rand-type`

on the command line.

I'm also going to deprecate the "special" distribution, mostly because it is hugely unscientific and hard to explain đź™‚.

What about the default value for `--rand-type`

? I'm leaning towards
making `zipfian`

the default instead of the current `special`

. I know
that opinions vary here, so if you have arguments against Zipfian being
the default distribution in sysbench, please let me know!