Yesterday I added Zipfian distribution to the list of random number distributions provided by sysbench. Why is that important? First of all, many Internet data access patterns typically conform to the Zipfian distribution. That's probably why it is used by default in many popular benchmarks, like LinkBench or YCSB. Which also makes support for the Zipfian distribution a prerequisite for my goals of re-implementing those suites in sysbench Lua.
sysbench already provides the Pareto distribution, which is, in fact, the "self-similar" or "80-20 rule" distribution as described by Knuth in "The Art of Computer Programming", and is another approximation to real-life distributions. For practical purposes, it is similar to the Zipfian distribution, i.e. a small subset of data (a "hotspot") is much more popular than the rest ("cold data"), e.g. "80 percent of transactions deal with 20 percent of data".
I thought it would still make sense to provide the Zipfian distribution in sysbench, to make sure that other popular benchmarks generate the same workload with the same parameters when re-implemented in sysbench.
The task looked simple and boring at first. Unfortunately, the most boring tasks usually tend to turn into yak shaving. This is what happened here as well: the task has eventually required me to do some research on existing algorithms and implementations and even dust off my math degree to understand the details (and has also reminded me how unscientific some legacy PRNG-related code is in sysbench 🙂).
So there are two popular approaches to implementing the Zipfian distribution, with each having its own issues:
- the one described in "Quickly Generating Billion-Record Synthetic Databases" (Jim Gray et al, SIGMOD 1994)
- the one described in "Non-Uniform Random Variate Generation" (Luc Devroye, p. 550-551, Springer 1986)
The former is a straightforward approach used in LinkBench and YCSB, and is based on computing the zetan value which is a generalized harmonic number. The time to compute it is linearly dependent on the total number of items (i.e. the range size, or cardinality), but the value can be cached, which results in O(n) benchmark initialization time and O(1) run-time computational complexity for LinkBench and YCSB. Which is not too bad (even though initializing the benchmark for huge data sets can take minutes), but on top of that it has poor statistical properties with the exponent (the distribution parameter) greater than 1, and is also undefined for exponent = 1.
The second approach is also frequently discussed and referenced. It is based on the rejection sampling method. It has constant initialization and memory space costs. The only downside is that uses iterations to generate random numbers and the number of iterations grows with higher rejection rates. Which becomes computationally more expensive with higher exponent values.
There have also been some recent efforts to add the Zipfian distribution to pgbench, where PostgreSQL developers ended up with a combined approach, i.e. using the algorithm based on harmonic numbers for exponents in the range (0; 1), and rejection sampling for exponents in the range (1; +inf). Which still suffers from the computational costs issue with large exponent values, and also excludes the case of exponent = 1
While looking for a better alternative, I came across a more recent article describing the method of implementing a number of distributions, including the Zipfian distribution: "Rejection-Inversion to Generate Variates from Monotone Discrete Distributions" (Wolfgang Hörmann and Gerhard Derflinger, ACM Transactions on Modeling and Computer Simulation (TOMACS) 6.3, 1996)
A verbatim implementation of that method is available in the Go standard library and limits the exponent to values greater than 1.0. Which is unfortunate, because both LinkBench and YCSB use exponents less than 1.0 by default.
There's also a slightly improved version of that method implemented in the Apache Commons project. The algorithm is defined for all non-negative exponent values (with exponent = 0 providing a uniform distribution as it should), and has bounded (and thus, O(1)) initialization and run-time complexity and constant memory requirements.
Which is why it was the algorithm I ported to sysbench from its original implementation in Java in Apache Commons.
The Zipfian distribution is now available in sysbench
--rand-type=zipfian option, and the exponent parameter can be
--rand-zipfian-exp. The default exponent value is 0.8,
which is identical to LinkBench, and is quite close to YCSB which uses
0.99 by default (which I guess was supposed to be 1.0, but 0.99 was chosen
due to implementation limitations).
As with other distributions, Lua scripts can use the Zipfian
distribution explicitly by using the
instead of relying on the default distribution specified with
--rand-type on the command line.
I'm also going to deprecate the "special" distribution, mostly because it is hugely unscientific and hard to explain 🙂.
What about the default value for
--rand-type? I'm leaning towards
zipfian the default instead of the current
special. I know
that opinions vary here, so if you have arguments against Zipfian being
the default distribution in sysbench, please let me know!