Over the last years a lot of research has been done on choosing the
most efficient memory allocation library for MySQL and its impact on
MySQL performance (InnoDB kernel_mutex Contention and Memory Allocators,
Impact of memory allocators on MySQL performance, TCMalloc and MySQL,
MySQL performance: Impact of memory allocators (Part 2), Concurrent
large allocations: glibc malloc, jemalloc and tcmalloc) The common
wisdom has always been that the glibc implementation of
doesn't scale and either jemalloc or tcmalloc should be used
instead. Percona provides jemalloc in their repositories and recommends
enabling it for efficiency reasons, and Oracle had even bundled tcmalloc
in MySQL distributions until 5.7.13
I've always considered it unfortunate. Sure, for advanced MySQL users or fellow performance engineers installing and using an alternative allocation library is not a big deal. But for most users it is a little cumbersome to go through all those extra steps to install the library and configure MySQL to use it (and the way to do so has changed with systemd, so old instructions no longer apply to modern distributions). On top of that, both jemalloc and tcmalloc are not bug free as well, and their tuning is sometimes more an art than a science. The default allocator in glibc must be good enough for the vast majority of users, leaving alternative allocators to researches and those users willing to squeeze the last drop of performance out of their hardware.
Anyway, even for moderately heavy workloads an alternative allocator is a requirement. And that has been my opinion as well until a few days ago when I had to run some benchmarks on a Ubuntu Artful machine. I usually use tcmalloc, and from my previous experiments on older distributions there was little practical difference between jemalloc and tcmalloc, but a big difference compared to glibc.
This time around I noticed that MySQL scalability suffers due to a
severe contention on a spinlock inside
libtcmalloc_minimal.so.4. Searching the web suggested there have been
some recent fixes of similar issues in tcmalloc. But I decided to
revisit my allocator benchmarks, and to my surprise glibc came out a
Performance Improvements in glibc 2.26
While looking for an explanation, I noticed that Ubuntu Artful was
probably the first mainline distribution that included glibc 2.26 and a
colleague pointed me to this excellent blog post describing
improvements in glibc 2.26.
That was on an ARM64 machine, which is of limited interest to the general audience, so I was wondering if I could repeat that on an x86_64 machine. And yes, even though the picture is a little different on x86_64, I could repeat both tcmalloc lock contention and glibc as the fastest allocator on Ubuntu Artful running on x86_64.
Benchmarks with glibc 2.26
For my experiment, I decided to run the same benchmarks as Mark Callaghan ran in his most recent evaluation of allocator libraries. I'm not going to repeat all benchmark configuration, the only differences with Mark's setup were:
- InnoDB instead of MyRocks
- MySQL 5.7.21
- 10 sysbench tables with 1M rows each instead of 8
- Ubuntu Artful with glibc 2.26, jemalloc 3.6 and tcmalloc 2.5 running on a 2-socket Xeon Gold 6138 machine.
Comparing to Mark's results:
- with 2M per-connection blocks (i.e. with
sort_buffer_size=2M) glibc 2.20 was slightly slower than jemalloc and tcmalloc, and glibc 2.23 was about the same. In my results glibc 2.26 is considerably faster than both tcmalloc and jemalloc;
- with 32M per-connection blocks glibc performance has a sharp drop at higher concurrency. This is the same in both Mark's results and mine;
- tcmalloc 2.5 shows poor performance with 2M and especially 32M blocks in my benchmarks. More on it later.
That is, glibc 2.26 has certainly improved its scalability with small
block allocations, but bigger blocks (>=32 MB) are still problematic. In
my comment to bug #88071 I explained the reasons for that and
recommended to the bug reporter playing with
malloc() parameters to see
if they have any impact on scalability.
So it was time for me to follow my own advice and play with malloc
tunable parameters. For experimental purposes I simply did
MALLOC_MMAP_MAX_=0 before starting MySQL to disable
Below are updated results with glibc and disabled
The summary is that with this simple tuning glibc 2.26 leads the pack. It is faster than both jemalloc and tcmalloc with both small and large blocks.
Anticipating questions about other jemalloc/tcmalloc versions and their tuning, I know that jemalloc and tcmalloc performance can vary considerably depending on their version and tuning parameters, but that wasn't my goal. I'm trying to look at it from a regular user perspective and just use whatever is provided by the distribution. My goal was to see if glibc 2.26 with recent scalability improvements is good enough as an allocation library for MySQL. In terms of performance and based on the benchmark numbers I got, the answer is rather "yes, it is good enough, but some tuning may be required for buffers >= 32 MB".
What about fragmentation?
One frequent comment that I hear when discussing memory allocators is
that glibc has higher fragmentation than alternative libraries, which
manifests itself as higher process RSS. That may very well be true, but
not in that particular benchmark I was running. I was capturing mysqld RSS
as reported by
pidstat(1) by the end of each run, and here are the
So RSS with glibc was about the same as jemalloc, with worst results shown by tcmalloc again.
What's wrong tcmalloc?
There's obviously something wrong with tcmalloc shipped with Ubuntu Artful. I have some tricks up my sleeve in tuning tcmalloc (and I will be talking about them in my Percona Live talk), but none of them worked in this case. A typical PMP stacktrace would look as follows:
26 base::internal::SpinLockDelay(libtcmalloc_minimal.so.4), SpinLock::SlowLock(libtcmalloc_minimal.so.4), tc_malloc(libtcmalloc_minimal.so.4),my_raw_malloc(my_malloc.c:191), my_malloc(my_malloc.c:191), Filesort_buffer::alloc_sort_buffer(filesort_utils.cc:124), Filesort_info::alloc_sort_buffer(sql_sort.h:509), filesort(sql_sort.h:509),create_sort_index(sql_executor.cc:3664), QEP_TAB::sort_table(sql_executor.cc:3664), join_init_read_record(sql_executor.cc:2465), sub_select(sql_executor.cc:1271), do_select(sql_executor.cc:944),JOIN::exec(sql_executor.cc:944), handle_query(sql_select.cc:184),execute_sqlcom_select(sql_parse.cc:5156), mysql_execute_command(sql_parse.cc:2792), Prepared_statement::execute(sql_prepare.cc:3952), Prepared_statement::execute_loop(sql_prepare.cc:3560), mysqld_stmt_execute(sql_prepare.cc:2551), dispatch_command(sql_parse.cc:1392),do_command(sql_parse.cc:999), handle_connection(connection_handler_per_thread.cc:300), pfs_spawn_thread(pfs.cc:2190),start_thread,clone
I could probably do some further research and fix it either by tuning or using a different version. But again, that's not something most users would do, so let's just keep these results as a warning to Ubuntu Artful users following multiple recommendations on the Internet to use tcmalloc with MySQL: don't use the default tcmalloc in Artful, it can actually lead to worse MySQL scalability than glibc or jemalloc.
It is great to see some progress with
malloc() performance in glibc
2.26. It already looks good enough for most installations, and for
systems with large (>= 32 MB) per-connection buffers one may want to
I also hear there are some further improvements coming up in 2.27. Hopefully some day, when these newer versions reach other mainline distributions and LTS releases, most users will not be required to bother with alternative allocator libraries anymore.
Update: per requests in the comments, I ran benchmarks with
sort_buffer_size=32K. Updated charts:
I also ran a benchmark with
DISTINCT range queries instead of
ORDER BY ones:
As seen, using
DISTINCT queries instead of
ORDER BY does not have much
impact on glibc and jemalloc, but magically restores tcmalloc performance
to its original glory. Interesting thing for further research...