Revisiting memory allocators and MySQL performance

6 minute read

Over the last years a lot of research has been done on choosing the most efficient memory allocation library for MySQL and its impact on MySQL performance (InnoDB kernel_mutex Contention and Memory Allocators, Impact of memory allocators on MySQL performance, TCMalloc and MySQL, MySQL performance: Impact of memory allocators (Part 2), Concurrent large allocations: glibc malloc, jemalloc and tcmalloc) The common wisdom has always been that the glibc implementation of malloc() doesn't scale and either jemalloc or tcmalloc should be used instead. Percona provides jemalloc in their repositories and recommends enabling it for efficiency reasons, and Oracle had even bundled tcmalloc in MySQL distributions until 5.7.13

I've always considered it unfortunate. Sure, for advanced MySQL users or fellow performance engineers installing and using an alternative allocation library is not a big deal. But for most users it is a little cumbersome to go through all those extra steps to install the library and configure MySQL to use it (and the way to do so has changed with systemd, so old instructions no longer apply to modern distributions). On top of that, both jemalloc and tcmalloc are not bug free as well, and their tuning is sometimes more an art than a science. The default allocator in glibc must be good enough for the vast majority of users, leaving alternative allocators to researches and those users willing to squeeze the last drop of performance out of their hardware.

Anyway, even for moderately heavy workloads an alternative allocator is a requirement. And that has been my opinion as well until a few days ago when I had to run some benchmarks on a Ubuntu Artful machine. I usually use tcmalloc, and from my previous experiments on older distributions there was little practical difference between jemalloc and tcmalloc, but a big difference compared to glibc.

This time around I noticed that MySQL scalability suffers due to a severe contention on a spinlock inside libtcmalloc_minimal.so.4. Searching the web suggested there have been some recent fixes of similar issues in tcmalloc. But I decided to revisit my allocator benchmarks, and to my surprise glibc came out a winner.

Performance Improvements in glibc 2.26

While looking for an explanation, I noticed that Ubuntu Artful was probably the first mainline distribution that included glibc 2.26 and a colleague pointed me to this excellent blog post describing malloc() improvements in glibc 2.26.

That was on an ARM64 machine, which is of limited interest to the general audience, so I was wondering if I could repeat that on an x86_64 machine. And yes, even though the picture is a little different on x86_64, I could repeat both tcmalloc lock contention and glibc as the fastest allocator on Ubuntu Artful running on x86_64.

Benchmarks with glibc 2.26

For my experiment, I decided to run the same benchmarks as Mark Callaghan ran in his most recent evaluation of allocator libraries. I'm not going to repeat all benchmark configuration, the only differences with Mark's setup were:

  • InnoDB instead of MyRocks
  • MySQL 5.7.21
  • 10 sysbench tables with 1M rows each instead of 8
  • Ubuntu Artful with glibc 2.26, jemalloc 3.6 and tcmalloc 2.5 running on a 2-socket Xeon Gold 6138 machine.

Results

The results are below:

Comparing to Mark's results:

  • with 2M per-connection blocks (i.e. with sort_buffer_size=2M) glibc 2.20 was slightly slower than jemalloc and tcmalloc, and glibc 2.23 was about the same. In my results glibc 2.26 is considerably faster than both tcmalloc and jemalloc;
  • with 32M per-connection blocks glibc performance has a sharp drop at higher concurrency. This is the same in both Mark's results and mine;
  • tcmalloc 2.5 shows poor performance with 2M and especially 32M blocks in my benchmarks. More on it later.

That is, glibc 2.26 has certainly improved its scalability with small block allocations, but bigger blocks (>=32 MB) are still problematic. In my comment to bug #88071 I explained the reasons for that and recommended to the bug reporter playing with malloc() parameters to see if they have any impact on scalability.

So it was time for me to follow my own advice and play with malloc tunable parameters. For experimental purposes I simply did export MALLOC_MMAP_MAX_=0 before starting MySQL to disable mmap() usage completely.

Below are updated results with glibc and disabled mmap() ("glibc_nommap"):

The summary is that with this simple tuning glibc 2.26 leads the pack. It is faster than both jemalloc and tcmalloc with both small and large blocks.

Anticipating questions about other jemalloc/tcmalloc versions and their tuning, I know that jemalloc and tcmalloc performance can vary considerably depending on their version and tuning parameters, but that wasn't my goal. I'm trying to look at it from a regular user perspective and just use whatever is provided by the distribution. My goal was to see if glibc 2.26 with recent scalability improvements is good enough as an allocation library for MySQL. In terms of performance and based on the benchmark numbers I got, the answer is rather "yes, it is good enough, but some tuning may be required for buffers >= 32 MB".

What about fragmentation?

One frequent comment that I hear when discussing memory allocators is that glibc has higher fragmentation than alternative libraries, which manifests itself as higher process RSS. That may very well be true, but not in that particular benchmark I was running. I was capturing mysqld RSS as reported by pidstat(1) by the end of each run, and here are the results:

So RSS with glibc was about the same as jemalloc, with worst results shown by tcmalloc again.

What's wrong tcmalloc?

There's obviously something wrong with tcmalloc shipped with Ubuntu Artful. I have some tricks up my sleeve in tuning tcmalloc (and I will be talking about them in my Percona Live talk), but none of them worked in this case. A typical PMP stacktrace would look as follows:

26 base::internal::SpinLockDelay(libtcmalloc_minimal.so.4),
   SpinLock::SlowLock(libtcmalloc_minimal.so.4),
   tc_malloc(libtcmalloc_minimal.so.4),my_raw_malloc(my_malloc.c:191),
   my_malloc(my_malloc.c:191),
   Filesort_buffer::alloc_sort_buffer(filesort_utils.cc:124),
   Filesort_info::alloc_sort_buffer(sql_sort.h:509),
   filesort(sql_sort.h:509),create_sort_index(sql_executor.cc:3664),
   QEP_TAB::sort_table(sql_executor.cc:3664),
   join_init_read_record(sql_executor.cc:2465),
   sub_select(sql_executor.cc:1271),
   do_select(sql_executor.cc:944),JOIN::exec(sql_executor.cc:944),
   handle_query(sql_select.cc:184),execute_sqlcom_select(sql_parse.cc:5156),
   mysql_execute_command(sql_parse.cc:2792),
   Prepared_statement::execute(sql_prepare.cc:3952),
   Prepared_statement::execute_loop(sql_prepare.cc:3560),
   mysqld_stmt_execute(sql_prepare.cc:2551),
   dispatch_command(sql_parse.cc:1392),do_command(sql_parse.cc:999),
   handle_connection(connection_handler_per_thread.cc:300),
   pfs_spawn_thread(pfs.cc:2190),start_thread,clone

I could probably do some further research and fix it either by tuning or using a different version. But again, that's not something most users would do, so let's just keep these results as a warning to Ubuntu Artful users following multiple recommendations on the Internet to use tcmalloc with MySQL: don't use the default tcmalloc in Artful, it can actually lead to worse MySQL scalability than glibc or jemalloc.

Conclusions

It is great to see some progress with malloc() performance in glibc 2.26. It already looks good enough for most installations, and for systems with large (>= 32 MB) per-connection buffers one may want to play with MALLOC_MMAP_THRESHOLD_ and MALLOC_MMAP_MAX_.

I also hear there are some further improvements coming up in 2.27. Hopefully some day, when these newer versions reach other mainline distributions and LTS releases, most users will not be required to bother with alternative allocator libraries anymore.

Update: per requests in the comments, I ran benchmarks with sort_buffer_size=32K. Updated charts:

I also ran a benchmark with DISTINCT range queries instead of ORDER BY ones:

As seen, using DISTINCT queries instead of ORDER BY does not have much impact on glibc and jemalloc, but magically restores tcmalloc performance to its original glory. Interesting thing for further research...