Eric Day

Thoughts, code, and other oddments.
Dark | Light

< || >

Cache Line Sizes and Concurrency

May 27th, 2009

We’ve been looking at high concurrency level issues with Drizzle and MySQL. Jay pointed me to this article on the concurrency issues due to shared cache lines and decided to run some of my own tests. The results were dramatic, and anyone who is writing multi-threaded code needs to be aware of current CPU cache line sizes and how to optimize around them.

I ran my tests on two 16-core Intel machines, one with a 64 byte cache line, and one with 128 byte cache line. First off, how did I find these values?

one:~$ cat /proc/cpuinfo | grep cache_alignment
cache_alignment : 64
...

two:~$ cat /proc/cpuinfo | grep cache_alignment
cache_alignment : 128
...

You will see one line for each CPU. If you are not familiar with /proc/cpuinfo, take a closer look at the full output. It’s a nice quick reference of other things like L2 cache sizes and CPU speed. As you can see, machine one has a 64 byte cache size, and machine two has a 128 byte cache size. Next, I wrote the following C program to test concurrency:

cache_line.c

This program creates a global array of counter variables and runs a variable number of threads, where each thread increments it’s own 4-byte counter in the array. It does so at a number of array spacing levels to see the performance when counters fall on the same cache lines. With a spacing of 1 the memory is directly adjacent, and for each spacing level it skips that many counter variables in the global array. For example, if spacing is 4, the threads would use counter[0], counter[4], counter[8], and so on, which uses a chunk of memory every 16 bytes. The cache_line.c program outputs a CSV formatted table that you can use to generate some graphs. The seconds CSV output is the same set of tests without using the global array counters, and instead a local counter on the stack. This is meant to provide a baseline (since those will always be on their own cache line). The results were:

64 Byte Cache Line

128 Byte Cache Line

So what does this tell us? When spacing is one and all counter memory (16 threads * 4 bytes == 64 bytes) is entirely on one cache line, concurrency is poor. As we add more space between each counter variable, we start to see performance improve (faster runtime). This is because all thread counters are no longer on one cache line. On the 64 byte cache line machine, we see things really level off when spacing is 16. This is because each counter is now on it’s own cache line. On the 128 byte cache line machine, you can see it takes one more iteration of spacing because the cache line is twice as big.

So what can we take from this? If you have any arrays or data structures that are accessed and updated independently from different threads, make sure they are on a different cache line. This may mean wasting a little space, but as you can see, the concurrency performance is well worth it.

Posted in Drizzle, Main, MySQL

5 Responses to "Cache Line Sizes and Concurrency"

  1. Hi Eric,

    Thanks for the info. This is something to really keep in mind when designing concurrent systems.

    I would also be interested in the affect of spacing on atomic ops, if you have some time… :)

  2. Jonas says:

    Which machine has 128b cacheline ?

  3. Eric Day says:

    Hi Paul, yes, some similar tests on atomic ops could be useful as well. One of these days when I have some more time. :)

    Jonas, this is on a test machine we have for Drizzle, so not entirely sure of the exact model number. Here is some more info from /proc/cpuinfo:

    vendor_id : GenuineIntel
    cpu family : 15
    model : 6
    model name : Genuine Intel(R) CPU 3.40GHz
    cpu MHz : 3391.490
    cache size : 16384 KB
    siblings : 4
    cache_alignment : 128

  4. Jeff Darcy says:

    On some architectures, it’s not just updates that matter either. For example, one of the SiCortex Ice9′s less optimal features was that it performed poorly even for shared *read-only* data. Normally you’d want read-only data that were used together to be located together, so that you’d end up fetching them both in one cache line instead of two, but in our case that applied only to private or infrequently-accessed data. In many cases we rearranged things or played aliasing tricks so that each processor got its own separate copy of shared read-only data, and it paid off.

    That example is probably a bit of an outlier, but the same problem has also existed to varying degrees on other processors as well. A closely related issue, BTW, is that of memory ordering. If you try to get too fancy about cache-access patterns without making sure that you have the right memory barriers in place, it might work fine on a strongly ordered x86 but then things get all flaky when you run it on a CPU with weaker ordering. Nobody ever gets it right the first few times. Maybe you, or Herb, or even I, should write an article about that.

Leave a Reply


< || >
Blog
Wiki
About
Resume
RSS
Comments

E-Mail
Launchpad
LinkedIn
Twitter
identi.ca
Facebook

OpenStack
Scale Stack
Gearman
NW Veg
Veg Food & Fit

Linux On Laptops