Loop optimization, volatile, and GCC

July 22, 2013

Can you find the bug in the following code?

#include <pthread.h>

unsigned u = 0;

void* thread_main(void *arg)
{
        u = 1;
        return 0;
}

int main(int argc, char* argv[])
{
        pthread_t thread;
        pthread_create(&thread, 0, thread_main, &u);

        while (!u);

        pthread_join(thread, 0);
        return 0;
}

This is a simplified version of some test code I was writing. I was kicking off N threads and did a busy wait until all N threads were up and ready to go. Because of some function inlines, the compiler was accessing these thread variables directly… and eventually optimized them away.

If you compile this code with -O0 (using GCC), it will work fine.  If you compile it -O1, -O2, or -O3, then GCC will convert `while (!u)’ into:

0x00000000004004ea <+26>: mov 0x200b50(%rip),%eax # 0x601040 <u>
0x00000000004004f0 <+32>: test %eax,%eax
0x00000000004004f2 <+34>: jne 0x400500 <main+48>
0x00000000004004f4 <+36>: jmp 0x4004f4 <main+36>

Which is the assembly language equivalent of:

        if (!u) {
                for (;;);
        }

That is: GCC ignored the fact that the variable is a global (and could change in another thread), and it optimized away the check.  My first reaction is that this is a bug in GCC. However, after talking with some peers the consensus is that GCC is doing the right thing (whether I agree or not).

I know of 3 ways to fix the code…

Keyword `volatile’

You can declare `u’ as a volatile:

volatile unsigned u = 0;

By doing this, you tell the compiler to not optimize access to this variable at all.  The C and C++ standards imply that this would be used for things like memory-mapped hardware (meaning it could change unexpectedly).  Thus it is unclear if this is needed for multi-threaded programs.  It appears to be needed.

Memory barrier

You can place a memory barrier in the while() loop:

while (!u) __sync_synchronize(); /* GCC built-in instruction */

In a simplified sense, this will ensure that the cache is fully updated before continuing.  It is both a processor instruction and compiler directive.  It tells the compiler not to cache variable values across the barrier.

You don’t have to use __sync_synchronize() explicitly. Using a mutex, for instance, will cause a memory barrier to happen.

Call an `extern’ function (caveat emptor!)

If you call an `extern’ function inside the loop, then GCC doesn’t optimize away the `u’ check.

while (!u) usleep(1);

Notice that I said doesn’t, and not won’t or shouldn’t. I don’t know of any standard or convention that causes GCC to not optimize here.  Therefore, you should not depend on this behavior. Use one of the other methods.  However, this explains why this method usually works… because it’s usually written with a sleep() (or some other action) in the loop.

Conclusion

Some will say that the code was broken from the start because it didn’t access the integer through a mutex and that accessing integers isn’t always atomic. That’s not right. For one, on every platform that anyone cares about, reading or writing a single integer is atomic. But even if it isn’t, it’s usually 1 single bit that matters (0x00000000 and 0x00000001).  If you read bytes out of order it doesn’t matter because you’re only writing a 0 or a 1. Even if I read the new MSB and the old MSB-1… it doesn’t matter because they’re both 0x00. And cache synchronization usually doesn’t matter here, either. If I read a slightly out-of-date value, then I just go through the loop a couple extra times. Not a problem.

For the common case of using an integer as a lock-free mechanism to shut down a thread — `volatile’ is absolutely necessary. (The memory barrier is equivalent, but usually not portable.) This surprised me because I have been inundated by various sources that the keyword `volatile’ was not only The Wrong Thing — but it’s unnecessary unless you had some kind of deep interactions (e.g. memory-mapped hardware).

With that said, if you’re doing anything more complicated than this… do it right. Use mutexes and atomic operations. Lock-free algorithms are hard to get right, so don’t go there unless you’ve proven that you really need it. In other words, don’t optimize away the mutexes unless you really need to (and unless you’re ready for some tricky debugging).

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: