Solution | More Exceptional C++: 40 New Engineering Puzzles, Programming Problems, and Solutions

I l @ ve RuBoard

Thread-Safety Problems with Copy-on-Write (COW)

Standard C++ is silent on the subject of threads. Unlike Java, C++ has no built-in support for threads, and does not attempt to address thread-safety issues through the language. So why an Item on thread- and concurrency- related issues? Simply because more and more of us are writing multithreaded (MT) programs, and no discussion of copy-on-write String implementations is complete if it does not cover thread-safety issues. Good C++ threading classes and frameworks have existed nearly as long as C++ itself. Today, they include facilities such as those in ACE, ^[2] the omnithread library available as part of omniORB, ^[3] and many commercial libraries. Chances are that today you're already using one of these libraries, or using your own in-house wrapper around your operating system's native services.

^[2] See http://www.gotw.ca/ publications /mxc++/ace.htm.

^[3] See http://www.gotw.ca/publications/mxc++/omniorb.htm.

In conjunction with this Item, see also Appendix : Optimizations That Aren't (In a Multithreaded World). That appendix overlaps with and builds upon the results we'll see in this Item. It demonstrates not only the effects that mistaken optimizations can have on you and your production code, but also how to detect them and protect yourself against them. This Item presents a summary of results for which more details are shown in Appendix : Test Results for Single-Threaded Versus Multithread-Safe String Implementations.

1. Why is Optimized::String (from Item 15) not thread-safe? Give examples.

It is the responsibility of code that uses an object to ensure that access to the object is serialized as needed. For example, if a certain String object could be modified by two different threads, it's not the poor String object's job to defend itself from abuse. It's the job of the code that uses the String to make sure that two threads never try to modify the same String object at the same time.

The problem with the code in Item 15 is twofold. First, the copy-on-write (COW) implementation is designed to hide the fact that two different visible String objects could be sharing a common hidden state. Hence, it's the String class's responsibility to ensure that the calling code never modifies a String whose representation is shared. The String code shown already does precisely that, by performing a deep copy (unsharing the representation), if necessary, the moment a caller tries to modify the String . This part is fine in general.

Unfortunately, it now brings us to the second problem, which is that the code in String that unshares the representation isn't thread-safe. Imagine two String objects s1 and s2 such that:

s1 and s2 happen to share the same representation under the covers (okay, because String is designed for this);
Thread 1 tries to modify s1 (okay, because thread 1 knows that no one else is trying to modify s1 );
Thread 2 tries to modify s2 (okay, because thread 2 knows that no one else is trying to modify s2 );
At the same time (error)

The problem is (d). At the same time, both s1 and s2 will attempt to unshare their shared representation, and the code to do that is not thread-safe. Specifically, consider the very first line of code in String::AboutToModify() :

 void String::AboutToModify(   size_t n,   bool   markUnshareable /* = false */ ) {   if( data_->refs > 1 && data_->refs != Unshareable )   {     /* ... etc. ... */

This if -condition is not thread-safe. For one thing, evaluating even data_->refs > 1 may not be atomic. If so, it's possible that if thread 1 tries to evaluate data_->refs > 1 while thread 2 is updating the value of refs , the value read from data_->refs might be anything ”1, 2, or even something that's neither the original value nor the new value. The problem is that String isn't following the basic thread-safety requirement that code that uses an object must ensure that access to the object is serialized as needed. Here, String has to ensure that no two threads are going to use the same refs value in a dangerous way at the same time. The traditional way to accomplish this is to serialize access to the shared StringBuf (or just its refs member) using a mutex or semaphore. In this case, at minimum the "comparing an int " operation must be guaranteed to be atomic.

This brings us to the second issue: Even if reading and updating refs were atomic, and even if we knew we were running on a single-CPU machine so that the threads were interleaved rather than truly concurrent, there are still two parts to the if condition. The problem is that the thread executing the above code could be interrupted after it evaluates the first part of the condition, but before it evaluates the second part. In our example:

Thread 1	Thread 2
enter `s1` 's `AboutToModify()` evaluate " `data_->refs > 1` " ( `true` , because `data_->refs` is `2` )
context switch
	enter `s2` 's `AboutToModify()` (runs to completion, including decrementing `data_->refs` to the value `1` ) exit `s2` 's `AboutToModify()`
context switch
evaluate " `data_->refs != Unshareable` " ( `true` , because `data_->refs` is now `1` ) enters `AboutToModify()` 's "I'm shared and need to unshare" block, which clones the representation, decrements `data_->refs` to the value , and gives up the last pointer to the `StringBuf` . Poof, we have a memory leak, because the `StringBuf` that had been shared by `s1` and `s2` can now never be deleted exit `s1` 's `AboutToModify()`

Having covered that, we're ready to see how to solve these safety problems.

Protecting COW Strings

In all that follows , remember that the goal is not to protect Optimized::String against every possible abuse. After all, the code using the String object is still responsible for knowing whether any given visible String object could be used by different threads and doing the proper concurrency control if it is ”that's just normal. The problem here is that because of the extra sharing going on under the covers, the calling code can't just perform its normal duty of care because it doesn't know which String objects really are completely distinct and which only look distinct but are actually invisibly joined. The goal, then, is simply for Optimized::String to get back up to the level at which it's safe enough so that other code that uses String can assume that distinct String objects really are distinct, and therefore do the usual proper concurrency control as needed.

2. Demonstrate how to make Optimized::String thread-safe:

a) assuming that there are atomic operations to get, set, and compare integer values; and

b) assuming that there aren't.

I'm going to answer (b) first because it's more general. What's needed here is a lock-management device such as a mutex. ^[4] The key to what we're about to do is quite simple. It turns out that if we do things right, we need only lock access to the reference count itself.

^[4] If you're using Win32, what's called a "critical section" (a slight hijacking of the term ) is a lot more efficient than a mutex, and should be used unless you really need the Win32 mutex's heavyweight facilities.

Before doing anything else, we need to add a Mutex member object into Optimized::StringBuf . Let's call the member m :

 namespace Optimized {   class StringBuf   {   public:       StringBuf();              // start off empty      ~StringBuf();              // delete the buffer       StringBuf( const StringBuf& other, size_t n = 0 );                                 // initialize to copy of other,                                 //  and ensure len >= n       void Reserve( size_t n ); // ensure len >= n       char*    buf;             // allocated buffer       size_t   len;             // length of buffer       size_t   used;            // # chars actually used       unsigned refs;            // reference count  Mutex    m;               // serialize work on this object  private:     // No copying...     //     StringBuf( const StringBuf& );     StringBuf& operator=( const StringBuf& ); };

The only function that necessarily operates on two StringBuf objects at once is the copy constructor. String only calls StringBuf 's copy constructor from two places (from String 's own copy constructor and from AboutToModify() ). Note that String only needs to serialize access to the reference count, because by definition no String will do any work on a StringBuf that's shared. (If it is shared, it will be read in order to take a copy. But we don't need to worry about anyone else trying to change or Reserve() or otherwise alter/move the buffer.)

The default constructor needs no locks:

 String::String() : data_(new StringBuf) { }

The destructor need only lock its interrogation and update of the refs count:

 String::~String() {   bool bDelete = false;  data_->m   .   Lock(); //---------------------------  if( data_->refs == Unshareable  --data_->refs < 1 )   {     bDelete = true;   }  data_->m   .   Unlock(); //-------------------------  if( bDelete )   {     delete data_;   } }

For the String copy constructor, note that we can assume the other String 's data buffer won't be modified or moved during this operation, because it's the responsibility of the caller to serialize access to visible objects. We must still, however, serialize access to the reference count itself, as we did above:

 String::String( const String& other ) {   bool bSharedIt = false;  other   .   data_->m   .   Lock(); //   ---------------------  if( other.data_->refs != Unshareable )   {     bSharedIt = true;     data_ = other.data_;     ++data_->refs;   }  other   .   data_->m   .   Unlock(); //   -------------------  if( !bSharedIt )   {     data_ = new StringBuf( *other.data_ );   } }

So making the String copy constructor safe wasn't very hard at all. This brings us to AboutToModify() , which turns out to be very similar. But notice that this sample code actually acquires the lock during the entire deep copy operation ”really, the lock is strictly needed only when looking at the refs value and again when updating the refs value at the end. But let's go ahead and lock the whole operation instead of getting slightly better concurrency by releasing the lock during the deep copy and then reacquiring it just to update refs :

 void String::AboutToModify(   size_t n,   bool   markUnshareable /* = false */ ) {  data_->m   .   Lock(); //   ---------------------------   if( data_->refs > 1 && data_->refs != Unshareable )   {  StringBuf* newdata = new StringBuf( *data_, n );  --data_->refs;   // now all the real work is   data_->m   .   Unlock(); //   -----------------------  data_ = newdata; //  done, so take ownership   }   else   {  data_->m   .   Unlock(); //   -----------------------  data_->Reserve( n );   }   data_->refs = markUnshareable ? Unshareable : 1; }

None of the other functions need to be changed. Append() and operator[]() don't need locks because once AboutToModify() completes, we're guaranteed that we're not using a shared representation. Length() doesn't need a lock because by definition, we're okay if our StringBuf is not shared (there's no one else to change the used count on us), and we're okay if it is shared (the other thread would take its own copy before doing any work and hence still wouldn't modify our used count on us):

 void String::Append( char c ) {      AboutToModify( data_->used+1 );      data_->buf[used++">data_->used++] = c;    }    size_t String::Length() const    {      return data_->used;    }    char& String::operator[]( size_t n )    {      AboutToModify( data_->len, true );      return data_->buf[n];    }    const char String::operator[]( size_t n ) const    {      return data_->buf[n];    } }

Again, note the interesting thing in all of this: The only locking we needed to do involved the refs count itself.

With that observation and the above general-purpose solution under our belts, let's look back to the (a) part of the question:

a) assuming that there are atomic operations to get, set, and compare integer values; and

Some operating systems provide these kinds of functions.

Note: These functions are usually significantly more efficient than general-purpose synchronization primitives such as mutexes . It is, however, a fallacy to say that we can use atomic integer operations "instead of locks" because locking is still required ”the locking is just generally less expensive than other alternatives, but it's not free by a long shot, as we will see.

Here is a thread-safe implementation of String that assumes we have three functions: an IntAtomicGet() , and IntAtomicDecrement() and IntAtomicIncrement() that safely return the new value. We'll do essentially the same thing we did above, but use only atomic integer operations to serialize access to the refs count:

 namespace Optimized {   String::String() : data_(new StringBuf) { }   String::~String()   {     if( IntAtomicGet( data_->refs ) == Unshareable          IntAtomicDecrement( data_->refs ) < 1 )     {       delete data_;     }   }   String::String( const String& other )   {     if( IntAtomicGet( other.data_->refs ) != Unshareable )     {       data_ = other.data_;       IntAtomicIncrement( data_->refs );     }     else     {       data_ = new StringBuf( *other.data_ );     }   }   void String::AboutToModify(         size_t n,         bool   markUnshareable /* = false */       )       {         int refs = IntAtomicGet( data_->refs );         if( refs > 1 && refs != Unshareable )         {           StringBuf* newdata = new StringBuf( *data_, n );           if( IntAtomicDecrement( data_->refs ) < 1 )           {                  // just in case two threads             delete newdata;  //  are trying this at once           }           else           {                  // now all the real work is             data_ = newdata; //  done, so take ownership           }         }         else         {           data_->Reserve( n );         }         data_->refs = markUnshareable ? Unshareable : 1;       }       void String::Append( char c )       {         AboutToModify( data_->used+1 );         data_->buf[used++">data_->used++] = c;       }       size_t String::Length() const       {         return data_->used;       }       char& String::operator[]( size_t n )       {         AboutToModify( data_->len, true );         return data_->buf[n];       }       const char String::operator[]( size_t n ) const       {         return data_->buf[n];       }     }

3. What are the effects on performance? Discuss.

Without atomic integer operations, copy-on-write typically incurs a significant performance penalty. Even with atomic integer operations, COW can make common String operations take nearly 50% longer, even in single-threaded programs.

In general, copy-on-write is often a bad idea in multithread-ready code. That's because the calling code can no longer know whether two distinct String objects actually share the same representation under the covers, so String must incur overhead to do enough serialization so that calling code can take its normal share of responsibility for thread safety. Depending on the availability of more-efficient options such as atomic integer operations, the impact on performance ranges from moderate to profound.

Some Empirical Results

In this test environment, I tested the following six main flavors of string implementations.

 Name  Description (refined versions of                               code shown earlier) ---------------  ---------------------------------------           Plain  Non-use-counted string; all others are                  modeled on this      COW_Unsafe  Plain + COW, not thread-safe   COW_AtomicInt  Plain + COW + thread-safe  COW_AtomicInt2  COW_AtomicInt + StringBuf in same                  buffer as the data     COW_CritSec  Plain + COW + thread-safe (Win32                  critical sections)       COW_Mutex  Plain + COW + thread-safe (Win32                  mutexes) (COW_CritSec with Win32 mutexes                  instead of Win32 critical sections)

I also threw in a seventh flavor to measure the result of optimizing memory allocation instead of optimizing copying:

 Plain_FastAlloc  Plain + an optimized memory allocator

I focused on comparing Plain with COW_AtomicInt . COW_AtomicInt was generally the most efficient thread-safe COW implementation. The results were as follows:

For all mutating and possibly-mutating operations, COW_AtomicInt was always worse than Plain . This is natural and expected.
COW should shine when there are many unmodified copies. But for an average string length of 50:
1. When 33% of all copies were never modified and the rest were modified only once each, COW_AtomicInt was still slower than Plain .
2. When 50% of all copies were never modified and the rest were modified only thrice each, COW_AtomicInt was still slower than Plain .

This second result may be the more surprising one to many ”particularly the result that COW_AtomicInt is slower in cases in which there are more copy operations than mutating operations in the entire system!

Note that in both cases traditional thread-unsafe COW did perform better than Plain . This shows that COW can be an optimization for purely single-threaded environments, but it is less often appropriate for thread-safe code.

It is a myth that COW's principal advantage lies in avoiding memory allocations . Especially for longer strings, COW's principal advantage is that it avoids copying the characters in the string.
Optimized allocation, not COW, was a consistent true speed optimization in all cases (but note that it does trade off space). Here is perhaps the most important conclusion from the Detailed Measurements section in the appendixes:

Most of COW's primary advantage for small strings could be gained without COW by using a more efficient allocator . (Of course, you could also do both ”use COW and an efficient allocator . )

Let's briefly raise and answer two natural questions.

First, why measure something as inefficient as COW_CritSec ? The answer is simple: Because at least one popular commercial basic_string implementation used this method as recently as 2000 (and perhaps still does, I haven't seen their code lately), despite the fact that COW_CritSec is nearly always a pessimization. Be sure to check whether your library vendor is doing this, because if the library is built for possible multithreaded use, you will bear the performance cost all the time, even if your program is single-threaded.

Second, what's COW_AtomicInt2 ? It's the same as COW_AtomicInt except that instead of allocating a StringBuf and then separately allocating the data buffer, the StringBuf and data are in the same allocated block of memory. Note that all other COW_* variants use a fast allocator for the StringBuf object (so that there's no unfair double allocation cost), and the purpose of COW_AtomicInt2 is mainly to demonstrate that I have actually addressed that concern. COW_AtomicInt2 is actually slightly slower than COW_AtomicInt for most operations, because of the extra logic.

I also tested the relative performance of various integer operations (incrementing int , incrementing volatile int , and incrementing int using the Win32 atomic integer operations) to ensure that COW_AtomicInt results were not unfairly skewed by poor implementations or function call overhead.

Summary

A few important points about this test harness and these results:

Caveat lector: Take this for what it is, a first cut at a test harness. Comments and corrections are welcome. I'm showing raw performance numbers here. I haven't inspected the compiled machine code, and I've made no attempt to determine the impact of cache hits/misses and other secondary effects.
TANSTAAFL. ("There ain't no such thing as a free lunch ." ”Robert A. Heinlein) Even with atomic integer operations, it is incorrect to say "there's no locking required," because the atomic integer operations clearly do perform serialization and do incur significant overhead.
Tanj. ("There ain't no justice ." ”Larry Niven) The test harness itself is single -threaded. A thread-safe COW implementation incurs overhead even in programs that are not themselves multithreaded. At best, COW could be a true optimization only when the COW code does not need to be made thread-safe at all (even then, see [Murray93] pages 70 “72 for more empirical tests that show COW is beneficial only in certain situations). If thread safety is required, COW imposes a significant performance penalty on all users, even users running only single-threaded code.

I l @ ve RuBoard