| I l @ ve RuBoard |
Item 16. Lazy Optimization, Part 4: Multithreaded EnvironmentsDifficulty: 8
In this final chapter of the miniseries, we consider the effects of thread safety on
|
| I l @ ve RuBoard |
| I l @ ve RuBoard |
Solution
Thread-Safety Problems with
|
||||||||||||
| Thread 1 | Thread 2 |
|---|---|
| enter s1 's AboutToModify() evaluate " data_->refs > 1 " ( true , because data_->refs is 2 ) | |
| context switch | |
|
enter
s2
's
AboutToModify()
(runs to completion, including
|
|
| context switch | |
|
evaluate "
data_->refs != Unshareable
" (
true
, because
data_->refs
is now
1
) enters
AboutToModify()
's "I'm shared and need to unshare" block, which
|
|
Having covered that, we're ready to see how to solve these safety problems.
In all that
2.
Demonstrate how to make Optimized::String thread-safe:
a)
assuming that there are atomic operations to get, set, and compare integer values; and
b) assuming that there aren't.
I'm going to answer (b) first because it's more general. What's needed here is a lock-management device such as a mutex. [4] The key to what we're about to do is quite simple. It turns out that if we do things right, we need only lock access to the reference count itself.
[4] If you're using Win32, what's called a "critical section" (a
slight hijacking of theterm ) is a lot more efficient than a mutex, and should be used unless you really need the Win32 mutex's heavyweight facilities.
Before doing anything else, we need to add a Mutex member object into Optimized::StringBuf . Let's call the member m :
namespace Optimized
{
class StringBuf
{
public:
StringBuf(); // start off empty
~StringBuf(); // delete the buffer
StringBuf( const StringBuf& other, size_t n = 0 );
// initialize to copy of other,
// and ensure len >= n
void Reserve( size_t n ); // ensure len >= n
char* buf; // allocated buffer
size_t len; // length of buffer
size_t used; // # chars actually used
unsigned refs; // reference count
Mutex m; // serialize work on this object
private:
// No copying...
//
StringBuf( const StringBuf& );
StringBuf& operator=( const StringBuf& );
};
The only function that
The default constructor needs no locks:
String::String() : data_(new StringBuf) { }
The destructor need only lock its
String::~String()
{
bool bDelete = false;
data_->m
.
Lock(); //---------------------------
if( data_->refs == Unshareable --data_->refs < 1 )
{
bDelete = true;
}
data_->m
.
Unlock(); //-------------------------
if( bDelete )
{
delete data_;
}
}
For the String copy constructor, note that we can assume the other String 's data buffer won't be modified or moved during this operation, because it's the responsibility of the caller to serialize access to visible objects. We must still, however, serialize access to the reference count itself, as we did above:
String::String( const String& other )
{
bool bSharedIt = false;
other
.
data_->m
.
Lock(); //
---------------------
if( other.data_->refs != Unshareable )
{
bSharedIt = true;
data_ = other.data_;
++data_->refs;
}
other
.
data_->m
.
Unlock(); //
-------------------
if( !bSharedIt )
{
data_ = new StringBuf( *other.data_ );
}
}
So making the
String
copy constructor safe wasn't very hard at all. This brings us to
AboutToModify()
, which turns out to be very similar. But notice that this sample code actually acquires the lock during the entire deep copy operation ”really, the lock is
void String::AboutToModify(
size_t n,
bool markUnshareable /* = false */
)
{
data_->m
.
Lock(); //
---------------------------
if( data_->refs > 1 && data_->refs != Unshareable )
{
StringBuf* newdata = new StringBuf( *data_, n );
--data_->refs; // now all the real work is
data_->m
.
Unlock(); //
-----------------------
data_ = newdata; // done, so take ownership
}
else
{
data_->m
.
Unlock(); //
-----------------------
data_->Reserve( n );
}
data_->refs = markUnshareable ? Unshareable : 1;
}
None of the other functions need to be changed. Append() and operator[]() don't need locks because once AboutToModify() completes, we're guaranteed that we're not using a shared representation. Length() doesn't need a lock because by definition, we're okay if our StringBuf is not shared (there's no one else to change the used count on us), and we're okay if it is shared (the other thread would take its own copy before doing any work and hence still wouldn't modify our used count on us):
void String::Append( char c )
{
AboutToModify( data_->used+1 );
data_->buf[used++">data_->used++] = c;
}
size_t String::Length() const
{
return data_->used;
}
char& String::operator[]( size_t n )
{
AboutToModify( data_->len, true );
return data_->buf[n];
}
const char String::operator[]( size_t n ) const
{
return data_->buf[n];
}
}
Again, note the interesting thing in all of this: The only locking we needed to do involved the refs count itself.
With that observation and the above general-purpose solution under our belts, let's look back to the (a) part of the question:
a) assuming that there are atomic operations to get, set, and compare integer values; and
Some operating systems provide these kinds of functions.
Note: These functions are usually significantly more efficient than general-purpose synchronization primitives such as
Here is a thread-safe implementation of
String
that assumes we have three functions: an
IntAtomicGet()
, and
IntAtomicDecrement()
and
IntAtomicIncrement()
that safely return the new value. We'll do
namespace Optimized
{
String::String() : data_(new StringBuf) { }
String::~String()
{
if( IntAtomicGet( data_->refs ) == Unshareable
IntAtomicDecrement( data_->refs ) < 1 )
{
delete data_;
}
}
String::String( const String& other )
{
if( IntAtomicGet( other.data_->refs ) != Unshareable )
{
data_ = other.data_;
IntAtomicIncrement( data_->refs );
}
else
{
data_ = new StringBuf( *other.data_ );
}
}
void String::AboutToModify(
size_t n,
bool markUnshareable /* = false */
)
{
int refs = IntAtomicGet( data_->refs );
if( refs > 1 && refs != Unshareable )
{
StringBuf* newdata = new StringBuf( *data_, n );
if( IntAtomicDecrement( data_->refs ) < 1 )
{ // just in case two threads
delete newdata; // are trying this at once
}
else
{ // now all the real work is
data_ = newdata; // done, so take ownership
}
}
else
{
data_->Reserve( n );
}
data_->refs = markUnshareable ? Unshareable : 1;
}
void String::Append( char c )
{
AboutToModify( data_->used+1 );
data_->buf[used++">data_->used++] = c;
}
size_t String::Length() const
{
return data_->used;
}
char& String::operator[]( size_t n )
{
AboutToModify( data_->len, true );
return data_->buf[n];
}
const char String::operator[]( size_t n ) const
{
return data_->buf[n];
}
}
3. What are the effects on performance? Discuss.
Without atomic integer operations, copy-on-write typically incurs a significant performance penalty. Even with atomic integer operations, COW can make common String operations take nearly 50% longer, even in single-threaded programs.
In general, copy-on-write is often a bad idea in multithread-ready code. That's because the calling code can no longer know whether two distinct String objects actually share the same representation under the covers, so String must incur overhead to do enough serialization so that calling code can take its normal share of responsibility for thread safety. Depending on the availability of more-efficient options such as atomic integer operations, the impact on performance ranges from moderate to profound.
In this test environment, I
Name Description (refined versions of
code shown earlier)
--------------- ---------------------------------------
Plain Non-use-counted string; all others are
modeled on this
COW_Unsafe Plain + COW, not thread-safe
COW_AtomicInt Plain + COW + thread-safe
COW_AtomicInt2 COW_AtomicInt + StringBuf in same
buffer as the data
COW_CritSec Plain + COW + thread-safe (Win32
critical sections)
COW_Mutex Plain + COW + thread-safe (Win32
mutexes) (COW_CritSec with Win32 mutexes
instead of Win32 critical sections)
I also
Plain_FastAlloc Plain + an optimized memory allocator
I focused on comparing Plain with COW_AtomicInt . COW_AtomicInt was generally the most efficient thread-safe COW implementation. The results were as follows:
For all mutating and possibly-mutating operations,
COW_AtomicInt
was always
COW should
When 33% of all copies were never modified and the rest were modified only once each, COW_AtomicInt was still slower than Plain .
When 50% of all copies were never modified and the rest were modified only thrice each, COW_AtomicInt was still slower than Plain .
This second result may be the more surprising one to many ”particularly the result that COW_AtomicInt is slower in cases in which there are more copy operations than mutating operations in the entire system!
Note that in both cases traditional thread-unsafe COW did perform better than Plain . This shows that COW can be an optimization for purely single-threaded environments, but it is less often appropriate for thread-safe code.
It is a myth that COW's principal advantage lies in avoiding memory
Optimized allocation, not COW, was a consistent true speed optimization in all cases (but note that it does trade off space). Here is perhaps the most important conclusion from the Detailed Measurements section in the appendixes:
Most of COW's primary advantage for small strings could be
gained without COW by using a more efficient allocator . (Of course, you could also do both ”use COW and an efficient allocator . )
Let's
First, why measure something as inefficient as COW_CritSec ? The answer is simple: Because at least one popular commercial basic_string implementation used this method as recently as 2000 (and perhaps still does, I haven't seen their code lately), despite the fact that COW_CritSec is nearly always a pessimization. Be sure to check whether your library vendor is doing this, because if the library is built for possible multithreaded use, you will bear the performance cost all the time, even if your program is single-threaded.
Second, what's
COW_AtomicInt2
? It's the same as
COW_AtomicInt
except that instead of allocating a
StringBuf
and then separately allocating the data buffer, the
StringBuf
and data are in the same allocated block of memory. Note that all other
COW_*
variants use a fast allocator for the
StringBuf
object (so that there's no unfair double allocation cost), and the purpose of
COW_AtomicInt2
is
I also tested the relative performance of various integer operations (incrementing int , incrementing volatile int , and incrementing int using the Win32 atomic integer operations) to ensure that COW_AtomicInt results were not unfairly skewed by poor implementations or function call overhead.
A few important points about this test harness and these results:
Caveat lector:
Take this for what it is, a first cut at a test harness. Comments and corrections are welcome. I'm showing raw performance
TANSTAAFL.
("There ain't no such thing as a free
Tanj.
("There ain't no
| I l @ ve RuBoard |