Tuesday, July 23, 2013

Thread-Safe Signals/Slots using C++11

Introduction

For any C++ developer who's used Qt, we've grown to love the Signals/Slots idiom it presents for creating clean Observer code. However, it relied on the Qt Moc pre-compiler tool, which meant any project that wanted to use this feature had to use follow along with the Qt idiom, which really made Qt applications look potentially foreign despite being written in C++. In addition to this the implementation wasn't type-safe (a.k.a. no compile-time type checking), and doesn't generalize to any callable target (you have to extend QtObject and declare a slot using Qt's syntax, can only return void).

Since then, there have been multiple implementations of Signals/Slots. Some notable ones are listed below:

  • Boost Signals. Not thread safe, performance wasn't great, now deprecated in favor of Boost Signals2. Licensed under the Boost Liscense.
  • Boost Signals2. Thread safe upgrade of Boost Signals. Others have complained about its performance, but my tests seem to show it's at least decent.. Licensed under the Boost Liscense.
  • Libsigc++. Supposedly decently quick, but not thread safe. I think this is also a near-fully featured implementation like Boost Signals/Signals2. Also licensed under LGPL, making use somewhat restricted.
  • libtscb. A thread safe fairly quick quick implementation. However, it skimps on features (I think it offers similar features to Qt's implementation). Also licensed under LGPL.
  • C++11-based Implementation. This is actually another blog which sought to implement Signals/Slots using new features brought by C++11, namely variadic templates, std::function, and possibly more. This is CC0 licensed (public domain). Probably one of the fastest implementations I have seen, but is not thread-safe.

I was wondering if I could implement a more feature-full implementation like Boost Signals2 bet with better performance like libtscb. I'll be using some C++11 features like the last implementation, notably atomics, std::function, and variadic templates. I'm also using Boost shared_ptr/weak_ptr, Boost mutex, and Boost shared_ptr atomics. These libraries are being used because currently the compiler I'm using doesn't have the standard variants implemented (Mingw w64 gcc-4.8.1).

For the most part, I was able to implement (or actually, can see a straight-forward implementation) nearly all of the features provided by Boost Signals2. There are some semantic changes (notably the interface for combiners is different), and I was unable to capture the O(log(n)) performance of inserting group-ordered slots. My implementation likely will have O(n) group insertion.

Theory/Reasoning Behind Implementation

Some basic observations I've noticed about how Signals/Slots are usually used:

  • Connecting/disconnecting slots is usually not that time critical.
  • Most signals either have no slots, or very few slots connected.
  • Signals may be invoked from multiple threads, and usually can be evaluated asynchronously. It seems like slots usually can be evaluated asynchronously, but they could be strongly ordered, too. In any case, only forward iteration is required for emitting a signal.
  • Emitting should be as fast as possible.

Thinking about how to best reach these goals, I decided to use an implementation which guarantees a wait-free singly linked list for read operations. The back-end implementation is still a doubly linked list, and only one writer is allowed at a time.

Memory management is taken care of using shared_ptr/weak_ptr. I wanted to see if I could implement using only standard C++11, and theoretically I could have, but unfortunately I don't have access to a compiler with all the necessary features. Fortunately, I'm only using the Boost equivalents in a "standards-compliant" manner, so as time goes on changing these to a pure C++11 implementation is a find/replace operation.

basic slot class structure

I'm utilizing a trick for transforming multiple inheritance into single inheritance. There's not really much good reason for doing so other than code bloat issues with Visual Studio. Currently my implementation uses standard C++11 features not supported by any release of Visual Studio (2012 or older), but in the future I want to work towards C++03 compatibility using various Boost libraries (since I'm already using a few for this purpose).

This is not the full code, but a near-bare bones implementation which at least shows the usage and some implementation details.

//
// (c) 2013 helloworld922
//
//  Distributed under the Boost Software License, Version 1.0. (See
//  accompanying file LICENSE_1_0.txt or copy at
//  http://www.boost.org/LICENSE_1_0.txt)

// used for type erasures and link list node management
class slot_wrapper
{
public:
 // should be private, but for some reason something isn't compiling nicely
 boost::shared_ptr< slot_wrapper > next;
 boost::weak_ptr< slot_wrapper > prev;
 
 /**
      * increases the block count.
      * @return: true if was able to increase count.
      */
     virtual bool
     block(void) const = 0;
     /**
      * decrease the block count.
      * @return: true if was able to decrease count.
      */
     virtual bool
     unblock(void) const = 0;

     /**
      * Marks this node as being disconnected.
      * @return: true if slot was not previously marked disconnected
      */
     virtual bool
     mark_disconnected(void) = 0;

     /**
      * The slot is blocked if the block count is greater than 0.
      * @return: true if the slot is blocked.
      */
     virtual bool
     blocked(void) const = 0;
     /**
      * @return: true if the slot is connected
      */
     virtual bool
     connected(void) const = 0;
     /**
      * @return: true if the slot is usable.
      */
     virtual bool
     usable(void) const = 0;

     /**
      * @return: true if it is safe to cast this to a callable.
      */
     virtual bool
     valid(void) const = 0;

     /**
      * @return true if it is safe to cast this to a groupable.
      */
     virtual bool
     grouped(void) const = 0;
}

template<typename Signature, typename B = slot_wrapper>
    struct callable;

template<typename Group = int, typename B = slot_wrapper>
    class groupable;
// used for general slot management, must be instantiable 
template<typename AtomicInt=std::atomic< int >, typename B = slot_wrapper>
class slot_base : public B
{
 AtomicInt _unusable;
};
template<typename Signature, typename FunctionType = std::function< Signature >,
     typename AtomicInt = std::atomic< int >, typename B = callable< Signature,
             slot_base< AtomicInt > > >
 class slot;

template<typename ResultType, typename ... Args, typename FunctionType,
        typename AtomicInt, typename B>
    class slot< ResultType(Args...), FunctionType, AtomicInt, B > : public B
    {
        FunctionType callback;
    };

Because I am limiting myself to single inheritance chains only, I have to carefully consider what order classes get inherited (especially for classes which could be instantiated). Eventually, I ended up with the following:

  • slot_base -> slot_wrapper
  • slot -> callable -> slot_base -> slot_wrapper
  • grouped_slot -> slot -> groupable -> callable -> slot_base -> slot_wrapper
  • extended_slot -> callable -> slot_base -> slot_wrapper
  • grouped_extended_slot -> extended_slot -> groupable -> callable -> slot_base -> slot_wrapper

Basic Signal Structure

Update: While I was implementing this I had assumed that atomic shared_ptr was lock-free. After looking into the current implementations, they currently use spinlocks on each shared_ptr. In other words, not lock-free. However, in theory my implementation could take advantage of true lock-free atomic shared_ptr's in the future.

The basic structure of the Signal class is a doubly linked list, but there internally the linked list ensures that atomically there is always a valid singly linked list from any node to the tail. I am unaware of any generic lock-free doubly linked list implementation which has all of the features I need, so the trade off I'm making is only one writer is allowed at a time. Memory management is handled entirely by shared_ptr's internal reference counting mechanisms. There might be some more efficient atomic reference counting method I could implement, but for now shared_ptr is fast enough.

In order to make signal emission lock-free, I decided to use head and tail nodes. I also place a shared_ptr to a given node which is where grouped slot insertion can begin searching from, but this only points to an existing node. I don't know if I can implement some sort of interleaved binary tree/singly linked list to allow better grouped slot insertion, but for now I'm going to assume that the somewhat crummy group insertion performance is acceptable since signal emission is the primary optimization goal.

template<typename Signature, typename Combiner = optional_last_value<
    typename std::function< Signature >::result_type >, typename Group = int,
    typename GroupCompare = std::less< Group >, typename SlotFunction = std::function<
            Signature >,
    typename ExtendedSlotFunction = typename extended_signature<
            Signature >::type, typename AtomicInt = std::atomic< int >,
    typename Mutex = boost::mutex>
class signal;

// signal_base is an empty class used purely for type erasure purposes
template<typename ResultType, typename ... Args, typename Combiner, typename Group,
        typename GroupCompare, typename SlotFunction, typename ExtendedSlotFunction,
        typename AtomicInt, typename Mutex>
    class signal< ResultType
    (Args...), Combiner, Group, GroupCompare, SlotFunction, ExtendedSlotFunction, AtomicInt,
            Mutex > : public signal_base
    {
     // not a true std::input_iterator, possibly could be with some inheritance trickery
        class iterator;

        boost::shared_ptr< slot_wrapper > head;
        boost::shared_ptr< slot_wrapper > tail;
        boost::shared_ptr< slot_wrapper > group_head;

        Combiner combiner;
        
        // only used for write operations
        mutable Mutex _lock;
    }

Basic Algorithm for List Write Operations

  1. Do non-list dependant operations (mainly allocating memory for nodes and build connection object)
  2. Acquire unique write lock.
  3. Perform all necessary list modification operations. Operations which will change the implicit singly linked list must be atomic, otherwise they do not.
  4. Release lock.

Here's an example implementation for push_front (connect new slot at front of list):

connection
push_front(const SlotFunction& f)
{
    // build node and connection manager object
    boost::shared_ptr< slot_wrapper > node =
            boost::make_shared< slot< ResultType(Args...), SlotFunction > >(f);
    connection conn(*this, node);
    // acquire unique write lock
    boost::unique_lock< Mutex > lock(_lock);
    node->next = head->next;
    node->prev = head;
    head->next->prev = node;
    if (group_head == head)
    {
        // move group head to the created node
        group_head = node;
    }
    // needs to be atomic because this changes the singly linked list
    boost::atomic_store(&(head->next), std::move(node));
    // done
    return conn;
}

Modified Iterator

One problem I encountered while working on the combiner call implementation was how to efficiently pass iterators to the combiner. The solution I came up with means the iterator object encapsulated all state internally, in particular if it is the end node. This means that no separate end iterator is passed. I think the problem could be resolved using a similar single inheritance chain like I did for the slots classes, though I haven't tried this out yet.

I am using a std::tuple to store parameters for later execution, and unpacking them using the technique Johannes Schaub presented here.

class signal::iterator
{
 // used for unpacking tuple to function call
 template<int...>
 struct seq
 {};
 template<int N, int... S>
 struct gens : gens<N - 1, N - 1, S...>
 {};
 template<int... S>
 struct gens<0, S...>
 {
     typedef seq<S...> type;
 };
 
 boost::shared_ptr< slot_wrapper > curr;
 const boost::shared_ptr< slot_wrapper >& end;
 std::tuple<Args...> params;
 
 template<int... S>
 ResultType call_func(seq<S...>) const
 {
     return static_cast< callable< ResultType
     (Args...), slot_base< AtomicInt > >& >(*curr)(std::get<S>(params)...);
 }
 
public:
 ResultType
 operator*(void) const
 {
     return call_func(typename gens<sizeof...(Args)>::type());
 }
 
 iterator&
 operator++(void); // to be defined later
 
 bool is_end(void) const
 {
     return curr == end;
 }
};

Signal Emit Algorithm

Implementing the signal emit operation is fairly simple, the only thing to keep in mind is that operations which try to move to the next node should be atomic. I've omitted a sample Combiner implementation because there's not much different about it (other than keeping in mind an iterator currently encapsulates the end state).

typename Combiner::result_type
signal::operator()(Args ... args) const
{
    iterator begin(boost::atomic_load(&(head->next)), tail,
            std::tuple< Args... >(args...));
    return combiner(std::move(begin));
}

iterator&
iterator::operator++(void)
{
    if(curr != end)
    {
        do
        {
            curr = boost::atomic_load(&(curr->next));
        }
        while(curr != end && !curr-gt;usable());
    }

    return *this;
}

Performance Testing

For my testing, I'm going to benchmark Qt Signals/Slots, Boost Signals2, and this implementation. The slot being benchmarked:

int val[2];

void
basic_handler(void)
{
    for (size_t i = 0; i < 2; ++i)
    {
        val[i] += i;
    }
}

The hardware running the test is an Intel i5-m430 laptop running Windows 7 x64. For Qt test I'm using VS2012 x64, and for the other tests I'm using Mingw-w64 gcc 4.8.1. Basically, I didn't want to re-compile Qt with mingw since I already had it built for VS2012. For the record I'm testing with *Qt5 and Boost *1.54.0.

Note*: well, actually I'm using trunk repository builds, but I'm pretty sure there's no difference between the two (it would be silly for Qt to significantly muck around with the Signals/Slots implementation, and I haven't observed any changes to Boost Signals2).

I'm going to time how long it takes to emit signals with various number of number of slots (yes, I am averaging over many signal emits). The std::function calls is repeated calls to a std::function wrapping this to handler, more or less used to determine how much time is spent doing actual slot work.

Table 1. Emit Testing
Test Case No Slots (ns) 1 Slot (ns) 2 Slots (ns) 4 Slots (ns) 8 Slots (ns) 16 Slots (ns) 32 Slots (ns) 64 Slots (ns) static overhead (ns) dynamic overhead (ns/slot)
std::function calls 0 3.5 7.5 15 29 60 116 232 0 3.6
Qt Signals/Slots 14 75 115 177 312 581 1123 2202 40 33.8
Boost Signals2 75 137 183 253 398 705 1280 2500 75 37
My Implementation 27 57 88 150 272 525 1010 1980 27 30

So what can we tell from these results? Well, unfortunately it's difficult to know exactly how Qt compares to the others because of the different compilers. However, it is probably safe to say Qt has some short-circuit mechanism for handling empty signals. I would be curious to see how well these implementations would fare under high contention/parallel loads. I suspect my implementation would handle multiple emits with a single writer quite easily because this is what the implementation targets. I don't know how Qt or Boost Signals2 are implemented under the hood, but I suspect both of them use some sort of locking mechanism on emits.

Conclusion

I don't know exactly how bad Boost Signals2 performed in the past, but as far as I can tell it is probably "acceptable" most of the time. However, it does perform poorly when there are no slots connected. I'm going to keep working on this implementation and will eventually release it when it's done (right now it basically works, but many features are unimplemented). The cost is not as good as is possible with non-thread safe implementations, but there are nice gains. I may even try submitting it for inclusion in Boost (either as signals3, or use parts to improve signals2).

No comments :

Post a Comment