Introduction
Last time I presented basic work on my implementation of Signals/Slots which was thread safe, full-featured (well, is capable of being full-featured), and performed quite well. I had a decent implementation, but there were problems with passing iterators. I have figured out a better solution by making a small change. This improves performance, as well as easily allows the use of begin/end iterators rather than a single iterator which tracked all state.
Theory
The change I made is really simple: I removed the special head/tail node instances. The end iterator is simply nullptr. This of course means I have to be careful with write operations operations, but as it turns out the actual implementation isn't much more complicated, and possibly may even be slightly simpler. There are fewer atomic operations required to setup/end an emit, so the static overhead is significantly less.
Push Front sample implementation
connection push_front(const SlotFunction& f) { boost::shared_ptr< slot_wrapper > node = boost::make_shared< slot< ResultType(Args...), SlotFunction > >(f); connection conn(*this, node); // acquire unique write lock boost::unique_lock< Mutex > lock(_lock); if (head == nullptr) { tail = node; group_head = node; } else { node->next = head; head->prev = node; } boost::atomic_store(&head, std::move(node)); return conn; }
Emit sample implementation
typename Combiner::result_type signal::operator()(Args ... args) const { boost::shared_ptr< slot_wrapper > begin_ptr = boost::atomic_load(&head); const std::tuple< Args... > params(args...); iterator begin(std::move(begin_ptr), params); iterator end(boost::shared_ptr< slot_wrapper >(), params); return combiner(std::move(begin), std::move(end)); }
Benchmark Results
I'm using the same setup as last-time (I actually just added my new timings to the old results). Again, I am using a laptop with an Intel i5-m430 running Windows 7 x64, compiling with mingw-w64 gcc 4.8.1.
Test Case | No Slots (ns) | 1 Slot (ns) | 2 Slots (ns) | 4 Slots (ns) | 8 Slots (ns) | 16 Slots (ns) | 32 Slots (ns) | 64 Slots (ns) | static overhead (ns) | dynamic overhead (ns/slot) |
---|---|---|---|---|---|---|---|---|---|---|
std::function calls | 0 | 3.5 | 7.5 | 15 | 29 | 60 | 116 | 232 | 0 | 3.6 |
Qt Signals/Slots | 14 | 75 | 115 | 177 | 312 | 581 | 1123 | 2202 | 40 | 33.8 |
Boost Signals2 | 75 | 137 | 183 | 253 | 398 | 705 | 1280 | 2500 | 75 | 37 |
My Implementation - special head/tail nodes | 27 | 57 | 88 | 150 | 272 | 525 | 1010 | 1980 | 27 | 30 |
My Implementation - no unique head/tail nodes | 15 | 47 | 74 | 134 | 258 | 488 | 954 | 1900 | 15 | 30 |
Conclusion
The biggest improvement here is the "standard" begin/end iterator style now can be used without sacrificing performance. Additionally, the static overhead is so small that having a non-thread safe implementation becomes almost moot for signals with no slots. Tim Janik's implementation which isn't thread safe is not that much faster for empty signals, even with his faster i7 hardware. I haven't seen any other stats which would indicate any other implementation (definitely thread-safe implementations, possibly many non-thread safe implementations) perform as well or better. There's also some static memory savings for not having unique head/tail nodes. More to come soon.
No comments :
Post a Comment