|
The Boost Statechart LibraryPerformance |
Quite a bit of effort has gone into making the library fast for small
simple machines and scaleable at the same time (this applies only to
state_machine<>
, there still is some room for optimizing
fifo_scheduler<>
, especially for multi-threaded builds). While I
believe it should perform reasonably in most applications, the scalability
does not come for free. Small, carefully handcrafted state machines will thus
easily outperform equivalent Boost.Statechart machines. To get a picture of how big
the gap is, I implemented a simple benchmark in the BitMachine example. The
Handcrafted example is a handcrafted variant of the 1-bit-BitMachine
implementing the same benchmark.
I tried to create a fair but somewhat unrealistic worst-case scenario:
The Benchmarks - compiled with MSVC7.1 (single threaded), running on 3.2GHz Intel Pentium 4 / 1.6GHz Pentium M - produced the following dispatch and transition times per event:
Although this is a big difference I still think it will not be noticeable in most real-world applications. No matter whether an application uses handcrafted or Boost.Statechart machines it will...
boost::fast_pool_allocator<>
):As mentioned above, there definitely is some room to improve the timings for the asynchronous machines. Moreover, these are very crude benchmarks, designed to show the overhead of locking and thread context switching. The overhead in a real-world application will typically be smaller and other operating systems can certainly do better in this area. However, I strongly believe that on most platforms the threading overhead is usually larger than the time that Boost.Statechart spends for event dispatch and transition. Handcrafted machines will inevitably have the same overhead, making raw single-threaded dispatch and transition speed much less important
new
and destroy them
after consumption. This will add a few cycles, even if event memory
management is customizedTherefore, in real-world applications event dispatch and transition not normally constitutes a bottleneck and the relative gap between handcrafted and Boost.Statechart machines also becomes much smaller than in the worst-case scenario.
BitMachine measurements with more states and with different levels of optimization:
Machine configuration # states / # outgoing transitions per state |
Event dispatch & transition time [nanoseconds] MSVC 7.1: 3.2GHz Pentium 4 / 1.6GHz Pentium M GCC 3.4.2: 3.2GHz Pentium 4 / 1.6GHz Pentium M |
||
Out of the box | Same as out of the box but with
BOOST_STATECHART_USE_NATIVE_RTTI defined |
Same as out of the box but with customized memory management | |
2 / 1 | 410 / 460 540 / 480 |
490 / 570 510 / 500 |
130 / 220 320 / 230 |
4 / 2 | 440 / 470 560 / 480 |
530 / 640 570 / 550 |
160 / 240 330 / 240 |
8 / 3 | 450 / 470 580 / 510 |
580 / 700 610 / 630 |
180 / 250 340 / 260 |
16 / 4 | 490 / 480 710 / 670 |
720 / 790 770 / 750 |
230 / 260 460 / 360 |
32 / 5 | 590 / 520 790 / 690 |
820 / 880 920 / 910 |
340 / 280 590 / 470 |
At the heart of every state machine lies an implementation of double dispatch. This is due to the fact that the incoming event and the active state define exactly which reaction the state machine will produce. For each event dispatch, one virtual call is followed by a linear search for the appropriate reaction, using one RTTI comparison per reaction. The following alternatives were considered but rejected:
Out of the box, all internal data is allocated on the normal heap. This should be satisfactory for applications meeting the following prerequisites:
Should an application not meet these prerequisites customization of all memory management (not just Boost.Statechart's) should be considered, which is supported as follows:
std::allocator<>
interface
for the Allocator
parameter of the state_machine
class templatesimple_state
, state
and
event
class templates with ones that have a customized operator
new()
and operator delete()
. This can be as easy as
inheriting your customized class templates from the framework-supplied class
templates and your preferred small-object/deterministic/constant-time
allocator base classsimple_state<>
and state<>
subtype objects are
constructed and destructed only by the state machine. It would therefore be
possible to use the state_machine<>
allocator instead of forcing
the user to overload operator new()
and operator delete()
.
However, a lot of systems employ at most one instance of a particular state
machine, which means that a) there is at most one object of a particular state
and b) this object is always constructed, accessed and destructed by one and
the same thread. We can exploit these facts in a much simpler (and faster)
new
/delete
implementation (for example, see
UniqueObject.hpp in the BitMachine example). However, this is only possible as
long as we have the freedom to customize memory management for state classes
separately.
RTTI is used for event dispatch and state_downcast<>()
.
Currently, there are exactly two options:
BOOST_STATECHART_USE_NATIVE_RTTI
Just about the only reason to favor 2 is the fact that state and event objects need to store one pointer less, meaning that in the best case the memory footprint of a state machine object could shrink by 15% (an empty event is typically 30% smaller, what can be an advantage when there are bursts of events rather than a steady flow). However, on most platforms executable size grows when C++ RTTI is turned on. So, given the small per machine object savings, option 2 only makes sense in applications where both of the following conditions hold:
Obvious candidates are embedded systems where the executable resides in ROM. Other candidates are applications running a large number of identical state machines where this measure could even reduce the overall memory footprint.
On a 32-bit box, one empty active state typically needs less than 50 bytes of memory. Even very complex machines will usually have less than 20 simultaneously active states so just about every machine should run with less than one kilobyte of memory (not counting event queues). Obviously, the per-machine memory footprint is offset by whatever state-local members the user adds.
The following ranking should give a rough picture of what feature will consume how many cycles:
state_cast<>()
: By far the most cycle-consuming feature.
Searches linearly for a suitable state, using one dynamic_cast
per visited statestate_downcast<>()
: Searches linearly for the requested
state, using one virtual call and one RTTI comparison per visited statehas_deep_history
or has_full_history
to its
state base class, a binary search
through the (usually small) history map must be performed on each exit.
History slot allocation is performed exactly once, at first exithas_shallow_history
or has_full_history
to its
state base class, a binary search
through the (usually small) history map must be performed on each exit.
History slot allocation is performed exactly once, at first exitRevised 04 June, 2005
© Copyright Andreas Huber Dönni 2003-2005. The link refers to a spam honeypot. Please remove the words spam and trap to obtain my real address.
Distributed under the Boost Software License, Version 1.0. (See accompanying file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)