C++ Boost

The Boost Statechart Library

Performance


Speed versus scalability tradeoffs
Memory management customization
RTTI customization
Resource usage

Speed versus scalability tradeoffs

Quite a bit of effort has gone into making the library fast for small simple machines and scaleable at the same time (this applies only to state_machine<>, there still is some room for optimizing fifo_scheduler<>, especially for multi-threaded builds). While I believe it should perform reasonably in most applications, the scalability does not come for free. Small, carefully handcrafted state machines will thus easily outperform equivalent Boost.Statechart machines. To get a picture of how big the gap is, I implemented a simple benchmark in the BitMachine example. The Handcrafted example is a handcrafted variant of the 1-bit-BitMachine implementing the same benchmark.

I tried to create a fair but somewhat unrealistic worst-case scenario:

The Benchmarks - compiled with MSVC7.1 (single threaded), running on 3.2GHz Intel Pentium 4 / 1.6GHz Pentium M - produced the following dispatch and transition times per event:

Although this is a big difference I still think it will not be noticeable in most real-world applications. No matter whether an application uses handcrafted or Boost.Statechart machines it will...

Therefore, in real-world applications event dispatch and transition not normally constitutes a bottleneck and the relative gap between handcrafted and Boost.Statechart machines also becomes much smaller than in the worst-case scenario.

BitMachine measurements with more states and with different levels of optimization:

Machine configuration
# states / # outgoing transitions per state
Event dispatch & transition time [nanoseconds]
MSVC 7.1: 3.2GHz Pentium 4 / 1.6GHz Pentium M
GCC 3.4.2: 3.2GHz Pentium 4 / 1.6GHz Pentium M
Out of the box Same as out of the box but with BOOST_STATECHART_USE_NATIVE_RTTI defined Same as out of the box but with customized memory management
2 / 1 410 / 460
540 / 480
490 / 570
510 / 500
130 / 220
320 / 230
4 / 2 440 / 470
560 / 480
530 / 640
570 / 550
160 / 240
330 / 240
8 / 3 450 / 470
580 / 510
580 / 700
610 / 630
180 / 250
340 / 260
16 / 4 490 / 480
710 / 670
720 / 790
770 / 750
230 / 260
460 / 360
32 / 5 590 / 520
790 / 690
820 / 880
920 / 910
340 / 280
590 / 470

Double dispatch

At the heart of every state machine lies an implementation of double dispatch. This is due to the fact that the incoming event and the active state define exactly which reaction the state machine will produce. For each event dispatch, one virtual call is followed by a linear search for the appropriate reaction, using one RTTI comparison per reaction. The following alternatives were considered but rejected:

Memory management customization

Out of the box, all internal data is allocated on the normal heap. This should be satisfactory for applications meeting the following prerequisites:

Should an application not meet these prerequisites customization of all memory management (not just Boost.Statechart's) should be considered, which is supported as follows:

simple_state<> and state<> subtype objects are constructed and destructed only by the state machine. It would therefore be possible to use the state_machine<> allocator instead of forcing the user to overload operator new() and operator delete(). However, a lot of systems employ at most one instance of a particular state machine, which means that a) there is at most one object of a particular state and b) this object is always constructed, accessed and destructed by one and the same thread. We can exploit these facts in a much simpler (and faster) new/delete implementation (for example, see UniqueObject.hpp in the BitMachine example). However, this is only possible as long as we have the freedom to customize memory management for state classes separately.

RTTI customization

RTTI is used for event dispatch and state_downcast<>(). Currently, there are exactly two options:

  1. By default, a speed-optimized internal implementation is employed
  2. The library can be instructed to use native C++ RTTI instead by defining BOOST_STATECHART_USE_NATIVE_RTTI

Just about the only reason to favor 2 is the fact that state and event objects need to store one pointer less, meaning that in the best case the memory footprint of a state machine object could shrink by 15% (an empty event is typically 30% smaller, what can be an advantage when there are bursts of events rather than a steady flow). However, on most platforms executable size grows when C++ RTTI is turned on. So, given the small per machine object savings, option 2 only makes sense in applications where both of the following conditions hold:

Obvious candidates are embedded systems where the executable resides in ROM. Other candidates are applications running a large number of identical state machines where this measure could even reduce the overall memory footprint.

Resource usage

Memory

On a 32-bit box, one empty active state typically needs less than 50 bytes of memory. Even very complex machines will usually have less than 20 simultaneously active states so just about every machine should run with less than one kilobyte of memory (not counting event queues). Obviously, the per-machine memory footprint is offset by whatever state-local members the user adds.

Processor cycles

The following ranking should give a rough picture of what feature will consume how many cycles:

  1. state_cast<>(): By far the most cycle-consuming feature. Searches linearly for a suitable state, using one dynamic_cast per visited state
  2. State entry and exit: Profiling of the fully optimized 1-bit-BitMachine suggested that roughly half of the total dispatch time is spent destructing the exited state and constructing the entered state. Obviously, transitions where the innermost common context is "far" from the leaf states and/or with lots of orthogonal states can easily cause the destruction and construction of quite a few states leading to significant amounts of time spent for a transition
  3. state_downcast<>(): Searches linearly for the requested state, using one virtual call and one RTTI comparison per visited state
  4. Deep history: For all innermost states inside a state passing either has_deep_history or has_full_history to its state base class, a binary search through the (usually small) history map must be performed on each exit. History slot allocation is performed exactly once, at first exit
  5. Shallow history: For all direct inner states of a state passing either has_shallow_history or has_full_history to its state base class, a binary search through the (usually small) history map must be performed on each exit. History slot allocation is performed exactly once, at first exit
  6. Event dispatch: One virtual call followed by a linear search for a suitable reaction, using one RTTI comparison per visited reaction
  7. Orthogonal states: One additional virtual call for each exited state if there is more than one active leaf state before a transition. It should also be noted that the worst-case event dispatch time is multiplied in the presence of orthogonal states. For example, if two orthogonal leaf states are added to a given state configuration, the worst-case time is tripled

Revised 04 June, 2005

© Copyright Andreas Huber Dönni 2003-2005. The link refers to a spam honeypot. Please remove the words spam and trap to obtain my real address.

Distributed under the Boost Software License, Version 1.0. (See accompanying file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)