Algorithms and Concurrency

This article discusses approaches and pitfalls of running clustering algorithms in concurrent environments.

Thread-safety

The following sentences explain the contracts required for concurrent use of Carrot2 Java API components:

Contrary to the 3.x line of Carrot2, the 4.x line has no "managed" support for caching, reusing or managing concurrency of clustering algorithm instances (previously provided by the Controller instance). This is a deliberate decision: algorithm instances are lightweight (cheap to create and discard) and modern JVMs have much better garbage collection mechanisms.

The following sections of this article show how various approaches to configuring the algorithm once and then reusing it in subsequent, possibly concurrent, clustering calls.

Ephemeral instances

The simplest way to achieve thread-safety is to create the algorithm instance on the fly, configure it appropriately and then discard it after the clustering completes. A simple pattern here would be to create a function that transforms a stream of documents into a list of clusters:



    

An important assumption here is that the LanguageComponents object (english in the example above) has been initialized beforehand (once) and is reused. See the Language components page for more information on initialization and customization of language resources.

Cloning preconfigured instances

Sometimes the configuration can become fairly complex. Clustering algorithm instances can be converted into and recreated from a map, so we can extract all the details from the preconfigured instance and then deep-clone it on demand, as this example shows: