This article discusses approaches and pitfalls of running clustering algorithms in concurrent environments.
The following sentences explain the contracts required for concurrent use of Carrot2 Java API components:
Contrary to the 3.x line of Carrot2, the 4.x line has no "managed" support
for caching, reusing or managing concurrency of clustering algorithm instances
(previously provided by the Controller
instance). This is a deliberate decision:
algorithm instances are lightweight (cheap to create and discard) and modern JVMs have much better
garbage collection mechanisms.
The following sections of this article show how various approaches to configuring the algorithm once and then reusing it in subsequent, possibly concurrent, clustering calls.
The simplest way to achieve thread-safety is to create the algorithm instance on the fly, configure it appropriately and then discard it after the clustering completes. A simple pattern here would be to create a function that transforms a stream of documents into a list of clusters:
An important assumption here is that the LanguageComponents
object
(english
in the example above) has been
initialized beforehand (once) and is reused. See the
Language components page for more information
on initialization and customization of language resources.
Sometimes the configuration can become fairly complex. Clustering algorithm instances can be converted into and recreated from a map, so we can extract all the details from the preconfigured instance and then deep-clone it on demand, as this example shows: