fastmath.clustering

Clustering algorithms.

Various clustering algrorithms backed by SMILE library.

Currently implemented: only partition clustering.

Input data

It’s always sequence of n-sized samples as sequences.

For example, 2d samples [[1 2] [2 2] [3 3] ...]

For 1d data you can pass sequence of numbers of sequence of 1d seqs of numbers

[1 2 3]
;; or
[[1] [2] [3]]

Distances

Some of the methods use distance functions, currently supported are:

  • :euclidean
  • :manhattan
  • :chebyshev

Output

Every function returns record which contains:

  • :type - name of the method used
  • :data - input data
  • :clustering - sequence of cluster ids
  • :sizes - sizes of clusters
  • :clusters - number of clusters
  • :predict - predicting function (see below), qualify additional sample
  • :representatives - list of centroids or medoids if available
  • :info - additional statistics for your samples (like distortion)
  • :obj - SMILE object

Cluster id is a integer ranging from 0 to the number of clusters minus 1. Some methods mark outliers with outlier-id.

Record acts as function and can qualify additonal sample by calling :predict function, for example (data is sequence of 3d samples):

(let [cl (k-means data 10)] (cl [0 1 2]))

See k-means

Regrouping

Clustering record can be regroupped to the list of individual clusters. Call regroup and get list of maps with following structure:

  • :key - cluster id
  • :data - samples which belong to the cluster
  • :outliers? - does it contain outliers or not
  • :representative - centroid/medoid or average vector if the former is not available
  • :size - size of cluster

Constants

clarans

(clarans data clusters)(clarans data dist clusters)(clarans data dist clusters max-neighbor)(clarans data dist clusters max-neighbor num-local)

Clustering Large Applications based upon RANdomized Search algorithm.

Input:

  • data - sequence of samples
  • clusters - numbe of clusters

Optional:

  • dist - distance method, default :euclidean
  • max-neighbor - maximum number of neighbors checked during random search
  • num-local - the number of local minima to search for

See more in SMILE doc

Examples

Usage

(dissoc (clarans (repeatedly
                  1000
                  (fn* []
                    (r/randval 0.1 (r/irand -10 10) (r/irand 100 150))))
                 :chebyshev
                 2)
        :data :clustering
        :obj :predict)
;;=> {:clusters 2,
;;=>  :info {:distortion 11621.0, :maxneighbor 100, :numlocalminima 8},
;;=>  :representatives ((126.0) (-1.0)),
;;=>  :sizes (893 107),
;;=>  :type :clarans}

clustering-methods-list

List of clustering methods.

Examples

List of methods

clustering-methods-list
;;=> (:dbscan :k-means :mec
;;=>          :clarans :g-means
;;=>          :neural-gas :x-means
;;=>          :deterministic-annealing :denclue)

dbscan

(dbscan data min-pts radius)(dbscan data dist min-pts radius)

Density-Based Spatial Clustering of Applications with Noise algorithm.

Input:

  • data - sequence of samples
  • dist (optional) - distance method, default :euclidean
  • min-pts - minimum number of neighbors
  • radius - the neighborhood radius

See more in SMILE doc

Examples

3d vectors

(dissoc (dbscan
         (repeatedly
          5000
          (fn* []
            (vector (r/randval 0.1 (r/irand -10 10) (r/irand 100 150))
                    (r/randval (r/irand -10 10) (r/irand 100 150))
                    (r/randval (r/irand -10 10) (r/irand 100 150)))))
         10
         20)
        :data :clustering
        :obj :predict)
;;=> {:clusters 8,
;;=>  :info {:minpts 10.0, :radius 20.0},
;;=>  :representatives nil,
;;=>  :sizes (1110 1121 1132 108 1165 120 116 128 0),
;;=>  :type :dbscan}

denclue

(denclue data sigma m)

DENsity CLUstering algorithm.

Input:

  • data - sequence of samples
  • sigma - gaussian kernel parameter
  • m - number of selected samples, much slower than number of all samples

See more in SMILE doc

Examples

Expect 2 clusters, uniform distribution.

((juxt :clusters :sizes :representatives)
 (denclue (repeatedly 100 (fn* [] (r/randval (r/drand) (r/drand 5 6))))
          1
          10))
;;=> [2 (61 39) nil]
(map
 (fn [m] (dissoc m :data))
 (regroup
  (denclue (repeatedly 1000
                       (fn* [] (r/randval 0.1 (r/drand) (r/drand 5 6))))
           1
           10)))
;;=> ({:key 0,
;;=>   :outliers? false,
;;=>   :representative (5.4885025003100845),
;;=>   :size 895}
;;=>  {:key 1,
;;=>   :outliers? false,
;;=>   :representative (0.48723616051539753),
;;=>   :size 105})

deterministic-annealing

(deterministic-annealing data max-clusters)(deterministic-annealing data max-clusters alpha)

Deterministic Annealing algorithm.

Input:

  • data - sequence of samples
  • max-clusters - number of clusters
  • alpha (optional) - temperature decreasing factor (valued from 0 to 1)

See more in SMILE doc

Examples

Usage

(map (fn [m] (dissoc m :data))
     (-> (repeatedly 1000
                     (fn* []
                       (vector (r/randval (r/grand) (r/grand 5 1.0))
                               (r/randval (r/grand) (r/grand 5 1.0)))))
         (deterministic-annealing 4 0.5)
         (regroup)))
;;=> ({:key 1,
;;=>   :outliers? false,
;;=>   :representative (4.959806312537571 2.4871696762711326),
;;=>   :size 478}
;;=>  {:key 0,
;;=>   :outliers? false,
;;=>   :representative (-0.09905633931763892 2.3833226176495117),
;;=>   :size 522})

distances-list

List of distances used in some clustring methods.

Examples

List of distances

distances-list
;;=> (:chebyshev :euclidean :manhattan)

g-means

(g-means data max-clusters)

G-Means algorithm.

Input:

  • data - sequence of samples
  • max-clusters - maximum number of clusters

See more in SMILE doc

Examples

Expect 2 clusters, uniform distribution.

((juxt :clusters :sizes :representatives)
 (g-means (repeatedly 100 (fn* [] (r/randval (r/drand) (r/drand 5 6))))
          4))
;;=> [2 (48 52) ((5.446422151877801) (0.48014682889023114))]

k-means

(k-means data clusters)(k-means data clusters max-iter)(k-means data clusters max-iter runs)

K-Means++ algorithm.

Input:

  • data - sequence of samples
  • clusters - number of clusters
  • max-iter (optional) - maximum number of iterations
  • runs (optional) - maximum number of runs

See more in SMILE doc

Examples

Usage

(k-means [1 2 3 -1 -1 2 -1 11 111] 4)
;;=> #fastmath.clustering/ClusteringResult
;;=>  {:clustering (1 1 1 3 3 1 3 2 0),
;;=>   :clusters 4,
;;=>   :data [1 2 3 -1 -1 2 -1 11 111],
;;=>   :info {:distortion 2.0000000000000004},
;;=>   :obj
;;=>   #object[smile.clustering.KMeans 0x462733f4 "K-Means distortion: 2,00000\r\nClusters of 9 data points of dimension 1:\r\n  0\t    1 (11.1%)\r\n  1\t    4 (44.4%)\r\n  2\t    1 (11.1%)\r\n  3\t    3 (33.3%)\r\n"],
;;=>   :predict #,
;;=>   :representatives ((111.0) (2.0) (11.0) (-1.0)),
;;=>   :sizes (1 4 1 3),
;;=>   :type :k-means}

Clusters group into separate maps.

(regroup (k-means [1 2 3 -1 -1 2 -1 11 111] 4))
;;=> ({:data (1 2 3 2),
;;=>   :key 3,
;;=>   :outliers? false,
;;=>   :representative (2.0),
;;=>   :size 4}
;;=>  {:data (-1 -1 -1),
;;=>   :key 0,
;;=>   :outliers? false,
;;=>   :representative (-1.0),
;;=>   :size 3}
;;=>  {:data (11), :key 2, :outliers? false, :representative (11.0), :size 1}
;;=>  {:data (111),
;;=>   :key 1,
;;=>   :outliers? false,
;;=>   :representative (111.0),
;;=>   :size 1})

Use as predictor

(let [cl (k-means [1 2 3 -1 -1 2 -1 11 111] 4)]
  [(cl -1) (cl 10) (cl 100) (cl 1) (cl -1000) (cl 1000)])
;;=> [3 2 1 0 3 1]

mec

(mec data max-clusters radius)(mec data dist max-clusters radius)

Nonparametric Minimum Conditional Entropy Clustering algorithm.

Input:

  • data - sequence of samples
  • dist (optional) - distance method, default :euclidean
  • max-clusters - maximum number of clusters
  • radius - the neighborhood radius

See more in SMILE doc

Examples

2d vectors

(dissoc (mec (repeatedly
              5000
              (fn* []
                (vector
                 (r/randval 0.1 (r/irand -10 10) (r/irand 100 150))
                 (r/randval (r/irand -10 10) (r/irand 100 150)))))
             :manhattan
             8
             20)
        :data :clustering
        :obj :predict)
;;=> {:clusters 4,
;;=>  :info {:entropy 0.0},
;;=>  :representatives nil,
;;=>  :sizes (2271 2226 259 244),
;;=>  :type :mec}

neural-gas

(neural-gas data clusters)(neural-gas data clusters lambda-i lambda-f eps-i eps-f steps)

Neural Gas algorithm.

Input:

  • data - sequence of samples
  • clusters - number of clusters

Optional:

  • lambda-i - intial lambda value (soft learning radius/rate)
  • lambda-f - final lambda value
  • eps-i - initial epsilon value (learning rate)
  • eps-f - final epsilon value
  • steps - number of iterations

See more in SMILE doc

Examples

Usage

(neural-gas [1 2 3 -1 -1 2 -1 11 111] 4)
;;=> #fastmath.clustering/ClusteringResult
;;=>  {:clustering (3 3 3 3 3 3 3 0 1),
;;=>   :clusters 4,
;;=>   :data [1 2 3 -1 -1 2 -1 11 111],
;;=>   :info {:distortion 100.21133258180804},
;;=>   :obj
;;=>   #object[smile.vq.NeuralGas 0x3011e6ef "Neural Gas distortion: 100,21133\r\nClusters of 9 data points of dimension 1:\r\n  0\t    1 (11.1%)\r\n  1\t    1 (11.1%)\r\n  2\t    0 ( 0.0%)\r\n  3\t    7 (77.8%)\r\n"],
;;=>   :predict #,
;;=>   :representatives ((9.498437706658049)
;;=>                     (102.02680736489556)
;;=>                     (38.047386026088205)
;;=>                     (0.676705769486632)),
;;=>   :sizes (1 1 0 7),
;;=>   :type :neural-gas}

Clusters group into separate maps.

(regroup (neural-gas [1 2 3 -1 -1 2 -1 11 111] 4))
;;=> ({:data (1 2 3 -1 -1 2 -1),
;;=>   :key 0,
;;=>   :outliers? false,
;;=>   :representative (0.676705769486632),
;;=>   :size 7}
;;=>  {:data (11),
;;=>   :key 2,
;;=>   :outliers? false,
;;=>   :representative (9.498437706658049),
;;=>   :size 1}
;;=>  {:data (111),
;;=>   :key 1,
;;=>   :outliers? false,
;;=>   :representative (102.02680736489556),
;;=>   :size 1})

outlier-id

const

;;=> 2147483647

Id of the cluster which contain outliers.

regroup

(regroup clustered-data)

Transform clusterig result into list of clusters as separate maps.

Every map contain:

  • :key - cluster id
  • :data - samples which belong to the cluster
  • :outliers? - does it contain outliers or not
  • :representative - centroid/medoid or average vector if the former is not available
  • :size - size of cluster

Representative is always a n-dimensional sequence even if input is a list of numbers.

Empty clusters are skipped.

Examples

Result of clustering with regrouping

(k-means [1 2 3 -1 -1 2 -1 11 111] 7)
;;=> #fastmath.clustering/ClusteringResult
;;=>  {:clustering (0 5 4 3 3 5 3 2 1),
;;=>   :clusters 7,
;;=>   :data [1 2 3 -1 -1 2 -1 11 111],
;;=>   :info {:distortion 0.0},
;;=>   :obj
;;=>   #object[smile.clustering.KMeans 0x2e909445 "K-Means distortion: 0,00000\r\nClusters of 9 data points of dimension 1:\r\n  0\t    1 (11.1%)\r\n  1\t    1 (11.1%)\r\n  2\t    1 (11.1%)\r\n  3\t    3 (33.3%)\r\n  4\t    1 (11.1%)\r\n  5\t    2 (22.2%)\r\n  6\t    0 ( 0.0%)\r\n"],
;;=>   :predict #,
;;=>   :representatives ((1.0) (111.0) (11.0) (-1.0) (3.0) (2.0) (##NaN)),
;;=>   :sizes (1 1 1 3 1 2 0),
;;=>   :type :k-means}
(regroup (k-means [1 2 3 -1 -1 2 -1 11 111] 7))
;;=> ({:data (1), :key 4, :outliers? false, :representative (1.0), :size 1}
;;=>  {:data (2 2), :key 3, :outliers? false, :representative (2.0), :size 2}
;;=>  {:data (3), :key 5, :outliers? false, :representative (3.0), :size 1}
;;=>  {:data (-1 -1 -1),
;;=>   :key 0,
;;=>   :outliers? false,
;;=>   :representative (-1.0),
;;=>   :size 3}
;;=>  {:data (11), :key 2, :outliers? false, :representative (11.0), :size 1}
;;=>  {:data (111),
;;=>   :key 1,
;;=>   :outliers? false,
;;=>   :representative (111.0),
;;=>   :size 1})
(count (regroup (k-means [1 2 3 -1 -1 2 -1 11 111] 7)))
;;=> 6

x-means

(x-means data max-clusters)

X-Means algorithm.

Input:

  • data - sequence of samples
  • max-clusters - number of clusters

See more in SMILE doc

Examples

Expect 2 clusters, gaussian distribution.

((juxt :clusters :sizes :representatives)
 (x-means (repeatedly 10000
                      (fn* [] (r/randval (r/grand) (r/grand 5 1.0))))
          4))
;;=> [2 (4995 5005) ((5.005656562406386) (-0.011168320992583245))]