fastmath.clustering
Clustering algorithms.
Various clustering algrorithms backed by SMILE library.
Currently implemented: only partition clustering.
Input data
It’s always sequence of n-sized samples as sequences.
For example, 2d samples [[1 2] [2 2] [3 3] ...]
For 1d data you can pass sequence of numbers of sequence of 1d seqs of numbers
[1 2 3]
;; or
[[1] [2] [3]]
Distances
Some of the methods use distance functions, currently supported are:
:euclidean
:manhattan
:chebyshev
Output
Every function returns record which contains:
:type
- name of the method used:data
- input data:clustering
- sequence of cluster ids:sizes
- sizes of clusters:clusters
- number of clusters:predict
- predicting function (see below), qualify additional sample:representatives
- list of centroids or medoids if available:info
- additional statistics for your samples (like distortion):obj
- SMILE object
Cluster id is a integer ranging from 0 to the number of clusters minus 1. Some methods mark outliers with outlier-id.
Record acts as function and can qualify additonal sample by calling :predict
function, for example (data
is sequence of 3d samples):
(let [cl (k-means data 10)] (cl [0 1 2]))
See k-means
Regrouping
Clustering record can be regroupped to the list of individual clusters. Call regroup and get list of maps with following structure:
:key
- cluster id:data
- samples which belong to the cluster:outliers?
- does it contain outliers or not:representative
- centroid/medoid or average vector if the former is not available:size
- size of cluster
Categories
Other vars: ->ClusteringResult clarans clustering-methods-list dbscan denclue deterministic-annealing distances-list g-means k-means map->ClusteringResult mec neural-gas regroup x-means
Constants
- outlier-id =
2147483647
clarans
(clarans data clusters)
(clarans data dist clusters)
(clarans data dist clusters max-neighbor)
(clarans data dist clusters max-neighbor num-local)
Clustering Large Applications based upon RANdomized Search algorithm.
Input:
- data - sequence of samples
- clusters - numbe of clusters
Optional:
- dist - distance method, default
:euclidean
- max-neighbor - maximum number of neighbors checked during random search
- num-local - the number of local minima to search for
See more in SMILE doc
Examples
Usage
(dissoc (clarans (repeatedly
1000
(fn* []
(r/randval 0.1 (r/irand -10 10) (r/irand 100 150))))
:chebyshev
2)
:data :clustering
:obj :predict)
;;=> {:clusters 2,
;;=> :info {:distortion 11327.0, :maxneighbor 100, :numlocalminima 8},
;;=> :representatives ((124.0) (-2.0)),
;;=> :sizes (885 115),
;;=> :type :clarans}
clustering-methods-list
List of clustering methods.
Examples
List of methods
clustering-methods-list
;;=> (:dbscan :k-means :mec
;;=> :clarans :g-means
;;=> :neural-gas :x-means
;;=> :deterministic-annealing :denclue)
dbscan
(dbscan data min-pts radius)
(dbscan data dist min-pts radius)
Density-Based Spatial Clustering of Applications with Noise algorithm.
Input:
- data - sequence of samples
- dist (optional) - distance method, default
:euclidean
- min-pts - minimum number of neighbors
- radius - the neighborhood radius
See more in SMILE doc
Examples
3d vectors
(dissoc (dbscan
(repeatedly
5000
(fn* []
(vector (r/randval 0.1 (r/irand -10 10) (r/irand 100 150))
(r/randval (r/irand -10 10) (r/irand 100 150))
(r/randval (r/irand -10 10) (r/irand 100 150)))))
10
20)
:data :clustering
:obj :predict)
;;=> {:clusters 8,
;;=> :info {:minpts 10.0, :radius 20.0},
;;=> :representatives nil,
;;=> :sizes (1156 1150 114 1131 1108 121 102 118 0),
;;=> :type :dbscan}
denclue
(denclue data sigma m)
DENsity CLUstering algorithm.
Input:
- data - sequence of samples
- sigma - gaussian kernel parameter
- m - number of selected samples, much slower than number of all samples
See more in SMILE doc
Examples
Expect 2 clusters, uniform distribution.
((juxt :clusters :sizes :representatives)
(denclue (repeatedly 100 (fn* [] (r/randval (r/drand) (r/drand 5 6))))
1
10))
;;=> [2 (45 55) nil]
(map
(fn [m] (dissoc m :data))
(regroup
(denclue (repeatedly 1000
(fn* [] (r/randval 0.1 (r/drand) (r/drand 5 6))))
1
10)))
;;=> ({:key 0,
;;=> :outliers? false,
;;=> :representative (5.499970399486117),
;;=> :size 888}
;;=> {:key 1,
;;=> :outliers? false,
;;=> :representative (0.4915343080890528),
;;=> :size 112})
deterministic-annealing
(deterministic-annealing data max-clusters)
(deterministic-annealing data max-clusters alpha)
Deterministic Annealing algorithm.
Input:
- data - sequence of samples
- max-clusters - number of clusters
- alpha (optional) - temperature decreasing factor (valued from 0 to 1)
See more in SMILE doc
Examples
Usage
(map (fn [m] (dissoc m :data))
(-> (repeatedly 1000
(fn* []
(vector (r/randval (r/grand) (r/grand 5 1.0))
(r/randval (r/grand) (r/grand 5 1.0)))))
(deterministic-annealing 4 0.5)
(regroup)))
;;=> ({:key 1,
;;=> :outliers? false,
;;=> :representative (-0.08706401571834416 2.8510783764716736),
;;=> :size 491}
;;=> {:key 0,
;;=> :outliers? false,
;;=> :representative (5.006798015816326 2.4072854544072357),
;;=> :size 509})
distances-list
List of distances used in some clustring methods.
Examples
List of distances
distances-list
;;=> (:chebyshev :euclidean :manhattan)
g-means
(g-means data max-clusters)
G-Means algorithm.
Input:
- data - sequence of samples
- max-clusters - maximum number of clusters
See more in SMILE doc
Examples
Expect 2 clusters, uniform distribution.
((juxt :clusters :sizes :representatives)
(g-means (repeatedly 100 (fn* [] (r/randval (r/drand) (r/drand 5 6))))
4))
;;=> [2 (44 56) ((0.49905290833728166) (5.454031860076209))]
k-means
(k-means data clusters)
(k-means data clusters max-iter)
(k-means data clusters max-iter runs)
K-Means++ algorithm.
Input:
- data - sequence of samples
- clusters - number of clusters
- max-iter (optional) - maximum number of iterations
- runs (optional) - maximum number of runs
See more in SMILE doc
Examples
Usage
(k-means [1 2 3 -1 -1 2 -1 11 111] 4)
;;=> #fastmath.clustering/ClusteringResult
;;=> {:clustering (2 2 2 0 0 2 0 3 1),
;;=> :clusters 4,
;;=> :data [1 2 3 -1 -1 2 -1 11 111],
;;=> :info {:distortion 2.0000000000000004},
;;=> :obj
;;=> #object[smile.clustering.KMeans 0x522b6f85 "K-Means distortion: 2,00000\r\nClusters of 9 data points of dimension 1:\r\n 0\t 3 (33.3%)\r\n 1\t 1 (11.1%)\r\n 2\t 4 (44.4%)\r\n 3\t 1 (11.1%)\r\n"],
;;=> :predict #,
;;=> :representatives ((-1.0) (111.0) (2.0) (11.0)),
;;=> :sizes (3 1 4 1),
;;=> :type :k-means}
Clusters group into separate maps.
(regroup (k-means [1 2 3 -1 -1 2 -1 11 111] 4))
;;=> ({:data (1 2 3 2),
;;=> :key 0,
;;=> :outliers? false,
;;=> :representative (2.0),
;;=> :size 4}
;;=> {:data (-1 -1 -1),
;;=> :key 3,
;;=> :outliers? false,
;;=> :representative (-1.0),
;;=> :size 3}
;;=> {:data (11), :key 2, :outliers? false, :representative (11.0), :size 1}
;;=> {:data (111),
;;=> :key 1,
;;=> :outliers? false,
;;=> :representative (111.0),
;;=> :size 1})
Use as predictor
(let [cl (k-means [1 2 3 -1 -1 2 -1 11 111] 4)]
[(cl -1) (cl 10) (cl 100) (cl 1) (cl -1000) (cl 1000)])
;;=> [0 2 1 3 0 1]
mec
(mec data max-clusters radius)
(mec data dist max-clusters radius)
Nonparametric Minimum Conditional Entropy Clustering algorithm.
Input:
- data - sequence of samples
- dist (optional) - distance method, default
:euclidean
- max-clusters - maximum number of clusters
- radius - the neighborhood radius
See more in SMILE doc
Examples
2d vectors
(dissoc (mec (repeatedly
5000
(fn* []
(vector
(r/randval 0.1 (r/irand -10 10) (r/irand 100 150))
(r/randval (r/irand -10 10) (r/irand 100 150)))))
:manhattan
8
20)
:data :clustering
:obj :predict)
;;=> {:clusters 4,
;;=> :info {:entropy 0.0},
;;=> :representatives nil,
;;=> :sizes (2256 2236 254 254),
;;=> :type :mec}
neural-gas
(neural-gas data clusters)
(neural-gas data clusters lambda-i lambda-f eps-i eps-f steps)
Neural Gas algorithm.
Input:
- data - sequence of samples
- clusters - number of clusters
Optional:
- lambda-i - intial lambda value (soft learning radius/rate)
- lambda-f - final lambda value
- eps-i - initial epsilon value (learning rate)
- eps-f - final epsilon value
- steps - number of iterations
See more in SMILE doc
Examples
Usage
(neural-gas [1 2 3 -1 -1 2 -1 11 111] 4)
;;=> #fastmath.clustering/ClusteringResult
;;=> {:clustering (3 3 3 3 3 3 3 0 1),
;;=> :clusters 4,
;;=> :data [1 2 3 -1 -1 2 -1 11 111],
;;=> :info {:distortion 100.21135606663056},
;;=> :obj
;;=> #object[smile.vq.NeuralGas 0x25d0e87d "Neural Gas distortion: 100,21136\r\nClusters of 9 data points of dimension 1:\r\n 0\t 1 (11.1%)\r\n 1\t 1 (11.1%)\r\n 2\t 0 ( 0.0%)\r\n 3\t 7 (77.8%)\r\n"],
;;=> :predict #,
;;=> :representatives ((9.498429886559977)
;;=> (102.02680736489556)
;;=> (38.047386026088205)
;;=> (0.6767057694246992)),
;;=> :sizes (1 1 0 7),
;;=> :type :neural-gas}
Clusters group into separate maps.
(regroup (neural-gas [1 2 3 -1 -1 2 -1 11 111] 4))
;;=> ({:data (1 2 3 -1 -1 2 -1),
;;=> :key 2,
;;=> :outliers? false,
;;=> :representative (0.6767057694246992),
;;=> :size 7}
;;=> {:data (11),
;;=> :key 1,
;;=> :outliers? false,
;;=> :representative (9.498429886559977),
;;=> :size 1}
;;=> {:data (111),
;;=> :key 0,
;;=> :outliers? false,
;;=> :representative (102.02680736489556),
;;=> :size 1})
regroup
(regroup clustered-data)
Transform clusterig result into list of clusters as separate maps.
Every map contain:
:key
- cluster id:data
- samples which belong to the cluster:outliers?
- does it contain outliers or not:representative
- centroid/medoid or average vector if the former is not available:size
- size of cluster
Representative is always a n-dimensional sequence even if input is a list of numbers.
Empty clusters are skipped.
Examples
Result of clustering with regrouping
(k-means [1 2 3 -1 -1 2 -1 11 111] 7)
;;=> #fastmath.clustering/ClusteringResult
;;=> {:clustering (4 5 0 3 3 5 3 2 1),
;;=> :clusters 7,
;;=> :data [1 2 3 -1 -1 2 -1 11 111],
;;=> :info {:distortion 0.0},
;;=> :obj
;;=> #object[smile.clustering.KMeans 0x2c0c47be "K-Means distortion: 0,00000\r\nClusters of 9 data points of dimension 1:\r\n 0\t 1 (11.1%)\r\n 1\t 1 (11.1%)\r\n 2\t 1 (11.1%)\r\n 3\t 3 (33.3%)\r\n 4\t 1 (11.1%)\r\n 5\t 2 (22.2%)\r\n 6\t 0 ( 0.0%)\r\n"],
;;=> :predict #,
;;=> :representatives ((3.0) (111.0) (11.0) (-1.0) (1.0) (2.0) (##NaN)),
;;=> :sizes (1 1 1 3 1 2 0),
;;=> :type :k-means}
(regroup (k-means [1 2 3 -1 -1 2 -1 11 111] 7))
;;=> ({:data (1), :key 4, :outliers? false, :representative (1.0), :size 1}
;;=> {:data (2 2), :key 5, :outliers? false, :representative (2.0), :size 2}
;;=> {:data (3), :key 3, :outliers? false, :representative (3.0), :size 1}
;;=> {:data (-1 -1 -1),
;;=> :key 0,
;;=> :outliers? false,
;;=> :representative (-1.0),
;;=> :size 3}
;;=> {:data (11), :key 2, :outliers? false, :representative (11.0), :size 1}
;;=> {:data (111),
;;=> :key 1,
;;=> :outliers? false,
;;=> :representative (111.0),
;;=> :size 1})
(count (regroup (k-means [1 2 3 -1 -1 2 -1 11 111] 7)))
;;=> 6
x-means
(x-means data max-clusters)
X-Means algorithm.
Input:
- data - sequence of samples
- max-clusters - number of clusters
See more in SMILE doc
Examples
Expect 2 clusters, gaussian distribution.
((juxt :clusters :sizes :representatives)
(x-means (repeatedly 10000
(fn* [] (r/randval (r/grand) (r/grand 5 1.0))))
4))
;;=> [2 (4997 5003) ((4.9898607148446015) (-0.011945174839566681))]