"KMeans" (Machine Learning Method)
- Method for FindClusters, ClusterClassify and ClusteringComponents.
- Partitions data into a specified
clusters of similar elements using a k-means clustering algorithm.
Details & Suboptions
- "KMeans" is a classic, simple, centroid-based clustering method. "KMeans" works when clusters have similar sizes and are locally and isotropically distributed around their centroid. When clusters have very different sizes, are anisotropic, are intertwined, or when outliers are present, it is likely that "KMeans" will give poor results.
- The following plots show the results of the "KMeans" method applied to toy datasets:
-
- The "KMeans" method aims to find k centroids defining k clusters. Each data point is assigned to its nearest centroid. All points assigned to a given centroid are forming a cluster.
- The procedure to find the best k centroids is iterative. The search starts by using random centroids and assigning each point to its nearest centroid:
- Once all clusters are defined, the mean of each cluster becomes a new centroid:
- This procedure is repeated until the clusters remain unchanged. This iterative procedure is sometimes called "hard EM" (hard Expectation Maximization).
- The "KMeans" method is similar to the "GaussianMixture" with a spherical covariance (that is, all clusters are isotropic and have the same size).
- Since the initial centroids are chosen randomly, results might differ upon evaluation.
- The suboption "InitialCentroids" can be used to specify the initial centroids as a list of data points.
- The following suboption can be given:
-
"InitialCentroids" Automatic a list of initial centroids


Examples
open allclose allBasic Examples (3)Summary of the most common use cases
Find exactly four clusters of nearby values using the "KMeans" clustering method:

https://wolfram.com/xid/0dx1j16qpw-wxc5bs


https://wolfram.com/xid/0dx1j16qpw-q8s2rm

Plot computed clusters using the "KMeans" method:

https://wolfram.com/xid/0dx1j16qpw-fj7crm

Train a ClassifierFunction on a list of strings:

https://wolfram.com/xid/0dx1j16qpw-nog76a

Find the cluster assignments and gather the elements by their cluster:

https://wolfram.com/xid/0dx1j16qpw-lkm67v


Options (3)Common values & functionality for each option
DistanceFunction (1)
"InitialCentroids" (2)
Generate a list of 100 random colors:

https://wolfram.com/xid/0dx1j16qpw-p4t1fk

Cluster the colors without specifying the initial configuration of centroids using the "KMeans" method:

https://wolfram.com/xid/0dx1j16qpw-f56t7r

Specify the initial colors to be used as centroids using the "KMeans" method:

https://wolfram.com/xid/0dx1j16qpw-ng6mcd


https://wolfram.com/xid/0dx1j16qpw-h5fmvf

Find different clusterings of data using the "KMeans" method by varying the "InitialCentroids":

https://wolfram.com/xid/0dx1j16qpw-7f42ki

Possible Issues (1)Common pitfalls and unexpected behavior
Create and visualize noisy 2D moon-shaped training and test datasets:

https://wolfram.com/xid/0dx1j16qpw-f8bvfg

Train a ClassifierFunction using "KMeans" for two clusters and find clusters in the test set:

https://wolfram.com/xid/0dx1j16qpw-xpxr9m

Visualizing clusters indicates that "KMeans" performs poorly on intertwined clusters:

https://wolfram.com/xid/0dx1j16qpw-ultqo9
