<> one ,k-means

1, sketch

K-Means It is an unsupervised learning algorithm . The idea of the algorithm is simple , For a given sample set , According to the distance between samples , The sample set is divided into two parts K Clusters . Let the points in the cluster connect as closely as possible , And let the distance between clusters as large as possible .

2,K-means The basic idea of the algorithm is :

First, choose to divide the data into two parts k class , Then it is initialized randomly k A point is the center point .

For each data point , Select the nearest center point as your own category .

When all data points are classified , Adjust center point : Reset the center point to the center position of all data points in the category , Each axis is set to average .( So it's called means)

Repeat the above 2~3 Step until the category of the data point no longer changes .

3, shortcoming :

Shortcoming one : The number of cluster centers K It needs to be given in advance , But in practice K The selection of values is very difficult , Most of the time, we don't know how many categories a given dataset should be clustered into .

Second shortcoming :k-means The algorithm needs to determine the initial clustering center randomly , Different initial clustering centers may lead to different clustering results , It may lead to slow convergence and even clustering errors .

4, solve

1, Solve the problem one , It's hard to be here k-means Algorithm and its improved algorithm , generally speaking , We will choose an appropriate one based on the prior experience of the data k value , If there is no prior knowledge , You can go through the “ Elbow method ” Choose the right one k value .

Elbow rule

The core index of elbow method is SSE(sum of the squared errors, Sum of square error )

among ,Ci It's number one i Clusters ,x yes Ci Sample points in ,μi yes Ci Center of mass (Ci Mean of all samples in ),SSE Is the clustering error of all samples , It represents the quality of clustering effect .

The core idea of elbow method is :

With the number of clusters increasing k Increase of , The sample division will be more refined , The aggregation degree of each cluster will gradually increase , So the sum of squared errors SSE It's going to get smaller . also , When k When the number of clusters is less than the real number of clusters , because k The aggregation degree of each cluster will be greatly increased with the increase of cluster size , so SSE It's going to go down a lot , And when k When the real number of clusters is reached , Add more k The return on the degree of aggregation is rapidly reduced , therefore SSE The decline in the price of the product will be sharply reduced , And then with it k The value increases continuously and tends to be flat , in other words SSE and k The graph is the shape of an elbow , And this elbow corresponds to k Value is the real clustering number of data . of course , That's why it's called the elbow method .

2, Solve the second shortcoming , Can pass k-means++ Algorithm

<> two ,k-means++

1,K-Means++ Optimization strategy for initialization centroid , as follows :

a) A point is randomly selected as the first cluster center from the input data point set μ1

b) For each point in the dataset xi, Calculate the distance between it and the nearest cluster center in the selected cluster center D

c) Select a new data point as the new clustering center , The principle of choice is :D Larger point , Probability of being selected as cluster center
more .( If you choose directly according to the distance , It is easy to select outliers , Cause clustering error or slow convergence )

d) repeat b and c Until you choose k Cluster centroids

e) Take advantage of this k A centroid is used as the initialization centroid to run the standard K-Means algorithm

2,k-means++ shortcoming

k-means++ The main drawback is its inherent sequential execution , obtain k A cluster center must traverse the dataset k second , also The calculation of the current cluster centers depends on all the previous cluster centers ,
This makes the algorithm unable to scale in parallel , It greatly limits the application of the algorithm in large-scale data sets .

In view of this defect ,k-means|| The algorithm provides a solution .

<> three ,k-means||

1, Algorithm implementation ideas :

The first step : A sample is randomly selected as a candidate clustering center

Step two : Calculate the distance from each sample to the nearest cluster center , Select a batch of points according to probability , As a candidate cluster center

The third step : Repeat the second step , loop 5 second , Get better than expected K Larger set of candidate cluster centers

Step four : Calculate the density of each candidate centroid ( See who has more sample points nearby )

Step five : A weighted algorithm is performed on the candidate centroid set kmeans++ algorithm , Get the exact answer K A centroid

Step six : implement kmeans algorithm

2, Summary :

k-means|| yes k-means++ A variant of ,k-means|| When initializing the center point kmeans++ The shortcomings of the system are avoided , This is mainly reflected in the fact that there is no need to rely on k Strictly look for the number of k A point , Break through the bottleneck of the algorithm in the application of large-scale data sets , At the same time, the initialization center is more robust .

<> four , summary