Golden
K-means

K-means

K-means is used to find clusters in a data space, where clusters group data points that share similar features with one another.

All edits

Edits on 2 May 2019
Jude Gomila
Jude Gomila edited on 2 May 2019 10:24 pm
Edits made to:
Article (+2/-4 characters)

Article

Where Si SiS_i represents the sum of the data points assigned to the i-th cluster. The new centroid position is obtained from the average of all the data points assigned to the cluster in the previous step.

Andrea D'Agostino
Andrea D'Agostino edited on 2 May 2019 10:24 pm
Edits made to:
Article (+405/-405 characters)

Article

The K-means algorithm is a clustering algorithm which divides groups of objects into K partitions based on their attributes. A cluster is identified by a centroid or midpoint. The algorithm follows an iterative procedureThe algorithm follows an iterative procedure.

...
  • The algorithm then calculates the centroids of each group; the distances are computed through the Euclidean distance Euclidean distance between point and centroid, with the formula:
...

K-means clustering is one of the simplest and most effective algorithms belonging to unsupervised learningK-means clustering is one of the simplest and most effective algorithms belonging to unsupervised learning. Clusters represent the groups that divide the objects based on whether or not they share a particular similarity between them, and are chosen a priori, before the execution of the algorithm.

...

K-means works by identifying K centroids, imaginary points in space that represent the center of the groupingimaginary points in space that represent the center of the grouping, and places each point at the nearest cluster. The variable K is defined by the data scientist and defines the number of centroids (or clusters) to be identified.

...

The Elbow method consists in computing the total sum between the distances of each point and its nearest centroidThe Elbow method consists in computing the total sum between the distances of each point and its nearest centroid. Since an increase in a cluster is related to smaller groupings and distances, this sum will decrease when this sum will decrease when KK increases and vice versa. increases and vice versa.

Jude Gomila"changed to sub script. please double check"
Jude Gomila edited on 2 May 2019 10:11 pm
Edits made to:
Article (+2/-4 characters)

Article

Where S_i\Si represents the sum of the data points assigned to the i-th cluster. The new centroid position is obtained from the average of all the data points assigned to the cluster in the previous step.

Daniel Frumkin
Daniel Frumkin edited on 2 May 2019 9:40 pm
Edits made to:
Description (+7/-7 characters)
Topic thumbnail

K-Means K-means

K-MeansK-means is used to find clusters in a data space, where clusters group data points that share similar features with one another.

Daniel Frumkin
Daniel Frumkin edited on 2 May 2019 9:38 pm
Edits made to:
Description (+7/-7 characters)
Topic thumbnail

K-means K-Means

K-meansK-Means is used to find clusters in a data space, where clusters group data points that share similar features with one another.

Andrea D'Agostino
Andrea D'Agostino edited on 2 May 2019 9:37 pm
Edits made to:
Description (+7/-7 characters)
Article (+82/-82 characters)
Topic thumbnail

K-means

K-MeansK-means is used to find clusters in a data space, where clusters group data points that share similar features with one another.

Article

The K-MeansK-means algorithm is a clustering algorithm which divides groups of objects into K partitions based on their attributes. A cluster is identified by a centroid or midpoint. The algorithm follows an iterative procedure.

...

Here are steps that allow the K-MeansK-means to converge to optimal data separation:

...
A demonstration of how K-MeansK-means groups data.

K-MeansK-means clustering is one of the simplest and most effective algorithms belonging to unsupervised learning. Clusters represent the groups that divide the objects based on whether or not they share a particular similarity between them, and are chosen a priori, before the execution of the algorithm.

...

K-MeansK-means works by identifying K centroids, imaginary points in space that represent the center of the grouping, and places each point at the nearest cluster. The variable K is defined by the data scientist and defines the number of centroids (or clusters) to be identified.

...

In this case, the elbow point falls at the value 4, which should be the number of clusters used to initialize the K-MeansK-means algorithm with.

...

Advantage and disadvantages of using K-MeansK-means

...
  • K-MeansK-means works only on numerical values as it minimizes a cost function by calculating the average of clusters

...
  • Pattern recognitionPattern recognition : K-MeansK-means can be used to distinguish between signal and noise. This applies to basically all scientific fields
Daniel Frumkin
Daniel Frumkin approved a suggestion from Golden's AI on 2 May 2019 9:37 pm
Edits made to:
Article (+8/-8 characters)

Article

  • Identification of outliers : Outliers are data points that present large differences with all the other elements of a data setdata set. Their identification can be interesting for two purposes: the elimination of these anomalous values, which could be caused by errors, or the isolation of these particular cases that may have a certain importance for the business.
Daniel Frumkin
Daniel Frumkin approved a suggestion from Golden's AI on 2 May 2019 9:36 pm
Edits made to:
Article (+19/-19 characters)

Article

  • Pattern recognitionPattern recognition : K-Means can be used to distinguish between signal and noise. This applies to basically all scientific fields
Andrea D'Agostino
Andrea D'Agostino edited on 2 May 2019 9:34 pm
Edits made to:
Article (+25/-25 characters)

Article

...

Advantages:

Advantages:

...

Disadvantages:

Disadvantages:

Andrea D'Agostino
Andrea D'Agostino edited on 2 May 2019 9:34 pm
Edits made to:
Topic thumbnail

K-means

K-means is used to find clusters in a data space, where clusters group data points that share similar features with one another.

Andrea D'Agostino
Andrea D'Agostino edited on 2 May 2019 9:33 pm
Edits made to:
Article (+64/-12 characters)
Documentaries, videos and podcasts (+1 rows) (+2 cells) (+74 characters)

Article

Introduction

...

Elbow method

How to find the ideal number of clusters

Elbow method

Documentaries, videos and podcasts

Title
Date
Link
Andrea D'Agostino
Andrea D'Agostino edited on 2 May 2019 9:31 pm
Edits made to:
Article
People (+1 rows) (+1 cells) (+10 characters)
Further reading (+4 rows) (+16 cells) (+625 characters)
Documentaries, videos and podcasts (+1 rows) (+2 cells) (+43 characters)

Article

Ci=  1/(Si ) (Xi ϵ Si)\ofxi C_i=\ \ 1/(|S_i\ |)\ \sum_(X_i\ \epsilon\ S_i)\of x_i\
Ci=1SixiϵSixiC_i= \frac{1}{|S_i |}\underset {x_i ϵ S_i}{∑} x_i

People

Name
Role
Related Golden topics

J MacQueen

Further reading

Title
Author
Link
Type

Constrained K-means Clustering with Background Knowledge

Kiri Wagstaff, Claire Cardie, Seth Rogers, Stefan Schroedl

PDF

SOME METHODS FOR CLASSIFICATION AND ANALYSIS OF MULTIVARIATE OBSERVATIONS

J Macqueen

PDF

Understanding K-means Clustering in Machine Learning

Dr. Michael J. Garbade

Web

Documentaries, videos and podcasts

Daniel Frumkin"Formatting, related topics"
Daniel Frumkin edited on 2 May 2019 9:28 pm
Edits made to:
Article (+42/-55 characters)
Categories (+1/-1 topics)
Related Topics (+5/-1 topics)

Article

  • The algorithm then calculates the centroids of each group; the distances are computed through the Euclidean distance between point and centroid, with the formula:

...

...

...

...

...

Some advantages:

Advantages:

...

As for the disadvantages:

Disadvantages:

  • Sometimes situation can occur in which one or more clusters are not associated with data points (K is too big)
...

K-means is usually implemented in PythonPython, R, Octave, or MatlabMatlab.

Categories

Related Topics

Andrea D'Agostino
Andrea D'Agostino edited on 2 May 2019 9:22 pm
Edits made to:
Article

Article

argmin\below(ci  ϵ C)dist(ci,x)2argmin\below(c_i\ \ \epsilon\ C) dist(c_i, x)^2
argminci ϵ Cdist(ci, x)2\underset{c_i\ \epsilon\ C}{\rm argmin}{{dist(c_i,\ x)}^2}
Andrea D'Agostino"First publication"
Andrea D'Agostino edited on 2 May 2019 9:12 pm
Edits made to:
Description (+128 characters)
Article (+2 images) (+5063 characters)
Categories (+3 topics)
Related Topics (+1 topics)
Topic thumbnail

K-means

K-Means is used to find clusters in a data space, where clusters group data points that share similar features with one another.

Article

The K-Means algorithm is a clustering algorithm which divides groups of objects into K partitions based on their attributes. A cluster is identified by a centroid or midpoint. The algorithm follows an iterative procedure.

Here are steps that allow the K-Means to converge to optimal data separation:

  • It initially creates K partitions and assigns the data points to each partition either randomly or using some known heuristic;
  • The algorithm then calculates the centroids of each group; the distances are computed through the Euclidean distance between point and centroid, with the formula

argmin\below(ci  ϵ C)dist(ci,x)2argmin\below(c_i\ \ \epsilon\ C) dist(c_i, x)^2

where there is a centroid in the set C (which includes all the centroids), x are the datapoints and dist is the standard Euclidean distance.

  • The centroids for the new clusters are recalculated continuously until the algorithm converges. The new value of a centroid will be the average of all the data points that have been assigned to the new cluster. In mathematical terms we will have the following expression:

Ci=  1/(Si ) (Xi ϵ Si)\ofxi C_i=\ \ 1/(|S_i\ |)\ \sum_(X_i\ \epsilon\ S_i)\of x_i\

Where S_i\ represents the sum of the data points assigned to the i-th cluster. The new centroid position is obtained from the average of all the data points assigned to the cluster in the previous step.

A demonstration of how K-Means groups data.

K-Means clustering is one of the simplest and most effective algorithms belonging to unsupervised learning. Clusters represent the groups that divide the objects based on whether or not they share a particular similarity between them, and are chosen a priori, before the execution of the algorithm.

K-Means works by identifying K centroids, imaginary points in space that represent the center of the grouping, and places each point at the nearest cluster. The variable K is defined by the data scientist and defines the number of centroids (or clusters) to be identified.

Each point is assigned to a specific cluster by calculating the standard deviation, the distance between the point and the centroid. For each point, the error is the distance from the centroid of the cluster to which it is assigned.

The value of K is arbitrary, but there are several ways to calculate the optimal value, one of which is the Elbow method.

Elbow method

Researchers of unsupervised learning must try to determine the appropriate amount of clusters to use in a given problem.

The question is very pertinent because if researchers already had this information, as in supervised clustering, it would be possible to derive the attributions from the labels and categories that are already defined. In this case, instead, the number of clusters is defined a priori and then their memberships are calculated. One of the most immediate methods to accomplish this is the so-called Elbow method.

The Elbow method consists in computing the total sum between the distances of each point and its nearest centroid. Since an increase in a cluster is related to smaller groupings and distances, this sum will decrease when K increases and vice versa.

As an extreme example, if we choose a value K equal to the number of data points we have, the sum will be zero, because the centroid will coincide with each point and the total distance is zero.

The goal of this process is to find the point where the increase in K will cause a very small decrease in the sum, while the decrease in K will sharply increase the sum.

This sweet spot is called the elbow point.

Elbow method. The within-cluster sum of squares drastically changes at the point 4, indicating that 4 clusters can effectively group data

In this case, the elbow point falls at the value 4, which should be the number of clusters used to initialize the K-Means algorithm with.

Advantage and disadvantages of using K-Means

Some advantages:

  • It is possible to vary the initial position of the centroids to try to reduce the dependence on the initial conditions
  • Efficient in managing large amounts of data
  • The algorithm often ends with an optimal local
...

As for the disadvantages:

  • K-Means works only on numerical values as it minimizes a cost function by calculating the average of clusters
  • K-Means is difficult to use when trying to find clusters of non-convex form
  • It’s necessary to set K a priori
  • Sometimes situation can occur in which one or more clusters are not associated with data points (K is too big)

Applications

K-means clustering can be used for numerous applications. Here are some examples.

  • Customer segmentation : clusters identify customers based on their behavior, allowing the company to deliver specific products to specific users
  • Pattern recognition : K-Means can be used to distinguish between signal and noise. This applies to basically all scientific fields
  • Identification of outliers : Outliers are data points that present large differences with all the other elements of a data set. Their identification can be interesting for two purposes: the elimination of these anomalous values, which could be caused by errors, or the isolation of these particular cases that may have a certain importance for the business.

K-means is usually implemented in Python, R, Octave, Matlab.

Categories

Related Topics

Edits on 13 Mar 2019
Daniel Frumkin"Initial topic creation"
Daniel Frumkin created this topic on 13 Mar 2019 11:10 am
Edits made to:
Topic thumbnail

 K-means

K-means is used to find clusters in a data space, where clusters group data points that share similar features with one another.

No more activity to show.