Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


K-Mean Clustering

No description

Krish Gupta

on 3 January 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of K-Mean Clustering

K-Mean Clustering
Data Sets
Randomly select K (in our case K = 2) data points from the dataset that are being clustered as the initial group centroids.

Assign each point to the nearest cluster which is represented by its centroid based on Euclidean distance.

After all the objects have been assigned, recalculate the positions of the K number of centroids

Repeat step 2 and 3 until all the centroids of clusters are not changing any more.
Aim and Objective
To implement a Clustering algorithm in Python
Run the Algorithm on Data sets
Analyze the generated Results
Compare the results with WEKA
Identify the weakness of the algorithm
Propose further improvement
K Mean - our
K-Means is sensitive to outlier and noise due to it highly influences the position of the initial centroid
The selection of initial centroid is quite significant, which has great impact on the value
K-Means can only be used in Distance-based Clustering, cannot represent density-based clusters

We have implemented two data mining algorithms using python : K-Means for clustering .
Our implemented versions are compared with the result of standard data mining software WEKA.
To perform better in the future, our program still has a large space to be improved: K-Means we will mainly focus on how to distribute the initial centroids
By :
Hari Gupta

Reason of Choice
Efficiency, with a large number of variables, K-Means may be computationally faster than hierarchical clustering (if K is small).
K-means we need a dataset whose dimension is same for all the instances due to the Euclidean distance is being used.

K-Means is a classical clustering algorithm
K-Means clustering generates a specific number of disjoint, flat (non-hierarchical) clusters. It is well suited to generating globular clusters [1].
The K-Means method is numerical, unsupervised, non-deterministic and iterative
A collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster Analysis
It’s hard to decide how many clusters should be produced when initially getting the sample, in other words, the value of K is hard to define.
The output is quite uncertain as the seed point selection is random. If the initial centroid is selected differently, the final result will be very different
K-Mean ++
Define number of clusters – K;
Just randomly assign each object into any one of the cluster
After all the objects are assigned, the number of objects in each cluster should be roughly the same
Calculate the current center point of each cluster as the initial centroid of K-Means
Proceed using standard k-means clustering

Choose one center uniformly at random from among the data points.
For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2.
Repeat Steps 2 and 3 until k centers have been chosen.
Now that the initial centers have been chosen, proceed using standard k-means clustering.

Facial Palsy
Attribute : 1410 Instances : 76
Attribute : 66 Instances : 66
WEKA: Simple K Mean

WEKA: Simple K Mean

Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters
[1] . MacQueen, Some methods for classification and analysis of multivariate observations," Proc. of the Fifth Berkeley Symp. On Math. Stat. and Prob., vol. 1, pp. 281-296, 1967.
Step 1:

Step 2:

Step 3:

Step 4:
Incorrect Clustered Instances : 22 28 %
Incorrect Clustered Instances : 24 31.5 %
Incorrect Clustered Instances : 22 28 %
Incorrect Clustered Instances : 30 45.45 %
[2] Hartigan, John A, and Manchek A Wong. "Algorithm AS 136: A k-means clustering algorithm." Journal of the Royal Statistical Society. Series C (Applied Statistics) 28.1 (1979): 100-108.
[4] Alsabti, Khaled, Sanjay Ranka, and Vineet Singh. "An efficient k-means clustering algorithm." (1997).
An Introduction to Data Mining, Tan, Steinbach, Kumar, Addision-Wesley, 2005. http://www-users.cs.umn.edu/~kumar/dmbook/index.phpz
Data Mining: Concepts and Techniques, 2ndEdition, Jiawei Han and Micheline Kamber, Morgan Kauffman, 2006 http://www-sal.cs.uiuc.edu/~hanj/bk2
K-means tutorial slides (Andrew Moore) http://www.autonlab.org/tutorials/kmeans11.pdf
Full transcript