**K-Mean Clustering**

Data Sets

Poem

Algorithm

Randomly select K (in our case K = 2) data points from the dataset that are being clustered as the initial group centroids.

Assign each point to the nearest cluster which is represented by its centroid based on Euclidean distance.

After all the objects have been assigned, recalculate the positions of the K number of centroids

Repeat step 2 and 3 until all the centroids of clusters are not changing any more.

Aim and Objective

To implement a Clustering algorithm in Python

Run the Algorithm on Data sets

Analyze the generated Results

Compare the results with WEKA

Identify the weakness of the algorithm

Propose further improvement

Improvement

K Mean - our

Findings

K-Means is sensitive to outlier and noise due to it highly influences the position of the initial centroid

The selection of initial centroid is quite significant, which has great impact on the value

K-Means can only be used in Distance-based Clustering, cannot represent density-based clusters

Conclusion

We have implemented two data mining algorithms using python : K-Means for clustering .

Our implemented versions are compared with the result of standard data mining software WEKA.

To perform better in the future, our program still has a large space to be improved: K-Means we will mainly focus on how to distribute the initial centroids

**By :**

Hari Gupta

Charles

Hari Gupta

Charles

Reason of Choice

Efficiency, with a large number of variables, K-Means may be computationally faster than hierarchical clustering (if K is small).

K-means we need a dataset whose dimension is same for all the instances due to the Euclidean distance is being used.

K-Mean

K-Means is a classical clustering algorithm

K-Means clustering generates a specific number of disjoint, flat (non-hierarchical) clusters. It is well suited to generating globular clusters [1].

The K-Means method is numerical, unsupervised, non-deterministic and iterative

A collection of data objects

Similar to one another within the same cluster

Dissimilar to the objects in other clusters

Cluster Analysis

It’s hard to decide how many clusters should be produced when initially getting the sample, in other words, the value of K is hard to define.

The output is quite uncertain as the seed point selection is random. If the initial centroid is selected differently, the final result will be very different

Disadvantages

K-Mean ++

Define number of clusters – K;

Just randomly assign each object into any one of the cluster

After all the objects are assigned, the number of objects in each cluster should be roughly the same

Calculate the current center point of each cluster as the initial centroid of K-Means

Proceed using standard k-means clustering

Choose one center uniformly at random from among the data points.

For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.

Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2.

Repeat Steps 2 and 3 until k centers have been chosen.

Now that the initial centers have been chosen, proceed using standard k-means clustering.

Facial Palsy

Attribute : 1410 Instances : 76

Attribute : 66 Instances : 66

K-Mean

WEKA: Simple K Mean

K-Mean

WEKA: Simple K Mean

Cluster

Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters

[1] . MacQueen, Some methods for classification and analysis of multivariate observations," Proc. of the Fifth Berkeley Symp. On Math. Stat. and Prob., vol. 1, pp. 281-296, 1967.

Step 1:

Step 2:

Step 3:

Step 4:

Incorrect Clustered Instances : 22 28 %

Incorrect Clustered Instances : 24 31.5 %

Incorrect Clustered Instances : 22 28 %

Incorrect Clustered Instances : 30 45.45 %

[2] Hartigan, John A, and Manchek A Wong. "Algorithm AS 136: A k-means clustering algorithm." Journal of the Royal Statistical Society. Series C (Applied Statistics) 28.1 (1979): 100-108.

[2]

[4] Alsabti, Khaled, Sanjay Ranka, and Vineet Singh. "An efficient k-means clustering algorithm." (1997).

[4]

References

An Introduction to Data Mining, Tan, Steinbach, Kumar, Addision-Wesley, 2005. http://www-users.cs.umn.edu/~kumar/dmbook/index.phpz

Data Mining: Concepts and Techniques, 2ndEdition, Jiawei Han and Micheline Kamber, Morgan Kauffman, 2006 http://www-sal.cs.uiuc.edu/~hanj/bk2

K-means tutorial slides (Andrew Moore) http://www.autonlab.org/tutorials/kmeans11.pdf