**Clustering, Klassifikation and Dimensions Reduktion**

**Supervised versus unsupervised learning**

Learning by examples

An

Enemy

is:

A

friend

is

Learning by rules:

A

friend

looks exactly

like us

is

smaller

than us

or - if larger -

has

hooves

is a big furry animal eating grass on a meadow

is an animal with feather

AND

either paddles or a very long neck/pecker (ideally both)

An

enemy

has sharp teeth

is larger than 20 cm - having paws

barks, howls or growls

is a bird of our size or larger with compact corpus (short neck) with claws

,...

(

size

in cm,

sharp

teeth

(1=yes, 0 =no),

#legs

,

feathers

(yes=1, no=0),

claws

(1=yes, 0 =no),

furry

(1=yes, 0 =no),

hooves

(1=yes, 0 =no), length of the

neck

in cm,...)

Supervised learning

is the machine learning task of inferring a function from labeled training data in order to make predictions for the future

Each example consists of an input vector (feature vector) and a desired output (label)

Feature vector:

Label

: (enemy, friend)

(100, 1, 4, 0,1 ,1,0,8)

Feature vector:

Label:

(150, 1, 4, 0, 1, 1, 0, 10)

(60, 1, 4, 0, 1, 1, 0, 7)

(30, 0, 2, 1, 1,0, 0, 3)

(140, 0, 4, 0, 0,1, 1, 5)

(8, 0, 4, 0, 1,1, 0, 2)

(6, 0, 2, 1, 1,0, 0, 1)

enemy

enemy

enemy

enemy

friend

friend

friend

Find a function that maps between the input and the desired output

allowing to classify (determine the correct class label) new/unseen instances.

? enemy or friend ?

(100, 0, 2, 1, 0, 0, 60)

(10, 0, 4,0,0,0,0)

? enemy or friend ?

? enemy or friend ?

(200, 0, 4,0,1,1,100)

unsupervised learning

is the machine learning task of inferring a function to describe cluster/groups of objects from

unlabeled

together according to their natural characteristics

clustering occurs without a priori knowledge

goal is to detect pattern or (hidden) structures in the (high-dimensional) data

Where do those rules come from ?

Reduction of Dimensions !

(50, 0,2,1,0,0,0,40)

This task is much more vage than supervised clustering.

In higher dimensions the categorization is neither unique nor canonical.

There exist a multitude of different approaches and concepts.

friend

**Clustering algorithms**

Why do we need clustering ?

In several situations we don't know the labeling and wish to learn rules from the data (data-driven modeling)

Clustering

approaches

Partitional

(or centroid-based)

clustering

Hierarchical

clustering

k-means

model-based

clustering

Spectral

clustering

agglomerative

algorithms

divisive

algorithms

density-based

clustering

**or "how to optimize the grouping of objects by minimizing distances"**

The k-means algorithm partitions the data for a given k into k mutually exclusive cluster

How k-mean works:

Fisher's iris data consists of measurements on the sepal length, sepal width, petal length, and petal width for 150 iris specimens. There are 50 specimens from each of three species.

Goal: minimize iteratively the variance (inner-cluster-distance) in each cluster

Formal definition: Minimization of the total-cluster variance

Tasks/Problems:

Fisher's iris data

Input:

K

and a set of data-points (

x1,...xn

)

Initialization

: place

K

centroid-seeds at random locations

Assign

each data point

x

i

to the nearest centroid

cj

(depends on the chosen distance measure)

Repeat

until convergence:

How to compute the centroids positions without knowing the cluster assignment of each point ?

How to assign the data points to a certain cluster without knowing the centroid positions ?

Calculate for

each cluster

j=1,...,K

the new centroid

cj

(the mean of all points

xi

assigned to cluster

j

in the

previous step

Stop

when the solution converges (none of the cluster assignment changes any more)

How to choose the correct number of cluster ?

Does the algorithm guarantee convergence to the global optimum ?

Sir Ronald Aylmer Fisher

Clustering

an English statistician and biologist who used mathematics to combine Mendelian genetics and natural selection.

(17 February 1890 – 29 July 1962),

Silhouette coefficient

**Principal Component Analysis (PCA)**

Pollen et. al. :Nature Biotechnology 32, 1053–1058 (2014) doi:10.1038/nbt.2967

**Example:**

PCA is a simple yet popular and useful linear transformation technique that is used in numerous applications (e.g. gene expression data).

It transforms the data to a new coordinate system such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

Main goals

:

identification and extraction of (hidden) patterns (correlation between variables) in the data -> allows for dimension reduction

Finding the directions of maximum variance in high-dimensional data and project it onto a smaller dimensional subspace while retaining most of the information.

Iris setosa

Iris versicolor

Iris virginica

The silhouette value of a point is a measure of how similar a point is to points in its own cluster compared to points in other clusters

where

a(i) is the average distance of the point

i

to the other points in its own cluster A

d(i, C)

is the average distance of the point

i

to the other points in the cluster C

b(i)

is the minimal

d(i, C)

over all clusters other than A

The silhouette coefficient is the average silhouette value over all points. A quantitative measure that can assess the quality of a clustering

Single linkage

complete linkage

average linkage

Connectivity-based clustering

Differential gene expression between the brachial and femoral artery based on microarray data: heat map showing the human orthologous genes of the 63 probe sets with a significant difference in gene expression (FDR < 0.05) as well as a 1.5-fold change between the brachial and the femoral artery.

Algorithm description:

Step 1:

Determine the distance between each pair of points m * (m - 1) / 2 different pairs -> adjacency matrix

Stop

:

Cut the hierarchical tree into cluster

Iteratively

group points into a binary hierarchical tree (linkage)

connect the closest pair of points and re-compute the distance matrix

Decicion trees

decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility.

Support vector machines

supervised learning models analyzing data used for classification and regression analysis.

Patients

Genes

**Susanne Gerber**

Computational Systems Genetics Group

Computational Systems Genetics Group