**Clusterungs Verfahren, Klassifikation and Dimensions Reduktion**

**Supervised learning Unsupervised learning**

Learning by examples

An

Enemy

is:

A

friend

is

Learning by rules:

A

friend

looks exactly

like us

is

smaller

than us

or - if larger -

has

hooves

is a big furry animal eating grass on a meadow

is an animal with feather

AND

either paddles or a very long neck/pecker (ideally both)

An

enemy

has sharp teeth

is larger than 20 cm - having paws

barks, howls or growls

is a bird of our size or larger with compact corpus (short neck) with claws

,...

(

size

in cm,

sharp

teeth

(1=yes, 0 =no),

#legs

,

feathers

(yes=1, no=0),

claws

(1=yes, 0 =no),

furry

(1=yes, 0 =no),

hooves

(1=yes, 0 =no), length of the

neck

in cm,...)

Supervised learning

is the machine learning task of inferring a function from labeled training data in order to make predictions for the future

Each example consists of an input vector (feature vector) and a desired output (label)

Feature vector:

Label

: (enemy, friend)

(100, 1, 4, 0,1 ,1,0,8)

Feature vector:

Label:

(150, 1, 4, 0, 1, 1, 0, 10)

(60, 1, 4, 0, 1, 1, 0, 7)

(30, 0, 2, 1, 1,0, 0, 3)

(140, 0, 4, 0, 0,1, 1, 5)

(8, 0, 4, 0, 1,1, 0, 2)

(6, 0, 2, 1, 1,0, 0, 1)

enemy

enemy

enemy

enemy

friend

friend

friend

Find a function that maps between the input and the desired output

allowing to classify (determine the correct class label) new/unseen instances.

? enemy or friend ?

(100, 0, 2, 1, 0, 0, 60)

(10, 0, 4,0,0,0,0)

? enemy or friend ?

? enemy or friend ?

(200, 0, 4,0,1,1,100)

**unsupervised learning**

is the machine learning task of inferring a function to describe cluster/groups of objects from

unlabeled

together according to their natural characteristics

clustering occurs without a priori knowledge

goal is to detect pattern or (hidden) structures in the (high-dimensional) data

Where do those rules come from ?

Reduction of Dimensions !

(50, 0,2,1,0,0,0,40)

This task is much more vage than supervised clustering.

In higher dimensions the categorization is neither unique nor canonical.

There exist a multitude of different approaches and concepts.

friend

**Clustering algorithms**

**Clustering**

approaches

approaches

Partitional

(or centroid-based)

clustering

Hierarchical

clustering

k-means

model-based

clustering

Spectral

clustering

and others

agglomerative

algorithms

divisive

algorithms

density-based

clustering

**or "how to optimize the grouping of objects by minimizing distances"**

The k-means algorithm partitions the data for a given k into k mutually exclusive cluster

How k-mean works:

Fisher's iris data consists of measurements on the sepal length, sepal width, petal length, and petal width for 150 iris specimens. There are 50 specimens from each of three species.

Goal: minimize iteratively the variance (inner-cluster-distance) in each cluster

Tasks/Problems:

Fisher's iris data

Input:

K

and a set of data-points (

x1,...xn

)

Initialization

: place

K

centroid-seeds at random locations

Assign

each data point

x

i

to the nearest centroid

cj

(depends on the chosen distance measure)

Repeat

until convergence:

How to compute the centroids positions without knowing the cluster assignment of each point ?

How to assign the data points to a certain cluster without knowing the centroid positions ?

Calculate for

each cluster

j=1,...,K

the new centroid

cj

(the mean of all points

xi

assigned to cluster

j

in the

previous step

Stop

when the solution converges (none of the cluster assignment changes any more)

How to choose the correct number of cluster ?

Does the algorithm guarantee convergence to the global optimum ?

Sir Ronald Aylmer Fisher

Clustering

an English statistician and biologist who used mathematics to combine Mendelian genetics and natural selection.

(17 February 1890 – 29 July 1962),

Silhouette coefficient

**Principal Component Analysis (PCA)**

Pollen et. al. :Nature Biotechnology 32, 1053–1058 (2014) doi:10.1038/nbt.2967

Example PCA-Plot: clustering of cell-types

**It transforms the data to a new coordinate system such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.**

**Main goals**

:

identification and extraction of (hidden) patterns (correlation between variables) in the data -> allows for dimension reduction

Finding the directions of maximum variance in high-dimensional data and project it onto a smaller dimensional subspace while retaining most of the information.

:

identification and extraction of (hidden) patterns (correlation between variables) in the data -> allows for dimension reduction

Finding the directions of maximum variance in high-dimensional data and project it onto a smaller dimensional subspace while retaining most of the information.

Iris setosa

Iris versicolor

Iris virginica

The silhouette value of a point is a measure of how similar a point is to points in its own cluster compared to points in other clusters

where

a(i) is the average distance of the point

i

to the other points in its own cluster A

d(i, C)

is the average distance of the point

i

to the other points in the cluster C

b(i)

is the minimal

d(i, C)

over all clusters other than A

The silhouette coefficient is the average silhouette value over all points. A quantitative measure that can assess the quality of a clustering

Single linkage

complete linkage

average linkage

Connetivity-based clustering

Differential gene expression between the brachial and femoral artery based on microarray data: heat map showing the human orthologous genes of the 63 probe sets with a significant difference in gene expression (FDR < 0.05) as well as a 1.5-fold change between the brachial and the femoral artery.

Algorithm description:

Step 1:

Determine the distance between each pair of points m * (m - 1) / 2 different pairs -> adjacency matrix

Stop

:

Cut the hierarchical tree into cluster

Iteratively

group points into a binary hierarchical tree (linkage)

connect the closest pair of points and re-compute the distance matrix

e.g.

e.g.

Further representatives:

kMedian-Algorithm

k-means++

Gaussian Mixture model (GMM)

Idea

:

i) Given a set of points in some space the goal is to

groups together

those points being

closely packed

together (points with many nearby neighbors).

ii)Points whose nearest neighbors are

"too far away

" (lie alone in low-density regions) will be marked as

outliers

points .

e.g.

DBSCAN

Density-based spatial clustering of applications with noise (DBSCAN)

Ester, Martin; Kriegel, Hans-Peter; Sander, Jörg; Xu, Xiaowei (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96). AAAI Press. pp. 226–231. CiteSeerX 10.1.1.121.9220Freely accessible. ISBN 1-57735-004-9.

Definitions:

Points are classified as

either

core points

,

(density-)

reachable points

or

outliers

Algorithm

A certain point

p

is called a

core point

if at least minPts points are within distance

epsilon

(the maximum radius of the neighborhood).

A point

q

is

reachable

from

p

if there is a path

p1, ..., pn

with

p1 = p

and

pn = q

, where each pi+1 is directly reachable from pi (all the points on the path must be core points, with the possible exception of q).

All points not reachable within distance

epsilon

from any other point are

outliers

.

Core point

form a cluster together with all points (core or non-core) that are reachable from it.

Each cluster

contains at least one core point; non-core points can be part of a cluster, but they form its "edge", since they cannot be used to reach more points.

Initialization

:

i

) Choose the two parameters:

distance (eps)

and the minimum amount of points required to form a dense region (

minPts

).

ii

)Choose a random data point

p

If

size of N < MinPts

mark

p

as Noise

else

open Cluster

-> expandCluster via depth first search and

label points as core points, border points or noise

until all points in (

eps

) are labeled

Repeat until all points are labeled:

iii

) Calculate it's (eps)-neighborhood-matrix (N) -( the set of all points that are within a defined distance)

v

) Choose randomly a new (unlabeled) data-point and continue with iii)

Stop:

If all points are labeled

iv

) assign each core point and each border point to the cluster

General task:

Cluster those cells having a similar expression profile .

Question

:

How can a cellular transcription profile of 10:000 genes/cell

become compressed and projected into a single dot in a

2 dimensional plot ???

Answer

:

PCA extracts and captures the essential information hidden

in high-dimensional data-space.

Each dot in this graph represents a single cell and it`s transcription profile (Single-cell RNA-sequencing data, 10:000 transcribed genes in each cell).

**A simple yet popular and useful linear transformation technique used in numerous applications (e.g. gene expression data).**

**Extrembeispiel: y=x**

1 Cell (1D)

Single cell RNA-seq data set:

Genes

Reads

A 10

B 5

C 18

D 54

E 25

F 34

2 Cells (2D)

Genes

(

Cell 1

) (

Cell 2

)

A 10 15

B 5 8

C 18 23

D 54 48

E 25 20

F 34 30

(basic overview)

correlation ?

3 Cells (3D)

Genes

(

Cell 1

) (

Cell 2

)

(Cell 3)

A 10 15 12

B 5 8 10

C 18 23 17

D 54 48 47

E 25 20 18

F 34 30 28

200 cells ?

Question

: are all dimensions (equally) important ?

Demonstration of the performance of the before mentined algorithms on an example of the fisher"s iris dataset.

A probabilistically- grounded way of doing soft clustering. Each "cluster" is here modeled by a gaussian probability distribution.

Tasks: For a given set of datapoints: extract the parameters of the distribution

Case one:

We already know the source of the data:

Expectation Maximization Algorithm

Example: Movies / teapot in 3D

Teapot

:

-> Even though it is in 2D, it is totally clear how it looks like - > 3rd Dimension is not necessary for getting an impression on the shape.

Movies

: Story and landscape can be captured without significant

loss of information compared to a 3D Movie.

Source: youtube.com/watch?v=iQoXFmbXRJA

by Victor Lavrenko

Case two:

We don't know the source, but we know the parameters of the Gaussians

Problem

: We don't have neither the source nor the parameters,...

Solution: Expectation Maximization Algorithm

Music

:

Uncompressed (.wav) is about 10 times larger than compressed .mp3

PCA takes a dataset with a lot of dimensions (i.e. lots of cells) and flattens it to few most-important dimensions

From 2D to 1D

Genes

Reads

A 10

B 5

C 18

D 54

E 25

F 34

2 Cells (2D)

Genes

(

Cell 1

) (

Cell 2

)

A 10 15

B 5 8

C 18 23

D 54 48

E 25 20

F 34 30

: : :

Orthogonal transformation

PC 1

PC 2

Step 2:

Capture the

second

largest variance (the second principal component,

PC2

) being orthogonal on the PC 1

In case of further dimensions:

each succeeding component has in turn the highest possible variance under the constraint that it is orthogonal to the preceding component

1 D

Since almost all variation is captured

via PC1, data could be flattened to 1D

without too much loss of information

(rotating the graph into new x-and y-achses)

PC1

is the linear combination of the original data that spans the direction of the most variation in the data

PC2

captures the direction of the 2nd most variation in the data

PC3

spans the direction of the 3rd largest variation

... and so on

Things that are important to know concerning PCA:

How to compute now PCA ?

By simply calculating the eigenvectors and eigenvalues from the

Covariance Matrix.

-> The first eigenvector defines a direction in space. The line along this direction is the direction of the greatest variance in the multidimensional dataset =

PC1

-> The second eigenvector gives the direction of the second largest variance ->

PC2

-> using these eigenvectores as axes in our new coordinate system (first eigenvector becomes the first axis in the new coordinate system, second ... ).

Start with two randomly placed Gaussians

for each point calculate: P(b I xi)= does it look like it came from b ? (E-Step)

adjust the two Gausssians to fit points assigned to them

iterate until convergence

Step 1: Capture the largest possible variance (the first principal component, PC1)

-> leading to a set of UNCORRELATED orthogonal

basis vectors