statistical

**A intro**

by Giuseppe Jurman

(with contribs by Roberto Visintainer)

FBK-Trento

June 2012

quick

**Examples of (animal) learning**

Bait shyness

Bait shyness revisited

Pigeon superstition

eat bait

get sick

eat bait

(learning, NOT "built-in" knowledge)

Food delivered at regular intervals

At delivery time, each pigeon pecks at some object

Then the pigeon spends more and more time pecking at the same object

This increases the chance that next food spraying finds the bird in the same location

Finally, this wrongly reinforces the pigeon's association of the food with the object

[Skinner, 1947]

eat bait

keep eating bait

"A priori" knowledge

Food & sickness can be related

Food & electric shock cannot be related

[Garcia & Koelling, 1966]

Incorporation of a priori knowledge is inevitable for the success of the learning algorithm: "

No free lunch

" theorem!

1)

guess a law

(starting from part of the available data)

2)

compute the consequences (

learn how to make predictions)

3)

verify the agreement with experiments (

compare the obtained predictions against the remaining available data)

**Inference**

**Why**

machine

learning?

machine

learning?

**Complexity**

**Adaptivity**

Problem

dimension

Problem

diversity

"

Select among competing hypotheses that which makes the fewest assumptions

"

Father William of Ockham (d'Okham) - Lex Parsimoniae

**Statistical machine learning**

The art of developing algorithms and techniques that allow

computers

to learn, i.e. gaining understanding by constructing

models

of observed data through

inference

from sample with the intention to use them for prediction.

**Types of learning**

Supervised

Unsupervised

Reinforcement

How an agent ought to take actions in an environment so as to maximize some notion of cumulative reward

Markov processes

Density estimation

Maximum likelihood

Class discovery

Clustering

data has NO target

data has PARTIAL target

data has target

Regression

Classification

Tic-Tac-Toe

(Multi-)armed bandits

**(Almost) formally...[binary classification]**

Finding a function f:X->{-1,1} in a space H that

predictively

approximates W

Each example in the training set D is

independently and identically distributed

(i.i.d.) according to an unknown distribution W from a universe X.

To assess the goodness of f a

loss

function is used: L(y,f) = sign(y-f(x))

**Risk & training error**

Expected risk

Empirical risk

Training error

Minimizing the training error is prone to

overfitting

and it results in poor predictive performance

Solution: add the

inductive bias

, restricting H

**The bias-variance dilemma**

The generalization error

can be decomposed into 3 parts:

Bias

: the loss of the main prediction relative to the optimal prediction - depends on (the

capacity

of) H

Variance

: the average loss of the predictions relative to the main prediction - depends on D

Noise

: the remaining loss that cannot be eliminated - unpredictable

The

capacity

h(H) is a measure of its size, or complexity: it is the largest n such that there exist a set of examples Dn such that one can always find an f in H which gives the correct answer for all examples in Dn, for any possible labeling (e.g., for the set of linear functions

y = w · x + b in d dimensions, the capacity is d + 1)

PREDICTIVITY

Compromise

between complexity of the

solution and error on training set.

Regularization

In the setting n<<d, beware the

curse of dimensionality:

the volume of the space is so large that the samples became

sparse

, jeopardizing the statistical significance of the methods.

no learning algorithm can work for all distributions

**Examples of classifiers**

k-nearest neighbour

Support Vector Machine

Decision Tree

Artificial Neural Network

Linear Discriminant Analysis

Gaussian Mixture

**The IRIS flower dataset**

[Fisher, 1936]

50 samples from each of three species of Iris

4 features: length and width of the sepals and petals

Iris setosa

Iris virginica

Iris versicolor

**X**

**O**

**+**

Linear discriminant analysis

Decision tree

Gaussian mixture model

Support Vector Machine

**Two key topics...**

... at least to mention

Model selection - tuning parameters and choosing the best model

Feature importance (is not the same for all variables)

K-fold cross validation

Bagging/Boosting

Feature selection

Feature ranking