**Supervised Learning**

inferring a function from labeled training data

**Recommender Systems**

**Reinforcement Learning**

**Machine Learning**

**What is**

Machine Learning?

Machine Learning?

1959, Arthur Samuel:

"

Field of study that gives computers the ability to learn without being explicitly programmed

"

Tom M. Mitchell

:

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E

**Unsupervised Learning**

Logistic Regression

Neural

Networks

Support Vector Machines

Linear

Regression

Decision

Trees

> lmres<-lm(data=cars, dist~speed)

> pred<-predict(lmres,se.fit=TRUE)

> ggplot(data=cars,aes(x=speed))+geom_point(aes(y=dist)) + geom_line(aes(y=pred$fit),color="red")

> summary(lmres)

Call:

lm(formula = dist ~ speed, data = cars)

Residuals:

Min 1Q Median 3Q Max

-29.069 -9.525 -2.272 9.215 43.201

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -17.5791 6.7584 -2.601 0.0123 *

speed 3.9324 0.4155 9.464 1.49e-12 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.38 on 48 degrees of freedom

Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438

F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12

lmres<-lm(data=cars, dist~poly(speed,11))

pred<-predict(lmres,se.fit=TRUE)

ggplot(data=cars,aes(x=speed))+geom_point(aes(y=dist)) + geom_line(aes(y=pred$fit),color="red")

summary(lmres)

Call:

lm(formula = dist ~ poly(speed, 11), data = cars)

Residuals:

Min 1Q Median 3Q Max

-22.228 -9.373 -1.230 6.411 39.414

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 42.9800 2.1937 19.593 < 2e-16 ***

poly(speed, 11)1 145.5523 15.5116 9.383 1.94e-11 ***

poly(speed, 11)2 22.9958 15.5116 1.482 0.146

poly(speed, 11)3 13.7969 15.5116 0.889 0.379

poly(speed, 11)4 18.3452 15.5116 1.183 0.244

poly(speed, 11)5 5.8811 15.5116 0.379 0.707

poly(speed, 11)6 -11.6775 15.5116 -0.753 0.456

poly(speed, 11)7 -13.5028 15.5116 -0.870 0.389

poly(speed, 11)8 -16.9023 15.5116 -1.090 0.283

poly(speed, 11)9 -17.7870 15.5116 -1.147 0.259

poly(speed, 11)10 -0.4219 15.5116 -0.027 0.978

poly(speed, 11)11 14.1118 15.5116 0.910 0.369

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.51 on 38 degrees of freedom

Multiple R-squared: 0.719, Adjusted R-squared: 0.6377

F-statistic: 8.84 on 11 and 38 DF, p-value: 1.644e-07

Clustering

K-means clustering

Dimensionality

Reduction - PCA

Motivation

:

Data compression:

less disk space

faster learning (e.g. supervised learning)

Visualization

Example

:

Speed up supervised learning by reducing input factors

Let's visualize a 6 dimension data

Algorithm with k=2

Algorithm with k=3

Motivation

: group unlabeled data into k number of clusters

Example

: group T-shirts into Small-Medium-Large groups

How this works

: Eigen value decomposition

How many components

: e.g. ErrorVariance / TotalVariance < 1%

T-shirts example

T-shirts clustered into: S M L

How to use it

: run it multiple times, verify objective function

Objective function

: minimize average distance from centers

What k to use

: experiment with objective function, use domain knowledge

Motivation

: we want to generalize and predict values based on known input-output pairs

Example

: let's predict how much distance is needed to stop the car if we know the speed

Stopping criteria

:

error is normally distributed

sufficient variance is explained by the model

How does it work

:

gradient descent

analytical solution

dist =

a

* speed +

b

+

Error

a

= ?

b

= ?

so that sum(sqr(

Error

)) is minimal

Let's predict stop distance from speed for unknown values:

Classification

Objective

: do classification on data using one or more features

Example

: will a consumer by a new product

Will a consumer by a product

Optimization

: gradient descent, more advanced methods

Overfitting

: penalized using parameter regularization

Helping the model

: scaling input features

Objective

: do classification on data using one or more features

Optimization

: error backpropagation

Output

: discrete vectors [1 0 0], [0 1 0], [0 0 1]

Example

: many input features, e.g. image recognition

Examples

: sigm(w0*1 + w1*x1 + w2*x2)

AND: w0 = -30, w1=20, w2=20

OR: w0 = -10, w1=20, w2=20

Regularization

: against overfitting

Motivation

: too many inputs, even worse with higher order features

Objective

: do classification of samples using one or more features

Motivation

: creating more robust classification by using modified cost function

Example

: identify benign and malignant tumors based on medical results

Logistic Regression vs. SVM:

many

features,

few

training examples: Logistic Regression or SVM without a kernel

few

features,

intermediate

number of training examples

: SVM with Gaussian kernel

few

features,

many

training examples: manually create/add more features, SVM with Gaussian kernel

Neural Networks

: good for any of the cases above but slower to train

Kernel trick

resulting in nonlinear classifier:

Maximum-margin

hyperplane

Overfitting

: parameter regularization

Overfitting

: parameter regularization

Helping

the model: input feature scaling for gradient descent

Motivation

: solving a classification problem in an understandable way

Example

: survival of passengers on Titanic

Learning

: for example iterative Dichotomiser 3, entropy based

Advantage

: easy to understand the result, random forest

Disadvantage

: slow on continuous data

How to use available data:

training set

validation set (or cross validation)

test set

Prevent overfitting

Understand the domain

Evolutionary Computing

- an alternative teaching method

Teaching a Fiver player Neural Network

Representation

:

Algorithm:

check output for every blank field

choose the most confident yes

multiple output nodes e.g. 20

assign priority to each from 1 to 20

use the field where highest priority output fires with highest confidence

Potential alternative:

mapping Neural Network weights into a vector /

chromosome

: [W1, W2, W3, ...]

create N random networks with random W values

Do sufficient number of iterations, in each:

make players

play against each other

assign success numbers

based on the results

from the most successful select some, do

crossover

and

mutation

decrease mutation

factor

replace

some bad performers with successful new ones

Some sources

:

free papers at http://jmlr.org/

http://www.kaggle.com

http://www.coursera.org

Which line is better?

When

: often needed, don't use it when it's not necessary

Examples:

Content based recommendation

Collaborative filtering

agent based,

partial feedback

long time feedback

time plays a special role

Advantage

: non-linear models

Survivals on Titanic

Ozon concentration (ppb)