Prezi

### Present Remotely

Send the link below via email or IM

• Invited audience members will follow you as you navigate and present
• People invited to a presentation do not need a Prezi account
• This link expires 10 minutes after you close the presentation

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

You can change this under Settings & Account at any time.

# Machine Learning

No description
by Zoltan Papp on 20 February 2014

Report abuse

#### Transcript of Machine Learning

Supervised Learning
inferring a function from labeled training data
Recommender Systems
Reinforcement Learning
Machine Learning
What is
Machine Learning?

1959, Arthur Samuel:
"
Field of study that gives computers the ability to learn without being explicitly programmed
"

Tom M. Mitchell
:
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E
Unsupervised Learning
Logistic Regression
Neural
Networks
Support Vector Machines
Linear
Regression
Decision
Trees
> lmres<-lm(data=cars, dist~speed)
> pred<-predict(lmres,se.fit=TRUE)
> ggplot(data=cars,aes(x=speed))+geom_point(aes(y=dist)) + geom_line(aes(y=pred\$fit),color="red")
> summary(lmres)

Call:
lm(formula = dist ~ speed, data = cars)

Residuals:
Min 1Q Median 3Q Max
-29.069 -9.525 -2.272 9.215 43.201

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
speed 3.9324 0.4155 9.464 1.49e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
lmres<-lm(data=cars, dist~poly(speed,11))
pred<-predict(lmres,se.fit=TRUE)
ggplot(data=cars,aes(x=speed))+geom_point(aes(y=dist)) + geom_line(aes(y=pred\$fit),color="red")
summary(lmres)
Call:
lm(formula = dist ~ poly(speed, 11), data = cars)

Residuals:
Min 1Q Median 3Q Max
-22.228 -9.373 -1.230 6.411 39.414

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 42.9800 2.1937 19.593 < 2e-16 ***
poly(speed, 11)1 145.5523 15.5116 9.383 1.94e-11 ***
poly(speed, 11)2 22.9958 15.5116 1.482 0.146
poly(speed, 11)3 13.7969 15.5116 0.889 0.379
poly(speed, 11)4 18.3452 15.5116 1.183 0.244
poly(speed, 11)5 5.8811 15.5116 0.379 0.707
poly(speed, 11)6 -11.6775 15.5116 -0.753 0.456
poly(speed, 11)7 -13.5028 15.5116 -0.870 0.389
poly(speed, 11)8 -16.9023 15.5116 -1.090 0.283
poly(speed, 11)9 -17.7870 15.5116 -1.147 0.259
poly(speed, 11)10 -0.4219 15.5116 -0.027 0.978
poly(speed, 11)11 14.1118 15.5116 0.910 0.369
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.51 on 38 degrees of freedom
Multiple R-squared: 0.719, Adjusted R-squared: 0.6377
F-statistic: 8.84 on 11 and 38 DF, p-value: 1.644e-07
Clustering
K-means clustering
Dimensionality
Reduction - PCA
Motivation
:
Data compression:
less disk space
faster learning (e.g. supervised learning)
Visualization
Example
:
Speed up supervised learning by reducing input factors
Let's visualize a 6 dimension data
Algorithm with k=2
Algorithm with k=3
Motivation
: group unlabeled data into k number of clusters
Example
: group T-shirts into Small-Medium-Large groups
How this works
: Eigen value decomposition
How many components
: e.g. ErrorVariance / TotalVariance < 1%
T-shirts example
T-shirts clustered into: S M L
How to use it
: run it multiple times, verify objective function
Objective function
: minimize average distance from centers
What k to use
: experiment with objective function, use domain knowledge
Motivation
: we want to generalize and predict values based on known input-output pairs
Example
: let's predict how much distance is needed to stop the car if we know the speed
Stopping criteria
:
error is normally distributed
sufficient variance is explained by the model
How does it work
:
analytical solution
dist =
a
* speed +
b
+
Error
a
= ?
b
= ?
so that sum(sqr(
Error
)) is minimal
Let's predict stop distance from speed for unknown values:
Classification
Objective
: do classification on data using one or more features
Example
: will a consumer by a new product
Will a consumer by a product
Optimization
Overfitting
: penalized using parameter regularization
Helping the model
: scaling input features
Objective
: do classification on data using one or more features
Optimization
: error backpropagation
Output
: discrete vectors [1 0 0], [0 1 0], [0 0 1]
Example
: many input features, e.g. image recognition
Examples
: sigm(w0*1 + w1*x1 + w2*x2)
AND: w0 = -30, w1=20, w2=20
OR: w0 = -10, w1=20, w2=20
Regularization
: against overfitting
Motivation
: too many inputs, even worse with higher order features
Objective
: do classification of samples using one or more features
Motivation
: creating more robust classification by using modified cost function
Example
: identify benign and malignant tumors based on medical results
Logistic Regression vs. SVM:
many
features,
few
training examples: Logistic Regression or SVM without a kernel
few
features,
intermediate
number of training examples
: SVM with Gaussian kernel
few
features,
many
training examples: manually create/add more features, SVM with Gaussian kernel

Neural Networks
: good for any of the cases above but slower to train
Kernel trick
resulting in nonlinear classifier:
Maximum-margin
hyperplane
Overfitting
: parameter regularization
Overfitting
: parameter regularization
Helping
the model: input feature scaling for gradient descent
Motivation
: solving a classification problem in an understandable way
Example
: survival of passengers on Titanic
Learning
: for example iterative Dichotomiser 3, entropy based
: easy to understand the result, random forest
: slow on continuous data
How to use available data:
training set
validation set (or cross validation)
test set

Prevent overfitting

Understand the domain
Evolutionary Computing
- an alternative teaching method
Teaching a Fiver player Neural Network
Representation
:
Algorithm:
check output for every blank field
choose the most confident yes
multiple output nodes e.g. 20
assign priority to each from 1 to 20
use the field where highest priority output fires with highest confidence
Potential alternative:
mapping Neural Network weights into a vector /
chromosome
: [W1, W2, W3, ...]
create N random networks with random W values
Do sufficient number of iterations, in each:
make players
play against each other
assign success numbers
based on the results
from the most successful select some, do
crossover
and
mutation
decrease mutation
factor
replace
some bad performers with successful new ones
Some sources
:
free papers at http://jmlr.org/
http://www.kaggle.com
http://www.coursera.org
Which line is better?
When
: often needed, don't use it when it's not necessary
Examples:
Content based recommendation
Collaborative filtering
agent based,
partial feedback
long time feedback
time plays a special role