The Backpropagation Algorithm

A Basic Neural Network

**Neural Networks**

**Modern Advances**

**RBMs**

Image credits:

digital brain: wikimedia user Gengiskanhg

simple neural network: wikimedia user Miso

neural network activation diagram: Artificial Neural Networks wikibook

RBM example from Edwin Chen's Blog

Dropout comparison graphs: Geoffrey Hinton Google Tech Talk: "Brains, Sex, and Machine Learning"

Copyright (c) 2013 Thomas Lotze

Permission is granted to copy, distribute and/or modify this document

under the terms of the GNU Free Documentation License, Version 1.3

or any later version published by the Free Software Foundation

Restricted Boltzmann Machines

**Dropout**

**Learning More**

Merchant's MCC is "individual_use":

{

no: x0 = 0

yes: x0 = 1

w01 = 0.7

w11 = 15

Merchant's avg. daily $, scaled

x1 = (daily $) / (maximum daily $)

x1 = [0, 1]

net1 =

x0*w01

+

x1*w11

o1 = (net1)

output activation = [0, 1]

List of examples:

individual_use, scaled_gpv, ... , is_fraudster

0, 0.2745, ..., 0

1, 0.0000, ..., 1

1, 0.4817, ..., 0

0, 0.0123, ..., 1

...

0

0.2745

0.19

ACTUAL=0

0.98

0.02

0.76

0.94

Compute error and derivative (gradient)

Adjust weights towards gradient (to get closer to target output)

Iterate to convergence

Basic Neural Networks, Unsupervised Feature Learning and Deep Learning

https://www.coursera.org/course/ml

http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial

Neural Networks win the Merck Kaggle Competition

http://blog.kaggle.com/2012/11/01/deep-learning-how-i-did-it-merck-1st-place-interview/

RBMs

http://blog.echen.me/2011/07/18/introduction-to-restricted-boltzmann-machines/

http://www.cs.toronto.edu/~hinton/absps/guideTR.pdf

Coursera Neural Network course from Hinton:

https://www.coursera.org/course/neuralnets

Deep Learning Tutorials (with Python examples)

http://deeplearning.net/tutorial/

Dropout

youtube.com/watch?v=DleXA5ADG78

http://www.cs.toronto.edu/~nitish/msc_thesis.pdf

http://www.stanford.edu/~sidaw/cgi-bin/home/lib/exe/fetch.php?media=papers:fastdropout.pdf

More Code Examples and Libraries in Progress

http://www.r-bloggers.com/restricted-boltzmann-machines-in-r/

https://github.com/lisa-lab/pylearn2

For each example, randomly "drop" half the hidden nodes in each layer

compute activations, then randomly set half the nodes to 0

update weights as normal

after training, multiply all output weights from hidden nodes by 0.5

Significantly reduces overfitting

Breaks up complex co-adapted hidden nodes

Makes the overall network more robust

Shared weights results in very strong regularization (reduction when node is dropped)

Like ensemble training 2^N networks and averaging via geometric mean

Learn Features Before Labels:

Quickly Identify Common Structure

Also helps in input layer (with lower dropout rate)

**Stochastic Spikes**

Instead of propagating a value p with probability 0.5, what about propagating a value 0.5 with probability p?

This is what actual neurons do -- and it has always been a neurological puzzle why.

Initial results from Hinton suggest that this is similar to dropout...

slightly longer to learn

requires a slightly larger network

and generalizes slightly better

Variance of p(1-p)/4 rather than p^2/4...which for small p, approaches Poisson...which is also seen in actual neurons

MNIST

(digit classification)

Boltzmann Machines

0/1 binary valued nodes

Undirected edges with weights

Probabilistic activation based on neighbors

Overall "energy" ~ weighted disagreement between nodes

Harry Potter Avatar LOTR Gladiator Titanic Star Trek

Deepening Layers

Harry Potter Avatar LOTR Gladiator Titanic Star Trek

Pre-training for Neural Network

Harry Potter Avatar LOTR Gladiator Titanic Star Trek

Train weights using Contrastive Divergence (fast approximate Gradient Descent). For each example:

set visible activations

construct hidden activations

reconstruct "imagine" visible (Gibbs-like)

update weights: wij' = wij + (vi*hj - ri*hj)

**More Improvements**

**Momentum**

Weight Decay (L2 Regularization)

Weight Scaling (Max-Norm Regularization)

More Gibbs Sampling

Rectified Linear Units

Parallel Batch Updates

Weight Decay (L2 Regularization)

Weight Scaling (Max-Norm Regularization)

More Gibbs Sampling

Rectified Linear Units

Parallel Batch Updates

"Deep Learning"