Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


ImageNet Classification with Deep Convolutional Neural Networks

in NIPS 2012 Authors: Alex Krizhevsky, Ilya Sutskever, Geoff E. Hinton

Amir Shahroudy

on 6 December 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of ImageNet Classification with Deep Convolutional Neural Networks

Training on Multiple GPUs
Problem (Dataset)
Network Architecture
ReLU Nonlinearity
Local Response Normalization
Local normalization scheme aids generalization
Overlapping Pooling
Pooling layers in CNNs summarize the outputs of neighboring groups of neurons in the same kernel

Traditionally, the neighborhoods summarized by adjacent pooling units do not overlap!

Models with overlapping pooling are slightly more difficult to overfit.
in NIPS 2012
Authors: A.Krizhevsky, I.Sutskever, G.E.Hinton
Presented by: Amir Shahroudy

ImageNet Classification with
Deep Convolutional Neural Networks

Over 15 million labeled high-resolution images
Roughly 22,000 categories
Collected from the web
Labeled by human labelers using Amazon’s Mechanical Turk crowd-sourcing
Annual competition called the
ImageNet Large-Scale Visual Recognition Challenge (ILSVRC)
Uses a subset of ImageNet with
roughly 1000 images in each of 1000 categories.
Each year:
1.2 million training images,
50,000 validation images,
and 150,000 testing images.
ImageNet consists of variable-resolution images,
our system requires a constant input dimensionality
Down-sample the images to a fixed resolution of 256x256

Subtracting the mean activity over the training set from each pixel. So training is on the (centered) raw RGB values of the pixels.
Reducing Overfitting
Details of learning
The output of the last fully-connected layer is fed to a 1000-way softmax which produces a distribution over the 1000 class labels
The ReLU non-linearity is applied to the output of every convolutional and fully-connected layer.
5 Convolutional
3 Fully connected layers
Done on GPU#1
Done on GPU#2
intra-GPU connections
inter-GPU connections
Response-normalization layers
Max-pooling layers
input image
96 kernels
256 kernels
384 kernels
384 kernels
256 kernels
2048 neurons each
A single GTX 580 GPU has only 3GB of memory,
which limits the maximum size of the networks.

1.2 million training examples do not fit on one GPU

Spread the net across two GPUs

Current GPUs can read from and write to
one another’s memory directly

This scheme reduces our top-1 and top-5 error rates by 1.7% and 1.2%, respectively.
Nonlinearity as Rectified Linear Units (ReLUs)
activity of a neuron computed by applying kernel i at position (x,y) and then applying the ReLU nonlinearity
response-normalized activity
The constants are hyper-parameters whose values are determined using a validation set;
They used k = 2, n = 5, alpha = 0.0001, and beta = 0.75
Response normalization reduces the top-1 and top-5 error rates by 1.4% and 1.2% respectively
This architecture has 60 million parameters.
Danger of overfitting, To combat overfitting:
Data Augmentation
Label-preserving transformations
Label-preserving transformations
Extracting random 224x224 patches (and their horizontal reflections) from the 256x256 images and training the network on these extracted patches
Perform PCA on the set of RGB pixel values on training set.
Add multiples of the found principal components into each training image
eigenvectors and eigenvalues
of the 3x3 covariance matrix
of RGB pixel values
random variable
Gaussian: mean=0 SD=0.1
Training using stochastic gradient descent
i is the iteration index
v is the momentum variable,
epsilon is learning rate,
the average over the i th batch Di of the derivative of the objective with respect to w, evaluated at wi
Initialized the weights in each layer from a zero-mean Gaussian distribution with standard deviation 0.01
Set to zero the output of each hidden neuron with probability 0.5
These neurons do not contribute to the forward pass and do not participate in backpropagation
Applied in the first two fully-connected layers
It doubles the number of iterations required to converge
Full transcript