Problem (Dataset)

**Network Architecture**

ReLU Nonlinearity

Local Response Normalization

Local normalization scheme aids generalization

Overlapping Pooling

Pooling layers in CNNs summarize the outputs of neighboring groups of neurons in the same kernel

map.

Traditionally, the neighborhoods summarized by adjacent pooling units do not overlap!

Models with overlapping pooling are slightly more difficult to overfit.

**in NIPS 2012**

Authors: A.Krizhevsky, I.Sutskever, G.E.Hinton

Presented by: Amir Shahroudy

Authors: A.Krizhevsky, I.Sutskever, G.E.Hinton

Presented by: Amir Shahroudy

**ImageNet Classification with**

Deep Convolutional Neural Networks

Deep Convolutional Neural Networks

ImageNet

Over 15 million labeled high-resolution images

Roughly 22,000 categories

Collected from the web

Labeled by human labelers using Amazon’s Mechanical Turk crowd-sourcing

ILSVRC

Annual competition called the

ImageNet Large-Scale Visual Recognition Challenge (ILSVRC)

Uses a subset of ImageNet with

roughly 1000 images in each of 1000 categories.

Each year:

1.2 million training images,

50,000 validation images,

and 150,000 testing images.

Normalization

ImageNet consists of variable-resolution images,

our system requires a constant input dimensionality

Down-sample the images to a fixed resolution of 256x256

Subtracting the mean activity over the training set from each pixel. So training is on the (centered) raw RGB values of the pixels.

Reducing Overfitting

Details of learning

Results

The output of the last fully-connected layer is fed to a 1000-way softmax which produces a distribution over the 1000 class labels

The ReLU non-linearity is applied to the output of every convolutional and fully-connected layer.

5 Convolutional

layers

3 Fully connected layers

Done on GPU#1

Done on GPU#2

intra-GPU connections

inter-GPU connections

Response-normalization layers

Max-pooling layers

224x224x3

input image

96 kernels

11x11x3

256 kernels

5x5x48

384 kernels

3x3x256

384 kernels

3x3x192

256 kernels

3x3x192

2048 neurons each

A single GTX 580 GPU has only 3GB of memory,

which limits the maximum size of the networks.

1.2 million training examples do not fit on one GPU

Spread the net across two GPUs

Current GPUs can read from and write to

one another’s memory directly

This scheme reduces our top-1 and top-5 error rates by 1.7% and 1.2%, respectively.

Nonlinearity as Rectified Linear Units (ReLUs)

activity of a neuron computed by applying kernel i at position (x,y) and then applying the ReLU nonlinearity

response-normalized activity

The constants are hyper-parameters whose values are determined using a validation set;

They used k = 2, n = 5, alpha = 0.0001, and beta = 0.75

Response normalization reduces the top-1 and top-5 error rates by 1.4% and 1.2% respectively

This architecture has 60 million parameters.

Danger of overfitting, To combat overfitting:

Data Augmentation

Label-preserving transformations

Label-preserving transformations

Extracting random 224x224 patches (and their horizontal reflections) from the 256x256 images and training the network on these extracted patches

Perform PCA on the set of RGB pixel values on training set.

Add multiples of the found principal components into each training image

eigenvectors and eigenvalues

of the 3x3 covariance matrix

of RGB pixel values

random variable

Gaussian: mean=0 SD=0.1

Training using stochastic gradient descent

i is the iteration index

v is the momentum variable,

epsilon is learning rate,

the average over the i th batch Di of the derivative of the objective with respect to w, evaluated at wi

Initialized the weights in each layer from a zero-mean Gaussian distribution with standard deviation 0.01

Dropout

Set to zero the output of each hidden neuron with probability 0.5

These neurons do not contribute to the forward pass and do not participate in backpropagation

Applied in the first two fully-connected layers

It doubles the number of iterations required to converge