Loading…
Transcript

Architecture of shallow vs deep learning

weights

bias

obective/loss function

optimizer (back propogaton)

Cost function on emprical data

Further, now we have automatic feature extraction

Deep Learning was not popular elarly due to

(i) Hardware Infrastructure

(ii) Datasets

Emprical risk minimization

Surrogate loss functions

Cost function on data generating function

  • Near human level image classification
  • Near human level speech recognition
  • Near human level handwriting transcription
  • Improved self driving cars
  • Improved ad targeting
  • Improved search
  • Natural language queries
  • Superhuman game playing

impact of deep learning

Learning Differs from pure optimization

Mini-batch Gradient

Batch Gradient

Stochastic Gradient

Gradient descent

diving

deeper

Deep learning is inspired by the human brain, but it is not an exact replica

Deep Learning and Human Brain

Causes:

(i) Vanishing gradients

(ii) Poorly scaled input

(iii) Overly large or small learning rate

(iv) Activation function issues

Ill conditioning in neural networks occur when small chan ges in input leads to large changes in output

Local minima are points where the loss function reaches a minimum in a small neighborhood, but it may not be the global minimum

Challenges in Neural Network optimization

Plateau problem of neural network

Saddle points

Flat Regions

Cliffs - regions of sudden abrupt increase or decrease of loss functions

Model identifiability refers to whether a given model's parameters can be uniquely determined from data. In deep learning, models often lack identifiability, meaning different parameter sets can produce the same outputs.

Neural Networks

Machine Learning

  • features manual
  • tabular data

Recurrent networks that produce an output at each time step and have recurrent connections between hidden units

Recurrent networks that produce an output at each time step and have recurrent connections only from the output at one time step to the hidden units at the next time step

Recurrent networks with recurrent connections between hidden units, that

read an entire sequence and then produce a single output

Unfolding computational graphs

RNN Computational graph

Teacher forcing

Gradients in RNN

Fully connected graphical model

Introduction to Deep Learning

Recurrent Networks as Directed Graphical Models

Efficient parameterization model with hidden states

Modeling Sequences Conditioned on Context with RNNs

Conditional RNN with variable sequences of input

Single fixed length input vector

Deep RNN

With skip connections

Inp-Hi

Hi-Op

more hidden states

Applications

- handwriting recognition

-speech recognition

- bioinformatics

Recursive Neural Networks

Gated Recurrent Units

  • Predicting stock prices
  • Translating languages
  • Chatbots & speech assistants
  • Handwriting and gesture recognition
  • Chatbots
  • Time-series forecasting
  • Sentiment analysis
  • Music or video generation

Loss layer

MSE- Mean Square Error

10 output neurons

C5: 120 features vector

P4: 10 X 10 X 16

5 X 5 X 16

C3: 14 X 14 X 6

10 X 10 X16

P2: 28 X28 X6

14 X 14 X 6

ReLU

C1: 32 X 32

28X28X6

CNN

Improving Gradient Descent

AlexNet

- Imagenet dataset

- 15 million images

- 22000 objects

- GTX 580 GPU , 3G Memory

Pretrained Models

Lenet-5

Pretrained models

AlexNet

Problem- Overfitting

Solution-Dropout

Fully Connected (FC)

Supervised Deep Learning

Swish - Self gated activation function

Leaky ReLU

Fixes dying ReLU

Exponentional Linear Unit (ELU)

Dying ReLU

Rectified Linear Unit (ReLU)

Vanishing gradients

tanh

Softmax - for multi class problems

Activation Functions

Has problem of vanishing gradients

Non Linear Functions

Linear functions does not capture complex relationships

These layers downsample the feature map to introduce Translation invariance, which reduces the overfitting of the CNN model.

Pooling Layer

Advantages

- Automatic feature extraction

- Translation variance

- Efficient parameter sharing

- Image classification, analysis, object fetection

Padding refers to the addition of extra pixels around the edge of the input image.

Hyperparameters

Stride

Padding

Stride refers to the number of pixels by which a kernel moves across the input image.

Depth of the layer

The ‘depth’ of a layer refers to the number of kernels it contains. Each filter produces a separate feature map, and the collection of these feature maps forms the complete output of the layer.

The convolution operation involves multiplying the kernel values by the original pixel values of the image and then summing up the results.

Channels

Convolution Layer