weights
bias
obective/loss function
optimizer (back propogaton)
Cost function on emprical data
Further, now we have automatic feature extraction
Deep Learning was not popular elarly due to
(i) Hardware Infrastructure
(ii) Datasets
Emprical risk minimization
Surrogate loss functions
Cost function on data generating function
Learning Differs from pure optimization
Mini-batch Gradient
Batch Gradient
Stochastic Gradient
Gradient descent
Deep learning is inspired by the human brain, but it is not an exact replica
Deep Learning and Human Brain
Causes:
(i) Vanishing gradients
(ii) Poorly scaled input
(iii) Overly large or small learning rate
(iv) Activation function issues
Ill conditioning in neural networks occur when small chan ges in input leads to large changes in output
Local minima are points where the loss function reaches a minimum in a small neighborhood, but it may not be the global minimum
Challenges in Neural Network optimization
Plateau problem of neural network
Saddle points
Flat Regions
Cliffs - regions of sudden abrupt increase or decrease of loss functions
Model identifiability refers to whether a given model's parameters can be uniquely determined from data. In deep learning, models often lack identifiability, meaning different parameter sets can produce the same outputs.
Neural Networks
Machine Learning
Recurrent networks that produce an output at each time step and have recurrent connections between hidden units
Recurrent networks that produce an output at each time step and have recurrent connections only from the output at one time step to the hidden units at the next time step
Recurrent networks with recurrent connections between hidden units, that
read an entire sequence and then produce a single output
Unfolding computational graphs
RNN Computational graph
Teacher forcing
Gradients in RNN
Fully connected graphical model
Recurrent Networks as Directed Graphical Models
Efficient parameterization model with hidden states
Modeling Sequences Conditioned on Context with RNNs
Conditional RNN with variable sequences of input
Single fixed length input vector
Deep RNN
With skip connections
Inp-Hi
Hi-Op
more hidden states
Applications
- handwriting recognition
-speech recognition
- bioinformatics
Recursive Neural Networks
Gated Recurrent Units
Loss layer
MSE- Mean Square Error
10 output neurons
C5: 120 features vector
P4: 10 X 10 X 16
5 X 5 X 16
C3: 14 X 14 X 6
10 X 10 X16
P2: 28 X28 X6
14 X 14 X 6
ReLU
C1: 32 X 32
28X28X6
CNN
Improving Gradient Descent
AlexNet
- Imagenet dataset
- 15 million images
- 22000 objects
- GTX 580 GPU , 3G Memory
Pretrained Models
Lenet-5
Pretrained models
AlexNet
Problem- Overfitting
Solution-Dropout
Fully Connected (FC)
Supervised Deep Learning
Swish - Self gated activation function
Leaky ReLU
Fixes dying ReLU
Exponentional Linear Unit (ELU)
Dying ReLU
Rectified Linear Unit (ReLU)
Vanishing gradients
tanh
Softmax - for multi class problems
Activation Functions
Has problem of vanishing gradients
Non Linear Functions
Linear functions does not capture complex relationships
These layers downsample the feature map to introduce Translation invariance, which reduces the overfitting of the CNN model.
Pooling Layer
Advantages
- Automatic feature extraction
- Translation variance
- Efficient parameter sharing
- Image classification, analysis, object fetection
Padding refers to the addition of extra pixels around the edge of the input image.
Hyperparameters
Stride
Padding
Stride refers to the number of pixels by which a kernel moves across the input image.
Depth of the layer
The ‘depth’ of a layer refers to the number of kernels it contains. Each filter produces a separate feature map, and the collection of these feature maps forms the complete output of the layer.
The convolution operation involves multiplying the kernel values by the original pixel values of the image and then summing up the results.
Channels
Convolution Layer