LSTM details




Click to access 1511.07889.pdf

What is RNN

RNN: multi layer feedback RNN (neural Network Recurrent, recurrent neural network) neural network is a kind of artificial neural network which is connected to the ring. The internal state of the network can display dynamic time series behavior. Unlike feedforward neural networks, RNN can use its internal memory to process arbitrary timing input sequences, which allows it to be more easily processed such as non segmented handwriting recognition, speech recognition, etc.. – Baidu Encyclopedia

Here we look at the abstract out of the RNN formula:
You can find that each RNN has to use the last time the middle layer of the outputht

The shortcomings of the traditional RNN – the gradient of the Vanishing (gradient problem)

We define function loss asEThen the gradient formula is as follows:
Multiplied by less than 1 of the number, the gradient will be smaller and smaller. In order to solve this problem, LSTM came into being.

LSTM introduction

Definition: LSTM (Term Memory Long-Short, LSTM)
Is a time recurrent neural network, the paper was first published in 1997. Due to the unique design structure, LSTM is suitable for processing and prediction of time series in the interval and delay is very long important events. – Baidu Encyclopedia

Mentioned LSTM, always accompanied by a picture as shown below:

Can be seen from the figure, in addition to the input, there are three parts: 1) Gate Input; 2) Gate Forget; 3) Gate Output
According to the RNN mentioned above, our input isxtandht1, while the input ishtandct(state cell), where state LSTM is the key to cell, which makes LSTM with memory function. Here’s a formula for LSTM:
1) Gate Input:
amongσRefers to the sigmoid function.
2) Gate Forget:Decide whether to delete or retain memory (memory)
3) Gate Output:
4) update Cell:
5) State Update Cell:
6) Output of LSTM Final:
Above is a formula for cell involved in LSTM,Below to explain why LSTM can solve the problem of gradient disappear in RNN.

Because each factor is very close to 1, so the gradient is difficult to decay, so as to solve the problem of gradient disappear.


Nngraph Torch

Before the use of LSTM to prepare torch, we need to learn a tool nngraph torch, an nngraph to the following commands:


Nngraph detailed introduction:Https://
Nngraph can facilitate the design of a neural network module. We first use nngraph to create a simple network module:
We can see that the input of this module is a total of three,x1,x2andx3, the output isz. The following is the implementation of this module torch code:

L=nn.CAddTable () () (){x1, nn.CMulTable () ({x2) () (nn.Linear) (20,10) (x3)}}))
Mlp=nn.gModule ({x1, X2, x3},{L})

First we definex1,x2andx3, useNn.Identity () () () ()And then tolinear(x3)We useX4=nn.Linear (20,10) (x3)A linear neural network with 20 neurons in the output layer is defined, and a linear neural network with 10 neurons in the output layer is defined.x2linear(x3), useX5=nn.CMulTable () (X2, x4)For; forx1+x2linear(x3)We useNn.CAddTable () (x1, x5)To achieve; finally useNn.gModule ({input}, {output})To define the neural network module.
We use the forward method to test whether our Module is correct:

H2=Torch.Tensor (Ten(fill ().One)
H3=Torch.Tensor (Twenty(fill ().Two)
B=Mlp:forward ({h1, H2, h3})
Parameters=Mlp:parameters ()One]
Bias=Mlp:parameters ()Two]
Result=Torch.cmul (H2, (parameters*h3+bias)) +h1

First we define three inputsh1,h2andh3, then call the module forward MPL command to get the output B, and then we get the network weights w and bias are saved in the parameters and bias variables, calculationz=h1+h2linear(h3)ResultResult=torch.cmul (H2, (parameters*h3+bias)) +h1, finally compare B and result is consistent, we found that the results of the calculation is the same, that our module is correct.

Use LSTM to prepare the nngraph module

Now we use nngraph to write the LSTM module described above, the code is as follows:

Require 'nngraph'
Function LSTM(XT, prev_c, prev_h)
    Function New_input_sum()
        Local I2h=NN.Linear(Four hundred,Four hundred)
        Local H2H=NN.Linear(Four hundred,Four hundred)
        Return NN.CAddTable()({i2h(XT)H2H.(prev_h)})
    Local Input_gate=NN.Sigmoid()(new_input_sum())
    Local Forget_gate=NN.Sigmoid()(new_input_sum())
    Local Output_gate=NN.Sigmoid()(new_input_sum())
    Local GT=NN.Tanh()(new_input_sum())
    Local CT=NN.CAddTable()({nn.CMulTable()({forget_gate, prev_c}), nn.CMulTable()({input_gate, gt})})
    Local HT=NN.CMulTable()({output_gate, nn.Tanh()(CT)})
    Return CT,HT
LSTM=NN.GModule({xt, prev_c, prev_h}, {lstm(XT, prev_c, prev_h)})

amongXTandPrev_hIs input,Prev_cIs state cell, and then we follow the previous formula one calculation, the final outputCT(cell state new) (), HT (output). The calculation sequence of the code is completely consistent with the above, so here is no longer one one explained.

nn.graph and nn.module

nn.graph and nn.module

Click to access practical5.pdf


nn.Sequential {
  [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> output]
  (1): nn.SpatialConvolution(1 -> 6, 5x5)
  (2): nn.ReLU
  (3): nn.SpatialMaxPooling(2x2, 2,2)
  (4): nn.SpatialConvolution(6 -> 16, 5x5)
  (5): nn.ReLU
  (6): nn.SpatialMaxPooling(2x2, 2,2)
  (7): nn.View(400)
  (8): nn.Linear(400 -> 120)
  (9): nn.ReLU
  (10): nn.Linear(120 -> 84)
  (11): nn.ReLU
  (12): nn.Linear(84 -> 10)
  (13): nn.LogSoftMax

To access

net:get(6).output (see get and output).




enter image description here

require 'torch'
require 'nn'
require 'nngraph'

function CreateModule(input_size)
    local input = nn.Identity()()   -- network input

    local nn_module_1 = nn.Linear(input_size, 100)(input)
    local nn_module_2 = nn.Linear(100, input_size)(nn_module_1)

    local output = nn.CMulTable()({input, nn_module_2})

    -- pack a graph into a convenient module with standard API (:forward(), :backward())
    return nn.gModule({input}, {output})

input = torch.rand(30)

my_module = CreateModule(input:size(1))

output = my_module:forward(input)
criterion_err = torch.rand(output:size())

gradInput = my_module:backward(input, criterion_err)


require 'nngraph'

h1 = nn.Linear(20, 20)()
h2 = nn.Linear(10, 10)()
hh1 = nn.Linear(20, 1)(nn.Tanh()(h1))
hh2 = nn.Linear(10, 1)(nn.Tanh()(h2))
madd = nn.CAddTable()({hh1, hh2})
oA = nn.Sigmoid()(madd)
oB = nn.Tanh()(madd)
gmod = nn.gModule({h1, h2}, {oA, oB})

for indexNode, node in ipairs(gmod.forwardnodes) do
  if then

how to share the parameters in nngraph #114



local net = nn.gModule({input}, {output})

for _,node in ipairs(net.forwardnodes) do
for _,node in ipairs(net.backwardnodes) do

m = nn.gModule()

m = nn.Sequential()

Multiply Two vector in torch

  1. torch.cmulz=torch.cmul(x,y) returns a new tensor.

    torch.cmul(z,x,y) puts the result in z.

    y:cmul(x) multiplies all elements of y with corresponding elements of x.

    z:cmul(x,y) puts the result in z.

  2. nn.CMul : learning  the scale input value
    mlp = nn.Sequential()
    mlp:add(nn.CMul(5, 1))
    y = torch.Tensor(5, 4)
    sc = torch.Tensor(5, 4)
    for i = 1, 5 do sc[i] = i; end -- scale input with this
    function gradUpdate(mlp, x, y, criterion, learningRate)
       local pred = mlp:forward(x)
       local err = criterion:forward(pred, y)
       local gradCriterion = criterion:backward(pred, y)
       mlp:backward(x, gradCriterion)
       return err
    for i = 1, 10000 do
       x = torch.rand(5, 4)
       err = gradUpdate(mlp, x, y, nn.MSECriterion(), 0.01)
  3. nn.CMultable
    ii = {torch.ones(5)*2, torch.ones(5)*3, torch.ones(5)*4}
    m = nn.CMulTable()
    [torch.DoubleTensor of dimension 5]


Visualizing CNN weights, load, save ‘.t7 & log as “.html” file from Torch Tensor



Torch Cuda Tensor of size 64x64x3x3 and I want to visualise its weights for a given layer as follows:

local layer = model:get(3)

local weights = layer.weight

local imgDisplay = image.toDisplayTensor{input=weights, padding=2, scaleeach=80}




To load .t7 file 

trainData = torch.load(train.t7)

testData = torch.load(test.t7)

To save .t7 file,trainData),testData)


Ref :

To view log as .html file  and other log as well

REf code:

function test()
— disable flips, dropouts and batch normalization
print( ‘==>’..” testing”)
local bs = 125
for i=1,,bs do
local outputs = model:forward(,i,bs))
confusion:batchAdd(outputs, provider.testData.labels:narrow(1,i,bs))

print(‘Test accuracy:’, confusion.totalValid * 100)

if testLogger then
testLogger:add{train_acc, confusion.totalValid * 100}

if paths.filep(’/test.log.eps’) then
local base64im
os.execute((‘convert -density 200 %s/test.log.eps %s/test.png’):format(,
os.execute((‘openssl base64 -in %s/test.png -out %s/test.base64′):format(,
local f =’/test.base64′)
if f then base64im = f:read’*all’ end

local file =’/report.html’,’w’)
<!DOCTYPE html>
<title>%s – %s</title>
<img src=”data:image/png;base64,%s”>
for k,v in pairs(optimState) do
if torch.type(v) == ‘number’ then

— save model every 50 epochs
if epoch % 50 == 0 then
local filename = paths.concat(, ‘’)
print(‘==> saving model to ‘..filename), model:get(3):clearState())



Deep Learning for Computer Vision – Introduction to Convolution Neural Networks

Completed refer to this blog:


The power of artificial intelligence is beyond our imagination. We all know robots have already reached a testing phase in some of the powerful countries of the world. Governments, large companies are spending billions in developing this ultra-intelligence creature. The recent existence of robots have gained attention of many research houses across the world.

Does it excite you as well ? Personally for me, learning about robots & developments in AI started with a deep curiosity and excitement in me! Let’s learn about computer vision today.

The earliest research in computer vision started way back in 1950s. Since then, we have come a long way but still find ourselves far from the ultimate objective. But with neural networks and deep learning, we have become empowered like never before.

Applications of deep learning in vision have taken this technology to a different level and made sophisticated things like self-driven cars possible in near future. In this article, I will also introduce you to Convolution Neural Networks which form the crux of deep learning applications in computer vision.

Note: This article is inspired by Stanford’s Class on Visual Recognition. Understanding this article requires prior knowledge of Neural Networks. If you are new to neural networks, you can start here. Another useful resource on basics of deep learning can be found here.

Table of Contents

  1. Challenges in Computer Vision
  2. Overview of Traditional Approaches
  3. Review of Neural Networks Fundamentals
  4. Introduction to Convolution Neural Networks
  5. Case Study: Increasing power of of CNNs in IMAGENET competition
  6. Implementing CNNs using GraphLab (Practical in Python)


1. Challenges in Computer Vision (CV)

As the name suggests, the aim of computer vision (CV) is to imitate the functionality of human eye and brain components responsible for your sense of sight.

Doing actions such as recognizing an animal, describing a view, differentiating among visible objects are really a cake-walk for humans. You’d be surprised to know that it took decades of research to discover and impart the ability of detecting an object to a computer with reasonable accuracy.

The field of computer vision has witnessed continual advancements in the past 5 years. One of the most stated advancement is Convolution Neural Networks (CNNs). Today, deep CNNs form the crux of most sophisticated fancy computer vision application, such as self-driving cars, auto-tagging of friends in our facebook pictures, facial security features, gesture recognition, automatic number plate recognition, etc.

Let’s get familiar with it a bit more:

Object detection is considered to be the most basic application of computer vision. Rest of the other developments in computer vision are achieved by making small enhancements on top of this. In real life, every time we(humans) open our eyes, we unconsciously detect objects.

Since it is super-intuitive for us, we fail to appreciate the key challenges involved when we try to design systems similar to our eye. Lets start by looking at some of the key roadblocks:

  1. Variations in Viewpoint
    • The same object can have different positions and angles in an image depending on the relative position of the object and the observer.
    • There can also be different positions. For instance look at the following images:cat_poses
    • Though its obvious to know that these are the same object, it is not very easy to teach this aspect to a computer (robots or machines).
  2. Difference in Illumination
    • Different images can have different light conditions. For instance:
    • Though this image is so dark, we can still recognize that it is a cat. Teaching this to a computer is another challenge.
  3. Hidden parts of images
    • Images need not necessarily be complete. Small or large proportions of the images might be hidden which makes the detection task difficult. For instance:
    • Here, only the face of the puppy is visible and that too partially, posing another challenge for the computer to recognize.
  4. Background Clutter
    • Some images might blend into the background. For instance:
    • If you observe carefully, you can find a man in this image. As simple as it looks, it’s an uphill task for a computer to learn.

These are just some of the challenges which I brought up so that you can appreciate the complexity of the tasks which your eye and brain duo does with such utter ease. Breaking up all these challenges and solving individually is still possible today in computer vision. But we’re still decades away from a system which can get anywhere close to our human eye (which can do everything!).

This brilliance of our human body is the reason why researchers have been trying to break the enigma of computer vision by analyzing the visual mechanics of humans or other animals. Some of the earliest work in this direction was done by Hubel and Weisel with their famous cat experiment in 1959. Read more about it here.

This was the first study which emphasized the importance of edge detection for solving the computer vision problem. They were rewarded the nobel prize for their work.

Before diving into convolutional neural networks, lets take a quick overview of the traditional or rather elementary techniques used in computer vision before deep learning became popular.


2. Overview of Traditional Approaches

Various techniques, other than deep learning are available enhancing computer vision. Though, they work well for simpler problems, but as the data become huge and the task becomes complex, they are no substitute for deep CNNs. Let’s briefly discuss two simple approaches.

  1. KNN (K-Nearest Neighbours)
    • Each image is matched with all images in training data. The top K with minimum distances are selected. The majority class of those top K is predicted as output class of the image.
    • Various distance metrics can be used like L1 distance (sum of absolute distance), L2 distance (sum of squares), etc.
    • Drawbacks:
      • Even if we take the image of same object with same illumination and orientation, the object might lie in different locations of image, i.e. left, right or center of image. For instance:
      • Here the same dog is on right side in first image and left side in second. Though its the same image, KNN would give highly non-zero distance for the 2 images.
      • Similar to above, other challenges mentioned in section 1 will be faced by KNN.
  2. Linear Classifiers
    • They use a parametric approach where each pixel value is considered as a parameter.
    • It’s like a weighted sum of the pixel values with the dimension of the weights matrix depending on the number of outcomes.
    • Intuitively, we can understand this in terms of a template. The weighted sum of pixels forms a template image which is matched with every image. This will also face difficulty in overcoming the challenges discussed in section 1 as single template is difficult to design for all the different cases.

I hope this gives some intuition into the challenges faced by approaches other than deep learning. Please note that more sophisticated techniques can be used than the ones discussed above but they would rarely beat a deep learning model.


3. Review of Neural Networks Fundamentals

Let’s discuss some properties of a neural networks. I will skip the basics of neural networks here as I have already covered that in my previous article – Fundamentals of Deep Learning – Starting with Neural Networks.

Once your fundamentals are sorted, let’s learn in detail some important concepts such as activation functions, data preprocessing, initializing weights and dropouts.


Activation Functions

There are various activation functions which can be used and this is an active area of research. Let’s discuss some of the popular options:

  1. Sigmoid Function
    • Equation: σ(x) = 1/(1+e-x)
      sigmoid act
    • Sigmoid activation, also used in logistic regression regression, squashes the input space from (-inf,inf) to (0,1)
    • But it has various problems and it is almost never used in CNNs:
      1. Saturated neurons kill the gradient
        • If you observe the above graph carefully, if the input is beyond -5 or 5, the output will be very close to 0 and 1 respectively. Also, in this region the gradients are almost zero. Notice that the tangents in this region will be almost parallel to x-axis thus ~0 slope.
        • As we know that gradients get multiplied in back-propogation, so this small gradient will virtually stop back-propogation into further layers, thus killing the gradient.
      2. Outputs are not zero-centered
        • As you can see that all the outputs are between 0 and 1. As these become inputs to the next layer, all the gradients of the next layer will be either positive or negative. So the path to optimum will be zig-zag. I will skip the mathematics here. Please refer the stanford class referred above for details.
      3. Taking the exp() is computationally expensive
        • Though not a big drawback, it has a slight negative impact
  2. tanh activation
    • It is simply the hyperbolic tangent function with form:
      tanh act
    • It is always preferred over sigmoid because it solved problem #2, i.e. the outputs are in range (-1,1).
    • But it will still result in killing the gradient and thus not recommended choice.
  3.  ReLU (Rectified Linear Unit)
    • Equation: f(x) = max( 0 , x )
      relu actv
    • It is the most commonly used activation function for CNNs. It has following advantages:
      • Gradient won’t saturate in the positive region
      • Computationally very efficient as simple thresholding is required
      • Empirically found to converge faster than sigmoid or tanh.
    • But still it has the following disadvantages:
      • Output is not zero-centered and always positive
      • Gradient is killed for x<0. Few techniques like leaky ReLU and parametric ReLU are used to overcome this and I encourage you to find these
      • Gradient is not defined at x=0. But this can be easily catered using sub-gradients and posts less practical challenges as x=0 is generally a rare case

To summarize, ReLU is mostly the activation function of choice. If the caveats are kept in mind, these can be used very efficiently.


Data Preprocessing

For images, generally the following preprocessing steps are done:

  1. Same Size Images: All images are converted to the same size and generally in square shape.
  2. Mean Centering: For each pixel, its mean value among all images can be subtracted from each pixel. Sometimes (but rarely) mean centering along red, green and blue channels can also be done

Note that normalization is generally not done in images.


Weight Initialization

There can be various techniques for initializing weights. Lets consider a few of them:

  1. All zeros
    • This is generally a bad idea because in this case all the neuron will generate the same output initially and similar gradients would flow back in back-propagation
    • The results are generally undesirable as network won’t train properly.
  2. Gaussian Random Variables
    • The weights can be initialized with random gaussian distribution of 0 mean and small standard deviation (0.1 to 1e-5)
    • This works for shallow networks, i.e. ~5 hidden layers but not for deep networks
    • In case of deep networks, the small weights make the outputs small and as you move towards the end, the values become even smaller. Thus the gradients will also become small resulting in gradient killing at the end.
    • Note that you need to play with the standard deviation of the gaussian distribution which works well for your network.
  3. Xavier Initialization
    • It suggests that variance of the gaussian distribution of weights for each neuron should depend on the number of inputs to the layer.
    • The recommended variance is square root of inputs. So the numpy code for initializing the weights of layer with n inputs is: np.random.randn(n_in, n_out)*sqrt(1/n_in)
    • A recent research suggested that for ReLU neurons, the recommended update is: np.random.randn(n_in, n_out)*sqrt(2/n_in). Read this blog post for more details.

One more thing must be remembered while using ReLU as activation function. It is that the weights initialization might be such that some of the neurons might not get activated because of negative input. This is something that should be checked. You might be surprised to know that 10-20% of the ReLUs might be dead at a particular time while training and even in the end.

These were just some of the concepts I discussed here. Some more concepts can be of importance like batch normalization, stochastic gradient descent, dropouts which I encourage you to read on your own.


4. Introduction to Convolution Neural Networks

Before going into the details, lets first try to get some intuition into why deep networks work better.

As we learned from the drawbacks of earlier approaches, they are unable to cater to the vast amount of variations in images. Deep CNNs work by consecutively modeling small pieces of informationand combining them deeper in network.

One way to understand them is that the first layer will try to detect edges and form templates for edge detection. Then subsequent layers will try to combine them into simpler shapes and eventually into templates of different object positions, illumination, scales, etc. The final layers will match an input image with all the templates and the final prediction is like a weighted sum of all of them. So, deep CNNs are able to model complex variations and behaviour giving highly accurate predictions.

There is an interesting paper on visualization of deep features in CNNs which you can go through to get more intuition – Understanding Neural Networks Through Deep Visualization.

For the purpose of explaining CNNs and finally showing an example, I will be using the CIFAR-10 dataset for explanation here and you can download the data set from here. This dataset has 60,000 images with 10 labels and 6,000 images of each type. Each image is colored and 32×32 in size.

A CNN typically consists of 3 types of layers:

  1. Convolution Layer
  2. Pooling Layer
  3. Fully Connected Layer

You might find some batch normalization layers in some old CNNs but they are not used these days. We’ll consider these one by one.


Convolution Layer

Since convolution layers form the crux of the network, I’ll consider them first. Each layer can be visualized in the form of a block or a cuboid. For instance in the case of CIFAR-10 data, the input layer would have the following form:


Here you can see, this is the original image which is 32×32 in height and width. The depth here is 3 which corresponds to the Red, Green and Blue colors, which form the basis of colored images. Now a convolution layer is formed by running a filter over it. A filter is another block or cuboid of smaller height and width but same depth which is swept over this base block. Let’s consider a filter of size 5x5x3.


We start this filter from the top left corner and sweep it till the bottom left corner. This filter is nothing but a set of eights, i.e. 5x5x3=75 + 1 bias = 76 weights. At each position, the weighted sum of the pixels is calculated as WTX + b and a new value is obtained. A single filter will result in a volume of size 28x28x1 as shown above.

Note that multiple filters are generally run at each step. Therefore, if 10 filters are used, the output would look like:


Here the filter weights are parameters which are learned during the back-propagation step. You might have noticed that we got a 28×28 block as output when the input was 32×32. Why so? Let’s look at a simpler case.

Suppose the initial image had size 6x6xd and the filter has size 3x3xd. Here I’ve kept the depth as d because it can be anything and it’s immaterial as it remains the same in both. Since depth is same, we can have a look at the front view of how filter would work:


Here we can see that the result would be 4x4x1 volume block. Notice there is a single output for entire depth of the each location of filter. But you need not do this visualization all the time. Let’s define a generic case where image has dimension NxNxd and filter has FxFxd. Also, lets define another term stride (S) here which is the number of cells (in above matrix) to move in each step. In the above case, we had a stride of 1 but it can be a higher value as well. So the size of the output will be:

output size = (N – F)/S + 1

You can validate the first case where N=32, F=5, S=1. The output had 28 pixels which is what we get from this formula as well. Please note that some S values might result in non-integer result and we generally don’t use such values.

Let’s consider an example to consolidate our understanding. Starting with the same image as before of size 32×32, we need to apply 2 filters consecutively, first 10 filters of size 7, stride 1 and next 6 filters of size 5, stride 2. Before looking at the solution below, just think about 2 things:

  1. What should be the depth of each filter?
  2. What will the resulting size of the images in each step.

Here is the answer:



Notice here that the size of the images is getting shrunk consecutively. This will be undesirable in case of deep networks where the size would become very small too early. Also, it would restrict the use of large size filters as they would result in faster size reduction.

To prevent this, we generally use a stride of 1 along with zero-padding of size (F-1)/2. Zero-padding is nothing but adding additional zero-value pixels towards the border of the image.

Consider the example we saw above with 6×6 image and 3×3 filter. The required padding is (3-1)/2=1. We can visualize the padding as:


Here you can see that the image now becomes 8×8 because of padding of 1 on each side. So now the output will be of size 6×6 same as the original image.

Now let’s summarize a convolution layer as following:

  • Input size: W1 x H1 x D1
  • Hyper-parameters:
    • K: #filters
    • F: filter size (FxF)
    • S: stride
    • P: amount of padding
  • Output size: W2 x H2 x D2
    • W21
    • H21
    • D2
  • #parameters = (F.F.D).K + K
    • F.F.D : Number of parameters for each filter (analogous to volume of the cuboid)
    • (F.F.D).K : Volume of each filter multiplied by the number of filters
    • +K: adding K parameters for the bias term

Some additional points to be taken into consideration:

  • K should be set as powers of 2 for computational efficiency
  • F is generally taken as odd number
  • F=1 might sometimes be used and it makes sense because there is a depth component involved
  • Filters might be called kernels sometimes

Having understood the convolution layer, lets move on to pooling layer.


Pooling Layer

When we use padding in convolution layer, the image size remains same. So, pooling layers are used to reduce the size of image. They work by sampling in each layer using filters. Consider the following 4×4 layer. So if we use a 2×2 filter with stride 2 and max-pooling, we get the following response:


Here you can see that 4 2×2 matrix are combined into 1 and their maximum value is taken. Generally, max-pooling is used but other options like average pooling can be considered.


Fully Connected Layer

At the end of convolution and pooling layers, networks generally use fully-connected layers in which each pixel is considered as a separate neuron just like a regular neural network. The last fully-connected layer will contain as many neurons as the number of classes to be predicted. For instance, in CIFAR-10 case, the last fully-connected layer will have 10 neurons.


5. Case Study: AlexNet

I recommend reading the prior section multiple times and getting a hang of the concepts before moving forward.

In this section, I will discuss the AlexNet architecture in detail. To give you some background, AlexNet is the winning solution of IMAGENET Challenge 2012. This is one of the most reputed computer vision challenge and 2012 was the first time that a deep learning network was used for solving this problem.

Also, this resulted in a significantly better result as compared to previous solutions. I will share the network architecture here and review all the concepts learned above.

The detailed solution has been explained in this paper. I will explain the overall architecture of the network here. The AlexNet consists of a 11 layer CNN with the following architecture:


Here you can see 11 layers between input and output. Lets discuss each one of them individually. Note that the output of each layer will be the input of next layer. So you should keep that in mind.

  • Layer 0: Input image
    • Size: 227 x 227 x 3
    • Note that in the paper referenced above, the network diagram has 224x224x3 printed which appears to be a typo.
  • Layer 1: Convolution with 96 filters, size 11×11, stride 4, padding 0
    • Size: 55 x 55 x 96
    • (227-11)/4 + 1 = 55 is the size of the outcome
    • 96 depth because 1 set denotes 1 filter and there are 96 filters
  • Layer 2: Max-Pooling with 3×3 filter, stride 2
    • Size: 27 x 27 x 96
    • (55 – 3)/2 + 1 = 27 is size of outcome
    • depth is same as before, i.e. 96 because pooling is done independently on each layer
  • Layer 3: Convolution with 256 filters, size 5×5, stride 1, padding 2
    • Size: 27 x 27 x 256
    • Because of padding of (5-1)/2=2, the original size is restored
    • 256 depth because of 256 filters
  • Layer 4: Max-Pooling with 3×3 filter, stride 2
    • Size: 13 x 13 x 256
    • (27 – 3)/2 + 1 = 13 is size of outcome
    • Depth is same as before, i.e. 256 because pooling is done independently on each layer
  • Layer 5: Convolution with 384 filters, size 3×3, stride 1, padding 1
    • Size: 13 x 13 x 384
    • Because of padding of (3-1)/2=1, the original size is restored
    • 384 depth because of 384 filters
  • Layer 6: Convolution with 384 filters, size 3×3, stride 1, padding 1
    • Size: 13 x 13 x 384
    • Because of padding of (3-1)/2=1, the original size is restored
    • 384 depth because of 384 filters
  • Layer 7: Convolution with 256 filters, size 3×3, stride 1, padding 1
    • Size: 13 x 13 x 256
    • Because of padding of (3-1)/2=1, the original size is restored
    • 256 depth because of 256 filters
  • Layer 8: Max-Pooling with 3×3 filter, stride 2
    • Size: 6 x 6 x 256
    • (13 – 3)/2 + 1 = 6 is size of outcome
    • Depth is same as before, i.e. 256 because pooling is done independently on each layer
  • Layer 9: Fully Connected with 4096 neuron
    • In this later, each of the 6x6x256=9216 pixels are fed into each of the 4096 neurons and weights determined by back-propagation.
  • Layer 10: Fully Connected with 4096 neuron
    • Similar to layer #9
  • Layer 11: Fully Connected with 1000 neurons
    • This is the last layer and has 1000 neurons because IMAGENET data has 1000 classes to be predicted.

I understand this is a complicated structure but once you understand the layers, it’ll give you a much better understanding of the architecture. Note that you fill find a different representation of the structure if you look at the AlexNet paper. This is because at that GPUs were not very powerful and they used 2 GPUs for training the network. So the work processing was divided between the two.

I highly encourage you to go through the other advanced solutions of ImageNet challenges after 2012 to get more ideas of how people design these networks. Some of interesting solutions are:

  • ZFNet: winner of 2013 challenge
  • GoogleNet: winner of 2014 challenge
  • VGGNet: a good solution from 2014 challenge
  • ResNet: winner of 2015 challenge designed by Microsoft Research Team

This video gives a brief overview and comparison of these solutions towards the end.


6. Implementing CNNs using GraphLab

Having understood the theoretical concepts, lets move on to the fun part (practical) and make a basic CNN on the CIFAR-10 dataset which we’ve downloaded before.

I’ll be using GraphLab for the purpose of running algorithms. Instead of GraphLab, you are free to use alternatives tools such as Torch, Theano, Keras, Caffe, TensorFlow, etc. But GraphLab allows a quick and dirty implementation as it takes care of the weights initializations and network architecture on its own.

We’ll work on the CIFAR-10 dataset which you can download from here. The first step is to load the data. This data is packed in a specific format which can be loaded using the following code:

import pandas as pd
import numpy as np
import cPickle

#Define a function to load each batch as dictionary:
def unpickle(file):
    fo = open(file, 'rb')
    dict = cPickle.load(fo)
    return dict

#Make dictionaries by calling the above function:
batch1 = unpickle('data/data_batch_1')
batch2 = unpickle('data/data_batch_2')
batch3 = unpickle('data/data_batch_3')
batch4 = unpickle('data/data_batch_4')
batch5 = unpickle('data/data_batch_5')
batch_test = unpickle('data/test_batch')

#Define a function to convert this dictionary into dataframe with image pixel array and labels:
def get_dataframe(batch):
    df = pd.DataFrame(batch['data'])
    df['image'] = df.as_matrix().tolist()
    df['label'] = batch['labels']
    return df

#Define train and test files:
train = pd.concat([get_dataframe(batch1),get_dataframe(batch2),get_dataframe(batch3),get_dataframe(batch4),get_dataframe(batch5)],ignore_index=True)
test = get_dataframe(batch_test)

We can verify this data by looking at the head and shape of data as follow:

print train.head()

1. train head

print train.shape, test.shape

2. train test shape

Since we’ll be using graphlab, the next step is to convert this into a graphlab SFrame and run neural network. Let’s convert the data first:

import graphlab as gl
gltrain = gl.SFrame(train)
gltest = gl.SFrame(test)

GraphLab has a functionality of automatically creating a neural network based on the data. Lets run that as a baseline model before going into an advanced model.

model = gl.neuralnet_classifier.create(gltrain, target='label', validation_set=None)

3. model1

Here it used a simple fully connected network with 2 hidden layers and 10 neurons each. Let’s evaluate this model on test data.


4. model1 test evaluate

As you can see that we have a pretty low accuracy of ~15%. This is because it is a very fundamental network. Lets try to make a CNN now. But if we go about training a deep CNN from scratch, we will face the following challenges:

  1. The available data is very less to capture all the required features
  2. Training deep CNNs generally requires a GPU as a CPU is not powerful enough to perform the required calculations. Thus we won’t be able to run it on our system. We can probably rent an Amazom AWS instance.

To overcome these challenges, we can use pre-trained networks. These are nothing but networks like AlexNet which are pre-trained on many images and the weights for deep layers have been determined. The only challenge is to find a pre-trianed network which has been trained on images similar to the one we want to train. If the pre-trained network is not made on images of similar domain, then the features will not exactly make sense and classifier will not be of higher accuracy.

Before proceeding further, we need to convert these images into the size used in ImageNet which we’re using for classification. The GraphLab model is based on 256×256 size images. So we need to convert our images to that size. Lets do it using the following code:

#Convert pixels to graphlab image format
gltrain['glimage'] = gl.SArray(gltrain['image']).pixel_array_to_image(32, 32, 3, allow_rounding = True)
gltest['glimage'] = gl.SArray(gltest['image']).pixel_array_to_image(32, 32, 3, allow_rounding = True)
#Remove the original column

5. train image orig

Here we can see that a new column of type graphlab image has been created but the images are in 32×32 size. So we convert them to 256×256 using following code:

#Convert into 256x256 size
gltrain['image'] = gl.image_analysis.resize(gltrain['glimage'], 256, 256, 3)
gltest['image'] = gl.image_analysis.resize(gltest['glimage'], 256, 256, 3)
#Remove old column:

6. train image conv

Now we can see that the image has been converted into the desired size. Next, we will load the ImageNet pre-trained model in graphlab and use the features created in its last layer into a simple classifier and make predictions.

Lets start by loading the pre-trained model.

#Load the pre-trained model:
pretrained_model = gl.load_model('')

Now we have to use this model and extract features which will be passed into a classifier. Note that the following operations may take a lot of computing time. I use a Macbook Pro 15″ and I had to leave it for whole night!

gltrain['features'] = pretrained_model.extract_features(gltrain)
gltest['features'] = pretrained_model.extract_features(gltest)

Lets have a look at the data to make sure we have the features:


7. dtrain head

Though, we have the features with us, notice here that lot of them are zeros. You can understand this as a result of smaller data set. ImageNet was created on 1.2Mn images. So there would be many features in those images that don’t make sense for this data, thus resulting in zero outcome.

Now lets create a classifier using graphlab. The advantage with “classifier” function is that it will automatically create various classifiers and chose the best model.

simple_classifier = graphlab.classifier.create(gltrain, features = ['features'], target = 'label')

The various outputs are:

  1. Boosted Trees Classifier
    8. boosted o:p
  2. Random Forest Classifier
    9. rf o:p
  3. Decision Tree Classifier
    10. dec tree op
  4. Logistic Regression Classifier
    11. log ref op

The final model selection is based on a validation set with 5% of the data. The results are:

12. final selection

So we can see that Boosted Trees Classifier has been chosen as the final model. Let’s look at the results on test data:


13. test result

So we can see that the test accuracy is now ~50%. It’s a decent jump from 15% to 50% but there is still huge potential to do better. The idea here was to get you started and I will skip the next steps. Here are some things which you can try:

  1. Remove the redundant features in the data
  2. Perform hyper-parameter tuning in models
  3. Search for pre-trained models which are trained on images similar to this dataset

You can find many open-source solutions for this dataset which give >95% accuracy. You should check those out. Please feel free to try them and post your solutions in comments below.


End Notes

In this article, we covered the basics of computer vision using deep Convolution Neural Networks (CNNs). We started by appreciating the challenges involved in designing artificial systems which mimic the eye. Then, we looked at some of the traditional techniques, prior to deep learning, and got some intuition into their drawbacks.

We moved on to understanding the some aspects of tuning a neural networks such as activation functions, weights initialization and data-preprocessing. Next, we got some intuition into why deep CNNs should work better than traditional approaches and we understood the different elements present in a general deep CNN.

Subsequently, we consolidated our understanding by analyzing the architecture of AlexNet, the winning solution of ImageNet 2012 challenge. Finally, we took the CIFAR-10 data and implemented a CNN on it using a pre-trained AlexNet deep network.

I hope you liked this article. Did you find this article useful ? Please feel free to share your feedback through comments below. And to gain expertise in working in neural network try out the deep learning practice problem – Identify the Digits.

Triplet , audio and video , and tracking



After a kind of self-promotional entry, let’s come to the essence. In this post, I like to talk about what I’ve done in this fun project from research point. It entails to a novel method which is also applicable to similar fine-grain image recognition problems beyond this particular one.

I call the problem fine-grain since what differentiates the score of a selfie relies on the very details. It is hard to capture compared to the traditional object categorization problems, even with simple deep learning models.

We like to model ‘human eye evaluation of a selfie image’ by a computer. Here; we do not define what the beauty is, which is a very vague term by itself, but let the model internalize the notion from the data. The data is labeled by human annotators on an internally developed crowd-sourced website.

In terms of research, this is a peculiar problem where traditional CNN approaches fail due to following reasons:

  • Fine-grain attributes are the factors defining one image better or  worse  than another.
  • Selfie images induce vast amount of variations with different applied filters, editions, pose and lighting.
  • Scoring is a different practice than categorization and it is not a well-studied problem compared to categorization.
  • Scarcity of annotated data yields learning in a small-data regime.

Previous Works

This is a problem already targeted by different works. is one of the well-known example of such, using deep learning back-end empowered with a large amount of data from a dating application. They use the application statistics as the annotation. Our solution differs strongly since we only use in-house data which is very small compared to what they have. Thus feeding data into a well-known CNN architecture simply does not work in our setting.

There is also a relevant blog post by A. Karpathy where he crawled Instagram for millions of images and use “likes” as annotation. He uses a simple CNN. He states that the model is not that good but still it gives a intuition about what is a good selfie. Again, we count on A. Karpathy that ad-hoc CNN solutions are not enough for decent results.

There are other research efforts suggesting different CNN architectures or ratio based beauty justifications, however they are limited to pose constrains or smooth backgrounds. In our setting, an image can be uploaded from any scene with an applied filter or mask.

Proposed Method

We solve this problem based on 3 steps. First, pre-train the network with Siamese layer [1][2] as enlarging the model by Net2Net [3] incrementally. Then fine-tune the model with Huber-Loss based regression for scoring and just before fine-tuning use Net2Net operator once more to double the model size.

Method overview. 1. Train the model with Siamese layer, 2. Double the model size with Net2Net, 3. Fine-tune the model with Huber-Loss for scoring.
Siamese Network

Siamese network architecture is a way of learning which is embedding images into lower-dimensions based on similarity computed with features learned by a feature network. The feature network is the architecture we intend to fine-tune in this setting. Given two images, we feed into the feature network and compute corresponding feature vectors. The final layer computes pair-wise distance between computed features and final loss layer considers whether these two images are from the same class (label 1) or not (label -1) .

Siammese network. From [2]
Siamese network. From [2]. Both convolutional network shares parameters and learning the representation in parallel. In  our setting, these parameters belong to our network to be fine-tuned.

Suppose G_w()G_w() is the function implying the feature network and XX is raw image pixels. Lower indices of XX shows different images. Based on this parametrization the final layer computes the below distance (L1 norm).

E_w = ||G_w(X_1) - G_W(X_2)||E_w = ||G_w(X_1) – G_W(X_2)||

On top of this any suitable loss function might be used. There are many different alternatives proposed lately. We choose to use Hinge Embedding Loss which is defined as,

L(X, Y) = begin{cases} x_i, & text{if } y_i=1  text{max}(0, margin-x_i), & text{if}y_i=-1 end{cases} L(X, Y) = begin{cases} x_i, & text{if } y_i=1 text{max}(0, margin-x_i), & text{if}y_i=-1 end{cases}

Here in this framework, Siamese layer tries to push the network to learn features common for the same classes and differentiating for different classes..  Being said this, we expect to learn powerful features capturing finer details compared to simple supervised learning with help of the pair-wise consideration of examples. These features present good initialization for latter stage fine-tuning in relation to simple random or ImageNet initialization.

Siamese network tries to contract instances belonging to the same classes and disperse instances from different classes in the feature space.
Siamese network tries to contract instances belonging to the same classes and disperse instances from different classes in the feature space.
Architecture update by Net2Net

Net2Net [3] proposes two different operators to make the networks deeper and wider while keeping the model activations the same. Hence, it enables to train a network incrementally from smaller and shallower to wider and deeper architectures. This accelerates the training, lowers computational requirements and results possibly better representations.

Figure from Net2Net slide

We use Net2Net to reduce the training time in our modest computing facility and benefit from Siamese training without any architectural deficit. We apply Net2Net operators once in everytime training stalls through Siamese traning. In the end of the Siamese training we applied Net2Net wider operation once more to double the size and increase model capability to learn more representation.

Wider operation adds more units to a layer by copying weights from the old units and normalizes the next layer weights by the cloning factor of each unit, in order to keep the propagated activation the same.  Deeper operation adds an identity layer between successive layers so that again the propagated activation stands the same.

One subtle difference in our use of Net2Net is to apply zeroing noise to cloned weights in wider operation. It basically breaks the symmetry and forces each unit to learn similar but different representations.

Sidenote: I studied this exact method in parallel to this paper at Qualcomm Research when I was participating ImageNet challenge. However, I cannot find time to publish before Net2Net.  Sad 🙁


Fine-tuning is performed with Huber-Loss on top of the network which was used as the feature network at Siamese stage.  Huber-Loss is the choice due to its resiliency to outlier instances. Outliers are extremely harmful in fine-grain problems (miss-labeled  or corrupted instance) especially for small scale data sets. Hence, it is important for us to reconcile the effect of wrongly scored instances.

As we discussed above, before fine-tuning, we double the width (number of units in each layer) of the network. It enables to increase the representation power of the network which seems important for fine-grain problems.

Data Collection and Annotation

For this mission, we collect ~100.000 images from the web,  prune the irrelevant or low-quality images then annotate the remaining ones  on a crowd-sourced website. Each image is scored between 0 to 9.  Eventually, we have 30.000 images annotated where each one is scored at least twice by different annotators.

Understanding of beauty varies among cultures and we assume that variety of annotators minimized any cultural bias.

Annotated images are processed by face detection and alignment procedure in order to focus faces centered and aligned by the eyes.

Implementation Details

For all the model training,  we use Torch7 framework and almost all of the training code is released on Github . In this repository, you find different architectures at different code branches.

Fine-tuning leverages a data sampling strategy alleviating the effect of data imbalance.  Our data set includes a a Gaussian like distribution over the classes in which mid-classes have more instances compared to fringes.  To alleviate this, we first pick a random class then select a random image belonging to that class. That gives equal change to each class to be selected.

We applied rotation, random scaling, color noise and random horizontal flip for data augmentation.

We do not use Batch Normalization (BN) layers since they lavish computational cost and in our experiments we obtain far worse performances. We believe it relies on the fine-detailed nature of the problem and BN layers just loose the representational power of the network due to implicit noise applied by its layers.

ELU activation is used for all our network architectures since, approving the claim of [8], it accelerates the training of a network without BN layers.

We tried many different architectures but with a simple and memory efficient model (Tiny Darknet)  was enough to obtain comparable performance in shorter training time. Below, I share Torch code for the model definition;

-- The Tiny Model






model:add(nn.Linear(1024, 1))


In this section, we will discuss what are the contributions of individual bits and pieces of the proposed method. For any numerical comparison, I show correlation between the model prediction and the annotators score in a validation set.

Effect of Pre-Training

Pre-training with Siamese loss depicts very crucial effect. The initial representation learned by Siamese training presents a very effective initialization scheme for the final model.  Without pre-training, many of our train runs stall so quickly or even not reduce the loss.

Correlation values with different settings, higher is better;

  • with pre-training : 0.82
  • without pre-training : 0.68
  • with ImageNet: 0.73
Effect of Net2Net

The most important aspect of Net2Net is to allow training incrementally, in a faster manner. It also reduces the engineering effort to your model architecture so that you can validate smaller version of your model  rapidly before training the real one.

In our experiments, It is observed that Net2Net provides good speed up. It also increase the final model performance slightly.

Correlation values with different settings;

  • pre-training + net2net : 0.84
  • with pre-training : 0.82
  • without pre-training : 0.68
  • with ImageNet (VGG): 0.73

Training times;

  • pre-training + net2net : 5 hours
  • with pre-training : 8 hours
  • without pre-training : 13 hours
  • with ImageNet (VGG): 3 hours

We can see the performance and time improvement above. Maybe 3 hours seems not crucial but think about replicating the same training again and again to find the best possible setting. In such case, it saves a lot.


Although, proposed method yields considerable performance gain, correcting the common notion, more data would increase the performance much beyond. It might be observed by the below learning curve that our model learns training data very-well but validation loss stalls quickly. Thus, we need much more coverage by the training data in order to generalize better on validation set.

Sample training curve from of the fine-tuning stage. Early saturation on validation loss is a sign of requirement for more training data.
Image result
Learning Multi-Domain Convolutional Neural Networks for Visual Tracking

Multimodel : Sementic Learning


Cheng Wang(王城)
Prof.-Dr.-Helmert-Str. 2-3, 14482 Potsdam, Germany


Multimodal Representation Learning(On going)

Multimodal representation learning has gained increasing importance in various real-world multimedia applications. Inspired by the success of deep networks in multimedia computing, we propose a novel unified deep neural framework for multimodal representation learning. To capture the high-level semantic correlations across modalities, we adopted deep learning feature as image representation and topic feature as text representation respectively. In joint model learning, a 5-layer neural network is designed and enforced with a supervised pre-training in the first 3 layers for intra-modal regularization.


Action Recognition with Deep Learning(On going)


Multimodal Video Represenation Learning for Action Recognition

Related image

Video contains rich information such as appearance, motion and audio to help us understand its content. Recent works have shown the combination of appearance(spatial) and motion(temporal) clues can significant improve human action recognition performance in videos. In order to further explore the multimodal representation of video in action recognition, this work proposes a framework for learning multimodal representations of video appearance, motion as well as audio data. Our proposed fusion approach achieves 85.1% accuracy in fusing spatial-temporal on UCF101(split 1), which is very competitive to state-of-the-art works.

Deep Siamese Network for Action Recognition

This project aims to present a novel approach for video feature embedding via deep Siamese Neural Network (SNN). Different from existing feature descriptor-based methods, we propose a metric learning-based approach to train deep SNN that builds on two-stream Convolutional Neural Network(CNN) by using generated similar and dissimilar video pairs. SNN features are learned by minimizing the distance between similar videos and maximizing the distance between dissimilar videos. Our experimental results show that training SNN is beneficial in discriminative task like human action recognition. Our approach achieves very competitive performance on open benchmark UCF101 compare to state-of-the-art work

Recent Publications

    • C. Wang, H. Yang, C. Bartz and C. Meinel, Image Captioning with Deep Bidirectional LSTMs,ACM Multimedia (ACMMM 2016) (accepted as oral presentation) Link    Demo


    • C. Wang, H. Yang and C. Meinel, “Exploring Multimodal Video Representation for Action Recognition”, The annual International Joint Conference on Neural Networks (IJCNN 2016) (to appear)


    • C. Wang, H. Yang and C. Meinel, “A Deep Semantic Framework for Multimodal Representation Learning”, International Journal of MULTIMEDIA TOOLS AND APPLICATIONS (MTAP, IF:1.346), DOI: 10.1007/s11042-016-3380-8, online ISSN:1573-7721, Print ISSN:1380-7501, Special Issue: “Representation Learning for Multimedia Data Understanding”, March 2016.Link


    • C. Wang, H. Yang and C. Meinel, “Deep Semantic Mapping for Cross-Modal Retrieval”, the 27th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2015),pp. 234-241, Vietri sul Mare, Italy, Novermber 9-11, 2015. Link


    • C. Wang, H. Yang and C. Meinel, “Visual-Textual Late Semantic Fusion Using Deep Neural Network for Document Categorization”, the 22nd International Conference on Neural Information Processing (ICONIP2015), pp. 662-670, Istanbul, Turkey, Novermber 9-12, 2015. Link


    • C. Wang, H. Yang and C. Meinel, “Does Multilevel Semantic Representation Improve Text Categorization?”, the 26th International Conference on Database and Expert Systems Applications (DEXA 2015), LNCS, Volume 9261, pp 319-333 Valencia, Spain, September 1-4, 2015. Link


    • H. Yang, C. Wang, X. Che and C. Meinel. “An Improved System For Real-Time Scene Text Recognition”, ACM International Conference on Multimedia Retrieval (ICMR 2015), Shanghai, June 23-26, 2015. Link


  • C. Wang, H. Yang, X. Che and C. Meinel, “Concept-Based Multimodal Learning for Topic Generation”, the 21st MultiMedia Modelling Conference (MMM2015), LNCS, Volume 8935, pp 385-395, Sydney, Australia, Jan 5-7, 2015. Link

Video , Audio and Speech feature



paper: Joint Audio-Visual Bi-Modal Codewords for Video Event Detection.  Guangnan Ye, I-Hong Jhuo, Dong Liu, Yu-Gang Jiang, D.T. Lee, Shih-Fu Chang  In ACM International Conference on Multimedia Retrieval (ICMR)   Hong Kong   June, 2012

Image result for speech based attention in video cvpr


A Hybrid Content- and Concept-Based Approach to Large-Scale Video Analytics

Figure 1. Architecture based on a hierarchical deep neural network (H-DNN) and hidden Markov models (HMMS) for audio-only video event detection. MFCCS (mel-frequency cepstral coefficients) are commonly used in speech recognition systems, and MLP (multilayer perceptron) is an artificial neural network that maps sets of input data onto a set of appropriate outputs.
Figure 2. Processing chain for deep neural-network audio-only video event detection (top). The deep neural-network architecture comparison is shown lower left. Deep neural-network sampling and training efficiency comparison is shown lower right.

Best links





Paper :Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks


Image result for speech feature extraction using cnn



Wake-Up-Word Speech RecognitionRelated image

Speech Processing Laboratory – CTU

Related image

Video editing based on behaviors-for-attention – a

Image result for speech based attention in video cvpr



Related image




Related image


Related image


Deep learning for computational biology

Image result for speech feature extraction using cnn


Video Applications

Ref :

Real-time Action Recognition with Enhanced Motion Vector CNNs


Video Understanding