Triplet , audio and video , and tracking



After a kind of self-promotional entry, let’s come to the essence. In this post, I like to talk about what I’ve done in this fun project from research point. It entails to a novel method which is also applicable to similar fine-grain image recognition problems beyond this particular one.

I call the problem fine-grain since what differentiates the score of a selfie relies on the very details. It is hard to capture compared to the traditional object categorization problems, even with simple deep learning models.

We like to model ‘human eye evaluation of a selfie image’ by a computer. Here; we do not define what the beauty is, which is a very vague term by itself, but let the model internalize the notion from the data. The data is labeled by human annotators on an internally developed crowd-sourced website.

In terms of research, this is a peculiar problem where traditional CNN approaches fail due to following reasons:

  • Fine-grain attributes are the factors defining one image better or  worse  than another.
  • Selfie images induce vast amount of variations with different applied filters, editions, pose and lighting.
  • Scoring is a different practice than categorization and it is not a well-studied problem compared to categorization.
  • Scarcity of annotated data yields learning in a small-data regime.

Previous Works

This is a problem already targeted by different works. is one of the well-known example of such, using deep learning back-end empowered with a large amount of data from a dating application. They use the application statistics as the annotation. Our solution differs strongly since we only use in-house data which is very small compared to what they have. Thus feeding data into a well-known CNN architecture simply does not work in our setting.

There is also a relevant blog post by A. Karpathy where he crawled Instagram for millions of images and use “likes” as annotation. He uses a simple CNN. He states that the model is not that good but still it gives a intuition about what is a good selfie. Again, we count on A. Karpathy that ad-hoc CNN solutions are not enough for decent results.

There are other research efforts suggesting different CNN architectures or ratio based beauty justifications, however they are limited to pose constrains or smooth backgrounds. In our setting, an image can be uploaded from any scene with an applied filter or mask.

Proposed Method

We solve this problem based on 3 steps. First, pre-train the network with Siamese layer [1][2] as enlarging the model by Net2Net [3] incrementally. Then fine-tune the model with Huber-Loss based regression for scoring and just before fine-tuning use Net2Net operator once more to double the model size.

Method overview. 1. Train the model with Siamese layer, 2. Double the model size with Net2Net, 3. Fine-tune the model with Huber-Loss for scoring.
Siamese Network

Siamese network architecture is a way of learning which is embedding images into lower-dimensions based on similarity computed with features learned by a feature network. The feature network is the architecture we intend to fine-tune in this setting. Given two images, we feed into the feature network and compute corresponding feature vectors. The final layer computes pair-wise distance between computed features and final loss layer considers whether these two images are from the same class (label 1) or not (label -1) .

Siammese network. From [2]
Siamese network. From [2]. Both convolutional network shares parameters and learning the representation in parallel. In  our setting, these parameters belong to our network to be fine-tuned.

Suppose G_w()G_w() is the function implying the feature network and XX is raw image pixels. Lower indices of XX shows different images. Based on this parametrization the final layer computes the below distance (L1 norm).

E_w = ||G_w(X_1) - G_W(X_2)||E_w = ||G_w(X_1) – G_W(X_2)||

On top of this any suitable loss function might be used. There are many different alternatives proposed lately. We choose to use Hinge Embedding Loss which is defined as,

L(X, Y) = begin{cases} x_i, & text{if } y_i=1  text{max}(0, margin-x_i), & text{if}y_i=-1 end{cases} L(X, Y) = begin{cases} x_i, & text{if } y_i=1 text{max}(0, margin-x_i), & text{if}y_i=-1 end{cases}

Here in this framework, Siamese layer tries to push the network to learn features common for the same classes and differentiating for different classes..  Being said this, we expect to learn powerful features capturing finer details compared to simple supervised learning with help of the pair-wise consideration of examples. These features present good initialization for latter stage fine-tuning in relation to simple random or ImageNet initialization.

Siamese network tries to contract instances belonging to the same classes and disperse instances from different classes in the feature space.
Siamese network tries to contract instances belonging to the same classes and disperse instances from different classes in the feature space.
Architecture update by Net2Net

Net2Net [3] proposes two different operators to make the networks deeper and wider while keeping the model activations the same. Hence, it enables to train a network incrementally from smaller and shallower to wider and deeper architectures. This accelerates the training, lowers computational requirements and results possibly better representations.

Figure from Net2Net slide

We use Net2Net to reduce the training time in our modest computing facility and benefit from Siamese training without any architectural deficit. We apply Net2Net operators once in everytime training stalls through Siamese traning. In the end of the Siamese training we applied Net2Net wider operation once more to double the size and increase model capability to learn more representation.

Wider operation adds more units to a layer by copying weights from the old units and normalizes the next layer weights by the cloning factor of each unit, in order to keep the propagated activation the same.  Deeper operation adds an identity layer between successive layers so that again the propagated activation stands the same.

One subtle difference in our use of Net2Net is to apply zeroing noise to cloned weights in wider operation. It basically breaks the symmetry and forces each unit to learn similar but different representations.

Sidenote: I studied this exact method in parallel to this paper at Qualcomm Research when I was participating ImageNet challenge. However, I cannot find time to publish before Net2Net.  Sad 🙁


Fine-tuning is performed with Huber-Loss on top of the network which was used as the feature network at Siamese stage.  Huber-Loss is the choice due to its resiliency to outlier instances. Outliers are extremely harmful in fine-grain problems (miss-labeled  or corrupted instance) especially for small scale data sets. Hence, it is important for us to reconcile the effect of wrongly scored instances.

As we discussed above, before fine-tuning, we double the width (number of units in each layer) of the network. It enables to increase the representation power of the network which seems important for fine-grain problems.

Data Collection and Annotation

For this mission, we collect ~100.000 images from the web,  prune the irrelevant or low-quality images then annotate the remaining ones  on a crowd-sourced website. Each image is scored between 0 to 9.  Eventually, we have 30.000 images annotated where each one is scored at least twice by different annotators.

Understanding of beauty varies among cultures and we assume that variety of annotators minimized any cultural bias.

Annotated images are processed by face detection and alignment procedure in order to focus faces centered and aligned by the eyes.

Implementation Details

For all the model training,  we use Torch7 framework and almost all of the training code is released on Github . In this repository, you find different architectures at different code branches.

Fine-tuning leverages a data sampling strategy alleviating the effect of data imbalance.  Our data set includes a a Gaussian like distribution over the classes in which mid-classes have more instances compared to fringes.  To alleviate this, we first pick a random class then select a random image belonging to that class. That gives equal change to each class to be selected.

We applied rotation, random scaling, color noise and random horizontal flip for data augmentation.

We do not use Batch Normalization (BN) layers since they lavish computational cost and in our experiments we obtain far worse performances. We believe it relies on the fine-detailed nature of the problem and BN layers just loose the representational power of the network due to implicit noise applied by its layers.

ELU activation is used for all our network architectures since, approving the claim of [8], it accelerates the training of a network without BN layers.

We tried many different architectures but with a simple and memory efficient model (Tiny Darknet)  was enough to obtain comparable performance in shorter training time. Below, I share Torch code for the model definition;

-- The Tiny Model






model:add(nn.Linear(1024, 1))


In this section, we will discuss what are the contributions of individual bits and pieces of the proposed method. For any numerical comparison, I show correlation between the model prediction and the annotators score in a validation set.

Effect of Pre-Training

Pre-training with Siamese loss depicts very crucial effect. The initial representation learned by Siamese training presents a very effective initialization scheme for the final model.  Without pre-training, many of our train runs stall so quickly or even not reduce the loss.

Correlation values with different settings, higher is better;

  • with pre-training : 0.82
  • without pre-training : 0.68
  • with ImageNet: 0.73
Effect of Net2Net

The most important aspect of Net2Net is to allow training incrementally, in a faster manner. It also reduces the engineering effort to your model architecture so that you can validate smaller version of your model  rapidly before training the real one.

In our experiments, It is observed that Net2Net provides good speed up. It also increase the final model performance slightly.

Correlation values with different settings;

  • pre-training + net2net : 0.84
  • with pre-training : 0.82
  • without pre-training : 0.68
  • with ImageNet (VGG): 0.73

Training times;

  • pre-training + net2net : 5 hours
  • with pre-training : 8 hours
  • without pre-training : 13 hours
  • with ImageNet (VGG): 3 hours

We can see the performance and time improvement above. Maybe 3 hours seems not crucial but think about replicating the same training again and again to find the best possible setting. In such case, it saves a lot.


Although, proposed method yields considerable performance gain, correcting the common notion, more data would increase the performance much beyond. It might be observed by the below learning curve that our model learns training data very-well but validation loss stalls quickly. Thus, we need much more coverage by the training data in order to generalize better on validation set.

Sample training curve from of the fine-tuning stage. Early saturation on validation loss is a sign of requirement for more training data.
Image result
Learning Multi-Domain Convolutional Neural Networks for Visual Tracking


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s