Data Analysis and Visualization - 094295

HW 2 - Box Office Revenue Prediction

Nitzan Shamir 206348187 & Omer Shubi 312236219

Link to GitHub repository - 094295_hw2

In this project we have two main tasks.

The first is to perform explanatory data analysis (EDA).

In the second is to perform prediction for the bounding box and the label of the box of each image in different experiments.

Note, we hide the code cells so that the notebook stays clean. The full code is in the included in the repository.

EDA

In this phase we explore the data. We perform a basic analysis of the data, we visualize some images and provide insights from the data.

Number of Records in each Directory

Number of Records in each Label, Directory

Basic statistics about the bounding box area

From the statistics and graphs above we can see that -

Sample of train images from the data

Sample of test images from the data

From the images above, that there are a lot of different images. The differences include -

Plot train images with bounding boxes - green proper mask, red for not proper mask

Plot test images with bounding boxes - green proper mask, red for not proper mask

From the images above, it seems that the annotations of the bounding boxes labels are according to the definition of proper mask. Although, the bounding boxes annotations are not perfect.

Experiments

The main model & process we used is desrcibed below. Afterwards, we explain the changes we made in the second configuration.

Throughout our work we make use of the pytorch-lighting framework, which abstracts away alot of the usual 'boilerplate' code that is usually involved in building & training neural networks.

It wraps around the regular pytorch framework and makes it easy to use advanced features and to avoid mistakes.

Additionally, we make use of torchvision for handling the images.

Data loading, pre-processing and cleaning

The cleaning and preprocessing steps we perform are rather simple -

First, we parse the image information (label & bounding box location) from the image file name.

The bounding boxes are defined as [$x_1$, $y_1$, $w$, $h$]. However, the model we use expects the bounding boxes to be in [$x_1$, $y_1$, $x_2$, $y_2$] format. Therefore we correct for this mismatch.

Before loading the images into the model, they are converted to a scale of [0,1].

We note that as the images are of different sizes, we create a custom collate function that handles this mismatch properly.

For loading the data into the model we make use of pytorch's datasets and dataloaders and of pytorch-lighting's datamodule. All these help to load the images efficiently into memory, when needed.

Architecture

For the model architecture we chose to use the widely popular Faster R-CNN.

It is a fast, end-to-end framework for object detection uses deep convolution networks.

The architecture was proposed in the game-changing article - Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

The architecture consists of three parts.

The first part extracts features from the raw image using a CNN module.

In our case we use Resnet50 for this module.

The second part is a Region Propoasl Network(RPN). This is is small neural network that goes over the last feature map of the previous module and predicts whether there is an object or not in that area, and if so, also the proposed bounding boxes.

The third part uses a fully connected neural network (ROI), that takes as an input the regions proposed by the RPN and extracts the prediction object class (classification) and Bounding boxes (Regression).

Implementation-wise, we use torchvision's implementation of faster R-CNN, with slight modifications as we describe below. Torchvision's implementation of faster R-CNN has two operating modes - training and evaluation. When training, the model returns only the losses, and when evaluating - only the bounding boxes and predictions. As we seek also the output when training and the loss when evaluating, we modify the torch vision code. The modified code is hosted in a public GitHub fork of the original implementation.

Loss Functions

As the architecture has multiple steps, all of them that need to be trained, we make use of losses both from RPN module and the ROI module. From each of the RPN modules we get 2 losses -

These losses take into account the proposals.

Similarly, from the ROI module we get

These losses take into account the final predicted bounding box.

From both modules, the classification loss is between predicted 'proper mask' and the ground truth label, and the bounding box regression loss is between the predicted bounding boxes and the ground truth bounding boxes.

Finally, we sum all 4 losses.

Optimizers & Regularization

We make use of the Adam optimizer, with initial learning rate of 1e-3 and default parameters. In this initial configuration there is no explicit regularization.

However, as we ran into issues where the model was not learning (probably) due to exploding gradients, we added gradient clipping, which puts an upper threshold on the gradient value.

Hyper parameter tuning

For selecting and evaluating the model with different hyper parameters we train the model on the train set, and verify the results on the validation set (called test images).

There are hyperparameters related to the optimization process, and ones related to the actual model.

Note

We note that we do not make use of any pretrained models and external data.

First Configuration

As described above, for the first configuration we made use of the Faster R-CNN architecture, with a resnet50 CNN module.

The model has 41.3 million trainable weights.

For the final Faster R-CNN w/ Resnet50 model we used the following parameters:

Params Values
Batch size 32
Epochs 12 (out of 27)
Min size image 224
Max size image 224
Optimizer Adam
Initial learning rate 1e-3
Gradient clip val 10

This model achieved an accuracy of 0.7535 and IOU of 0.5776 on the validation (test) set.

The model training time is around 16 minutes per epoch on the VM machine and takes up around 500 mb.

Loss, Accuracy & IOU Graphs

Second configuration

Next we describe the second configuration that we tested out.

The main difference between the first configuration is that we replaced the CNN module which used resnet with mobilenet v3 'large'.

This architecture was first proposed in the article Searching for MobileNetV3.

Replacing the CNN module also required us to adapt the anchor generator and the box roi pooler which are part of the other two modules.

The model has 18.9 million trainable weights.

For the final Faster R-CNN w/ mobilenet v3 large model we used the following parameters:

Params Values
Batch size 32
Epochs 9 (out of 15)
Min size image 224
Max size image 224
Optimizer Adam
Initial learning rate 1e-3
Gradient clip val 10

This model achieved an accuracy of 0.5817 and IOU of 0.4004 on the validation (test) set.

The model training time is around 8 minutes per epoch on the VM machine and takes up around 220 mb.

Loss, Accuracy & IOU Graphs

Result analysis of the best configuration (first configuration)

Conclusions

Looking at the first configuration train graphs we can see that model keeps improving the train metrics even after 25 epochs, but a saturation is reached for the validation set already after 12 epochs. After 27 epochs the loss exploded and the results dropped drastically, but as the best epoch is at epoch 12 this is less important.

The second configuration exhibits a nice training curve, but when looking at the results of the two configurations, we can see that the first configuration is noticably better than the second configuration, both in the achieved accuracy and IOU on the validation set. Therefore, the final model we submit is the Faster R-CNN w/ Resnet50 model (first configuration).

In the Result analysis of the best configuration section we display 6 random images from the validation set, to get a feel of the model performance. We can see that the model does a pretty good job in classifing and detecting the masks, however on 'harder' images, such as images with peculiar obstructions or multiple people - it still makes mistakes.