Link to GitHub repository - 094295_hw2
In this project we have two main tasks.
The first is to perform explanatory data analysis (EDA).
In the second is to perform prediction for the bounding box and the label of the box of each image in different experiments.
Note, we hide the code cells so that the notebook stays clean. The full code is in the included in the repository.
In this phase we explore the data. We perform a basic analysis of the data, we visualize some images and provide insights from the data.
fileName | id | bbox | label | x | y | w | h | box_area | type | imageWidth | imageHeight | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 000193__[6, 70, 87, 85]__True.jpg | 000193 | [6, 70, 87, 85] | True | 6 | 70 | 87 | 85 | 7395 | train | 224 | 168 |
1 | 013731__[31, 94, 140, 162]__False.jpg | 013731 | [31, 94, 140, 162] | False | 31 | 94 | 140 | 162 | 22680 | train | 224 | 224 |
2 | 008110__[63, 70, 105, 102]__False.jpg | 008110 | [63, 70, 105, 102] | False | 63 | 70 | 105 | 102 | 10710 | train | 224 | 224 |
3 | 009097__[59, 18, 13, 11]__False.jpg | 009097 | [59, 18, 13, 11] | False | 59 | 18 | 13 | 11 | 143 | train | 148 | 224 |
4 | 000225__[72, 70, 80, 69]__True.jpg | 000225 | [72, 70, 80, 69] | True | 72 | 70 | 80 | 69 | 5520 | train | 224 | 169 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3995 | 016921__[51, 112, 128, 138]__True.jpg | 016921 | [51, 112, 128, 138] | True | 51 | 112 | 128 | 138 | 17664 | test | 224 | 224 |
3996 | 017740__[16, 96, 77, 73]__False.jpg | 017740 | [16, 96, 77, 73] | False | 16 | 96 | 77 | 73 | 5621 | test | 224 | 134 |
3997 | 016435__[50, 96, 147, 147]__False.jpg | 016435 | [50, 96, 147, 147] | False | 50 | 96 | 147 | 147 | 21609 | test | 213 | 224 |
3998 | 017186__[70, 71, 53, 64]__False.jpg | 017186 | [70, 71, 53, 64] | False | 70 | 71 | 53 | 64 | 3392 | test | 224 | 218 |
3999 | 019067__[177, 58, 27, 30]__False.jpg | 019067 | [177, 58, 27, 30] | False | 177 | 58 | 27 | 30 | 810 | test | 148 | 224 |
20000 rows × 12 columns
train 16000 test 4000 Name: type, dtype: int64
<AxesSubplot:xlabel='type'>
box_area | imageWidth | imageHeight | |
---|---|---|---|
count | 20000.0 | 20000.0 | 20000.0 |
mean | 5250.0 | 199.0 | 203.0 |
std | 5512.0 | 36.0 | 30.0 |
min | -100.0 | 70.0 | 58.0 |
25% | 1155.0 | 168.0 | 179.0 |
50% | 3480.0 | 224.0 | 224.0 |
75% | 7480.0 | 224.0 | 224.0 |
max | 49275.0 | 224.0 | 224.0 |
<AxesSubplot:title={'center':'Histogram Of box width'}, ylabel='Frequency'>
<AxesSubplot:title={'center':'Histogram Of box height'}, ylabel='Frequency'>
From the statistics and graphs above we can see that -
From the images above, that there are a lot of different images. The differences include -
From the images above, it seems that the annotations of the bounding boxes labels are according to the definition of proper mask. Although, the bounding boxes annotations are not perfect.
The main model & process we used is desrcibed below. Afterwards, we explain the changes we made in the second configuration.
Throughout our work we make use of the pytorch-lighting framework, which abstracts away alot of the usual 'boilerplate' code that is usually involved in building & training neural networks.
It wraps around the regular pytorch framework and makes it easy to use advanced features and to avoid mistakes.
Additionally, we make use of torchvision for handling the images.
The cleaning and preprocessing steps we perform are rather simple -
First, we parse the image information (label & bounding box location) from the image file name.
The bounding boxes are defined as [$x_1$, $y_1$, $w$, $h$]. However, the model we use expects the bounding boxes to be in [$x_1$, $y_1$, $x_2$, $y_2$] format. Therefore we correct for this mismatch.
Before loading the images into the model, they are converted to a scale of [0,1].
We note that as the images are of different sizes, we create a custom collate function that handles this mismatch properly.
For loading the data into the model we make use of pytorch's datasets and dataloaders and of pytorch-lighting's datamodule. All these help to load the images efficiently into memory, when needed.
For the model architecture we chose to use the widely popular Faster R-CNN.
It is a fast, end-to-end framework for object detection uses deep convolution networks.
The architecture was proposed in the game-changing article - Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
The architecture consists of three parts.
The first part extracts features from the raw image using a CNN module.
In our case we use Resnet50 for this module.
The second part is a Region Propoasl Network(RPN). This is is small neural network that goes over the last feature map of the previous module and predicts whether there is an object or not in that area, and if so, also the proposed bounding boxes.
The third part uses a fully connected neural network (ROI), that takes as an input the regions proposed by the RPN and extracts the prediction object class (classification) and Bounding boxes (Regression).
Implementation-wise, we use torchvision's implementation of faster R-CNN, with slight modifications as we describe below. Torchvision's implementation of faster R-CNN has two operating modes - training and evaluation. When training, the model returns only the losses, and when evaluating - only the bounding boxes and predictions. As we seek also the output when training and the loss when evaluating, we modify the torch vision code. The modified code is hosted in a public GitHub fork of the original implementation.
As the architecture has multiple steps, all of them that need to be trained, we make use of losses both from RPN module and the ROI module. From each of the RPN modules we get 2 losses -
These losses take into account the proposals.
Similarly, from the ROI module we get
These losses take into account the final predicted bounding box.
From both modules, the classification loss is between predicted 'proper mask' and the ground truth label, and the bounding box regression loss is between the predicted bounding boxes and the ground truth bounding boxes.
Finally, we sum all 4 losses.
We make use of the Adam optimizer, with initial learning rate of 1e-3 and default parameters. In this initial configuration there is no explicit regularization.
However, as we ran into issues where the model was not learning (probably) due to exploding gradients, we added gradient clipping, which puts an upper threshold on the gradient value.
For selecting and evaluating the model with different hyper parameters we train the model on the train set, and verify the results on the validation set (called test images).
There are hyperparameters related to the optimization process, and ones related to the actual model.
box_detections_per_img
flag with a value of 1.We note that we do not make use of any pretrained models and external data.
As described above, for the first configuration we made use of the Faster R-CNN architecture, with a resnet50 CNN module.
The model has 41.3 million trainable weights.
For the final Faster R-CNN w/ Resnet50 model we used the following parameters:
Params | Values |
---|---|
Batch size | 32 |
Epochs | 12 (out of 27) |
Min size image | 224 |
Max size image | 224 |
Optimizer | Adam |
Initial learning rate | 1e-3 |
Gradient clip val | 10 |
This model achieved an accuracy of 0.7535
and IOU of 0.5776
on the validation (test) set.
The model training time is around 16 minutes per epoch on the VM machine and takes up around 500 mb.
Next we describe the second configuration that we tested out.
The main difference between the first configuration is that we replaced the CNN module which used resnet with mobilenet v3 'large'.
This architecture was first proposed in the article Searching for MobileNetV3.
Replacing the CNN module also required us to adapt the anchor generator and the box roi pooler which are part of the other two modules.
The model has 18.9 million trainable weights.
For the final Faster R-CNN w/ mobilenet v3 large model we used the following parameters:
Params | Values |
---|---|
Batch size | 32 |
Epochs | 9 (out of 15) |
Min size image | 224 |
Max size image | 224 |
Optimizer | Adam |
Initial learning rate | 1e-3 |
Gradient clip val | 10 |
This model achieved an accuracy of 0.5817
and IOU of 0.4004
on the validation (test) set.
The model training time is around 8 minutes per epoch on the VM machine and takes up around 220 mb.
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
Looking at the first configuration train graphs we can see that model keeps improving the train metrics even after 25 epochs, but a saturation is reached for the validation set already after 12 epochs. After 27 epochs the loss exploded and the results dropped drastically, but as the best epoch is at epoch 12 this is less important.
The second configuration exhibits a nice training curve, but when looking at the results of the two configurations, we can see that the first configuration is noticably better than the second configuration, both in the achieved accuracy and IOU on the validation set. Therefore, the final model we submit is the Faster R-CNN w/ Resnet50 model (first configuration).
In the Result analysis of the best configuration section we display 6 random images from the validation set, to get a feel of the model performance. We can see that the model does a pretty good job in classifing and detecting the masks, however on 'harder' images, such as images with peculiar obstructions or multiple people - it still makes mistakes.