Figure 3 illustrates a popular form of a CNN known as a VGG net 300. The initial convolution layer 302 stores the raw image pixels and the final pooling layer 320 determines the class scores. Each of the intermediate convolution layers ( convolution layer 306, convolution layer 312, and convolution layer 316) and rectifier activations ( RELU layer 304, RELUlayer 308, RELUlayer 314, and RELUlayer 318) and intermediate pooling layers ( pooling layer 310, pooling layer 320) along the processing path is shown as a column.
The VGG net 300 replaces the large single-layer filters of basic CNNs with multiple 3x3 sized filters in series. With a given receptive field (the effective area size of input image on which output depends), multiple stacked smaller size filters may perform better at image feature classification than a single layer with a larger filter size, because multiple non-linear layers increase the depth of the network which enables it to learn more complex features. In a VGG net 300 each pooling layer may be only 2x2.