Transfer learning to generalize with DenseNet

The DenseNet 121 architecture

Abstract:

Introduction

We can use the previously trained weights that were trained on the convolution layers of DenseNet model and then build new densely connected layers and only train those new layers with the CIFAR-10 dataset. Because imagenet is large the convolution layers should be able to identify features of the CIFAR-10 dataset even if it was not originally trained on it.

Materials and Methods

DenseNet121 has a Tensorflow Keras implementation in python and that is how we can easily make use of the model. Additionally the CIFAR-10 dataset can be loaded with methods native to python’s tensorflow keras module. However to make use of the CIFAR-10 data we have to preprocess it in order to have the right format for DenseNet. To do that we need to transform the CIFAR-10 labels into one-hot matrices using the to_categorical function, and when we load the DenseNet121 model we can use the densenet preprocess input function process the images.

When loading the DenseNet121 model the densely connected layers are referred to as the top and so we must set “include_top” to false in order to just use the convolutional layers and then build our own dense layers. Because there are already many convolution layers we need very few dense layers, we can get good results from just one dense layer with 256 nodes, however there is a worthwhile performance increase if we use two layers, the first having 512 nodes and the second with only 128 nodes. Each of those layers uses rectified linear activation function, or ‘relu’. After those two dense layer we have the output layer which uses softmax activation and has only 10 nodes because the CIFAR-10 dataset has 10 classes. At that point we have our basic model, however performance will still be lacking.

DenseNet natively uses images of size (224, 224, 3) however the CIFAR-10 dataset uses smaller images of only (32, 32, 3). Many of the pictures are of the same subjects, such as birds vehicles etc, but because of this size difference the convolutional layers will have more trouble in identifying features. There are two way to solve this. First when setting up our DenseNet121 base we can specify that the input should have shape (32, 32, 3). Because the model is trained on larger images this will greatly impact performance. Instead we need to make an additional layer before the DenseNet convolutions that resizes the images to be the native DenseNet size (luckily this is a factor of exactly seven).

With just that we could expect pretty good results, however we also want to watch out for overfitting. To avoid overfitting we implement several strategies that improve accuracy with very minimal impact to computation time or complexity. The first is implementing 2 dropout layers, one after each of our dense layers. Dropout layers improve results with very limited performance setbacks if any at all. After dropout it is also useful to implement some callbacks for during training, such as early stopping and learning rate reduction, both of which can help mitigate overfitting.

Other measures were also taken such as fine-tuning and data augmentation but showed very little performance increases relative to the cost, we will discuss those in the results and or conclusions sections.

Results

Perhaps more impressive is that this model achieves these results in only 3–5 epochs. Due to the pre-trained weights transfer learning requires relatively very few epochs to get comparable results. If the base DenseNet121 were not frozen it would likely take many more epochs to get comparable results, but also each epoch would take roughly 40 times as long (time tested based on gpu accelerated systems so the difference would be much larger if running on non gpu based systems)

Additionally the training accuracy was very similar to the validation accuracy meaning there was likely little overfitting which means our methods for stopping overfitting were effective. However when we look at the specifics of overfitting mitigation, it may not have been entirely necessary. When run using a larger number of epochs we can see that Early stopping doesn’t occur until very late when we’ve already achieved good results. Learning rate reduction happened more often, but often led to very little real performance increases, the 88.5% validation accuracy rate was already achieved before any reduction in the learning rate occurs. The most effective methods proved to be the dropout layers. Because they are always active, unlike learning rate reduction or early stopping, dropout layers can help results from every epoch. That said the performance difference was significant but not huge, the validation accuracy results was 0.5% to 1.5% lower than with dropout.

This similarity in accuracies could also mean there are more potential routes for increasing accuracy by increasing training accuracy, such as more dense layers, or adding more nodes to the existing ones, making “wider” layers.

Alternative Methods:

As stated previously it was necessary to resize the images to fit the DenseNet121 native image size. When using not native sizing (instead load the DenseNet with alternate size input) the results took a significant hit. Instead of the high 88.5% the model was only able to achieve 83% with optimization for the different image size. If all else is equal (no optimizing for different size) the highest result was only 80%. It is notable that the smaller images made training much faster, up to ten times faster per epoch in some cases. But even with the speed increase the accuracy could never reach higher levels even if trained for many more epochs. The original model was trained to identify features in the larger image size and it didn’t translate well to the smaller size.

Another methods attempted was data augmentation, meaning increasing the size of the training set by manipulating the image data, such as zooming, cropping, rotating, mirroring, or scaling in some direction. Unfortunately this method didn’t prove to be particularly effective, it showed no improvement in accuracy, but it did impact speed. When using data augmentation, the model trained slower (roughly twice as long per epoch) and each epoch made significantly less progress in metrics. Data augmentation can be thought of as manipulating the image data, which because of the size of the images is a large performance hit. It’s possible data augmentation could be more useful if run over many more epochs, but most likely it is a technique that should be reserved for smaller data sets, whereas the CIFAR-10 set was sufficiently large to not need augmentation.

The last attempted method was fine-tuning. Fine-tuning in transfer learning is when you temporarily unfreeze the original base model (meaning all 8+ million parameters become trainable) for one or two epochs to improve feature identification for the dataset. Unfortunately the performance cost could not be justified. The time per epoch increased exponentially (roughly 100 times longer for full size images, only 10 times longer for the smaller image size). Using full size images with fine tuning meant many millions more parameters and the larger image size depleted the system memory and the systems could not handle the load, results could not be objectively evaluated. Using the smaller (32, 32, 3) image data instead of resizing the image showed promise in fine tuning, results increased well, but not nearly enough as compared to full size images without fine-tuning.

Discussion

specific transfer learning techniques merit more research. techniques such as dropout layers and image resizing were shown to be very effective here, but other such as data augmentation proved ineffective for this case. Fine-tuning showed a lot of promise, but due to hardware limitations could not be explored further. A point of investigation would be instead of unfreezing the whole DenseNet121 model, only some layers nearest our dense layers and compare performance thereof.

Transfer learning demonstrates performance improvements not only for total training time, but also training time per epoch, and number of epochs. We used comparatively very few epochs to get very good results. If a similar dataset is available transfer learning should be thought of as the most logical method to start with for training a new model. I plan on testing further the results here to improve upon the insights gained.

--

--

Cornell University Engineering and Holberton Software engineering schools. Studied operations research and information engineering now learning machine learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Michael George

Cornell University Engineering and Holberton Software engineering schools. Studied operations research and information engineering now learning machine learning