Transfer learning to generalize with DenseNet

Michael George
8 min readFeb 20, 2021
The DenseNet 121 architecture

Abstract:

Transfer learning can be a powerful tool when looking to train a neural network for a new purpose while taking advantage of previously trained networks or network weights. Transfer learning is accomplished by using a pre-trained neural network model and freezing most of those layers and weights in place, then either unfreezing the last layers, or adding new layers that you can train with your new data set. The amount of layers needed to add to get useful results is often quite low and so there is a large computational savings to be gained from using transfer learning. However transfer learning is only useful if the original model’s data is sufficiently similar to the new data you wish to train. By implementing transfer learning here with DenseNet121 I demonstrate the effectiveness and necessary steps that must be taken to produce a new neural network based on a previously trained model that can effectively classify data from the CIFAR-10 dataset.

Introduction

Convolutional neural networks can be thought of as a two step neural network. First the convolution layers identify and or separate image features, then the densely connected layers use those features as input to make predictions for classification of results. DenseNet is a popular network based on this idea, but with variations in that some convolution layers have extra “dense” connections that have the potential to greatly aid accuracy of predictions. Particularly, DenseNet has demonstrated very positive results for the “imagenet” dataset. Because the imagenet dataset is large compared to the CIFAR-10 dataset that we wish to classify, transfer learning may be an effective approach.

We can use the previously trained weights that were trained on the convolution layers of DenseNet model and then build new densely connected layers and only train those new layers with the CIFAR-10 dataset. Because imagenet is large the convolution layers should be able to identify features of the CIFAR-10 dataset even if it was not originally trained on it.

Materials and Methods

DenseNet has many versions, specifically we will use DenseNet121 meaning it has a depth of 121 layers. Those 121 layers are varied in purpose, some are convolution layers, pooling, concatenation of matrices, densely connected layers and others. In total the DenseNet121 model would have roughly 8 million parameters and can achieve above 90% accuracy on the imagenet dataset. However by freezing the densenet model and using transfer learning those 8 million parameters will not be trained, instead only several hundred will need to be trained.

DenseNet121 has a Tensorflow Keras implementation in python and that is how we can easily make use of the model. Additionally the CIFAR-10 dataset can be loaded with methods native to python’s tensorflow keras module. However to make use of the CIFAR-10 data we have to preprocess it in order to have the right format for DenseNet. To do that we need to transform the CIFAR-10 labels into one-hot matrices using the to_categorical function, and when we load the DenseNet121 model we can use the densenet preprocess input function process the images.

When loading the DenseNet121 model the densely connected layers are referred to as the top and so we must set “include_top” to false in order to just use the convolutional layers and then build our own dense layers. Because there are already many convolution layers we need very few dense layers, we can get good results from just one dense layer with 256 nodes, however there is a worthwhile performance increase if we use two layers, the first having 512 nodes and the second with only 128 nodes. Each of those layers uses rectified linear activation function, or ‘relu’. After those two dense layer we have the output layer which uses softmax activation and has only 10 nodes because the CIFAR-10 dataset has 10 classes. At that point we have our basic model, however performance will still be lacking.

DenseNet natively uses images of size (224, 224, 3) however the CIFAR-10 dataset uses smaller images of only (32, 32, 3). Many of the pictures are of the same subjects, such as birds vehicles etc, but because of this size difference the convolutional layers will have more trouble in identifying features. There are two way to solve this. First when setting up our DenseNet121 base we can specify that the input should have shape (32, 32, 3). Because the model is trained on larger images this will greatly impact performance. Instead we need to make an additional layer before the DenseNet convolutions that resizes the images to be the native DenseNet size (luckily this is a factor of exactly seven).

With just that we could expect pretty good results, however we also want to watch out for overfitting. To avoid overfitting we implement several strategies that improve accuracy with very minimal impact to computation time or complexity. The first is implementing 2 dropout layers, one after each of our dense layers. Dropout layers improve results with very limited performance setbacks if any at all. After dropout it is also useful to implement some callbacks for during training, such as early stopping and learning rate reduction, both of which can help mitigate overfitting.

Other measures were also taken such as fine-tuning and data augmentation but showed very little performance increases relative to the cost, we will discuss those in the results and or conclusions sections.

Results

Using transfer learning with all the above specification we achieved 88.5% validation accuracy for the CIFAR-10 dataset. This figure is quite impressive since we do no training to the DenseNet121 model itself and our results are almost as good as the 92.3% accuracy that the densenet121 achieves on the original “imagenet” dataset. this similarity demonstrates the effectiveness of transfer learning.

Perhaps more impressive is that this model achieves these results in only 3–5 epochs. Due to the pre-trained weights transfer learning requires relatively very few epochs to get comparable results. If the base DenseNet121 were not frozen it would likely take many more epochs to get comparable results, but also each epoch would take roughly 40 times as long (time tested based on gpu accelerated systems so the difference would be much larger if running on non gpu based systems)

Additionally the training accuracy was very similar to the validation accuracy meaning there was likely little overfitting which means our methods for stopping overfitting were effective. However when we look at the specifics of overfitting mitigation, it may not have been entirely necessary. When run using a larger number of epochs we can see that Early stopping doesn’t occur until very late when we’ve already achieved good results. Learning rate reduction happened more often, but often led to very little real performance increases, the 88.5% validation accuracy rate was already achieved before any reduction in the learning rate occurs. The most effective methods proved to be the dropout layers. Because they are always active, unlike learning rate reduction or early stopping, dropout layers can help results from every epoch. That said the performance difference was significant but not huge, the validation accuracy results was 0.5% to 1.5% lower than with dropout.

This similarity in accuracies could also mean there are more potential routes for increasing accuracy by increasing training accuracy, such as more dense layers, or adding more nodes to the existing ones, making “wider” layers.

Alternative Methods:

As stated previously it was necessary to resize the images to fit the DenseNet121 native image size. When using not native sizing (instead load the DenseNet with alternate size input) the results took a significant hit. Instead of the high 88.5% the model was only able to achieve 83% with optimization for the different image size. If all else is equal (no optimizing for different size) the highest result was only 80%. It is notable that the smaller images made training much faster, up to ten times faster per epoch in some cases. But even with the speed increase the accuracy could never reach higher levels even if trained for many more epochs. The original model was trained to identify features in the larger image size and it didn’t translate well to the smaller size.

Another methods attempted was data augmentation, meaning increasing the size of the training set by manipulating the image data, such as zooming, cropping, rotating, mirroring, or scaling in some direction. Unfortunately this method didn’t prove to be particularly effective, it showed no improvement in accuracy, but it did impact speed. When using data augmentation, the model trained slower (roughly twice as long per epoch) and each epoch made significantly less progress in metrics. Data augmentation can be thought of as manipulating the image data, which because of the size of the images is a large performance hit. It’s possible data augmentation could be more useful if run over many more epochs, but most likely it is a technique that should be reserved for smaller data sets, whereas the CIFAR-10 set was sufficiently large to not need augmentation.

The last attempted method was fine-tuning. Fine-tuning in transfer learning is when you temporarily unfreeze the original base model (meaning all 8+ million parameters become trainable) for one or two epochs to improve feature identification for the dataset. Unfortunately the performance cost could not be justified. The time per epoch increased exponentially (roughly 100 times longer for full size images, only 10 times longer for the smaller image size). Using full size images with fine tuning meant many millions more parameters and the larger image size depleted the system memory and the systems could not handle the load, results could not be objectively evaluated. Using the smaller (32, 32, 3) image data instead of resizing the image showed promise in fine tuning, results increased well, but not nearly enough as compared to full size images without fine-tuning.

Discussion

Transfer-learning with DenseNet121 on imagenet data to classify CIFAR-10 data proved to be a hugely advantageous strategy. If you want to create a model to classify CIFAR-10 using DenseNet121 you could either train entirely form scratch, which would take much much longer and in many cases use prohibitively more memory, or you could use transfer learning from the imagenet dataset and only train a few layers’ parameters at the top of the model and get good results much faster. Transfer learning is particularly attractive because of the abundance of data. Imagenet is one particularly popular dataset, but there are many others for various other applications that you can find trained weights of online and from various data sources.

specific transfer learning techniques merit more research. techniques such as dropout layers and image resizing were shown to be very effective here, but other such as data augmentation proved ineffective for this case. Fine-tuning showed a lot of promise, but due to hardware limitations could not be explored further. A point of investigation would be instead of unfreezing the whole DenseNet121 model, only some layers nearest our dense layers and compare performance thereof.

Transfer learning demonstrates performance improvements not only for total training time, but also training time per epoch, and number of epochs. We used comparatively very few epochs to get very good results. If a similar dataset is available transfer learning should be thought of as the most logical method to start with for training a new model. I plan on testing further the results here to improve upon the insights gained.

--

--

Michael George

Cornell University Engineering and Holberton Software engineering schools. Studied operations research and information engineering now learning machine learning