{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "C9HmC2T4ld5B" }, "source": [ "# Overfitting, underfitting and regularization" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "19rPukKZsPG6" }, "source": [ "Here, we will use the `tf.keras` API, which you can learn more about in the TensorFlow [Keras guide](https://www.tensorflow.org/guide/keras).\n", "\n", "It is important to know how to deal with overfitting. Although it's often possible to achieve high accuracy on the *training set*, what we really want is to develop models that generalize well to a *testing data* (or data they haven't seen before).\n", "\n", "The opposite of overfitting is *underfitting*. Underfitting occurs when there is still room for improvement on the test data. This can happen for a number of reasons: If the model is not powerful enough, is over-regularized, or has simply not been trained long enough. This means the network has not learned the relevant patterns in the training data. \n", "\n", "If you train for too long though, the model will start to overfit and learn patterns from the training data that don't generalize to the test data. We need to strike a balance. Understanding how to train for an appropriate number of epochs as we'll explore below is a useful skill.\n", "\n", "To prevent overfitting, the best solution is to use more training data. A model trained on more data will naturally generalize better. When that is no longer possible, the next best solution is to use techniques like regularization. These place constraints on the quantity and type of information your model can store. If a network can only afford to memorize a small number of patterns, the optimization process will force it to focus on the most prominent patterns, which have a better chance of generalizing well.\n", "\n", "In this notebook, we'll explore three common regularization techniques—weight regularization, dropout and batch normalization—and use them to improve our IMDB movie review classification notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "5pZ8A2liqvgk" }, "outputs": [], "source": [ "from __future__ import absolute_import, division, print_function\n", "\n", "import tensorflow as tf\n", "from tensorflow import keras\n", "\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "print(tf.__version__)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "1cweoTiruj8O" }, "source": [ "## Download the IMDB dataset\n", "\n", "Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer \"3\" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: \"only consider the top 10,000 most common words, but eliminate the top 20 most common words\".\n", "\n", "As a convention, \"0\" does not stand for a specific word, but instead is used to encode any unknown word.\n", "\n", "The inputs are multi-hot-encodings, this means we turn them into vectors of 0s and 1s. Concretely, this would mean for instance turning the sequence `[3, 5]` into a 10,000-dimensional vector that would be all-zeros except for indices 3 and 5, which would be ones. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "QpzE4iqZtJly" }, "outputs": [], "source": [ "NUM_WORDS = 10000\n", "\n", "(train_data, train_labels), (test_data, test_labels) = keras.datasets.imdb.load_data(num_words=NUM_WORDS)\n", "\n", "def multi_hot_sequences(sequences, dimension):\n", " # Create an all-zero matrix of shape (len(sequences), dimension)\n", " results = np.zeros((len(sequences), dimension))\n", " for i, word_indices in enumerate(sequences):\n", " results[i, word_indices] = 1.0 # set specific indices of results[i] to 1s\n", " return results\n", "\n", "\n", "train_data = multi_hot_sequences(train_data, dimension=NUM_WORDS)\n", "test_data = multi_hot_sequences(test_data, dimension=NUM_WORDS)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "MzWVeXe3NBTn" }, "source": [ "Let's look at one of the resulting multi-hot vectors. The word indices are sorted by frequency, so it is expected that there are more 1-values near index zero, as we can see in this plot:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "71kr5rG4LkGM" }, "outputs": [], "source": [ "plt.plot(train_data[0])" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "lglk41MwvU5o" }, "source": [ "## Under- and Overfitting\n", "\n", "The simplest way to prevent overfitting is to reduce the size of the model, i.e. the number of learnable parameters in the model (which is determined by the number of layers and the number of units per layer). In deep learning, the number of learnable parameters in a model is often referred to as the model's \"capacity\". Intuitively, a model with more parameters will have more \"memorization capacity\" and therefore will be able to easily learn a perfect dictionary-like mapping between training samples and their targets, a mapping without any generalization power, but this would be useless when making predictions on previously unseen data. \n", "\n", "Always keep this in mind: deep learning models tend to be good at fitting to the training data, but the real challenge is generalization, not fitting.\n", "\n", "On the other hand, if the network has limited memorization resources, it will not be able to learn the mapping as easily. To minimize its loss, it will have to learn compressed representations that have more predictive power. At the same time, if you make your model too small, it will have difficulty fitting to the training data. There is a balance between \"too much capacity\" and \"not enough capacity\".\n", "\n", "Unfortunately, there is no magical formula to determine the right size or architecture of your model (in terms of the number of layers, or the right size for each layer). You will have to experiment using a series of different architectures.\n", "\n", "To find an appropriate model size, it's best to start with relatively few layers and parameters, then begin increasing the size of the layers or adding new layers until you see diminishing returns on the validation loss. Let's try this on our movie review classification network. \n", "\n", "We'll create a simple model using only ```Dense``` layers as a baseline, then create smaller and larger versions, and compare them." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "_ReKHdC2EgVu" }, "source": [ "### Create a baseline model" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "QKgdXPx9usBa" }, "outputs": [], "source": [ "baseline_model = keras.Sequential([\n", " # `input_shape` is only required here so that `.summary` works. \n", " keras.layers.Dense(16, activation=tf.nn.relu, input_shape=(NUM_WORDS,)),\n", " keras.layers.Dense(16, activation=tf.nn.relu),\n", " keras.layers.Dense(1, activation=tf.nn.sigmoid)\n", "])\n", "\n", "baseline_model.compile(optimizer='adam',\n", " loss='binary_crossentropy',\n", " metrics=['accuracy', 'binary_crossentropy'])\n", "\n", "baseline_model.summary()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "LqG3MXF5xSjR" }, "outputs": [], "source": [ "baseline_history = baseline_model.fit(train_data,\n", " train_labels,\n", " epochs=20,\n", " batch_size=512,\n", " validation_data=(test_data, test_labels),\n", " verbose=2)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "L-DGRBbGxI6G" }, "source": [ "### Create a smaller model" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "SrfoVQheYSO5" }, "source": [ "Let's create a model with less hidden units to compare against the baseline model that we just created:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "jksi-XtaxDAh" }, "outputs": [], "source": [ "smaller_model = keras.Sequential([\n", " keras.layers.Dense(4, activation=tf.nn.relu, input_shape=(NUM_WORDS,)),\n", " keras.layers.Dense(4, activation=tf.nn.relu),\n", " keras.layers.Dense(1, activation=tf.nn.sigmoid)\n", "])\n", "\n", "smaller_model.compile(optimizer='adam',\n", " loss='binary_crossentropy',\n", " metrics=['accuracy', 'binary_crossentropy'])\n", "\n", "smaller_model.summary()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "jbngCZliYdma" }, "source": [ "And train the model using the same data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "Ofn1AwDhx-Fe" }, "outputs": [], "source": [ "smaller_history = smaller_model.fit(train_data,\n", " train_labels,\n", " epochs=20,\n", " batch_size=512,\n", " validation_data=(test_data, test_labels),\n", " verbose=2)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "vIPuf23FFaVn" }, "source": [ "### Create a bigger model\n", "\n", "As an exercise, you can create an even larger model, and see how quickly it begins overfitting. Next, let's add to this benchmark a network that has much more capacity, far more than the problem would warrant:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "ghQwwqwqvQM9" }, "outputs": [], "source": [ "bigger_model = keras.models.Sequential([\n", " keras.layers.Dense(512, activation=tf.nn.relu, input_shape=(NUM_WORDS,)),\n", " keras.layers.Dense(512, activation=tf.nn.relu),\n", " keras.layers.Dense(1, activation=tf.nn.sigmoid)\n", "])\n", "\n", "bigger_model.compile(optimizer='adam',\n", " loss='binary_crossentropy',\n", " metrics=['accuracy','binary_crossentropy'])\n", "\n", "bigger_model.summary()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "D-d-i5DaYmr7" }, "source": [ "And, again, train the model using the same data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "U1A99dhqvepf" }, "outputs": [], "source": [ "bigger_history = bigger_model.fit(train_data, train_labels,\n", " epochs=20,\n", " batch_size=512,\n", " validation_data=(test_data, test_labels),\n", " verbose=2)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "Fy3CMUZpzH3d" }, "source": [ "### Plot the training and validation loss\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "HSlo1F4xHuuM" }, "source": [ "The solid lines show the training loss, and the dashed lines show the validation loss (remember: a lower validation loss indicates a better model)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "0XmKDtOWzOpk" }, "outputs": [], "source": [ "def plot_history(histories, key='binary_crossentropy'):\n", " plt.figure(figsize=(16,10))\n", " \n", " for name, history in histories:\n", " val = plt.plot(history.epoch, history.history['val_'+key],\n", " '--', label=name.title()+' Val')\n", " plt.plot(history.epoch, history.history[key], color=val[0].get_color(),\n", " label=name.title()+' Train')\n", "\n", " plt.xlabel('Epochs')\n", " plt.ylabel(key.replace('_',' ').title())\n", " plt.legend()\n", "\n", " plt.xlim([0,max(history.epoch)])\n", "\n", "\n", "plot_history([('baseline', baseline_history),\n", " ('smaller', smaller_history),\n", " ('bigger', bigger_history)])" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "Bi6hBhdnSfjA" }, "source": [ "Notice that the larger network begins overfitting almost right away, after just one epoch, and overfits much more severely. The more capacity the network has, the quicker it will be able to model the training data (resulting in a low training loss), but the more susceptible it is to overfitting (resulting in a large difference between the training and validation loss)." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "ASdv7nsgEFhx" }, "source": [ "## Regularization" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "4rHoVWcswFLa" }, "source": [ "### L2 weight regularization\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "HFGmcwduwVyQ" }, "outputs": [], "source": [ "l2_model = keras.models.Sequential([\n", " keras.layers.Dense(16, kernel_regularizer=keras.regularizers.l2(0.001),\n", " activation=tf.nn.relu, input_shape=(NUM_WORDS,)),\n", " keras.layers.Dense(16, kernel_regularizer=keras.regularizers.l2(0.001),\n", " activation=tf.nn.relu),\n", " keras.layers.Dense(1, activation=tf.nn.sigmoid)\n", "])\n", "\n", "l2_model.compile(optimizer='adam',\n", " loss='binary_crossentropy',\n", " metrics=['accuracy', 'binary_crossentropy'])\n", "\n", "l2_model_history = l2_model.fit(train_data, train_labels,\n", " epochs=20,\n", " batch_size=512,\n", " validation_data=(test_data, test_labels),\n", " verbose=2)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "bUUHoXb7w-_C" }, "source": [ "```l2(0.001)``` means that every coefficient in the weight matrix of the layer will add ```0.001 * weight_coefficient_value**2``` to the total loss of the network. Note that because this penalty is only added at training time, the loss for this network will be much higher at training than at test time.\n", "\n", "Here's the impact of our L2 regularization penalty:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "7wkfLyxBZdh_" }, "outputs": [], "source": [ "plot_history([('baseline', baseline_history),\n", " ('l2', l2_model_history)])" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "Kx1YHMsVxWjP" }, "source": [ "As you can see, the L2 regularized model has become much more resistant to overfitting than the baseline model, even though both models have the same number of parameters." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "HmnBNOOVxiG8" }, "source": [ "### Dropout" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "OFEYvtrHxSWS" }, "outputs": [], "source": [ "dpt_model = keras.models.Sequential([\n", " keras.layers.Dense(16, activation=tf.nn.relu, input_shape=(NUM_WORDS,)),\n", " keras.layers.Dropout(0.5),\n", " keras.layers.Dense(16, activation=tf.nn.relu),\n", " keras.layers.Dropout(0.5),\n", " keras.layers.Dense(1, activation=tf.nn.sigmoid)\n", "])\n", "\n", "dpt_model.compile(optimizer='adam',\n", " loss='binary_crossentropy',\n", " metrics=['accuracy','binary_crossentropy'])\n", "\n", "dpt_model_history = dpt_model.fit(train_data, train_labels,\n", " epochs=20,\n", " batch_size=512,\n", " validation_data=(test_data, test_labels),\n", " verbose=2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "SPZqwVchx5xp" }, "outputs": [], "source": [ "plot_history([('baseline', baseline_history),\n", " ('dropout', dpt_model_history)])" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "gjfnkEeQyAFG" }, "source": [ "Adding dropout is a clear improvement over the baseline model. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Batch Normalization" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bn_model = keras.models.Sequential([\n", " keras.layers.Dense(16, activation=tf.nn.relu, input_shape=(NUM_WORDS,)),\n", " keras.layers.BatchNormalization(),\n", " keras.layers.Dense(16, activation=tf.nn.relu),\n", " keras.layers.BatchNormalization(),\n", " keras.layers.Dense(1, activation=tf.nn.sigmoid)\n", "])\n", "bn_model.compile(optimizer='adam',\n", " loss='binary_crossentropy',\n", " metrics=['accuracy','binary_crossentropy'])\n", "\n", "bn_model_history = dpt_model.fit(train_data, train_labels,\n", " epochs=20,\n", " batch_size=512,\n", " validation_data=(test_data, test_labels),\n", " verbose=2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_history([('baseline', baseline_history),\n", " ('batch normalization', bn_model_history)])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "accelerator": "GPU", "colab": { "collapsed_sections": [ "fTFj8ft5dlbS" ], "name": "overfit-and-underfit.ipynb", "private_outputs": true, "provenance": [], "toc_visible": true, "version": "0.3.2" }, "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.12" } }, "nbformat": 4, "nbformat_minor": 1 }