|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "metadata": { |
| 6 | + "colab_type": "text", |
| 7 | + "id": "NqtXrZApBkM4" |
| 8 | + }, |
| 9 | + "source": [ |
| 10 | + "# Optimizing training and inference\n", |
| 11 | + "\n", |
| 12 | + "In this notebook, we will discuss different ways to reduce memory and compute usage during training and inference." |
| 13 | + ] |
| 14 | + }, |
| 15 | + { |
| 16 | + "cell_type": "markdown", |
| 17 | + "metadata": { |
| 18 | + "colab_type": "text", |
| 19 | + "id": "NEt8wg4JCQdm" |
| 20 | + }, |
| 21 | + "source": [ |
| 22 | + "## Prepare training script\n", |
| 23 | + "\n", |
| 24 | + "When training large models, it is usually a best practice not to use Jupyter notebooks, but run a **separate script** for training which could have command-line flags for various hyperparameters and training modes. This is especially useful when you need to run multiple experiments simultaneously (e.g. on a cluster with task scheduler). Another advantage of this is that after training, the process will finish and free the resources for other users of a shared GPU.\n", |
| 25 | + "\n", |
| 26 | + "In this part, you will need to put all your code to train a model on Tiny ImageNet that you wrote for the previous task in `train.py`.\n", |
| 27 | + "\n", |
| 28 | + "You can then run your script from inside of this notebook like this:" |
| 29 | + ] |
| 30 | + }, |
| 31 | + { |
| 32 | + "cell_type": "code", |
| 33 | + "execution_count": null, |
| 34 | + "metadata": { |
| 35 | + "colab": {}, |
| 36 | + "colab_type": "code", |
| 37 | + "id": "6-TWiKq8H9yT" |
| 38 | + }, |
| 39 | + "outputs": [], |
| 40 | + "source": [ |
| 41 | + "!python3 train.py --flag --some_parameter <its value>" |
| 42 | + ] |
| 43 | + }, |
| 44 | + { |
| 45 | + "cell_type": "markdown", |
| 46 | + "metadata": {}, |
| 47 | + "source": [ |
| 48 | + "**Task** \n", |
| 49 | + "\n", |
| 50 | + "Write code for training with architecture from homework_part2\n", |
| 51 | + "\n", |
| 52 | + "**Requirements**\n", |
| 53 | + "* Optional arguments from command line such as batch size and number of epochs with built-in argparse\n", |
| 54 | + "* Modular structure - separate functions for creating data generator, building model and training \n" |
| 55 | + ] |
| 56 | + }, |
| 57 | + { |
| 58 | + "cell_type": "markdown", |
| 59 | + "metadata": { |
| 60 | + "colab_type": "text", |
| 61 | + "id": "tKPYZ3QLEqX8" |
| 62 | + }, |
| 63 | + "source": [ |
| 64 | + "## Profiling time\n", |
| 65 | + "\n", |
| 66 | + "For the next tasks, you need to add measurements to your training loop. You can use [`perf_counter`](https://docs.python.org/3/library/time.html#time.perf_counter) for that:" |
| 67 | + ] |
| 68 | + }, |
| 69 | + { |
| 70 | + "cell_type": "code", |
| 71 | + "execution_count": null, |
| 72 | + "metadata": { |
| 73 | + "colab": {}, |
| 74 | + "colab_type": "code", |
| 75 | + "id": "bSr-PyQNFkSC" |
| 76 | + }, |
| 77 | + "outputs": [], |
| 78 | + "source": [ |
| 79 | + "import time\n", |
| 80 | + "import numpy as np\n", |
| 81 | + "import torch" |
| 82 | + ] |
| 83 | + }, |
| 84 | + { |
| 85 | + "cell_type": "code", |
| 86 | + "execution_count": null, |
| 87 | + "metadata": { |
| 88 | + "colab": { |
| 89 | + "base_uri": "https://localhost:8080/", |
| 90 | + "height": 34 |
| 91 | + }, |
| 92 | + "colab_type": "code", |
| 93 | + "id": "HMJMCGRKFYCc", |
| 94 | + "outputId": "571046a2-443b-465f-ce62-ddaf68b105d0" |
| 95 | + }, |
| 96 | + "outputs": [], |
| 97 | + "source": [ |
| 98 | + "x = np.random.randn(1000, 1000)\n", |
| 99 | + "y = np.random.randn(1000, 1000)\n", |
| 100 | + "\n", |
| 101 | + "start_counter = time.perf_counter()\n", |
| 102 | + "z = x @ y\n", |
| 103 | + "elapsed_time = time.perf_counter() - start_counter\n", |
| 104 | + "print(\"Matrix multiplication took %.3f seconds\".format(elapsed_time))" |
| 105 | + ] |
| 106 | + }, |
| 107 | + { |
| 108 | + "cell_type": "markdown", |
| 109 | + "metadata": { |
| 110 | + "colab_type": "text", |
| 111 | + "id": "FfhLeWjTGTpB" |
| 112 | + }, |
| 113 | + "source": [ |
| 114 | + "**Task**. You need to add the following measurements to your training script:\n", |
| 115 | + "* How much time a forward-backward pass takes for a single batch;\n", |
| 116 | + "* How much time an epoch takes." |
| 117 | + ] |
| 118 | + }, |
| 119 | + { |
| 120 | + "cell_type": "markdown", |
| 121 | + "metadata": { |
| 122 | + "colab_type": "text", |
| 123 | + "id": "khDOTn_SHaND" |
| 124 | + }, |
| 125 | + "source": [ |
| 126 | + "## Profiling memory usage\n", |
| 127 | + "\n", |
| 128 | + "**Task**. You need to measure the memory consumptions\n", |
| 129 | + "\n", |
| 130 | + "This section depends on whether you train on CPU or GPU.\n", |
| 131 | + "\n", |
| 132 | + "### If you train on CPU\n", |
| 133 | + "You can use GNU time to measure peak RAM usage of a script:" |
| 134 | + ] |
| 135 | + }, |
| 136 | + { |
| 137 | + "cell_type": "code", |
| 138 | + "execution_count": null, |
| 139 | + "metadata": { |
| 140 | + "colab": {}, |
| 141 | + "colab_type": "code", |
| 142 | + "id": "98xvXSjUIDzl" |
| 143 | + }, |
| 144 | + "outputs": [], |
| 145 | + "source": [ |
| 146 | + "!/usr/bin/time -lp python train.py" |
| 147 | + ] |
| 148 | + }, |
| 149 | + { |
| 150 | + "cell_type": "markdown", |
| 151 | + "metadata": { |
| 152 | + "colab_type": "text", |
| 153 | + "id": "v1ES2Pc9IlH5" |
| 154 | + }, |
| 155 | + "source": [ |
| 156 | + "**Maximum resident set size** will show you the peak RAM usage in bytes after the script finishes." |
| 157 | + ] |
| 158 | + }, |
| 159 | + { |
| 160 | + "cell_type": "markdown", |
| 161 | + "metadata": {}, |
| 162 | + "source": [ |
| 163 | + "**Note**. \n", |
| 164 | + "Imports also require memory, do the correction" |
| 165 | + ] |
| 166 | + }, |
| 167 | + { |
| 168 | + "cell_type": "markdown", |
| 169 | + "metadata": { |
| 170 | + "colab_type": "text", |
| 171 | + "id": "kq5lY5CKJHX1" |
| 172 | + }, |
| 173 | + "source": [ |
| 174 | + "### If you train on GPU\n", |
| 175 | + "\n", |
| 176 | + "Use [`torch.cuda.max_memory_allocated()`](https://pytorch.org/docs/stable/cuda.html#torch.cuda.max_memory_allocated) at the end of your script to show the maximum amount of memory in bytes used by all tensors." |
| 177 | + ] |
| 178 | + }, |
| 179 | + { |
| 180 | + "cell_type": "code", |
| 181 | + "execution_count": null, |
| 182 | + "metadata": { |
| 183 | + "colab": { |
| 184 | + "base_uri": "https://localhost:8080/", |
| 185 | + "height": 34 |
| 186 | + }, |
| 187 | + "colab_type": "code", |
| 188 | + "id": "fSQdauqLIkf1", |
| 189 | + "outputId": "8bcffc30-637d-461a-8f44-0e444a28caae" |
| 190 | + }, |
| 191 | + "outputs": [], |
| 192 | + "source": [ |
| 193 | + "x = torch.randn(1000, 1000, 1000, device='cuda:0')\n", |
| 194 | + "print(f\"Peak memory usage by Pytorch tensors: {(torch.cuda.max_memory_allocated() / 1024 / 1024):.2f} Mb\")" |
| 195 | + ] |
| 196 | + }, |
| 197 | + { |
| 198 | + "cell_type": "markdown", |
| 199 | + "metadata": { |
| 200 | + "colab_type": "text", |
| 201 | + "id": "M3RWHxYKBUys" |
| 202 | + }, |
| 203 | + "source": [ |
| 204 | + "## Gradient based techniques\n", |
| 205 | + "\n", |
| 206 | + "Modern architectures can potentially consume lots and lots of memory even for minibatch of several objects. To handle such cases here we will discuss two simple techniques.\n", |
| 207 | + "\n", |
| 208 | + "### Gradient Checkpointing\n", |
| 209 | + "\n", |
| 210 | + "Checkpointing works by trading compute for memory. Rather than storing all intermediate activations of the entire computation graph for computing backward, the checkpointed part does not save intermediate activations, and instead recomputes them in backward pass. It can be applied on any part of a model.\n", |
| 211 | + "\n", |
| 212 | + "See [blogpost](https://medium.com/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9) for kind introduction and different strategies or [article](https://arxiv.org/pdf/1604.06174.pdf) for not kind introduction.\n", |
| 213 | + "\n", |
| 214 | + "**Task**. Use [built-in checkpointing](https://pytorch.org/docs/stable/checkpoint.html), measure the difference in memory/compute \n", |
| 215 | + "\n", |
| 216 | + "**Requirements**. \n", |
| 217 | + "* Try several arrangements for checkpoints\n", |
| 218 | + "* Add the chekpointing as the optional flag into your script\n", |
| 219 | + "* Measure the difference in memory/compute between the different arrangements and baseline " |
| 220 | + ] |
| 221 | + }, |
| 222 | + { |
| 223 | + "cell_type": "markdown", |
| 224 | + "metadata": { |
| 225 | + "colab_type": "text", |
| 226 | + "id": "mjY8LR_GQbTV" |
| 227 | + }, |
| 228 | + "source": [ |
| 229 | + "### Accumulating gradient for large batches\n", |
| 230 | + "We can increase the effective batch size by simply accumulating gradients over multiple forward passes. Note that `loss.backward()` simply adds the computed gradient to `tensor.grad`, so we can call this method multiple times before actually taking an optimizer step. However, this approach might be a little tricky to combine with batch normalization. Do you see why?" |
| 231 | + ] |
| 232 | + }, |
| 233 | + { |
| 234 | + "cell_type": "code", |
| 235 | + "execution_count": null, |
| 236 | + "metadata": { |
| 237 | + "colab": {}, |
| 238 | + "colab_type": "code", |
| 239 | + "id": "qbbbO7V0QeGT" |
| 240 | + }, |
| 241 | + "outputs": [], |
| 242 | + "source": [ |
| 243 | + "effective_batch_size = 1024\n", |
| 244 | + "loader_batch_size = 32\n", |
| 245 | + "batches_per_update = effective_batch_size / loader_batch_size # Updating weights after 8 forward passes\n", |
| 246 | + "\n", |
| 247 | + "dataloader = DataLoader(dataset, batch_size=loader_batch_size)\n", |
| 248 | + "\n", |
| 249 | + "optimizer.zero_grad()\n", |
| 250 | + "\n", |
| 251 | + "for batch_i, (batch_X, batch_y) in enumerate(dataloader):\n", |
| 252 | + " l = loss(model(batch_X), batch_y)\n", |
| 253 | + " l.backward() # Adds gradients\n", |
| 254 | + " \n", |
| 255 | + " if (batch_i + 1) % batches_per_update == 0:\n", |
| 256 | + " optimizer.step()\n", |
| 257 | + " optimizer.zero_grad()" |
| 258 | + ] |
| 259 | + }, |
| 260 | + { |
| 261 | + "cell_type": "markdown", |
| 262 | + "metadata": { |
| 263 | + "colab_type": "text", |
| 264 | + "id": "ZqxvZWH9Uxtq" |
| 265 | + }, |
| 266 | + "source": [ |
| 267 | + "**Task**. Explore the trade-off between computation time and memory usage while maintaining the same effective batch size. By effective batch size we mean the number of objects over which the loss is computed before taking a gradient step.\n", |
| 268 | + "\n", |
| 269 | + "**Requirements**\n", |
| 270 | + "\n", |
| 271 | + "* Compare compute between accumulating gradient and gradient checkpointing with similar memory consumptions\n", |
| 272 | + "* Incorporate gradient accumulation into your script with optional argument" |
| 273 | + ] |
| 274 | + }, |
| 275 | + { |
| 276 | + "cell_type": "markdown", |
| 277 | + "metadata": { |
| 278 | + "colab_type": "text", |
| 279 | + "id": "K3iiJZuhSUR0" |
| 280 | + }, |
| 281 | + "source": [ |
| 282 | + "## Accuracy vs compute trade-off" |
| 283 | + ] |
| 284 | + }, |
| 285 | + { |
| 286 | + "cell_type": "markdown", |
| 287 | + "metadata": { |
| 288 | + "colab_type": "text", |
| 289 | + "id": "0WOWhqMJSboR" |
| 290 | + }, |
| 291 | + "source": [ |
| 292 | + "### Tensor type size\n", |
| 293 | + "\n", |
| 294 | + "One of the hyperparameter affecting memory consumption is the precision (e.g. floating point number). The most popular choice is 32 bit however with several hacks* 16 bit arithmetics can save you approximately half of the memory without considerable loss of perfomance. This is called mixed precision training.\n", |
| 295 | + "\n", |
| 296 | + "*https://arxiv.org/pdf/1710.03740.pdf" |
| 297 | + ] |
| 298 | + }, |
| 299 | + { |
| 300 | + "cell_type": "markdown", |
| 301 | + "metadata": { |
| 302 | + "colab_type": "text", |
| 303 | + "id": "-xAEF9aJc-43" |
| 304 | + }, |
| 305 | + "source": [ |
| 306 | + "### Quantization\n", |
| 307 | + "\n", |
| 308 | + "We can actually move further and use even lower precision like 8-bit integers:\n", |
| 309 | + "\n", |
| 310 | + "* https://heartbeat.fritz.ai/8-bit-quantization-and-tensorflow-lite-speeding-up-mobile-inference-with-low-precision-a882dfcafbbd\n", |
| 311 | + "* https://nervanasystems.github.io/distiller/quantization/\n", |
| 312 | + "* https://arxiv.org/abs/1712.05877" |
| 313 | + ] |
| 314 | + }, |
| 315 | + { |
| 316 | + "cell_type": "markdown", |
| 317 | + "metadata": { |
| 318 | + "colab_type": "text", |
| 319 | + "id": "fXad1svpSk8f" |
| 320 | + }, |
| 321 | + "source": [ |
| 322 | + "### Knowledge distillation\n", |
| 323 | + "Suppose that we have a large network (*teacher network*) or an ensemble of networks which has a good accuracy. We can like train a much smaller network (*student network*) using the outputs of teacher networks. It turns out that the perfomance could be even better! This approach doesn't help with training speed, but can be quite beneficial when we'd like to reduce the model size for low-memory devices.\n", |
| 324 | + "\n", |
| 325 | + "* https://www.ttic.edu/dl/dark14.pdf\n", |
| 326 | + "* [Distilling the Knowledge in a Neural Network](https://arxiv.org/abs/1503.02531)\n", |
| 327 | + "* https://medium.com/neural-machines/knowledge-distillation-dc241d7c2322\n", |
| 328 | + "\n", |
| 329 | + "Even the completely different ([article](https://arxiv.org/abs/1711.10433)) architecture can be used in a student model, e.g. you can approximate an autoregressive model (WaveNet) by a non-autoregressive one.\n", |
| 330 | + "\n", |
| 331 | + "**Task**. Distill your (teacher) network with smaller one (student), compare it perfomance with the teacher network and with the same (student) trained directly from data.\n", |
| 332 | + "\n", |
| 333 | + "**Note**. Logits carry more information than the probabilities after softmax\n", |
| 334 | + "\n", |
| 335 | + "This approach doesn't help with training speed, but can be quite beneficial when we'd like to reduce the model size for low-memory devices." |
| 336 | + ] |
| 337 | + }, |
| 338 | + { |
| 339 | + "cell_type": "markdown", |
| 340 | + "metadata": {}, |
| 341 | + "source": [ |
| 342 | + "### Pruning\n", |
| 343 | + "\n", |
| 344 | + "The idea of pruning is to remove unnecessary (in terms of loss) weights. It can be measured in different ways: for example, by the norm of the weights (similar to L1 feature selection), by the magnitude of the activation or via Taylor expansion*.\n", |
| 345 | + "\n", |
| 346 | + "One iteration of pruning consists of two steps:\n", |
| 347 | + "\n", |
| 348 | + "1) Rank weights with some importance measure and remove the least important\n", |
| 349 | + "\n", |
| 350 | + "2) Fine-tune the model\n", |
| 351 | + "\n", |
| 352 | + "This approach is a bit computationally heavy but can lead to drastic (up to 150x) decrease of memory to store the weights. Moreover if you make use of structure in layers you can decrease also compute. For example, the whole convolutional filters can be removed.\n", |
| 353 | + "\n", |
| 354 | + "*https://arxiv.org/pdf/1611.06440.pdf" |
| 355 | + ] |
| 356 | + } |
| 357 | + ], |
| 358 | + "metadata": { |
| 359 | + "accelerator": "GPU", |
| 360 | + "colab": { |
| 361 | + "collapsed_sections": [], |
| 362 | + "name": "homework_optimization.ipynb", |
| 363 | + "provenance": [], |
| 364 | + "version": "0.3.2" |
| 365 | + }, |
| 366 | + "kernelspec": { |
| 367 | + "display_name": "Python 3", |
| 368 | + "language": "python", |
| 369 | + "name": "python3" |
| 370 | + }, |
| 371 | + "language_info": { |
| 372 | + "codemirror_mode": { |
| 373 | + "name": "ipython", |
| 374 | + "version": 3 |
| 375 | + }, |
| 376 | + "file_extension": ".py", |
| 377 | + "mimetype": "text/x-python", |
| 378 | + "name": "python", |
| 379 | + "nbconvert_exporter": "python", |
| 380 | + "pygments_lexer": "ipython3", |
| 381 | + "version": "3.7.2" |
| 382 | + } |
| 383 | + }, |
| 384 | + "nbformat": 4, |
| 385 | + "nbformat_minor": 2 |
| 386 | +} |
0 commit comments