Deep Learning Tuning Playbook | Google Research

GitHub - google-research/tuning_playbook: A playbook for systematically maximizing the performance of deep learning models.

Guide for starting a new project

  • Guides on choosing the model architecture, the optimizer, and the batch size.
  • The goal of tuning batch size is to saturate the GPUs, which can be monitored by:
    • training throughput = (# examples processed per second)
    • time per step = (batch size) / (training throughput)
    • P.S. When batch size is tuned, most hyper-parameters need to be tuned. Among them, the learning rate and the regularization term are the most important.

A scientific approach to improving model performance

  • Design the experiment
    • Scientific hyper-parameters
      The experiment is aimed at explore the effect of the scientific hyper-parameters.
    • Nuisance hyper-parameters
      To compare different approach, nuisance hyper-parameters are needed to be tuned and the best trails are to be compared.
    • Fixed hyper-parameters
      The hyper-parameters to be fixed to reduce the number of trails needed.
  • The exploration of parameters search space
    Bayesian optimization / quasi-random search. Considering the search boundary carefully.
  • Automate the plotting to ensure we plot enough graphics.

Determining the number of steps for each training run

  • Deciding how long to train when training is not compute-bound
  • Deciding how long to train when training is compute-bound
    • "Round 1: Shorter runs to find good model and optimizer hyperparameters."
    • "Round 2: Very few long runs on good hyperparameter points to get the final model."

Additional guidance for the training pipeline

  • Optimizing the input pipeline
  • Saving checkpoints and retrospectively selecting the best checkpoint
    Keep the best kk checkpoints along training.

FAQs

  • How should Adam’s hyper-parameters be tuned?
  • Why use quasi-random search instead of more sophisticated black box optimization algorithms during the exploration phase of tuning?
  • Unstable training
    • Learning rate warmup
      "Our goal is to find the shortest number of warmup_steps that allows us to access peak learning rates that are much higher than unstable_base_learning_rate." The default is 10x unstable_base_learning_rate.
    • Gradient clipping
      "Choose a gradient clipping threshold based on the 90th percentile of gradient norms."
    • Issue with Batch Normalization: Use x + f(Norm(x)). Norm(x + f(x)) is known to cause issues.
  • Update rules of popular optimizers
    • Stochastic gradient descent (SGD)
      θt+1=θtηtl(θt)\theta_{t+1} = \theta_{t} - \eta_t \nabla \mathcal{l}(\theta_t)
    • Momentum
      v0=0v_0 = 0
      vt+1=γvt+l(θt)v_{t+1} = \gamma v_{t} + \nabla \mathcal{l}(\theta_t)
      θt+1=θtηtvt+1\theta_{t+1} = \theta_{t} - \eta_t v_{t+1}
    • Nesterov
      v0=0v_0 = 0
      vt+1=γvt+l(θt)v_{t+1} = \gamma v_{t} + \nabla \mathcal{l}(\theta_t)
      θt+1=θtηt(γvt+1+l(θt))\theta_{t+1} = \theta_{t} - \eta_t( \gamma v_{t+1} + \nabla \mathcal{l}(\theta_{t}))
    • RMSProp
      v0=1,m0=0v_0 = 1 \text{,} m_0 = 0
      vt+1=ρvt+(1ρ)l(θt)2v_{t+1} = \rho v_{t} + (1 - \rho) \nabla \mathcal{l}(\theta_t)^2
      mt+1=γmt+ηtvt+1+ϵl(θt)m_{t+1} = \gamma m_{t} + \frac{\eta_t}{\sqrt{v_{t+1} + \epsilon}}\nabla \mathcal{l}(\theta_t)
      θt+1=θtmt+1\theta_{t+1} = \theta_{t} - m_{t+1}
    • ADAM
      m0=0,v0=0m_0 = 0 \text{,} v_0 = 0
      mt+1=β1mt+(1β1)l(θt)m_{t+1} = \beta_1 m_{t} + (1 - \beta_1) \nabla \mathcal{l} (\theta_t)
      vt+1=β2vt+(1β2)l(θt)2v_{t+1} = \beta_2 v_{t} + (1 - \beta_2) \nabla \mathcal{l}(\theta_t)^2
      bt+1=1β2t+11β1t+1b_{t+1} = \frac{\sqrt{1 - \beta_2^{t+1}}}{1 - \beta_1^{t+1}}
      θt+1=θtαtmt+1vt+1+ϵbt+1\theta_{t+1} = \theta_{t} - \alpha_t \frac{m_{t+1}}{\sqrt{v_{t+1}} + \epsilon} b_{t+1}
    • NADAM
      m0=0,v0=0m_0 = 0 \text{,} v_0 = 0
      mt+1=β1mt+(1β1)l(θt)m_{t+1} = \beta_1 m_{t} + (1 - \beta_1) \nabla \mathcal{l} (\theta_t)
      vt+1=β2vt+(1β2)l(θt)2v_{t+1} = \beta_2 v_{t} + (1 - \beta_2) \nabla \mathcal{l} (\theta_t)^2
      bt+1=1β2t+11β1t+1b_{t+1} = \frac{\sqrt{1 - \beta_2^{t+1}}}{1 - \beta_1^{t+1}}
      θt+1=θtαtβ1mt+1+(1β1)l(θt)vt+1+ϵbt+1\theta_{t+1} = \theta_{t} - \alpha_t \frac{\beta_1 m_{t+1} + (1 - \beta_1) \nabla \mathcal{l} (\theta_t)}{\sqrt{v_{t+1}} + \epsilon} b_{t+1}