Islam Mohamed

Mixed Precision Training


A High-Level Overview

Deep Neural Networks (DNNs) have achieved breakthroughs in several areas, including Computer Vision, Natural Language Understanding, Speech Recognition tasks, and many others.

Although increasing network size typically improves accuracy, the computational resources also increase (GPU utilization, Memory).
So new techniques have been developed to train models faster ( without losing accuracy or modifying the network hyper-parameters) by lowering the required memory which will enable us to train larger models or train with larger mini-batches.

In neural nets, all the computations are done in a single precision floating point.
Single precision floating point arithmetic deals with 32-bit floating point numbers which means that all the floats in all the arrays that represent inputs, activations, weights .. etc are 32-bit floats (FP32).

So an idea to reduce memory usage by dealing with 16-bits floats which called Half precision floating point format (FP16). But Half precision had some issues which localized in its small range and low precision, unlike single or double precision, floats. And for this reason, half precision sometimes won’t able to achieve the same accuracy.
So with Mixed precision which uses both single and half precision representations will able to speed up training and achieving the same accuracy.

Problems In Half Precision

To understand the problems in half precision, let’s have a look what an FP16 looks like :

Fig. 1 : half precision floating point format.

Divided Into three Segments

  1. The bit number 15 is the sign bit.
  2. The bists for 10 to 14 are the exponent.
  3. The final 10 bits are the fraction.

The values for this representation is calculated as shown below

Fig. 2 : half precision floating point formatt with value for bits.

  1. If the exponent bits is ones (11111), then the value will be NaN (“Not a number”).
  2. If the exponent bits is zeros (0000), then the value will be a subnormal number and calculated by :
    (−1) ^ signbit × 2 ^ −14 × 0.significantbits_base_2
  3. Otherwise the value will be a normalized value and calculated by :
    (−1) ^ signbit × 2 ^ exponent value − 15 × 1.significantbits_base_2

Based on half precision floating point methodology if we tried to add 1 + 0.0001 the output will be 1 because of the limited range and aligning between 1 and 0.0001 as shown in this answer.

And that will cause numbers of problems while training DNNs, For trying and investigation through conversion or adding in binary 16 float point check this site.

The main issues while training with FP16

  1. Values is imprecise.
  2. Underflow Risk.
  3. Exploding Risk.

Values is imprecise

In neural Network training all weights, activations, and gradients are stored as FP16.
And as we know updating weights is done based on this equation

New_weight = Weight - Learning_Rate * Weight.Gradient

Since Weight.Gradient and Learning_Rate usually with small values and as shown before in half precision if the weight is 1 and Learning_Rate is 0.0001 or lower that will made freezing thrugh weights value.

Underflow Risk

In FP16, Gradients will get converted to zero because gradients usually are too low.
In FP16 arithmetic the values smaller than 0.000000059605 = 2 ^ -24 become zero as this value is the smallest positive subnormal number and for more details investigate here.
With underflow, network never learns anything.

Overflow Risk

In FP16, activations and network paramters can increase till hitting NANs.
With overflow or exploding gradients, network learns garbage.

The Proposed Techniques for Training with Mixed Precision

Mainly there are three techniques for preventing the loss of critical information.

  1. Single precision FP32 Master copy of weights and updates.
  2. Loss (Gredient) Scaling.
  3. Accumulating half precision products into single precision.

Single precision FP32 Master copy of weights and updates

To overcome the first problem we use a copy from the FP32 master of all weights and in each iteration apply the forward and backward propagation in FP16 and then update weights stored in the master copy as shown below.

Fig. 3 : Mixed precision training iteration for a layer.

Through the storing an additional copy of weights increases the memory requirements but the overall memory consumptions is approximately halved the need by FP32 training.

Loss (Gredient) Scaling

Accumulating half precision products into single precision

After investigatin through last issue found that the neural network arithmetic operations falls into three groups: vector dot-products, Reductions and point-wise operations.
These categories benefit from different treatment when it comes to re-duced precision arithmetic.

Mixed Precision Training Steps

  1. Maintain a master copy of weights in FP32.
  2. Initialize scaling factor (S) to a large value.
  3. For each iteration:
    3.1 Make an FP16 copy of the weights.
    3.2 Forward propagation (FP16 weights and activations).
    3.3 Multiply the resulting loss with the scaling factor S.
    3.4 Backward propagation (FP16 weights, activations, and their gradients).
    3.5 If there is an Inf or NaN in weight gradients:
    3.5.1 Reduce S.
    3.5.2 Skip the weight update and move to the next iteration.
    3.6 Multiply the weight gradient with 1/S.
    3.7 Complete the weight update (including gradient clipping, etc.).
    3.8 If there hasn’t been an Inf or NaN in the last N iterations, increase S.


    “now native”, “is supported”, “PyTorch” (replace all), point: drawbacks/points, “like””

Mixed Precision APIs

Mixed Precision In Frameworks

Automatic Mixed Precision package - Pytorch

References

comments powered by Disqus