Quantization Algorithms

Symmetric Linear Quantization

In this method, a float value is quantized by multiplying with a numeric constant (the scale factor), hence it is Linear. We use a signed integer to represent the quantized range, with no quantization bias (or "offset") used. As a result, the floating-point range considered for quantization is symmetric with respect to zero.
In the current implementation the scale factor is chosen so that the entire range of the floating-point tensor is quantized (we do not attempt to remove outliers).
Let us denote the original floating-point tensor by , the quantized tensor by , the scale factor by and the number of bits used for quantization by . Then, we get: (The operation is round-to-nearest-integer)

Let's see how a convolution or fully-connected (FC) layer is quantized using this method: (we denote input, output, weights and bias with and respectively) Note how the bias has to be re-scaled to match the scale of the summation.

Implementation

We've implemented convolution and FC using this method.

  • They are implemented by wrapping the existing PyTorch layers with quantization and de-quantization operations. That is - the computation is done on floating-point tensors, but the values themselves are restricted to integer values.
  • All other layers are unaffected and are executed using their original FP32 implementation.
  • For weights and bias the scale factor is determined once at quantization setup ("offline"), and for activations it is determined dynamically at runtime ("online").
  • Important note: Currently, this method is implemented as inference only, with no back-propagation functionality. Hence, it can only be used to quantize a pre-trained FP32 model, with no re-training. As such, using it with is likely to lead to severe accuracy degradation for any non-trivial workload.