# Improving Inference Speeds of Transformer Models

“With great models comes slower inference speeds”.

Deep Learning has evolved immensely and it has **Transforme (r)d** NLP completely in the past 5 years. Although these models do achieve state of the art results on various NLP tasks, the models are really big and slow. Bringing these models to production can be a pain because of their large memory footprint and slow speeds. At GumGum, we spend a considerable amount of time building models that are not just accurate but are also lean and fast. Verity, which is the engine that powers our contextual targeting capabilities, requires models that can scale well to our production traffic. In this blog, we will look at various techniques like Mixed Precision Training, Patience Based Early Exit (PABEE) and Knowledge Distillation in order to build faster Deep Learning models.

# Mixed Precision Training

All of these latest deep learning models such as BERT, RoBERTa etc have millions of parameters each taking 32 byte of information when stored in Single Precision. Memory quickly becomes an issue while training as each step requires calculation and storing of many values like gradients.

Therefore, if we could use smaller precision such as Floating Point (fp)16, we could considerably reduce the memory footprint of the model (by half). But naively using fp16 could severely hamper the performance of the model compared to fp32. To work around this, NVIDIA introduced Mixed Precision Training which uses both fp32 and fp16 in a really interesting manner such that the performance remains roughly the same with reduction in memory footprint and improvements in the inference speed.

NVIDIA’s researchers found that when simply using the fp16, approximately 5% of the weight gradients become too small (less than 2⁻²⁴) to be represented in fp16 and therefore becomes zero. The rest of the non-zero weight gradients when multiplied by the learning rate (which itself is a really small number) to compute the weight updates can become too small to be represented in fp16. With so many weight updates becoming zero, the model effectively stops learning, severely impacting the accuracy compared to a model being trained using fp32.

**Keeping a Master Copy of FP32 weights**NVIDIA introduced a way to use fp16 for forward pass, activations and backward pass, and only use fp32 for weight update step. This is done by keeping a master copy of fp32 weights. While training in mixed precision, all the weights, activations and gradients are stored in fp16. The master copy of fp32 weights is updated with the weight gradient during each optimizer step.

**Loss Scaling**An important thing to note is the need for Loss Scaling. The authors of the NVIDIA paper clearly show the motivation behind this. They looked at the activation gradients computed in fp32 across all layers and showed (Figure 2 below) how the majority of the fp16 representable range is left unused and most of the values also fall below the fp16 representable range (less than 2⁻²⁴).

Their experiments showed that the gradient values below 2⁻²⁷ weren’t as relevant, while values between 2⁻²⁷ and 2⁻²⁴ were critical to preserve in order to maintain the fp32 performance. This served as the motivation behind scaling these values. An efficient way to do this scaling was to scale the loss itself. The Chain Rule would dictate that a scaled loss would also scale all these gradient values by the same amount. This step would help us in preventing the gradient values from becoming zero while being represented in fp16. Right before making the fp32 weight updates, the weight gradients must be unscaled in order to maintain the update magnitudes as in fp32 training.

Once we have a model trained using Mixed Precision, we can simply use fp16 for inference giving us an over two times speed up compared to fp32 inference.

# Patience Based Early Exit

Patience-based Early Exit (PABEE) mechanism allows the model to potentially dynamically stop the inference. It is a pretty straightforward way which can be used as a plug-and-play method to improve the efficiency of a pre-trained language models (PLMs like BERT, RoBERTa, ALBERT etc).

The widely used Early Stopping strategy employed in model training serves as an inspiration behind PABEE. The idea is to add classifier heads with each layer of the PLM and dynamically stop inference if the predictions of these classifier heads remain unchanged for `t`

times consecutively (see Figure below), where `t`

is a pre-defined patience.

The authors found that:

although the model becomes more “confident” with its prediction as more layers join, the actual error rate instead increases after 10 layers. This phenomenon was discovered and named “overthinking” by Kaya et al.

Looking at the figure above (a), you’d observe that the model continues to perform better on the training set, but after 2.5 epochs of training, the performance deteriorates on the dev set. This is what we call an overfitting problem. We can employ early stopping mechanism to avoid this. Similarly, when we consider figure b, we can see how the prediction entropy continues to improve, i.e., the model becomes more confident as more layers are involved in making a prediction. But the error rate deteriorates further. Therefore the authors argue that:

overfitting in training and overthinking in inference are naturally alike, inspiring us to adopt an approach similar to early stopping for inference.

# Knowledge Distillation

A variety of model compression and acceleration techniques are available out there. Knowledge Distillation is a model compression technique. The idea here is to train a more compact (student) model to reproduce the behavior of a larger and heavier teacher model (or an ensemble of models).

Knowledge Distillation was introduced by Geoffrey Hinton in a 2014 NIPS paper called Distilling the Knowledge in a Neural Network. This is a great paper to understand the basic ideas and motivation behind Knowledge Distillation, and a highly recommended read. Let me try to break down some of the key ideas mentioned in the paper:

- The knowledge that we are trying to learn when training a model could be associated with the learned parameter values (weights). This is the conceptual view. A more abstract view could be to learn the mapping from input vectors to output vectors. It is this abstract view of knowledge that knowledge distillation tries to learn. It aims at learning a slightly less complex (smaller student model) mapping using the more complex mapping provided by a larger teacher model.
- The true objective of any model training is to learn to generalize well for new data, but this requires information about the correct way to generalize which is not normally available. Therefore more often than not, a model is trained with an objective to maximize performance on the training data by maximizing the average log probability of the correct answer (correct class). With this we also get the probabilities for the incorrect classes as well. Although very small values, some of them are much larger than others and it is this relative information that can tell us a lot about how a model could potentially learn to generalize.
- It is this generalization ability of a big teacher model that we aim to transfer to a smaller student model. This can be done by training a smaller student model using the class probabilities produced by the bigger teacher model as “soft targets”.
- Using “soft targets” is definitely the way to go, but there is a minor issue. Consider the MNIST task where the model learns to identify the integer between 0 to 9 given the images of these integers. This is a well known problem, and there are big cumbersome models out there that can predict the correct class with very high confidence. But if we look at the ratios of the small probabilities given to the other incorrect classes, we could find some important information. For example, a “2” would be given a probability if 0.000000001 for being a “3”, while 0.000001 for being a “7”. This information can be valuable when generalizing to new examples as it can teach a model to learn which “2”’s look like “3”’s and which look like “7”’s. But as you would have guessed by now, since these probabilities are so low, they have very little influence on the cross-entropy cost function when trying to transfer the generalization capability of a teacher model to a smaller student model.
- Therefore, the authors introduced the concept of “Distillation” where the temperature of the final softmax is raised until the teacher model produces suitably soft set of targets.

Here, z_i are the logits, q_i are the class probabilities, and T is a temperature that is normally set to 1, but using a higher value produces softer probability distributions.

- In the simplest case, we can train a smaller student model using the soft targets produced by the bigger teacher model using a high temperature in its softmax. The same high temperature is used in the softmax of the smaller distilled student model as well during training. Once the distilled model is trained, it uses a temperature of 1. Since we are using just the soft targets from a teacher model, we don’t really need a labeled dataset assuming we have a trained teacher model.
- If we do have a trained dataset available, we can also train the smaller distilled student model on the hard targets as well. We could do this by simply training on a weighted average of two different objective functions (cross entropy); one with soft targets and one with hard targets.

DistilBERT was one of the first distilled models for the Transformer Based architectures. The student here is trained with a distillation loss (L_ce) over the soft target probabilities of the teacher. Here t_i is the probability of soft targets produced by the bigger teacher model (BERT) and s_i are the probabilities by the distilled model using the same temperature based softmax.

The final training objective is a linear combination of the distillation loss (L_ce) with the supervised training loss (masked language modeling loss). The authors found it beneficial to add a cosine embedding loss which will tend to align the directions of the student and teacher hidden states vectors. For more details please refer to the original paper.

Since I was working with RoBERTa models, I used DistilRoBERTa which is a distilled version of the RoBERTa-base model. It follows the same training procedure as DistilBERT. The model has 6 layers, 768 dimensions, and 12 heads, totalizing 82M parameters (compared to 125M parameters for RoBERTa-base). On average, DistilRoBERTa is twice as fast as Roberta-base.

# Results

I tried all of the techniques discussed above for a 29 class Multilabel Multiclass classification problem that I dealt with at GumGum. Applying mixed precision training and performing the inference at fp16 gave a huge increase in speed compared to performing the inference at fp32 with the difference in F0.5 (FBeta at Beta=0.5) not being statistically significant. With RoBERTa-Base, the fp32 F0.5 was 76.81% with inference speed of 29 msg/s and on the other hand, the fp16 F0.5 was 76.75% with inference speed of 90 msg/s.

For this reason, I employed Mixed Precision training while training for RoBERTa with PABEE and DistilRoBERTa. The authors of *Patience Based Early Exit (PABEE)* experimented with BERT and Albert models to showcase PABEE’s efficiency. Since I was working with RoBERTa models, I extended PABEE for RoBERTa model.

Below are the results of fp16 inference for RoBERTa, RoBERTa with PABEE, and DistilRoBERTa. The results shown are on a single NVIDIA Tesla V100 GPU with 16gb RAM:

On bootstrapping these results, the difference in the evaluation metrics was not statistically significant. Given the huge gains in inference speed, Distil-RoBERTa was chosen as the final model.

# Conclusion

In this blog we discussed Mixed Precision Training, Patience Based Early Exit and Knowledge Distillation in order to achieve greater inference speeds. There are few other techniques like Pruning which were not discussed here. It could be slightly tough to implement, but can also be used to achieve a lighter and faster models as well. Patience Based Early Exit is also a great technique, but relies heavily on the amount of training data in order to learn to exit early. In my experiments, I observed that more often it used to exit around 9th or 10th layer which is good but not great. Maybe with more data, it could learn to exit earlier than that giving us more speed gains. PyTorch provides inbuilt support for Mixed Precision Training and HuggingFace also provides the distilled version of various Transformer based models like BERT and RoBERTa. Maybe, it will be interesting to combine all the three techniques discussed here by performing a mixed precision training for Distil-RoBERTa with PABEE.

**About Me**: Graduated with a Masters in Computer Science from ASU. I am a NLP Scientist at GumGum. I am interested in applying Machine Learning/Deep Learning to provide some structure to the unstructured data that surrounds us.