How to garden your B2B sales?

B2B sales are hard. They take a long time to reach fruition. At each of the five steps of B2B sales, your company needs to be active. A B2B software sale is usually a long process and it is vital to…

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转

Large language model inference on CPU

Transformer based large language models(LLMs) have potential to understand different types of data and provide insights like never before. They are being used in different business areas and can solve dozens of problems.

Well, here’s the issue …

What if we can use large language models for inference without GPU ?

First, let’s see what goes into fine-tuning a large language model -

Domain adaptation is performed when a language model is trained on a dataset which is entirely different from what we want our model to train on. Casual Language Modeling (CLM), Masked Language Modeling (MLM) etc. can be used for domain adaptation.

We will be considering inference for text-classification in this post. Below are main steps for model compression :

Encoder of the language models mostly consists of 12 hidden layers and 12 attention heads which takes a lot of time to process data through it. Will not go into detail of transformer working as lot of good material is already out there. But, if a tiny model is able to capture all relevant details as of the LLM with a lot less stack of layers, wouldn’t it be amazing!

Studies suggests that different layers of these models capture different information— initial layers gathers surface features, middle and last layers are responsible for syntactic and semantic features respectively. Models with downstream tasks like classification(binary class, multi class and multi label) can perform well without having knowledge of all the features with comparable accuracy . For most of the LLMs we don’t have smaller versions available on 🤗hub, but transformers allows us to use custom config for all the models where we can tweak the architecture and choose parameters of our own choice. We can vary the number of hidden layers and attention heads between 1 to 12, and also adjust the values of other parameters depending on the requirement. Selecting 4 hidden layers and 4 attention heads like in the code below can be a good starting point to see how the model performs on a given dataset.

All the parameters can be adjusted as per the use case until desired accuracy is achieved. With this we can decrease the model size up to 4 times which consequently results in faster training and inference.

With this conversion, model size can be reduced up to ~3X

Quantization is a mechanism which lowers the precision of ONNX model, example — converting floating point numbers(fp16, fp32) to int8. This significantly improves the inference speed and reduces the size of the model up to 6 times. We are using post-training quantization here.

Combining the above three techniques it’s possible to run predictions for thousands and millions of records without GPU. These techniques reduces the model size up to 25X which can be as low as 80Mb. But, we need to be cautious and constantly evaluate the model performance at each stage so that there is not significant drop in overall accuracy. There is a lot of work going on to efficiently compress the transformer models while preserving the overall accuracy like — pruning, distillation, quantization etc. which will further make it easier to adopt these intelligent models.

References :

How to garden your B2B sales?

Large language model inference on CPU

Add a comment

Related posts:

My Project Journey

Long Form

Of Algorithms and Employment Status