[1' read]

Bits and Bytes

QuantizationBnb

Bits-and-bytes is a very fast and straightforward approach to quantization, quantizing while loading. However, speed and quality are not optimal, useful for quick quantization and loading of models quantizing in the fly.

Quantizing with transformers

Lets do a short demo and quantize Mistral 7B!

First, we install transformers and all dependencies required.

!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git

!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git

Once we're done, we can download the model we want to quantize. First, let's log in with a read access token so we have access to the models.

Note: You need to first accept the terms in the repo.

from huggingface_hub import login

login("read_token")

from huggingface_hub import login

login("read_token")

Now everything is ready, so we can load the model and quantize it! Here, we will quantize the model to 8-bit!

import torch
from transformers import BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer

pretrained_model_dir = "mistralai/Mistral-7B-Instruct-v0.3"
quantized_model_dir = "mistral_bnb_quant"

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True
)

import torch
from transformers import BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer

pretrained_model_dir = "mistralai/Mistral-7B-Instruct-v0.3"
quantized_model_dir = "mistral_bnb_quant"

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True
)

Contrary to other methods, BnB is pretty fast and efficient. We do not necessarily need to quantize it beforehand, we can do it on the fly!

model = AutoModelForCausalLM.from_pretrained(pretrained_model_dir, quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir)

model = AutoModelForCausalLM.from_pretrained(pretrained_model_dir, quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir)

Once ready you can use the model as follows:

conversation = [{"role": "user", "content": "Tell me a joke."}]

prompt = tokenizer.apply_chat_template(
            conversation,
            tokenize=False,
            add_generation_prompt=True,
)

print(prompt)

inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
inputs.to("cuda:0")

outputs = model.generate(**inputs, max_new_tokens=64)
response = tokenizer.decode(outputs[0], skip_special_tokens=False)

print(response)

conversation = [{"role": "user", "content": "Tell me a joke."}]

prompt = tokenizer.apply_chat_template(
            conversation,
            tokenize=False,
            add_generation_prompt=True,
)

print(prompt)

inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
inputs.to("cuda:0")

outputs = model.generate(**inputs, max_new_tokens=64)
response = tokenizer.decode(outputs[0], skip_special_tokens=False)

print(response)