AWQ

QuantizationAwq

Similar to GPTQ, AWQ is optimized for GPU inference. It is based on the fact that ~1% of weights actually contribute significantly to the model's accuracy, and hence these must be treated delicately by using a dataset to analyze the activation distributions during inference and identify those important and critical weights.

Quantizing with AutoAWQ

Let's do a short demo and quantize Mistral 7B!

First, we install autoawq. It will allow us to easily quantize and perform inference on AWQ models! AutoAWQ also provides, by default, a pile-val dataset that will be used for the quantization process!

!pip install autoawq

Once we're done, we can download the model we want to quantize. First, let's log in with a read access token so we have access to the models.

Note: You need to first accept the terms in the repo.

from huggingface_hub import login

login("read_token")

Now everything is ready, so we can load the model and quantize it! Here, we will quantize the model to 4-bit!

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

pretrained_model_dir = "mistralai/Mistral-7B-Instruct-v0.3"
quantized_model_dir = "mistral_awq_quant"

model = AutoAWQForCausalLM.from_pretrained(
    pretrained_model_dir, **{"low_cpu_mem_usage": True, "use_cache": False}
)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, trust_remote_code=True)

# quantize the model
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
model.quantize(tokenizer, quant_config=quant_config)

Now that the model is quantized, we can save it so we can share it or load it later! Since quantizing with AWQ takes a while and some resources, it's advised to always save them.

model.save_quantized(quantized_model_dir)

tokenizer.save_pretrained(quantized_model_dir)

Model quantized and saved to AWQ 4-bit precision!

You can also load it for inference using autoawq as follows:

model = AutoAWQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0") # loads quantized model to the first GPU
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir)

conversation = [{"role": "user", "content": "How are you today?"}]

prompt = tokenizer.apply_chat_template(
            conversation,
            tokenize=False,
            add_generation_prompt=True,
)

inputs = tokenizer(prompt, return_tensors="pt")
inputs.to("cuda:0") # loads tensors to the first GPU

outputs = model.generate(**inputs, max_new_tokens=32)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(response)