[2' read]

AWQ

QuantizationAwq

Similar to GPTQ, AWQ is optimized for GPU inference. It is based on the fact that ~1% of weights actually contribute significantly to the model's accuracy, and hence these must be treated delicately by using a dataset to analyze the activation distributions during inference and identify those important and critical weights.

Quantizing with AutoAWQ

Let's do a short demo and quantize Mistral 7B!

First, we install autoawq. It will allow us to easily quantize and perform inference on AWQ models! AutoAWQ also provides, by default, a pile-val dataset that will be used for the quantization process!

!pip install autoawq

!pip install autoawq

Once we're done, we can download the model we want to quantize. First, let's log in with a read access token so we have access to the models.

Note: You need to first accept the terms in the repo.

from huggingface_hub import login

login("read_token")

from huggingface_hub import login

login("read_token")

Now everything is ready, so we can load the model and quantize it! Here, we will quantize the model to 4-bit!

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

pretrained_model_dir = "mistralai/Mistral-7B-Instruct-v0.3"
quantized_model_dir = "mistral_awq_quant"

model = AutoAWQForCausalLM.from_pretrained(
    pretrained_model_dir, **{"low_cpu_mem_usage": True, "use_cache": False}
)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, trust_remote_code=True)

# quantize the model
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
model.quantize(tokenizer, quant_config=quant_config)

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

pretrained_model_dir = "mistralai/Mistral-7B-Instruct-v0.3"
quantized_model_dir = "mistral_awq_quant"

model = AutoAWQForCausalLM.from_pretrained(
    pretrained_model_dir, **{"low_cpu_mem_usage": True, "use_cache": False}
)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, trust_remote_code=True)

# quantize the model
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
model.quantize(tokenizer, quant_config=quant_config)

Now that the model is quantized, we can save it so we can share it or load it later! Since quantizing with AWQ takes a while and some resources, it's advised to always save them.

model.save_quantized(quantized_model_dir)

tokenizer.save_pretrained(quantized_model_dir)

model.save_quantized(quantized_model_dir)

tokenizer.save_pretrained(quantized_model_dir)

Model quantized and saved to AWQ 4-bit precision!

You can also load it for inference using autoawq as follows:

model = AutoAWQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0") # loads quantized model to the first GPU
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir)

conversation = [{"role": "user", "content": "How are you today?"}]

prompt = tokenizer.apply_chat_template(
            conversation,
            tokenize=False,
            add_generation_prompt=True,
)

inputs = tokenizer(prompt, return_tensors="pt")
inputs.to("cuda:0") # loads tensors to the first GPU

outputs = model.generate(**inputs, max_new_tokens=32)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(response)

model = AutoAWQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0") # loads quantized model to the first GPU
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir)

conversation = [{"role": "user", "content": "How are you today?"}]

prompt = tokenizer.apply_chat_template(
            conversation,
            tokenize=False,
            add_generation_prompt=True,
)

inputs = tokenizer(prompt, return_tensors="pt")
inputs.to("cuda:0") # loads tensors to the first GPU

outputs = model.generate(**inputs, max_new_tokens=32)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(response)