Similar to GPTQ, AWQ is optimized for GPU inference. It is based on the fact that ~1% of weights actually contribute significantly to the model's accuracy, and hence these must be treated delicately by using a dataset to analyze the activation distributions during inference and identify those important and critical weights.
Quantizing with AutoAWQ
Let's do a short demo and quantize Mistral 7B!
First, we install autoawq. It will allow us to easily quantize and perform inference on AWQ models! AutoAWQ also provides, by default, a pile-val dataset that will be used for the quantization process!
!pip install autoawqOnce we're done, we can download the model we want to quantize. First, let's log in with a read access token so we have access to the models.
Note: You need to first accept the terms in the repo.
from huggingface_hub import login
login("read_token")Now everything is ready, so we can load the model and quantize it! Here, we will quantize the model to 4-bit!
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
pretrained_model_dir = "mistralai/Mistral-7B-Instruct-v0.3"
quantized_model_dir = "mistral_awq_quant"
model = AutoAWQForCausalLM.from_pretrained(
pretrained_model_dir, **{"low_cpu_mem_usage": True, "use_cache": False}
)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, trust_remote_code=True)
# quantize the model
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
model.quantize(tokenizer, quant_config=quant_config)Now that the model is quantized, we can save it so we can share it or load it later! Since quantizing with AWQ takes a while and some resources, it's advised to always save them.
model.save_quantized(quantized_model_dir)
tokenizer.save_pretrained(quantized_model_dir)Model quantized and saved to AWQ 4-bit precision!
You can also load it for inference using autoawq as follows:
model = AutoAWQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0") # loads quantized model to the first GPU
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir)
conversation = [{"role": "user", "content": "How are you today?"}]
prompt = tokenizer.apply_chat_template(
conversation,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt")
inputs.to("cuda:0") # loads tensors to the first GPU
outputs = model.generate(**inputs, max_new_tokens=32)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)