[1' read]

EXL2

QuantizationExl2

A more recent format based on the GPTQ optimization method but with mixed quantization levels. It achieves an average desired bitrate with lower errors than GPTQ while keeping the same or similar bitrate. Can have a slightly higher VRAM usage but better inference speed and quality.

Quantizing with exllamav2

Lets do a short demo and quantize Mistral 7B!

let's install exllamav2 and all dependencies required.

!git clone https://github.com/turboderp/exllamav2
!(cd exllamav2 && pip install -r requirements.txt && pip install .)

!git clone https://github.com/turboderp/exllamav2
!(cd exllamav2 && pip install -r requirements.txt && pip install .)

Once everything installed we can download the model.

model_id = "mistralai/Mistral-7B-Instruct-v0.3"
user_name = "huggingface_username"
hf_token = "read_token"

!git lfs install
!git clone https://{user_name}:{hf_token}@huggingface.co/{model_id}

model_id = "mistralai/Mistral-7B-Instruct-v0.3"
user_name = "huggingface_username"
hf_token = "read_token"

!git lfs install
!git clone https://{user_name}:{hf_token}@huggingface.co/{model_id}

Time to quantize! Lets go with a bitrate of 4.0

model_name = model_id.split('/')[-1]
quant_bpw = 4.0

!mkdir temp
!python exllamav2/convert.py \
    -i {model_name} \
    -o temp/ \
    -cf {model_name}-exl2/{quant_bpw}bpw/ \
    -b {quant_bpw}

model_name = model_id.split('/')[-1]
quant_bpw = 4.0

!mkdir temp
!python exllamav2/convert.py \
    -i {model_name} \
    -o temp/ \
    -cf {model_name}-exl2/{quant_bpw}bpw/ \
    -b {quant_bpw}

Model quantized and saved! You can test it with the following:

!python exllamav2/test_inference.py -m {model_name}-exl2/{quant_bpw}bpw -p "Once upon a time,"

!python exllamav2/test_inference.py -m {model_name}-exl2/{quant_bpw}bpw -p "Once upon a time,"