GGUF

QuantizationGguf

Previously nammed GGML, GGUF is favored by a lot of the community for its ability to run efficiently on CPU and Apple devices, offloading to a GPU if available! Making it a good choice for local testing and deployment as it can make good use of both RAM (and VRAM if available).

Quantizing with llama.cpp

Here is a list of possible quantizations with llama.cpp:

  • q2_k
  • q3_k_l
  • q3_k_m
  • q3_k_s
  • q4_0: <- 4-bit
  • q4_1
  • q4_k_s
  • q4_k_m <- Recommended
  • q5_0: <- 5-bit
  • q5_1
  • q5_k_s
  • q5_k_m: <- Recommended
  • q6_k: <- Recommended
  • q8_0: <- 8-bit, very close to lossless compared to the original weights

Lets do a short demo and quantize Mistral 7B!

First lets install all the dependencies required and llama.cpp, as well as downloading the model.

model_id = "mistralai/Mistral-7B-Instruct-v0.3"
model_name = model_id.split('/')[-1]
user_name = "huggingface_username"
hf_token = "read_token"

# install llama.cpp
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && git pull && make clean && LLAMA_CUBLAS=1 make
!pip install -r llama.cpp/requirements.txt
!(cd llama.cpp && make)

# download model
!git lfs install
!git clone https://{user_name}:{hf_token}@huggingface.co/{model_id}

Once everything installed and downloaded, we can convert our model to fp16, required before quantizing to GGUF.

fp16 = f"{model_name}/{model_name.lower()}.fp16.bin"
!python llama.cpp/convert_hf_to_gguf.py {model_name} --outtype f16 --outfile {fp16}

Now our model is ready, we can quantize, feel free to change the method, in this example we will quantize to q4_k_m.

method = "q4_k_m"
qtype = f"{model_name}/{model_name.lower()}.{method.upper()}.gguf"
!./llama.cpp/llama-quantize {fp16} {qtype} {method}

Perfect, we may now test it with llama.cpp using the folllowing:

!./llama.cpp/llama-cli -m {qtype} -n 128 --color -ngl 35 -cnv --chat-template mistral