Quick Start Guide To Converting Your Own GGUFs (including fp16)

IMPORTANT: This is not an exhaustive guide by far. Consider this a quick start tutorial to get you going. For a lot of models, what I have below is all you need. But for some models it looks like there's some wrestling involved with settings, vocab thingies, etc. This guide exists purely for people like me who want help taking the first step, and then can start working out the rest on their own.

-----------------------------------------------------------

Once in a while you run across something that is really simple, and yet really useful. And sometimes that thing is so simple, in fact, that everyone assumes everyone else knows how to do it. I feel like this is something that happens here with GGUF conversions. I constantly see people say "it's just a script", but I rarely saw much more info than that.

So... why do it yourself? Several great members of this community are already producing quantized GGUFs, so what's the point? Well, for Mac users like me this is a big deal. I've been trying to get transformers to work nicely with Metal for a while so I can run unquantized models; I mean, what's the point of stupid amounts of VRAM if you can't do that? Never could get it to work though. Well, it turns out that you can make fp16 ggufs. And they run just fine on Mac (I'm running one now).

The Guide

NOTE: This process does NOT rely on GPU the way inferencing does. This was a big hangup for me in even trying this, imagining that if I wanted to quantize a 70GB model I'd need 140GB of VRAM. Nope!

I can't guarantee your personal computer is strong enough to do it since everyone's hardware is different, but I'd imagine for a lot of folks reading this can.

Below looks really complex; its not. It's very simple, but I'm spelling out every step in case anyone gets confused on something.

  • Step 1) You'll need Python! I recommend grabbing the latest 3.11 version (I ran into problems with 3.12), and (on Windows) during the install make sure that you have the checkbox to add Path variables as needed.
  • Step 2) You'll need llama.cpp. I just grab the latest version. If you have git, you know how already. If you don't have git, you can just click the green "Code" button and click the down arrow to download a zip of the code.
  • Step 3) You'll need a model! You can grab one over at huggingface. For my example, I'll use https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B, which is currently a popular favorite in the 7B range.
    • There are several ways to download these. The easiest and fastest way that I use is via Oobabooga. If you pop over to the model tab, on the right side is an area to download.
      • At the top of the huggingface page, you'll see the name of the model, to the left of the "like" button. Copy it, or click the little icon to copy it. For the above model, it should be "teknium/OpenHermes-2.5-Mistral-7B"
      • Paste that in the top of the two text boxes on the right side above "Download". Click "download" and it should start.
    • IMPORTANT: Ensure that the model you are downloading has a file in it called "tokenizer.model". You can confirm this on huggingface by clicking the "File and Versions" tab. If it doesn't, there's a note below about that, but the rest of this won't work as of this writing (2023-12-04).
  • Step 4) Navigate to the llama.cpp folder you downloaded. You should see a file in there called requirements.txt; if so, you're in the right place. Open a command prompt/terminal window to this folder. The following command should install some stuff:
    • python -m pip install -r requirements.txt
      • You should get no major errors. Maybe some warnings. If you do get errors, then something went wrong.
  • Step 5) Now to convert! You will be specifying the path to the folder containing the model (we downloaded this is in step 3), a path and filename for the output, and a quantization
    • This process is run with the command "python convert.py" and then arguments. Below is an example for our model being converted to q8, assuming we downloaded from oobabooga
    • q8 Example: python convert.py C:\text-generation-webui-main\models\teknium_OpenHermes-2.5-Mistral-7B --outfile C:\Folder_For_GGUFs\OpenHermes-2.5-Mistral-7b.q8_0.gguf --outtype q8_0
      • Note that the output file name is entirely up to you. I named it according to the general standard I see others use because I'm sure some front ends do work based on that.
    • fp16 Example: python convert.py C:\text-generation-webui-main\models\teknium_OpenHermes-2.5-Mistral-7B --outfile C:\Folder_For_GGUFs\OpenHermes-2.5-Mistral-7b-fp16.gguf --outtype f16

Tada! You have a gguf now.

NOTES

  • If you have a model that lacks the tokenizer.model file, this is a special huggingface tokenizer thingy. Currently, as of this writing, the convert.py in the main branch for llamacpp does not handle these.
    • There is, however, a PR with a change to handle them. Deepseek 67b is such a model. https://github.com/ggerganov/llama.cpp/pull/3633
      • I won't give instructions on how to make use of this unmerged change, because it has not gone through the proper review process. I will say, however, that I see TheBloke has made use of it, and I also tried it with DeepSeek 67b for fp16 and it worked well for me.If you know how to make use of it it then you probably know enough to feel comfortable vetting the changes yourself before running it. If you don't know, then I'd wait for the PR to go into the main branch.
      • Of you do this for deepseek, you'll need the argument --padvocab as well. I got an error otherwise, and the convo on that chat told me do that lol
  • Again- you don't need a fancy GPU for this.
  • An fp16 is going to be about the size of the model folder you got from huggingface. So a 70b model will be around 130-140GB. That fits neatly on a 192GB Mac Studio!
  • Some models seem to be a little more of a pain to quantize than others, so just keep that in mind. From what I can tell, most will quickly run without any special settings. But I've seen a few conversations where folks had to work out some pretty challenging looking stuff to make it work right, so just keep that in mind.
  • EDIT: This can generate an FP16 or a q8_0 file. To quantize lower than that, you need to use the "quantize" executable that comes with Llama.cpp.NOTE: I haven't done this step myself; just looked up the directions. Looks pretty straight forward. The below may not be 100% correct, but should get you close enough to figure the rest out.
    • On windows, you have two options:
      • Use CMake to build the project
      • Or, more easily, on the Llamacpp page is a list of releases. You can download a zip file with all the executables built and ready to go. Quantize is in there. If you're using an Nvidia card, then you probably want the CuBlas ones. For example, I'd personally get this zip for the current release: llama-b1610-bin-win-cublas-cu12.2.0-x64.zip
      • Then just navigate to the exe and you should be able to run something like quantize.exe C:\Folder_For_GGUFs\OpenHermes-2.5-Mistral-7b-fp16.gguf C:\Folder_for_GGUFs\OpenHermes-2.5-Mistral-7b.q5_K.gguf Q5_K
    • On Mac and probably Linux, in the terminal window that is open for what you did above, you should just be able to type the word "make" and it should generate all the stuff you need.
      • After that, it looks like you type something like .quantize somepath\Folder_For_GGUFs\OpenHermes-2.5-Mistral-7b-fp16.gguf somepath\Folder_for_GGUFs\OpenHermes-2.5-Mistral-7b.q5_K.gguf Q5_K