There's no doubt that the Llama 3 series models are the hottest models this week. The 70B version is yielding performance close to the top proprietary models. The 8B version, on the other hand, is a ChatGPT-3.5 level model. But the greatest thing is that the weights of these models are open, meaning you could run them locally! I decided to give it a try and see how feasible it is.
Results
I did the tests using Ollama, which allows you to pull a variety of LLMs and run them on your own computers. In the next section, I will share some tricks in case you want to run the models yourself. Here, I will focus on the results.
First, I tested the Llama 3 8B model on a virtual Linux machine with 8 CPUs, 30G RAM, and no GPUs. The model itself is about 4GB. It only took a few commands to install Ollama and download the LLM (see below). And then it just worked! It could generate text at the speed of ~20 tokens/second.
Then I tried it again on a bigger cluster with 70 CPUs, 180G RAM, and no GPUs. The model outputs text at the speed of ~28 tokens/second. More CPUs didn’t seem to help much. But this is already fast enough for many different tasks.
On the same cluster, I then tested the 70B model, which is around 40GB. The speed is now 2.5 tokens/second. It still works, but the speed is too slow.
I'll try to run more experiments using GPUs later. However, if you only have CPU machines, the 8B model is already a variable solution. Considering that this is a ChatGPT-3.5 level model, I think it has changed a lot of things for better or worse.
For bigger models, it might be more practical to use the API providers. For example, Groq provides these models with crazy fast inference speed and reasonable prices.
Tricks of using Ollama
If you want to run this yourself, here are some tricks that might be helpful for you.
With sudo
If you have sudo privileges on your Linux machine or cluster, you can simply follow the official instructions. The following command will install Ollama as a system service for you (that’s why it needs sudo privileges):
curl -fsSL https://ollama.com/install.sh | sh
Then, you can run a model by:
ollama run llama3
And yes, it’s that simple. Installing Ollama on Mac is similar. I don’t have a Windows machine, so I can’t comment on that.
Without sudo
Now, let’s consider a more common situation where (1) you don’t have sudo privileges on the cluster and (2) you don’t have enough space in your home directory to store the models. Here is what you can do.
In your .bashrc
file, add:
export OLLAMA_MODELS=/path/to/target/directory
Which will specify the location where the LLMs will be stored.
Then download the binary file:
curl -L https://ollama.com/download/ollama-linux-amd64 -o /usr/bin/ollama
You can change /usr/bin/ollama
to other places, as long as they are in your path. Then, add execution permission to the binary:
chmod +x /usr/bin/ollama
Then, you need to run the Ollama server in the backend:
ollama serve&
Now, you are ready to run the models:
ollama run llama3
Note that running the model directly will give you an interactive terminal to talk to the model. However, you can access the models through HTTP requests as well. After you run the Ollama server in the backend, the HTTP endpoints are ready. You just need to follow the instructions to interact with them.