Running LLMs on your desktop with one click

It works, mind-blowing

Dec 12, 2023

Simon Willison mentioned this a few days ago. I didn’t take it seriously—until today.

A project called llamafile makes it insanely easy to run LLMs on your local machine. You just need to download a file and then run it. That’s it!

They provided a few different models. I tried the LLaVA 1.5 model, a multimodal model on my M2 Mac mini with 8G RAM, and the machine rebooted immediately. OK, fair enough.

But then I tried the Mistral-7B-Instruct model, which works (screenshot below)!

Screenshot of my chat with Mistral-7B-Instruct model on my basic Mac mini

The statistics in the terminal suggest that the model can produce around 13 tokens per second, which is not bad at all.

Screenshot of the statistics of the Mistral-7B-Instruct model during inference

Based on my understanding, this was made possible by the llama.cpp project that implemented the Llama framework in pure CPP and added support for Apple Silicon. Because of this, some people even argue that Mac Studio M2 Ultra becomes a cost-effective option if you want to deploy LLMs locally.

And this happened today, someone successfully ran the Mistral 8x7B instruct model (ChatGPT-3.5 level model) at 27 tokens per second on an M3 Max 64G laptop.

This really changes a lot of things.

Some other options

There are some other solutions for deploying LLMs locally:

https://github.com/oobabooga/text-generation-webui
https://github.com/bentoml/OpenLLM
https://lmstudio.ai/beta-releases.html
https://github.com/ninehills/llm-inference-benchmark (This is a benchmark of different frameworks)

Based on the documents, running these frameworks locally shouldn’t be super hard, either. Also, I have heard people using these in production already.

Kevin’s digital whiteboard

Discussion about this post