Simon Willison mentioned this a few days ago. I didn’t take it seriously—until today.
A project called llamafile makes it insanely easy to run LLMs on your local machine. You just need to download a file and then run it. That’s it!
They provided a few different models. I tried the LLaVA 1.5 model, a multimodal model on my M2 Mac mini with 8G RAM, and the machine rebooted immediately. OK, fair enough.
But then I tried the Mistral-7B-Instruct model, which works (screenshot below)!
The statistics in the terminal suggest that the model can produce around 13 tokens per second, which is not bad at all.
Based on my understanding, this was made possible by the llama.cpp project that implemented the Llama framework in pure CPP and added support for Apple Silicon. Because of this, some people even argue that Mac Studio M2 Ultra becomes a cost-effective option if you want to deploy LLMs locally.
And this happened today, someone successfully ran the Mistral 8x7B instruct model (ChatGPT-3.5 level model) at 27 tokens per second on an M3 Max 64G laptop.
This really changes a lot of things.
Some other options
There are some other solutions for deploying LLMs locally:
https://github.com/oobabooga/text-generation-webui
https://github.com/bentoml/OpenLLM
https://lmstudio.ai/beta-releases.html
https://github.com/ninehills/llm-inference-benchmark (This is a benchmark of different frameworks)
Based on the documents, running these frameworks locally shouldn’t be super hard, either. Also, I have heard people using these in production already.