The local accessibility problem
Open source language models are available, but running them on local hardware remains complex: downloading weights in the correct format, configuring CUDA or Metal dependencies, managing quantization, exposing an API for applications. Each step requires specific expertise and different configurations depending on the model, operating system and available hardware. Ollama was created to eliminate this complexity, offering a user experience comparable to that of a package manager.
One command to run a model
Installing Ollama requires a single binary. Running a model is reduced to one command: ollama run llama2 downloads the model, quantizes it if necessary and starts an interactive session. Supported models include Llama 2, CodeLlama and dozens of other community open source models. The model registry works similarly to a container image registry: each model is identified by a name and a tag specifying the variant (size, quantization).
Modelfile and customisation
The Modelfile is Ollama’s configuration mechanism, inspired by the Dockerfile. It allows defining the base model, generation parameters (temperature, top_p, context), the system prompt and LoRA adapters for customisation. A typical Modelfile specifies the base model, sets inference parameters and defines a default behaviour — all in a declarative, versionable format.
Quantization and REST API
Ollama uses the GGUF (GPT-Generated Unified Format) from llama.cpp for model quantization. The 4-bit and 5-bit variants reduce memory requirements from tens of gigabytes to sizes manageable on consumer hardware: a 7-billion-parameter model quantized to 4 bits requires roughly 4 GB of RAM, runnable on a laptop with an integrated GPU.
The REST API exposed by Ollama allows applications to interact with local models via standard HTTP endpoints. Generation happens entirely locally: data never leaves the machine, a fundamental requirement for enterprise contexts with confidentiality constraints or regulatory compliance.
Link: ollama.com
