Ollama is the easiest way to run large language models locally. It wraps llama.cpp under the hood and provides a simple CLI and REST API — no CUDA setup, no Python environments, no manual model downloads.
| Feature | What It Gives You |
|---|---|
| One-command run | ollama run llama3.2 — downloads, loads, and starts chatting |
| REST API | POST http://localhost:11434/api/chat — integrate with your own tools |
| Modelfiles | Customise prompts, temperature, and system messages without code |
| GPU acceleration | Auto-detects NVIDIA CUDA, AMD ROCm, and Apple Metal |
| Cross-platform | Linux, macOS, and Windows (native, no WSL needed) |
curl -fsSL https://ollama.com/install.sh | sh
# Download from ollama.com or use Homebrew
brew install ollama
Download the installer from ollama.com — native Windows build, no WSL required.
docker run -d \
-v ollama:/root/.ollama \
-p 11434:11434 \
--name ollama \
ollama/ollama
Post-install: The Ollama service starts automatically. On Linux it runs as a systemd service; on macOS/Windows it runs as a background app.
# Download and run a model in one command
ollama run llama3.2
# Download only (no chat)
ollama pull llama3.2
# List downloaded models
ollama list
# Remove a model
ollama rm llama3.2
# Stop a running model
ollama stop llama3.2
The first time you run a model, Ollama downloads it. After that it loads from cache. Type /help in the chat interface to see slash commands.
This is the single most important thing to understand before using Ollama:
| Model Size | Q4_K_M (~25%) | Q8_0 (~50%) | FP16 (full) | Minimum VRAM |
|---|---|---|---|---|
| 1–3B | 0.7–2 GB | 1.5–3 GB | 2–6 GB | 2–4 GB — any GPU, even integrated |
| 7–8B | 4–5 GB | 7–9 GB | 14–16 GB | 6–8 GB — GTX 1060, RTX 3050, M1/M2 |
| 13–14B | 7–9 GB | 13–15 GB | 26–28 GB | 10–16 GB — RTX 3060 12GB, RTX 4060 Ti |
| 30–34B | 16–19 GB | 30–34 GB | 60–68 GB | 20–24 GB — RTX 3090, RTX 4090 |
| 70–72B | 35–40 GB | 65–72 GB | 130–144 GB | 40–48 GB — dual RTX 3090, A6000 |
| 120B+ | 60+ GB | 110+ GB | 220+ GB | 80+ GB — enterprise GPUs |
Rule of thumb:
nvidia-smi. On macOS: Activity Monitor.| Command | Description |
|---|---|
ollama serve | Start the Ollama server |
ollama create name -f ./Modelfile | Create a model from a Modelfile |
ollama run model | Run a model (interactive chat) |
ollama pull model | Download a model |
ollama push model | Push a model to a registry |
ollama list | Show all downloaded models |
ollama ps | Show currently loaded/running models |
ollama cp source target | Copy a model |
ollama rm model | Remove a model |
ollama show model | Show model details |
ollama stop model | Stop a running model |
| Variable | Default | Purpose |
|---|---|---|
OLLAMA_HOST | 127.0.0.1:11434 | Bind address. Set to 0.0.0.0 for network access. |
OLLAMA_MODELS | ~/.ollama/models | Where models are stored on disk. |
OLLAMA_NUM_PARALLEL | 1 | Number of parallel request slots. |
OLLAMA_MAX_LOADED_MODELS | 1 | Max models kept in memory at once. |
OLLAMA_KEEP_ALIVE | 5m | How long a model stays loaded after last request. |
OLLAMA_DEBUG | (off) | Enable debug logging. |
A Modelfile lets you customise how a model behaves. Create a file called Modelfile (no extension):
FROM llama3.2
# System prompt sets the model's persona
SYSTEM "You are a senior Python developer. Be concise and use type hints."
# Override default generation parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER num_ctx 4096
PARAMETER repeat_penalty 1.1
# Load custom adapters (LoRA)
ADAPTER ./my-lora.safetensors
# Set the licence (for custom models)
LICENSE MIT
Then build it:
ollama create my-custom-model -f ./Modelfile
ollama run my-custom-model
| Parameter | Default | What It Does |
|---|---|---|
temperature | 0.8 | Lower = more deterministic, higher = more creative. |
top_p | 0.9 | Nucleus sampling. Lower = more focused. |
top_k | 40 | Limit next-token choices to top K options. |
num_ctx | 2048 | Context window size in tokens. Higher = more VRAM. |
repeat_penalty | 1.1 | Penalise repetition. |
num_predict | 128 | Max tokens to generate per response. |
stop | (none) | Stop sequences as a JSON array. |
seed | 0 | Random seed for reproducible outputs. |
Ollama exposes a full REST API on http://localhost:11434. No auth by default (keep it local or use a reverse proxy).
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Why is the sky blue?",
"stream": false
}'
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
"stream": false
}'
curl http://localhost:11434/api/tags
curl http://localhost:11434/api/pull -d '{
"model": "llama3.2"
}'
import requests
response = requests.post("http://localhost:11434/api/chat", json={
"model": "llama3.2",
"messages": [{"role": "user", "content": "Say hello in 5 languages."}],
"stream": False
})
print(response.json()["message"]["content"])
| Model | Sizes | Best For | Notes |
|---|---|---|---|
| Llama 3.2 | 1B, 3B | Fast chat, mobile/edge | Tiny but capable. 3B runs on a phone. |
| Llama 3.1 | 8B, 70B, 405B | General purpose, coding | Best all-rounder. 8B fits most GPUs. |
| Mistral | 7B, 12B | Fast inference | Efficient. Runs well on 8GB VRAM. |
| Mixtral | 8x7B (MoE) | Heavy reasoning | Mixture-of-Experts — ~13B used per token. |
| Qwen 2.5 | 0.5B–72B | Multilingual, coding | Strong Chinese + English. 32B excellent for coding. |
| Phi-4 | 14B | Small device, reasoning | Microsoft. Punches above its weight. |
| DeepSeek R1 | 7B–70B (distilled) | Math, reasoning | "Thinking" model. Great at step-by-step logic. |
| Gemma 2 | 2B, 9B, 27B | General, research | Google. 9B outperforms many 13B models. |
| Command R+ | 35B, 104B | RAG, enterprise | Cohere. Optimised for retrieval. |
docker run -d \
-p 3000:8080 \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:main
Then open http://localhost:3000 — add Ollama as the backend.
from langchain_ollama import OllamaLLM
llm = OllamaLLM(model="llama3.2", temperature=0.3)
result = llm.invoke("Write a haiku about servers")
print(result)
| Problem | Likely Cause | Fix |
|---|---|---|
| Model requires more VRAM than available | Model too big for your GPU | Use a smaller model or Q3_K_S quantization. |
| Very slow generation (1–2 tok/s) | Model offloaded to CPU/RAM | Check ollama ps. Use a smaller model. |
ollama: command not found | Service not running | On Linux: systemctl --user start ollama. |
| CUDA errors on startup | Missing/incompatible NVIDIA drivers | Update drivers (nvidia-driver-550). |
| Model rambles / poor responses | Bad parameters or wrong model | Lower temperature to 0.3–0.5. Use instruct-tuned variant. |
Error: pull access denied | Model name is wrong | Verify name on ollama.com/library. |
| Out of disk space | Too many models cached | ollama list → ollama rm unused ones. |
| API requests hang / timeout | Model still loading | Large models take 10–60s to load. Use stream: true. |
| Port 11434 already in use | Another Ollama instance running | killall ollama or set OLLAMA_HOST to a different port. |
num_ctx to balance context vs VRAM. 4096 tokens is the sweet spot.OLLAMA_KEEP_ALIVE=0 when experimenting with different models.curl -fsSL https://ollama.com/install.sh | shnomic-embed-text) alongside your main LLM.ollama ps.