Tags: ollama, llm, local-ai, inference, quantization, gguf, llama.cpp, beginner Last updated: 2026-07-02

Ollama Cheatsheet

What Is Ollama?

Ollama is the easiest way to run large language models locally. It wraps llama.cpp under the hood and provides a simple CLI and REST API — no CUDA setup, no Python environments, no manual model downloads.

Feature	What It Gives You
One-command run	`ollama run llama3.2` — downloads, loads, and starts chatting
REST API	`POST http://localhost:11434/api/chat` — integrate with your own tools
Modelfiles	Customise prompts, temperature, and system messages without code
GPU acceleration	Auto-detects NVIDIA CUDA, AMD ROCm, and Apple Metal
Cross-platform	Linux, macOS, and Windows (native, no WSL needed)

Installation

Linux

curl -fsSL https://ollama.com/install.sh | sh

macOS

# Download from ollama.com or use Homebrew
brew install ollama

Windows

Download the installer from ollama.com — native Windows build, no WSL required.

Docker

docker run -d \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

Post-install: The Ollama service starts automatically. On Linux it runs as a systemd service; on macOS/Windows it runs as a background app.

Quick Start

# Download and run a model in one command
ollama run llama3.2

# Download only (no chat)
ollama pull llama3.2

# List downloaded models
ollama list

# Remove a model
ollama rm llama3.2

# Stop a running model
ollama stop llama3.2

The first time you run a model, Ollama downloads it. After that it loads from cache. Type /help in the chat interface to see slash commands.

🚨 Pro Tip: Model Sizes & VRAM Requirements

This is the single most important thing to understand before using Ollama:

Model Size	Q4_K_M (~25%)	Q8_0 (~50%)	FP16 (full)	Minimum VRAM
1–3B	0.7–2 GB	1.5–3 GB	2–6 GB	2–4 GB — any GPU, even integrated
7–8B	4–5 GB	7–9 GB	14–16 GB	6–8 GB — GTX 1060, RTX 3050, M1/M2
13–14B	7–9 GB	13–15 GB	26–28 GB	10–16 GB — RTX 3060 12GB, RTX 4060 Ti
30–34B	16–19 GB	30–34 GB	60–68 GB	20–24 GB — RTX 3090, RTX 4090
70–72B	35–40 GB	65–72 GB	130–144 GB	40–48 GB — dual RTX 3090, A6000
120B+	60+ GB	110+ GB	220+ GB	80+ GB — enterprise GPUs

Rule of thumb:

A Q4_K_M quantized model needs roughly 0.5 GB per billion parameters of VRAM.
A 70B model at Q4_K_M (~35GB) will NOT fit on a single RTX 3090 (24GB) — you need dual GPUs or a 48GB A6000.
Ollama can offload to system RAM when VRAM runs out, but inference becomes 10–50x slower. Avoid this if possible.
Check your VRAM before pulling a large model. On Linux: nvidia-smi. On macOS: Activity Monitor.
When in doubt, start small. A 7B or 8B model (like Llama 3.1 8B or Mistral 7B) at Q4_K_M runs comfortably on 8GB VRAM and is remarkably capable.

The Ollama CLI

Command	Description
`ollama serve`	Start the Ollama server
`ollama create name -f ./Modelfile`	Create a model from a Modelfile
`ollama run model`	Run a model (interactive chat)
`ollama pull model`	Download a model
`ollama push model`	Push a model to a registry
`ollama list`	Show all downloaded models
`ollama ps`	Show currently loaded/running models
`ollama cp source target`	Copy a model
`ollama rm model`	Remove a model
`ollama show model`	Show model details
`ollama stop model`	Stop a running model

Environment Variables

Variable	Default	Purpose
`OLLAMA_HOST`	`127.0.0.1:11434`	Bind address. Set to `0.0.0.0` for network access.
`OLLAMA_MODELS`	`~/.ollama/models`	Where models are stored on disk.
`OLLAMA_NUM_PARALLEL`	`1`	Number of parallel request slots.
`OLLAMA_MAX_LOADED_MODELS`	`1`	Max models kept in memory at once.
`OLLAMA_KEEP_ALIVE`	`5m`	How long a model stays loaded after last request.
`OLLAMA_DEBUG`	(off)	Enable debug logging.

Modelfile Reference

A Modelfile lets you customise how a model behaves. Create a file called Modelfile (no extension):

FROM llama3.2

# System prompt sets the model's persona
SYSTEM "You are a senior Python developer. Be concise and use type hints."

# Override default generation parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER num_ctx 4096
PARAMETER repeat_penalty 1.1

# Load custom adapters (LoRA)
ADAPTER ./my-lora.safetensors

# Set the licence (for custom models)
LICENSE MIT

Then build it:

ollama create my-custom-model -f ./Modelfile
ollama run my-custom-model

Key Parameters

Parameter	Default	What It Does
`temperature`	`0.8`	Lower = more deterministic, higher = more creative.
`top_p`	`0.9`	Nucleus sampling. Lower = more focused.
`top_k`	`40`	Limit next-token choices to top K options.
`num_ctx`	`2048`	Context window size in tokens. Higher = more VRAM.
`repeat_penalty`	`1.1`	Penalise repetition.
`num_predict`	`128`	Max tokens to generate per response.
`stop`	(none)	Stop sequences as a JSON array.
`seed`	`0`	Random seed for reproducible outputs.

REST API

Ollama exposes a full REST API on http://localhost:11434. No auth by default (keep it local or use a reverse proxy).

Generate (one-shot)

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Chat (conversation)

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"}
  ],
  "stream": false
}'

List models

curl http://localhost:11434/api/tags

Pull a model

curl http://localhost:11434/api/pull -d '{
  "model": "llama3.2"
}'

Python example

import requests

response = requests.post("http://localhost:11434/api/chat", json={
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "Say hello in 5 languages."}],
    "stream": False
})
print(response.json()["message"]["content"])

Popular Models

Model	Sizes	Best For	Notes
Llama 3.2	1B, 3B	Fast chat, mobile/edge	Tiny but capable. 3B runs on a phone.
Llama 3.1	8B, 70B, 405B	General purpose, coding	Best all-rounder. 8B fits most GPUs.
Mistral	7B, 12B	Fast inference	Efficient. Runs well on 8GB VRAM.
Mixtral	8x7B (MoE)	Heavy reasoning	Mixture-of-Experts — ~13B used per token.
Qwen 2.5	0.5B–72B	Multilingual, coding	Strong Chinese + English. 32B excellent for coding.
Phi-4	14B	Small device, reasoning	Microsoft. Punches above its weight.
DeepSeek R1	7B–70B (distilled)	Math, reasoning	"Thinking" model. Great at step-by-step logic.
Gemma 2	2B, 9B, 27B	General, research	Google. 9B outperforms many 13B models.
Command R+	35B, 104B	RAG, enterprise	Cohere. Optimised for retrieval.

Using Ollama with Tools

Open WebUI (ChatGPT-like frontend)

docker run -d \
  -p 3000:8080 \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Then open http://localhost:3000 — add Ollama as the backend.

Continue.dev (VS Code / Cursor)

Install the Continue extension in VS Code.
Set the model provider to Ollama.
Select a downloaded model.
Get autocomplete, chat, and inline editing powered by your local model.

AnythingLLM (RAG with your documents)

Download AnythingLLM from anythingllm.com.
Select Ollama as the LLM provider.
Point it at your documents for local RAG.

LangChain / LlamaIndex

from langchain_ollama import OllamaLLM

llm = OllamaLLM(model="llama3.2", temperature=0.3)
result = llm.invoke("Write a haiku about servers")
print(result)

Troubleshooting

Problem	Likely Cause	Fix
Model requires more VRAM than available	Model too big for your GPU	Use a smaller model or Q3_K_S quantization.
Very slow generation (1–2 tok/s)	Model offloaded to CPU/RAM	Check `ollama ps`. Use a smaller model.
`ollama: command not found`	Service not running	On Linux: `systemctl --user start ollama`.
CUDA errors on startup	Missing/incompatible NVIDIA drivers	Update drivers (`nvidia-driver-550`).
Model rambles / poor responses	Bad parameters or wrong model	Lower temperature to 0.3–0.5. Use instruct-tuned variant.
`Error: pull access denied`	Model name is wrong	Verify name on ollama.com/library.
Out of disk space	Too many models cached	`ollama list` → `ollama rm` unused ones.
API requests hang / timeout	Model still loading	Large models take 10–60s to load. Use `stream: true`.
Port 11434 already in use	Another Ollama instance running	`killall ollama` or set `OLLAMA_HOST` to a different port.

Tips & Advice

Start with a 7–8B model (Llama 3.1 8B or Mistral 7B) at Q4_K_M. Runs on any GPU with 6–8GB VRAM.
Use num_ctx to balance context vs VRAM. 4096 tokens is the sweet spot.
Set OLLAMA_KEEP_ALIVE=0 when experimenting with different models.
Don't expose Ollama to the internet without a reverse proxy with auth.
Keep Ollama updated — re-run the install script: curl -fsSL https://ollama.com/install.sh | sh
Use Modelfiles to lock in your preferred parameters.
Ollama can run multiple models concurrently if your VRAM allows.
For RAG workflows, use a smaller embedder model (like nomic-embed-text) alongside your main LLM.
Monitor your VRAM with ollama ps.
If you have an AMD GPU, ensure ROCm is installed. Windows AMD support has limitations.