Tags: ollama, llm, local-ai, inference, quantization, gguf, llama.cpp, beginner Last updated: 2026-07-02

Ollama Cheatsheet

What Is Ollama?

Ollama is the easiest way to run large language models locally. It wraps llama.cpp under the hood and provides a simple CLI and REST API — no CUDA setup, no Python environments, no manual model downloads.

FeatureWhat It Gives You
One-command runollama run llama3.2 — downloads, loads, and starts chatting
REST APIPOST http://localhost:11434/api/chat — integrate with your own tools
ModelfilesCustomise prompts, temperature, and system messages without code
GPU accelerationAuto-detects NVIDIA CUDA, AMD ROCm, and Apple Metal
Cross-platformLinux, macOS, and Windows (native, no WSL needed)

Installation

Linux

curl -fsSL https://ollama.com/install.sh | sh

macOS

# Download from ollama.com or use Homebrew
brew install ollama

Windows

Download the installer from ollama.com — native Windows build, no WSL required.

Docker

docker run -d \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

Post-install: The Ollama service starts automatically. On Linux it runs as a systemd service; on macOS/Windows it runs as a background app.

Quick Start

# Download and run a model in one command
ollama run llama3.2

# Download only (no chat)
ollama pull llama3.2

# List downloaded models
ollama list

# Remove a model
ollama rm llama3.2

# Stop a running model
ollama stop llama3.2

The first time you run a model, Ollama downloads it. After that it loads from cache. Type /help in the chat interface to see slash commands.

🚨 Pro Tip: Model Sizes & VRAM Requirements

This is the single most important thing to understand before using Ollama:

Model SizeQ4_K_M (~25%)Q8_0 (~50%)FP16 (full)Minimum VRAM
1–3B0.7–2 GB1.5–3 GB2–6 GB2–4 GB — any GPU, even integrated
7–8B4–5 GB7–9 GB14–16 GB6–8 GB — GTX 1060, RTX 3050, M1/M2
13–14B7–9 GB13–15 GB26–28 GB10–16 GB — RTX 3060 12GB, RTX 4060 Ti
30–34B16–19 GB30–34 GB60–68 GB20–24 GB — RTX 3090, RTX 4090
70–72B35–40 GB65–72 GB130–144 GB40–48 GB — dual RTX 3090, A6000
120B+60+ GB110+ GB220+ GB80+ GB — enterprise GPUs

Rule of thumb:

The Ollama CLI

CommandDescription
ollama serveStart the Ollama server
ollama create name -f ./ModelfileCreate a model from a Modelfile
ollama run modelRun a model (interactive chat)
ollama pull modelDownload a model
ollama push modelPush a model to a registry
ollama listShow all downloaded models
ollama psShow currently loaded/running models
ollama cp source targetCopy a model
ollama rm modelRemove a model
ollama show modelShow model details
ollama stop modelStop a running model

Environment Variables

VariableDefaultPurpose
OLLAMA_HOST127.0.0.1:11434Bind address. Set to 0.0.0.0 for network access.
OLLAMA_MODELS~/.ollama/modelsWhere models are stored on disk.
OLLAMA_NUM_PARALLEL1Number of parallel request slots.
OLLAMA_MAX_LOADED_MODELS1Max models kept in memory at once.
OLLAMA_KEEP_ALIVE5mHow long a model stays loaded after last request.
OLLAMA_DEBUG(off)Enable debug logging.

Modelfile Reference

A Modelfile lets you customise how a model behaves. Create a file called Modelfile (no extension):

FROM llama3.2

# System prompt sets the model's persona
SYSTEM "You are a senior Python developer. Be concise and use type hints."

# Override default generation parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER num_ctx 4096
PARAMETER repeat_penalty 1.1

# Load custom adapters (LoRA)
ADAPTER ./my-lora.safetensors

# Set the licence (for custom models)
LICENSE MIT

Then build it:

ollama create my-custom-model -f ./Modelfile
ollama run my-custom-model

Key Parameters

ParameterDefaultWhat It Does
temperature0.8Lower = more deterministic, higher = more creative.
top_p0.9Nucleus sampling. Lower = more focused.
top_k40Limit next-token choices to top K options.
num_ctx2048Context window size in tokens. Higher = more VRAM.
repeat_penalty1.1Penalise repetition.
num_predict128Max tokens to generate per response.
stop(none)Stop sequences as a JSON array.
seed0Random seed for reproducible outputs.

REST API

Ollama exposes a full REST API on http://localhost:11434. No auth by default (keep it local or use a reverse proxy).

Generate (one-shot)

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Chat (conversation)

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"}
  ],
  "stream": false
}'

List models

curl http://localhost:11434/api/tags

Pull a model

curl http://localhost:11434/api/pull -d '{
  "model": "llama3.2"
}'

Python example

import requests

response = requests.post("http://localhost:11434/api/chat", json={
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "Say hello in 5 languages."}],
    "stream": False
})
print(response.json()["message"]["content"])

Popular Models

ModelSizesBest ForNotes
Llama 3.21B, 3BFast chat, mobile/edgeTiny but capable. 3B runs on a phone.
Llama 3.18B, 70B, 405BGeneral purpose, codingBest all-rounder. 8B fits most GPUs.
Mistral7B, 12BFast inferenceEfficient. Runs well on 8GB VRAM.
Mixtral8x7B (MoE)Heavy reasoningMixture-of-Experts — ~13B used per token.
Qwen 2.50.5B–72BMultilingual, codingStrong Chinese + English. 32B excellent for coding.
Phi-414BSmall device, reasoningMicrosoft. Punches above its weight.
DeepSeek R17B–70B (distilled)Math, reasoning"Thinking" model. Great at step-by-step logic.
Gemma 22B, 9B, 27BGeneral, researchGoogle. 9B outperforms many 13B models.
Command R+35B, 104BRAG, enterpriseCohere. Optimised for retrieval.

Using Ollama with Tools

Open WebUI (ChatGPT-like frontend)

docker run -d \
  -p 3000:8080 \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Then open http://localhost:3000 — add Ollama as the backend.

Continue.dev (VS Code / Cursor)

  1. Install the Continue extension in VS Code.
  2. Set the model provider to Ollama.
  3. Select a downloaded model.
  4. Get autocomplete, chat, and inline editing powered by your local model.

AnythingLLM (RAG with your documents)

  1. Download AnythingLLM from anythingllm.com.
  2. Select Ollama as the LLM provider.
  3. Point it at your documents for local RAG.

LangChain / LlamaIndex

from langchain_ollama import OllamaLLM

llm = OllamaLLM(model="llama3.2", temperature=0.3)
result = llm.invoke("Write a haiku about servers")
print(result)

Troubleshooting

ProblemLikely CauseFix
Model requires more VRAM than availableModel too big for your GPUUse a smaller model or Q3_K_S quantization.
Very slow generation (1–2 tok/s)Model offloaded to CPU/RAMCheck ollama ps. Use a smaller model.
ollama: command not foundService not runningOn Linux: systemctl --user start ollama.
CUDA errors on startupMissing/incompatible NVIDIA driversUpdate drivers (nvidia-driver-550).
Model rambles / poor responsesBad parameters or wrong modelLower temperature to 0.3–0.5. Use instruct-tuned variant.
Error: pull access deniedModel name is wrongVerify name on ollama.com/library.
Out of disk spaceToo many models cachedollama listollama rm unused ones.
API requests hang / timeoutModel still loadingLarge models take 10–60s to load. Use stream: true.
Port 11434 already in useAnother Ollama instance runningkillall ollama or set OLLAMA_HOST to a different port.

Tips & Advice