← Cheatsheets
Tags: llm, ai, machine-learning, transformer, quantization, inference, beginnerLast updated: 2026-07-01
LLMs Cheatsheet
What Is an LLM?
LLM stands for Large Language Model — a type of artificial intelligence trained on vast amounts of text (books, articles, code, the web) so it can understand and generate human-like language.
Think of it as autocomplete on steroids. You give it a prompt (the start of a sentence, a question, or an entire conversation), and it predicts what comes next, one word-like chunk (called a token) at a time.
| Core Idea | Explanation |
| It's a pattern matcher | LLMs learn statistical patterns from text — they don't “understand” meaning the way humans do. |
| It's a next-token predictor | Given a sequence, it predicts the most likely next token. Repeat. That's how text is generated. |
| It's not a search engine | LLMs don't browse the internet (unless connected to a tool). They generate text based on what they learned during training. |
| Size matters (mostly) | Bigger models generally perform better, but smaller quantized models can run on a laptop. |
How LLMs Work (Simplified)
1. Training
The model is fed terabytes of text and learns to predict the next word. Over time, it builds an internal representation of grammar, facts, reasoning patterns, and even style. This takes weeks on supercomputer clusters costing millions.
2. The Transformer Architecture
Almost all modern LLMs use the Transformer architecture (introduced by Google in 2017). Key concepts:
| Concept | What It Means |
| Attention | The model decides which words matter most to each other. “The cat sat on the mat — it was fluffy.” The model learns that “it” refers to “cat”. |
| Layers | The model passes text through dozens/hundreds of attention layers, each refining the representation. |
| Parameters | The weights and biases the model learns. More parameters = more capacity to store patterns. GPT-3 had 175B; modern models have hundreds of billions to over a trillion. |
3. Inference
After training, the model can infer (generate) text. You give it a prompt, it runs through its layers once, and outputs a probability distribution over all possible tokens. It picks one (usually the most likely, sometimes randomly for creativity), appends it to the prompt, and repeats.
Prompt: "The capital of France is"
↓
Model predicts: "Paris" (98%), "Lyon" (1%), ...
↓
Output: "The capital of France is Paris"
4. Tokens
Tokens are the atomic unit LLMs read and write. ~100 tokens ≈ 75 English words. Different tokenizers split text differently:
- “hello world” →
["hello", " world"] (2 tokens)
- “unbelievably” →
["un", "believe", "ably"] (3 tokens)
- Code, special characters, and non-English text often take more tokens.
Core Terminology
| Term | Definition |
| Token | The atomic unit an LLM reads/writes. ~0.75 words per token. Input + output tokens = cost. |
| Parameter | A learnable weight inside the model. More parameters → more capacity (and more compute needed). |
| Context Window | The maximum number of tokens the model can “see” at once. Small = forgets early conversation. Large (1M+) = can ingest entire codebases or books. |
| Temperature | Controls randomness. 0.0 = deterministic, always picks the most likely token. 1.0+ = more creative, sometimes picks unlikely tokens. |
| Top-p (Nucleus Sampling) | Alternative to temperature. Picks from the smallest set of tokens whose cumulative probability exceeds p. 0.9 means “consider the top 90% of probability mass.” |
| Inference | The act of running a trained model to generate text (as opposed to training). |
| Training Data Cutoff | The date beyond which the model has no knowledge unless given tools (search, RAG). |
| Embedding | A numerical vector representation of text. Used for semantic search, clustering, and RAG. |
| RAG (Retrieval-Augmented Generation) | The model searches a database for relevant documents and includes them in the prompt before answering. Combats hallucinations and stale knowledge. |
| Fine-tuning | Taking a pre-trained model and training it further on a specific dataset (e.g., your company's docs) to specialise it. |
| RLHF (Reinforcement Learning from Human Feedback) | Training technique where humans rate model outputs to align the model with human preferences. |
| Hallucination | The model confidently outputs false information. LLMs don't “know” facts — they generate plausible-sounding text. |
| LoRA / QLoRA | Lightweight fine-tuning methods that adapt a model by training a small set of extra parameters instead of all of them. Cheap and fast. |
| System Prompt | The hidden instruction that sets the model's behaviour before the conversation starts. |
What Are Q Models? (Quantization)
Quantization (often called “Q models”) is the process of reducing the precision of a model's parameters to make it smaller and faster.
Why Quantize?
| Benefit | What It Means |
| Smaller file size | A 70B-parameter model at 16-bit = ~140 GB. At 4-bit = ~35 GB. Fits on a single consumer GPU. |
| Faster inference | Less precision = less memory bandwidth needed = more tokens per second. |
| Lower hardware requirements | Run 70B models on a single RTX 3090/4090 instead of needing server GPUs. |
| Enables local LLMs | Without quantization, running large models on consumer hardware would be impossible. |
How It Works
LLM parameters are normally stored as FP16 (16-bit floating point, ~2 bytes per parameter) or FP32 (32-bit). Quantization maps these to lower-precision formats:
| Format | Bits per Parameter | Size vs FP16 | Quality |
| FP16 / BF16 | 16 | 1x (baseline) | Full quality |
| INT8 | 8 | 50% | Nearly indistinguishable |
| INT4 | 4 | 25% | Slight quality loss on complex reasoning |
| INT3 / INT2 | 3 / 2 | ~18% / ~12% | Noticeable degradation |
| NF4 (Normal Float 4) | 4 | 25% | Better than INT4 for LLMs (designed for bell-shaped weight distributions) |
GGUF and the Q4/Q5/Q8 Naming Convention
GGUF (GPT-Generated Unified Format) is the most common file format for quantized local LLMs, popularised by llama.cpp. File names look like:
llama-3-8b-instruct.Q4_K_M.gguf
Breaking this down:
| Part | Meaning |
llama-3-8b-instruct | The base model name and variant |
Q4_K_M | Quantization type |
.gguf | File format |
Quantization Types Decoded
| Code | Bits | K-quant Variant | Quality / Use Case |
Q2_K | 2 | Smallest K-quant | Heavily degraded, only for extreme memory constraints |
Q3_K_S / Q3_K_M / Q3_K_L | 3 | Small/Medium/Large | Low quality, usable for small models on very limited hardware |
Q4_K_S | 4 | Small | Good balance for most users. Minimal quality loss. |
Q4_K_M | 4 | Medium (recommended) | The most popular choice. Best quality/size trade-off. |
Q5_K_S | 5 | Small | Slightly better than Q4, noticeably larger files |
Q5_K_M | 5 | Medium | Great quality, good when you have enough VRAM |
Q6_K | 6 | Full K-quant | Excellent quality, large files |
Q8_0 | 8 | Full | Near-lossless, very large files |
F16 | 16 | Unquantized | Full precision, huge files, barely better than Q8 |
Tip: Q4_K_M is the sweet spot for most users — ~25% of the original size with minimal quality degradation.
Important Caveats
- Quantization degrades quality — A 70B model at Q4 is still better than an 8B model at F16, but it won't match the unquantized 70B.
- Perplexity is the metric — Quantization quality is measured by perplexity (how surprised the model is by test data). Higher perplexity = worse quality. Most quantizations show only 0.1–2% perplexity increase.
- Task-dependent — Creative writing and simple Q&A degrade very little. Complex math and coding can degrade measurably at Q4 and below.
- Imatrix (Importance Matrix) — Some quantizations use an importance matrix computed from calibration data to minimise quality loss on specific tasks. Files with
IQ prefix use this technique.
Major LLMs Compared
| Model | Maker | Open Weights | Best For | Notes |
| Claude 4 / 3.5 Sonnet | Anthropic | ✘ | Coding, reasoning, long context | Excellent for agentic vibe coding. Large context window (~200K). Strong safety training. |
| GPT-4o / o1 / o3 | OpenAI | ✘ | General purpose, reasoning | The original. o1/o3 models “think” before answering (chain-of-thought). GPT-4o is fast and multimodal (text + images). |
| Gemini 2.5 Pro | Google | ✘ | Long context, multimodal | Massive 1M+ context window. Strong on video and audio understanding. |
| Llama 3 / 3.1 / 4 | Meta | ✔ | Local deployment, fine-tuning | Best open-weight family. 8B (laptop), 70B (desktop GPU), 405B (server). Llama 4 is multimodal. |
| DeepSeek V3 / R1 | DeepSeek | ✔ | Coding, math, reasoning | Chinese lab. R1 is a “reasoning” model (like o1). Very strong on math/code benchmarks. Cheap API pricing. |
| Mistral / Mixtral | Mistral AI | ✔ | Fast inference, European | Efficient models. Mixtral uses MoE (Mixture of Experts) — only part of the model activates per token, saving compute. |
| Qwen 2.5 | Alibaba | ✔ | Multilingual, coding | Strong Chinese + English. 0.5B to 72B+ sizes. Good open-weight alternative. |
| Phi-3 / Phi-4 | Microsoft | ✔ | Small device deployment | Tiny but capable. Phi-3-mini is 3.8B — runs on a phone. Surprising quality for size. |
| Command R+ | Cohere | ✔ | RAG, enterprise | Optimised for retrieval-augmented generation and tool use. Strong at following complex instructions. |
Local vs Cloud LLMs
| Factor | Cloud (API) | Local (Self-Hosted) |
| Hardware needed | None — works in any browser | GPU with 6GB+ VRAM for 7B, 24GB+ for 70B |
| Cost | Pay-per-token (usage scales) | Free after hardware (electricity only) |
| Privacy | You send data to a third party | Everything stays on your machine |
| Quality | Best models available (Claude, GPT-4o) | Good, but lags behind frontier models |
| Speed | Fast (depends on provider) | Limited by your GPU (2–40 tok/s) |
| Internet needed | Yes | No |
| Offline capability | None | Full offline |
| Model selection | Provider's models only | Any open-weight model you download |
| Customisation | Fine-tuning limited or unavailable | Full control — fine-tune, quantize, modify |
| Censorship | Provider decides content policies | You decide |
Tools for Running Local LLMs
| Tool | Platform | Notes |
| lm-studio | macOS, Windows, Linux | Friendly GUI. Download models and chat. Best for beginners. |
| llama.cpp | Cross-platform (CLI) | The gold standard. Runs on CPU + GPU. Powers most other tools. |
| Ollama | macOS, Linux, Windows | ollama run llama3.2 — simple CLI. Great for developers. |
| KoboldCPP | Windows, Linux | Built for roleplay/story writing. Has UI. |
| Text Generation WebUI (oobabooga) | Cross-platform | Power-user UI. Supports many loaders, LoRA, training. |
| Jan | Cross-platform (Electron) | Beautiful UI. Simple download-and-chat. |
How to Choose an LLM
| You Want | Try |
| The best coding assistant | Claude 4 Sonnet (via API) |
| Free, uncensored, privacy-first | Llama 3.1 8B (Q4_K_M) via Ollama |
| Mathematical reasoning | DeepSeek R1 or o3 (OpenAI) |
| Running on a laptop with no GPU | Phi-4 14B (Q4_K_M) or Mistral 7B (Q4_K_M) |
| Processing a whole book/codebase | Gemini 2.5 Pro (1M context) |
| Chatting with your documents (RAG) | Ollama + AnythingLLM |
| Cheapest API (good enough quality) | DeepSeek V3 or Gemini 2.5 Flash |
| Enterprise deployment with custom data | Llama 3.1 70B (fine-tuned) |
Prompting Basics
Even the best LLM is useless without a good prompt. Here's what matters:
| Technique | What It Does | Example |
| System Prompt | Sets the model's role before the conversation. | “You are a senior Python developer. Write clean, type-hinted code.” |
| Few-Shot Prompting | Give examples of what you want. | “Q: What's 2+2? A: 4. Q: What's 3*5? A:” → 15 |
| Chain-of-Thought (CoT) | Ask the model to think step-by-step. | “Let's think through this step by step.” |
| Role Prompting | Assign a persona. | “You are a world-class database administrator…” |
| Structured Output | Specify the exact format. | “Respond in JSON: {"name": string, "age": number}” |
| Negative Prompting | Tell the model what NOT to do. | “Do NOT use markdown. Do NOT explain your reasoning.” |
Common Misconceptions
| Misconception | Reality |
| “LLMs understand language” | They don't understand — they predict tokens. No consciousness, intent, or comprehension. |
| “LLMs are search engines” | They generate text, not retrieve it. They can confidently produce false information (hallucination). |
| “Bigger is always better” | A well-tuned 8B can outperform a poorly prompted 70B. Quantization and fine-tuning matter too. |
| “LLMs are always right” | They're often wrong on niche topics, recent events, and complex math. Always verify. |
| “LLMs have internet access” | They only know what's in their training data. RAG or web search must be explicitly connected. |
| “You need a supercomputer” | Quantized models run on consumer GPUs. Cloud APIs work from any browser. |
| “LLMs will replace developers” | They change how developers work (less typing, more reviewing) but still need human judgment. |
| “All LLMs are censored the same” | Open-weight models can be run uncensored. Cloud APIs enforce provider policies. |
History & Milestones
| When | What Happened |
| 2017 | Google publishes “Attention Is All You Need” — the Transformer paper that makes modern LLMs possible. |
| 2018 | OpenAI releases GPT-1 (117M parameters). Shows that large-scale language modelling works. |
| 2019 | GPT-2 (1.5B). Initially withheld due to “too dangerous” concerns. |
| 2020 | GPT-3 (175B). Demonstrates few-shot learning — models perform tasks from just a few examples. |
| 2021 | Codex (GPT-3 fine-tuned on code) launches as GitHub Copilot. First mainstream AI coding assistant. |
| Late 2022 | ChatGPT launches. Reaches 100M users in 2 months. LLMs enter public consciousness. |
| 2023 | GPT-4, Claude 2, Llama 2, Gemini. Quantization (GGUF/llama.cpp) makes local LLMs practical. |
| Early 2024 | Llama 3, Mistral, Mixtral. GPT-4o adds multimodal. Open-weight models rival closed ones. |
| Late 2024 | o1-preview introduces “reasoning” models that think before answering. DeepSeek R1 follows. |
| 2025 | Claude 4, Gemini 2.5 Pro (1M+ context). Llama 4 (multimodal). Small local models become shockingly capable. |
| Now | LLMs are infrastructure. They power IDEs, chatbots, search engines, game NPCs, and millions of API calls per day. |
Philosophy
LLMs are not magic. They are pattern-matching engines trained on the collective text of humanity. They don't think, feel, or understand — but they are extraordinarily useful as tools.
- They amplify human capability — Not replace it. A developer using Claude + tools ships 10x faster, but still needs to review, test, and direct.
- They are probabilistic, not deterministic — Same prompt can produce different outputs. Temperature matters.
- Garbage in, garbage out — Vague prompts produce vague answers. Specific, well-structured prompts produce useful output.
- Know their limits — Hallucinations, stale knowledge, token limits, and lack of true reasoning are real constraints.
- The best model is the one you can actually use — A quantized 8B model you run locally today is more useful than an unquantized 500B model in a paper you can't deploy.