Tags: llm, ai, machine-learning, transformer, quantization, inference, beginnerLast updated: 2026-07-01

LLMs Cheatsheet

What Is an LLM?

LLM stands for Large Language Model — a type of artificial intelligence trained on vast amounts of text (books, articles, code, the web) so it can understand and generate human-like language.

Think of it as autocomplete on steroids. You give it a prompt (the start of a sentence, a question, or an entire conversation), and it predicts what comes next, one word-like chunk (called a token) at a time.

Core Idea	Explanation
It's a pattern matcher	LLMs learn statistical patterns from text — they don't “understand” meaning the way humans do.
It's a next-token predictor	Given a sequence, it predicts the most likely next token. Repeat. That's how text is generated.
It's not a search engine	LLMs don't browse the internet (unless connected to a tool). They generate text based on what they learned during training.
Size matters (mostly)	Bigger models generally perform better, but smaller quantized models can run on a laptop.

How LLMs Work (Simplified)

1. Training

The model is fed terabytes of text and learns to predict the next word. Over time, it builds an internal representation of grammar, facts, reasoning patterns, and even style. This takes weeks on supercomputer clusters costing millions.

2. The Transformer Architecture

Almost all modern LLMs use the Transformer architecture (introduced by Google in 2017). Key concepts:

Concept	What It Means
Attention	The model decides which words matter most to each other. “The cat sat on the mat — it was fluffy.” The model learns that “it” refers to “cat”.
Layers	The model passes text through dozens/hundreds of attention layers, each refining the representation.
Parameters	The weights and biases the model learns. More parameters = more capacity to store patterns. GPT-3 had 175B; modern models have hundreds of billions to over a trillion.

3. Inference

After training, the model can infer (generate) text. You give it a prompt, it runs through its layers once, and outputs a probability distribution over all possible tokens. It picks one (usually the most likely, sometimes randomly for creativity), appends it to the prompt, and repeats.

 Prompt: "The capital of France is"
   ↓
 Model predicts: "Paris" (98%), "Lyon" (1%), ...
   ↓
 Output: "The capital of France is Paris"

4. Tokens

Tokens are the atomic unit LLMs read and write. ~100 tokens ≈ 75 English words. Different tokenizers split text differently:

“hello world” → ["hello", " world"] (2 tokens)
“unbelievably” → ["un", "believe", "ably"] (3 tokens)
Code, special characters, and non-English text often take more tokens.

Core Terminology

Term	Definition
Token	The atomic unit an LLM reads/writes. ~0.75 words per token. Input + output tokens = cost.
Parameter	A learnable weight inside the model. More parameters → more capacity (and more compute needed).
Context Window	The maximum number of tokens the model can “see” at once. Small = forgets early conversation. Large (1M+) = can ingest entire codebases or books.
Temperature	Controls randomness. 0.0 = deterministic, always picks the most likely token. 1.0+ = more creative, sometimes picks unlikely tokens.
Top-p (Nucleus Sampling)	Alternative to temperature. Picks from the smallest set of tokens whose cumulative probability exceeds p. 0.9 means “consider the top 90% of probability mass.”
Inference	The act of running a trained model to generate text (as opposed to training).
Training Data Cutoff	The date beyond which the model has no knowledge unless given tools (search, RAG).
Embedding	A numerical vector representation of text. Used for semantic search, clustering, and RAG.
RAG (Retrieval-Augmented Generation)	The model searches a database for relevant documents and includes them in the prompt before answering. Combats hallucinations and stale knowledge.
Fine-tuning	Taking a pre-trained model and training it further on a specific dataset (e.g., your company's docs) to specialise it.
RLHF (Reinforcement Learning from Human Feedback)	Training technique where humans rate model outputs to align the model with human preferences.
Hallucination	The model confidently outputs false information. LLMs don't “know” facts — they generate plausible-sounding text.
LoRA / QLoRA	Lightweight fine-tuning methods that adapt a model by training a small set of extra parameters instead of all of them. Cheap and fast.
System Prompt	The hidden instruction that sets the model's behaviour before the conversation starts.

What Are Q Models? (Quantization)

Quantization (often called “Q models”) is the process of reducing the precision of a model's parameters to make it smaller and faster.

Why Quantize?

Benefit	What It Means
Smaller file size	A 70B-parameter model at 16-bit = ~140 GB. At 4-bit = ~35 GB. Fits on a single consumer GPU.
Faster inference	Less precision = less memory bandwidth needed = more tokens per second.
Lower hardware requirements	Run 70B models on a single RTX 3090/4090 instead of needing server GPUs.
Enables local LLMs	Without quantization, running large models on consumer hardware would be impossible.

How It Works

LLM parameters are normally stored as FP16 (16-bit floating point, ~2 bytes per parameter) or FP32 (32-bit). Quantization maps these to lower-precision formats:

Format	Bits per Parameter	Size vs FP16	Quality
FP16 / BF16	16	1x (baseline)	Full quality
INT8	8	50%	Nearly indistinguishable
INT4	4	25%	Slight quality loss on complex reasoning
INT3 / INT2	3 / 2	~18% / ~12%	Noticeable degradation
NF4 (Normal Float 4)	4	25%	Better than INT4 for LLMs (designed for bell-shaped weight distributions)

GGUF and the Q4/Q5/Q8 Naming Convention

GGUF (GPT-Generated Unified Format) is the most common file format for quantized local LLMs, popularised by llama.cpp. File names look like:

llama-3-8b-instruct.Q4_K_M.gguf

Breaking this down:

Part	Meaning
`llama-3-8b-instruct`	The base model name and variant
`Q4_K_M`	Quantization type
`.gguf`	File format

Quantization Types Decoded

Code	Bits	K-quant Variant	Quality / Use Case
`Q2_K`	2	Smallest K-quant	Heavily degraded, only for extreme memory constraints
`Q3_K_S` / `Q3_K_M` / `Q3_K_L`	3	Small/Medium/Large	Low quality, usable for small models on very limited hardware
`Q4_K_S`	4	Small	Good balance for most users. Minimal quality loss.
`Q4_K_M`	4	Medium (recommended)	The most popular choice. Best quality/size trade-off.
`Q5_K_S`	5	Small	Slightly better than Q4, noticeably larger files
`Q5_K_M`	5	Medium	Great quality, good when you have enough VRAM
`Q6_K`	6	Full K-quant	Excellent quality, large files
`Q8_0`	8	Full	Near-lossless, very large files
`F16`	16	Unquantized	Full precision, huge files, barely better than Q8

Tip: Q4_K_M is the sweet spot for most users — ~25% of the original size with minimal quality degradation.

Important Caveats

Quantization degrades quality — A 70B model at Q4 is still better than an 8B model at F16, but it won't match the unquantized 70B.
Perplexity is the metric — Quantization quality is measured by perplexity (how surprised the model is by test data). Higher perplexity = worse quality. Most quantizations show only 0.1–2% perplexity increase.
Task-dependent — Creative writing and simple Q&A degrade very little. Complex math and coding can degrade measurably at Q4 and below.
Imatrix (Importance Matrix) — Some quantizations use an importance matrix computed from calibration data to minimise quality loss on specific tasks. Files with IQ prefix use this technique.

Major LLMs Compared

Model	Maker	Open Weights	Best For	Notes
Claude 4 / 3.5 Sonnet	Anthropic	✘	Coding, reasoning, long context	Excellent for agentic vibe coding. Large context window (~200K). Strong safety training.
GPT-4o / o1 / o3	OpenAI	✘	General purpose, reasoning	The original. o1/o3 models “think” before answering (chain-of-thought). GPT-4o is fast and multimodal (text + images).
Gemini 2.5 Pro	Google	✘	Long context, multimodal	Massive 1M+ context window. Strong on video and audio understanding.
Llama 3 / 3.1 / 4	Meta	✔	Local deployment, fine-tuning	Best open-weight family. 8B (laptop), 70B (desktop GPU), 405B (server). Llama 4 is multimodal.
DeepSeek V3 / R1	DeepSeek	✔	Coding, math, reasoning	Chinese lab. R1 is a “reasoning” model (like o1). Very strong on math/code benchmarks. Cheap API pricing.
Mistral / Mixtral	Mistral AI	✔	Fast inference, European	Efficient models. Mixtral uses MoE (Mixture of Experts) — only part of the model activates per token, saving compute.
Qwen 2.5	Alibaba	✔	Multilingual, coding	Strong Chinese + English. 0.5B to 72B+ sizes. Good open-weight alternative.
Phi-3 / Phi-4	Microsoft	✔	Small device deployment	Tiny but capable. Phi-3-mini is 3.8B — runs on a phone. Surprising quality for size.
Command R+	Cohere	✔	RAG, enterprise	Optimised for retrieval-augmented generation and tool use. Strong at following complex instructions.

Local vs Cloud LLMs

Factor	Cloud (API)	Local (Self-Hosted)
Hardware needed	None — works in any browser	GPU with 6GB+ VRAM for 7B, 24GB+ for 70B
Cost	Pay-per-token (usage scales)	Free after hardware (electricity only)
Privacy	You send data to a third party	Everything stays on your machine
Quality	Best models available (Claude, GPT-4o)	Good, but lags behind frontier models
Speed	Fast (depends on provider)	Limited by your GPU (2–40 tok/s)
Internet needed	Yes	No
Offline capability	None	Full offline
Model selection	Provider's models only	Any open-weight model you download
Customisation	Fine-tuning limited or unavailable	Full control — fine-tune, quantize, modify
Censorship	Provider decides content policies	You decide

Tools for Running Local LLMs

Tool	Platform	Notes
lm-studio	macOS, Windows, Linux	Friendly GUI. Download models and chat. Best for beginners.
llama.cpp	Cross-platform (CLI)	The gold standard. Runs on CPU + GPU. Powers most other tools.
Ollama	macOS, Linux, Windows	`ollama run llama3.2` — simple CLI. Great for developers.
KoboldCPP	Windows, Linux	Built for roleplay/story writing. Has UI.
Text Generation WebUI (oobabooga)	Cross-platform	Power-user UI. Supports many loaders, LoRA, training.
Jan	Cross-platform (Electron)	Beautiful UI. Simple download-and-chat.

How to Choose an LLM

You Want	Try
The best coding assistant	Claude 4 Sonnet (via API)
Free, uncensored, privacy-first	Llama 3.1 8B (Q4_K_M) via Ollama
Mathematical reasoning	DeepSeek R1 or o3 (OpenAI)
Running on a laptop with no GPU	Phi-4 14B (Q4_K_M) or Mistral 7B (Q4_K_M)
Processing a whole book/codebase	Gemini 2.5 Pro (1M context)
Chatting with your documents (RAG)	Ollama + AnythingLLM
Cheapest API (good enough quality)	DeepSeek V3 or Gemini 2.5 Flash
Enterprise deployment with custom data	Llama 3.1 70B (fine-tuned)

Prompting Basics

Even the best LLM is useless without a good prompt. Here's what matters:

Technique	What It Does	Example
System Prompt	Sets the model's role before the conversation.	`“You are a senior Python developer. Write clean, type-hinted code.”`
Few-Shot Prompting	Give examples of what you want.	`“Q: What's 2+2? A: 4. Q: What's 3*5? A:”` → `15`
Chain-of-Thought (CoT)	Ask the model to think step-by-step.	`“Let's think through this step by step.”`
Role Prompting	Assign a persona.	`“You are a world-class database administrator…”`
Structured Output	Specify the exact format.	`“Respond in JSON: {"name": string, "age": number}”`
Negative Prompting	Tell the model what NOT to do.	`“Do NOT use markdown. Do NOT explain your reasoning.”`

Common Misconceptions

Misconception	Reality
“LLMs understand language”	They don't understand — they predict tokens. No consciousness, intent, or comprehension.
“LLMs are search engines”	They generate text, not retrieve it. They can confidently produce false information (hallucination).
“Bigger is always better”	A well-tuned 8B can outperform a poorly prompted 70B. Quantization and fine-tuning matter too.
“LLMs are always right”	They're often wrong on niche topics, recent events, and complex math. Always verify.
“LLMs have internet access”	They only know what's in their training data. RAG or web search must be explicitly connected.
“You need a supercomputer”	Quantized models run on consumer GPUs. Cloud APIs work from any browser.
“LLMs will replace developers”	They change how developers work (less typing, more reviewing) but still need human judgment.
“All LLMs are censored the same”	Open-weight models can be run uncensored. Cloud APIs enforce provider policies.

History & Milestones

When	What Happened
2017	Google publishes “Attention Is All You Need” — the Transformer paper that makes modern LLMs possible.
2018	OpenAI releases GPT-1 (117M parameters). Shows that large-scale language modelling works.
2019	GPT-2 (1.5B). Initially withheld due to “too dangerous” concerns.
2020	GPT-3 (175B). Demonstrates few-shot learning — models perform tasks from just a few examples.
2021	Codex (GPT-3 fine-tuned on code) launches as GitHub Copilot. First mainstream AI coding assistant.
Late 2022	ChatGPT launches. Reaches 100M users in 2 months. LLMs enter public consciousness.
2023	GPT-4, Claude 2, Llama 2, Gemini. Quantization (GGUF/llama.cpp) makes local LLMs practical.
Early 2024	Llama 3, Mistral, Mixtral. GPT-4o adds multimodal. Open-weight models rival closed ones.
Late 2024	o1-preview introduces “reasoning” models that think before answering. DeepSeek R1 follows.
2025	Claude 4, Gemini 2.5 Pro (1M+ context). Llama 4 (multimodal). Small local models become shockingly capable.
Now	LLMs are infrastructure. They power IDEs, chatbots, search engines, game NPCs, and millions of API calls per day.

Philosophy

LLMs are not magic. They are pattern-matching engines trained on the collective text of humanity. They don't think, feel, or understand — but they are extraordinarily useful as tools.

They amplify human capability — Not replace it. A developer using Claude + tools ships 10x faster, but still needs to review, test, and direct.
They are probabilistic, not deterministic — Same prompt can produce different outputs. Temperature matters.
Garbage in, garbage out — Vague prompts produce vague answers. Specific, well-structured prompts produce useful output.
Know their limits — Hallucinations, stale knowledge, token limits, and lack of true reasoning are real constraints.
The best model is the one you can actually use — A quantized 8B model you run locally today is more useful than an unquantized 500B model in a paper you can't deploy.