Tags: llm, ai, machine-learning, transformer, quantization, inference, beginnerLast updated: 2026-07-01

LLMs Cheatsheet

What Is an LLM?

LLM stands for Large Language Model — a type of artificial intelligence trained on vast amounts of text (books, articles, code, the web) so it can understand and generate human-like language.

Think of it as autocomplete on steroids. You give it a prompt (the start of a sentence, a question, or an entire conversation), and it predicts what comes next, one word-like chunk (called a token) at a time.

Core IdeaExplanation
It's a pattern matcherLLMs learn statistical patterns from text — they don't “understand” meaning the way humans do.
It's a next-token predictorGiven a sequence, it predicts the most likely next token. Repeat. That's how text is generated.
It's not a search engineLLMs don't browse the internet (unless connected to a tool). They generate text based on what they learned during training.
Size matters (mostly)Bigger models generally perform better, but smaller quantized models can run on a laptop.

How LLMs Work (Simplified)

1. Training

The model is fed terabytes of text and learns to predict the next word. Over time, it builds an internal representation of grammar, facts, reasoning patterns, and even style. This takes weeks on supercomputer clusters costing millions.

2. The Transformer Architecture

Almost all modern LLMs use the Transformer architecture (introduced by Google in 2017). Key concepts:

ConceptWhat It Means
AttentionThe model decides which words matter most to each other. “The cat sat on the mat — it was fluffy.” The model learns that “it” refers to “cat”.
LayersThe model passes text through dozens/hundreds of attention layers, each refining the representation.
ParametersThe weights and biases the model learns. More parameters = more capacity to store patterns. GPT-3 had 175B; modern models have hundreds of billions to over a trillion.

3. Inference

After training, the model can infer (generate) text. You give it a prompt, it runs through its layers once, and outputs a probability distribution over all possible tokens. It picks one (usually the most likely, sometimes randomly for creativity), appends it to the prompt, and repeats.

 Prompt: "The capital of France is"
   ↓
 Model predicts: "Paris" (98%), "Lyon" (1%), ...
   ↓
 Output: "The capital of France is Paris"

4. Tokens

Tokens are the atomic unit LLMs read and write. ~100 tokens ≈ 75 English words. Different tokenizers split text differently:

Core Terminology

TermDefinition
TokenThe atomic unit an LLM reads/writes. ~0.75 words per token. Input + output tokens = cost.
ParameterA learnable weight inside the model. More parameters → more capacity (and more compute needed).
Context WindowThe maximum number of tokens the model can “see” at once. Small = forgets early conversation. Large (1M+) = can ingest entire codebases or books.
TemperatureControls randomness. 0.0 = deterministic, always picks the most likely token. 1.0+ = more creative, sometimes picks unlikely tokens.
Top-p (Nucleus Sampling)Alternative to temperature. Picks from the smallest set of tokens whose cumulative probability exceeds p. 0.9 means “consider the top 90% of probability mass.”
InferenceThe act of running a trained model to generate text (as opposed to training).
Training Data CutoffThe date beyond which the model has no knowledge unless given tools (search, RAG).
EmbeddingA numerical vector representation of text. Used for semantic search, clustering, and RAG.
RAG (Retrieval-Augmented Generation)The model searches a database for relevant documents and includes them in the prompt before answering. Combats hallucinations and stale knowledge.
Fine-tuningTaking a pre-trained model and training it further on a specific dataset (e.g., your company's docs) to specialise it.
RLHF (Reinforcement Learning from Human Feedback)Training technique where humans rate model outputs to align the model with human preferences.
HallucinationThe model confidently outputs false information. LLMs don't “know” facts — they generate plausible-sounding text.
LoRA / QLoRALightweight fine-tuning methods that adapt a model by training a small set of extra parameters instead of all of them. Cheap and fast.
System PromptThe hidden instruction that sets the model's behaviour before the conversation starts.

What Are Q Models? (Quantization)

Quantization (often called “Q models”) is the process of reducing the precision of a model's parameters to make it smaller and faster.

Why Quantize?

BenefitWhat It Means
Smaller file sizeA 70B-parameter model at 16-bit = ~140 GB. At 4-bit = ~35 GB. Fits on a single consumer GPU.
Faster inferenceLess precision = less memory bandwidth needed = more tokens per second.
Lower hardware requirementsRun 70B models on a single RTX 3090/4090 instead of needing server GPUs.
Enables local LLMsWithout quantization, running large models on consumer hardware would be impossible.

How It Works

LLM parameters are normally stored as FP16 (16-bit floating point, ~2 bytes per parameter) or FP32 (32-bit). Quantization maps these to lower-precision formats:

FormatBits per ParameterSize vs FP16Quality
FP16 / BF16161x (baseline)Full quality
INT8850%Nearly indistinguishable
INT4425%Slight quality loss on complex reasoning
INT3 / INT23 / 2~18% / ~12%Noticeable degradation
NF4 (Normal Float 4)425%Better than INT4 for LLMs (designed for bell-shaped weight distributions)

GGUF and the Q4/Q5/Q8 Naming Convention

GGUF (GPT-Generated Unified Format) is the most common file format for quantized local LLMs, popularised by llama.cpp. File names look like:

llama-3-8b-instruct.Q4_K_M.gguf

Breaking this down:

PartMeaning
llama-3-8b-instructThe base model name and variant
Q4_K_MQuantization type
.ggufFile format

Quantization Types Decoded

CodeBitsK-quant VariantQuality / Use Case
Q2_K2Smallest K-quantHeavily degraded, only for extreme memory constraints
Q3_K_S / Q3_K_M / Q3_K_L3Small/Medium/LargeLow quality, usable for small models on very limited hardware
Q4_K_S4SmallGood balance for most users. Minimal quality loss.
Q4_K_M4Medium (recommended)The most popular choice. Best quality/size trade-off.
Q5_K_S5SmallSlightly better than Q4, noticeably larger files
Q5_K_M5MediumGreat quality, good when you have enough VRAM
Q6_K6Full K-quantExcellent quality, large files
Q8_08FullNear-lossless, very large files
F1616UnquantizedFull precision, huge files, barely better than Q8

Tip: Q4_K_M is the sweet spot for most users — ~25% of the original size with minimal quality degradation.

Important Caveats

Major LLMs Compared

ModelMakerOpen WeightsBest ForNotes
Claude 4 / 3.5 SonnetAnthropicCoding, reasoning, long contextExcellent for agentic vibe coding. Large context window (~200K). Strong safety training.
GPT-4o / o1 / o3OpenAIGeneral purpose, reasoningThe original. o1/o3 models “think” before answering (chain-of-thought). GPT-4o is fast and multimodal (text + images).
Gemini 2.5 ProGoogleLong context, multimodalMassive 1M+ context window. Strong on video and audio understanding.
Llama 3 / 3.1 / 4MetaLocal deployment, fine-tuningBest open-weight family. 8B (laptop), 70B (desktop GPU), 405B (server). Llama 4 is multimodal.
DeepSeek V3 / R1DeepSeekCoding, math, reasoningChinese lab. R1 is a “reasoning” model (like o1). Very strong on math/code benchmarks. Cheap API pricing.
Mistral / MixtralMistral AIFast inference, EuropeanEfficient models. Mixtral uses MoE (Mixture of Experts) — only part of the model activates per token, saving compute.
Qwen 2.5AlibabaMultilingual, codingStrong Chinese + English. 0.5B to 72B+ sizes. Good open-weight alternative.
Phi-3 / Phi-4MicrosoftSmall device deploymentTiny but capable. Phi-3-mini is 3.8B — runs on a phone. Surprising quality for size.
Command R+CohereRAG, enterpriseOptimised for retrieval-augmented generation and tool use. Strong at following complex instructions.

Local vs Cloud LLMs

FactorCloud (API)Local (Self-Hosted)
Hardware neededNone — works in any browserGPU with 6GB+ VRAM for 7B, 24GB+ for 70B
CostPay-per-token (usage scales)Free after hardware (electricity only)
PrivacyYou send data to a third partyEverything stays on your machine
QualityBest models available (Claude, GPT-4o)Good, but lags behind frontier models
SpeedFast (depends on provider)Limited by your GPU (2–40 tok/s)
Internet neededYesNo
Offline capabilityNoneFull offline
Model selectionProvider's models onlyAny open-weight model you download
CustomisationFine-tuning limited or unavailableFull control — fine-tune, quantize, modify
CensorshipProvider decides content policiesYou decide

Tools for Running Local LLMs

ToolPlatformNotes
lm-studiomacOS, Windows, LinuxFriendly GUI. Download models and chat. Best for beginners.
llama.cppCross-platform (CLI)The gold standard. Runs on CPU + GPU. Powers most other tools.
OllamamacOS, Linux, Windowsollama run llama3.2 — simple CLI. Great for developers.
KoboldCPPWindows, LinuxBuilt for roleplay/story writing. Has UI.
Text Generation WebUI (oobabooga)Cross-platformPower-user UI. Supports many loaders, LoRA, training.
JanCross-platform (Electron)Beautiful UI. Simple download-and-chat.

How to Choose an LLM

You WantTry
The best coding assistantClaude 4 Sonnet (via API)
Free, uncensored, privacy-firstLlama 3.1 8B (Q4_K_M) via Ollama
Mathematical reasoningDeepSeek R1 or o3 (OpenAI)
Running on a laptop with no GPUPhi-4 14B (Q4_K_M) or Mistral 7B (Q4_K_M)
Processing a whole book/codebaseGemini 2.5 Pro (1M context)
Chatting with your documents (RAG)Ollama + AnythingLLM
Cheapest API (good enough quality)DeepSeek V3 or Gemini 2.5 Flash
Enterprise deployment with custom dataLlama 3.1 70B (fine-tuned)

Prompting Basics

Even the best LLM is useless without a good prompt. Here's what matters:

TechniqueWhat It DoesExample
System PromptSets the model's role before the conversation.“You are a senior Python developer. Write clean, type-hinted code.”
Few-Shot PromptingGive examples of what you want.“Q: What's 2+2? A: 4. Q: What's 3*5? A:”15
Chain-of-Thought (CoT)Ask the model to think step-by-step.“Let's think through this step by step.”
Role PromptingAssign a persona.“You are a world-class database administrator…”
Structured OutputSpecify the exact format.“Respond in JSON: {"name": string, "age": number}”
Negative PromptingTell the model what NOT to do.“Do NOT use markdown. Do NOT explain your reasoning.”

Common Misconceptions

MisconceptionReality
“LLMs understand language”They don't understand — they predict tokens. No consciousness, intent, or comprehension.
“LLMs are search engines”They generate text, not retrieve it. They can confidently produce false information (hallucination).
“Bigger is always better”A well-tuned 8B can outperform a poorly prompted 70B. Quantization and fine-tuning matter too.
“LLMs are always right”They're often wrong on niche topics, recent events, and complex math. Always verify.
“LLMs have internet access”They only know what's in their training data. RAG or web search must be explicitly connected.
“You need a supercomputer”Quantized models run on consumer GPUs. Cloud APIs work from any browser.
“LLMs will replace developers”They change how developers work (less typing, more reviewing) but still need human judgment.
“All LLMs are censored the same”Open-weight models can be run uncensored. Cloud APIs enforce provider policies.

History & Milestones

WhenWhat Happened
2017Google publishes “Attention Is All You Need” — the Transformer paper that makes modern LLMs possible.
2018OpenAI releases GPT-1 (117M parameters). Shows that large-scale language modelling works.
2019GPT-2 (1.5B). Initially withheld due to “too dangerous” concerns.
2020GPT-3 (175B). Demonstrates few-shot learning — models perform tasks from just a few examples.
2021Codex (GPT-3 fine-tuned on code) launches as GitHub Copilot. First mainstream AI coding assistant.
Late 2022ChatGPT launches. Reaches 100M users in 2 months. LLMs enter public consciousness.
2023GPT-4, Claude 2, Llama 2, Gemini. Quantization (GGUF/llama.cpp) makes local LLMs practical.
Early 2024Llama 3, Mistral, Mixtral. GPT-4o adds multimodal. Open-weight models rival closed ones.
Late 2024o1-preview introduces “reasoning” models that think before answering. DeepSeek R1 follows.
2025Claude 4, Gemini 2.5 Pro (1M+ context). Llama 4 (multimodal). Small local models become shockingly capable.
NowLLMs are infrastructure. They power IDEs, chatbots, search engines, game NPCs, and millions of API calls per day.

Philosophy

LLMs are not magic. They are pattern-matching engines trained on the collective text of humanity. They don't think, feel, or understand — but they are extraordinarily useful as tools.