← Cheatsheets
Tags: machine-learning, ml, ai, supervised, unsupervised, scikit-learn, algorithms, beginnerLast updated: 2026-07-01
Machine Learning Cheatsheet
What Is Machine Learning?
Machine Learning (ML) is a approach to programming where instead of writing explicit rules, you show a computer examples and let it figure out the patterns itself.
| Approach | Traditional Programming | Machine Learning |
| You write | Rules + data | Data + expected answers |
| Computer produces | Answers | Rules (the model) |
| Best when | You know all the rules | You have lots of examples |
| Example | "If temp > 30, return 'hot'" | Show 10,000 labelled photos |
Types of ML
| Type | What It Does | Example | Algorithms |
| Supervised | Learn from labelled examples | Predict house price | Linear regression, Random Forest, SVM |
| Unsupervised | Find hidden patterns | Group customers | k-Means, DBSCAN, PCA |
| Reinforcement | Learn through trial and error | Train a game agent | Q-Learning, PPO |
| Semi-Supervised | Mix of labelled + unlabelled | Classify with few labels | Self-training |
When ML vs Traditional Programming
Do you know all the rules?
└── No → Do you have examples?
└── Yes → Use supervised learning
Rule of thumb: If a human can't do the task in 1 second, ML is probably needed.
Core Terminology
| Term | Definition |
| Feature | Input variable the model uses to predict (e.g., square footage). |
| Label | Output you're trying to predict (e.g., house price). |
| Training Set | Data the model learns from. Typically 60-80% of your data. |
| Validation Set | Data used to tune hyperparameters during development. |
| Test Set | Held-out data for final performance. Never peek until you're done. |
| Overfitting | Model memorises training data but fails on new data. |
| Underfitting | Model too simple to capture the pattern. |
| Loss Function | Measures how wrong the model is. |
| Gradient Descent | Optimisation that tweaks parameters to reduce loss. |
| Epoch | One complete pass through the training dataset. |
| Batch | Subset of data processed before updating parameters. |
| Hyperparameters | Choices made before training: learning rate, tree depth, etc. |
| Feature Engineering | Creating better input features from raw data. |
| Baseline | Simple model to beat. If your model can't beat "predict the average," it's useless. |
Algorithm Selection
For Regression (predicting a number)
| Start Here | If That Fails | Best For |
| Linear Regression | Ridge / Lasso | Simple, interpretable. Fastest baseline. |
| Decision Tree Regressor | Random Forest Regressor | Non-linear relationships, feature interactions. |
| Gradient Boosting (XGBoost / LightGBM) | Neural Network | Structured/tabular data. Usually the winner. |
For Classification (predicting a category)
| Start Here | If That Fails | Best For |
| Logistic Regression | Regularised LR | Binary classification, baseline, interpretable. |
| k-Nearest Neighbours | SVM with RBF kernel | Non-linear boundaries, small datasets. |
| Random Forest | Gradient Boosting | Most tabular data. Handles missing values well. |
| Neural Network | — | Images, audio, text, very large datasets. |
For Clustering (grouping without labels)
| Algorithm | Best For |
| k-Means | Spherical clusters, known number of groups. Fast and simple. |
| DBSCAN | Arbitrary shapes, outliers detection, unknown k. |
| Hierarchical | Small datasets, dendrogram visualisation. |
| Gaussian Mixture | Overlapping clusters, soft assignments (probabilities). |
For Dimensionality Reduction
| Algorithm | Best For |
| PCA | Linear relationships, visualising high-dimensional data. |
| t-SNE | Visualising complex high-dimensional data (2D/3D plots). |
| UMAP | Faster than t-SNE, preserves more global structure. |
Key Algorithms Explained
| Algorithm | Type | One-Sentence Summary |
| Linear Regression | Regression | Fits a line through the data. Simple, interpretable, often wrong. |
| Logistic Regression | Classification | Outputs a probability via sigmoid. Best baseline for binary classification. |
| Decision Tree | Both | Asks yes/no questions. Easy to understand, prone to overfitting. |
| Random Forest | Both | Hundreds of averaged trees. Reduces overfitting, excellent default. |
| k-NN | Both | Looks at k closest examples and returns their average/vote. |
| SVM | Classification | Finds best hyperplane separating classes. Great for small/medium data. |
| Naive Bayes | Classification | Uses probability with "naive" assumption features are independent. Fast for text. |
| k-Means | Clustering | Picks k centroids and groups by proximity. Simple and fast. |
| DBSCAN | Clustering | Groups dense regions, finds outliers. No need to specify k. |
| Neural Network | Both | Layers of "neurons" learning hierarchies. Needs lots of data. |
| XGBoost / LightGBM | Both | Gradient-boosted trees. State-of-the-art for tabular data. |
| PCA | Dim Reduction | Projects data onto directions of maximum variance. |
The ML Pipeline
Raw Data → Clean → Explore → Engineer → Split → Train → Evaluate → Deploy → Monitor
Step-by-Step with scikit-learn
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
# 1. Load
df = pd.read_csv("data.csv")
# 2. Clean
df = df.dropna()
df["feature"] = df["feature"].astype(float)
# 3. Feature engineer
df["new_feature"] = df["col_a"] / (df["col_b"] + 1e-6)
# 4. Split
X = df.drop("target", axis=1)
y = df["target"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# 5. Train
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
# 6. Evaluate
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print(classification_report(y_test, y_pred))
Evaluation Metrics
Classification
| Metric | What It Measures | When to Use |
| Accuracy | (TP + TN) / (TP + TN + FP + FN) | Balanced classes. Misleading when data is imbalanced. |
| Precision | TP / (TP + FP) | False positives are costly (spam filter). |
| Recall | TP / (TP + FN) | False negatives are costly (cancer screening). |
| F1 Score | 2 × (P × R) / (P + R) | Best single metric for imbalanced data. |
| ROC/AUC | Area under TPR vs FPR curve | How well model ranks positives vs negatives. |
| Confusion Matrix | TP / FP / FN / TN in a grid | Always check before trusting accuracy. |
Regression
| Metric | What It Means | Scale |
| MSE | Average squared error. Penalises large errors heavily. | Same as target² |
| MAE | Average absolute error. More interpretable. | Same as target |
| R² | Proportion of variance explained. 1.0 = perfect. | 0 to 1 |
Overfitting & How to Prevent It
Overfitting is when the model memorises training noise instead of learning the pattern. Great on training data, poor on new data.
Signs of Overfitting
- Training accuracy >> validation accuracy
- The gap grows as training progresses
- Model weights are very large
- Decision boundary is overly complex
How to Fix It
| Technique | What It Does |
| More data | More examples = harder to memorise noise. |
| Simplify the model | Reduce tree depth, fewer features. |
| Regularisation (L1/L2) | Penalty for large weights. L1 drives some to zero. |
| Dropout | Randomly turn off neurons during training. |
| Early stopping | Stop when validation loss starts increasing. |
| Cross-validation | More reliable performance estimate. |
| Data augmentation | Create synthetic examples by modifying existing ones. |
Python ML Stack
| Library | Purpose | Import |
| NumPy | Numerical arrays, math | import numpy as np |
| pandas | Data loading, cleaning | import pandas as pd |
| scikit-learn | Algorithms, preprocessing | from sklearn import ... |
| matplotlib | Basic plotting | import matplotlib.pyplot as plt |
| seaborn | Statistical visualisations | import seaborn as sns |
| XGBoost | Gradient boosting | import xgboost as xgb |
| LightGBM | Faster gradient boosting | import lightgbm as lgb |
Your First ML Project (Iris)
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
Common Pitfalls
| Pitfall | Effect | How to Avoid |
| Data leakage | Test info leaks into training. Unrealistic performance. | Split before preprocessing. Never use full-dataset statistics. |
| Imbalanced data | 99% negative, 1% positive. 99% accuracy is useless. | Use F1, class weighting, oversampling (SMOTE). |
| Correlation ≠ causation | Ice cream and shark attacks are correlated (summer). | Ask: is there a hidden third variable? |
| Normalising after splitting | Test data influences training preprocessing. | Fit transform on train only, then transform both. |
| No baseline | Fancy model gets 80%. Predict majority class gets 82%. | Always run a simple baseline first. |
| Tuning on test set | Model is overfit to test data. | Use a validation set. Touch test set once at the end. |
Philosophy
- Start simple. Linear regression + baseline tells you if the problem is solvable.
- More data beats a better algorithm. Spend time getting more data before tuning.
- A deployed model beats a perfect one. 80% in production > 99% stuck in a notebook.
- Garbage in, garbage out. Bias in the data becomes bias in the model.
- ML is a tool, not a solution. Sometimes a simple if/else rule is all you need.
History & Milestones
| When | What Happened |
| 1957 | Perceptron invented — first neural network. |
| 1986 | Backpropagation popularised — multi-layer learning. |
| 1997 | Deep Blue beats Kasparov at chess. |
| 2000s | SVM and Random Forest become dominant. |
| 2012 | AlexNet wins ImageNet — deep learning revolution. |
| 2014 | GANs invented — AI generates realistic images. |
| 2017 | Transformers paper — powers every modern LLM. |
| 2022 | ChatGPT launches — ML enters public consciousness. |
| Now | ML powers search, recommendations, self-driving cars, medical imaging. |
The best model is the one you can actually use.