Tags: machine-learning, ml, ai, supervised, unsupervised, scikit-learn, algorithms, beginnerLast updated: 2026-07-01

Machine Learning Cheatsheet

What Is Machine Learning?

Machine Learning (ML) is a approach to programming where instead of writing explicit rules, you show a computer examples and let it figure out the patterns itself.

Approach	Traditional Programming	Machine Learning
You write	Rules + data	Data + expected answers
Computer produces	Answers	Rules (the model)
Best when	You know all the rules	You have lots of examples
Example	"If temp > 30, return 'hot'"	Show 10,000 labelled photos

Types of ML

Type	What It Does	Example	Algorithms
Supervised	Learn from labelled examples	Predict house price	Linear regression, Random Forest, SVM
Unsupervised	Find hidden patterns	Group customers	k-Means, DBSCAN, PCA
Reinforcement	Learn through trial and error	Train a game agent	Q-Learning, PPO
Semi-Supervised	Mix of labelled + unlabelled	Classify with few labels	Self-training

When ML vs Traditional Programming

Do you know all the rules?
  └── No  → Do you have examples?
       └── Yes → Use supervised learning

Rule of thumb: If a human can't do the task in 1 second, ML is probably needed.

Core Terminology

Term	Definition
Feature	Input variable the model uses to predict (e.g., square footage).
Label	Output you're trying to predict (e.g., house price).
Training Set	Data the model learns from. Typically 60-80% of your data.
Validation Set	Data used to tune hyperparameters during development.
Test Set	Held-out data for final performance. Never peek until you're done.
Overfitting	Model memorises training data but fails on new data.
Underfitting	Model too simple to capture the pattern.
Loss Function	Measures how wrong the model is.
Gradient Descent	Optimisation that tweaks parameters to reduce loss.
Epoch	One complete pass through the training dataset.
Batch	Subset of data processed before updating parameters.
Hyperparameters	Choices made before training: learning rate, tree depth, etc.
Feature Engineering	Creating better input features from raw data.
Baseline	Simple model to beat. If your model can't beat "predict the average," it's useless.

Algorithm Selection

For Regression (predicting a number)

Start Here	If That Fails	Best For
Linear Regression	Ridge / Lasso	Simple, interpretable. Fastest baseline.
Decision Tree Regressor	Random Forest Regressor	Non-linear relationships, feature interactions.
Gradient Boosting (XGBoost / LightGBM)	Neural Network	Structured/tabular data. Usually the winner.

For Classification (predicting a category)

Start Here	If That Fails	Best For
Logistic Regression	Regularised LR	Binary classification, baseline, interpretable.
k-Nearest Neighbours	SVM with RBF kernel	Non-linear boundaries, small datasets.
Random Forest	Gradient Boosting	Most tabular data. Handles missing values well.
Neural Network	—	Images, audio, text, very large datasets.

For Clustering (grouping without labels)

Algorithm	Best For
k-Means	Spherical clusters, known number of groups. Fast and simple.
DBSCAN	Arbitrary shapes, outliers detection, unknown k.
Hierarchical	Small datasets, dendrogram visualisation.
Gaussian Mixture	Overlapping clusters, soft assignments (probabilities).

For Dimensionality Reduction

Algorithm	Best For
PCA	Linear relationships, visualising high-dimensional data.
t-SNE	Visualising complex high-dimensional data (2D/3D plots).
UMAP	Faster than t-SNE, preserves more global structure.

Key Algorithms Explained

Algorithm	Type	One-Sentence Summary
Linear Regression	Regression	Fits a line through the data. Simple, interpretable, often wrong.
Logistic Regression	Classification	Outputs a probability via sigmoid. Best baseline for binary classification.
Decision Tree	Both	Asks yes/no questions. Easy to understand, prone to overfitting.
Random Forest	Both	Hundreds of averaged trees. Reduces overfitting, excellent default.
k-NN	Both	Looks at k closest examples and returns their average/vote.
SVM	Classification	Finds best hyperplane separating classes. Great for small/medium data.
Naive Bayes	Classification	Uses probability with "naive" assumption features are independent. Fast for text.
k-Means	Clustering	Picks k centroids and groups by proximity. Simple and fast.
DBSCAN	Clustering	Groups dense regions, finds outliers. No need to specify k.
Neural Network	Both	Layers of "neurons" learning hierarchies. Needs lots of data.
XGBoost / LightGBM	Both	Gradient-boosted trees. State-of-the-art for tabular data.
PCA	Dim Reduction	Projects data onto directions of maximum variance.

The ML Pipeline

Raw Data → Clean → Explore → Engineer → Split → Train → Evaluate → Deploy → Monitor

Step-by-Step with scikit-learn

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# 1. Load
df = pd.read_csv("data.csv")

# 2. Clean
df = df.dropna()
df["feature"] = df["feature"].astype(float)

# 3. Feature engineer
df["new_feature"] = df["col_a"] / (df["col_b"] + 1e-6)

# 4. Split
X = df.drop("target", axis=1)
y = df["target"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 5. Train
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

# 6. Evaluate
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print(classification_report(y_test, y_pred))

Evaluation Metrics

Classification

Metric	What It Measures	When to Use
Accuracy	(TP + TN) / (TP + TN + FP + FN)	Balanced classes. Misleading when data is imbalanced.
Precision	TP / (TP + FP)	False positives are costly (spam filter).
Recall	TP / (TP + FN)	False negatives are costly (cancer screening).
F1 Score	2 × (P × R) / (P + R)	Best single metric for imbalanced data.
ROC/AUC	Area under TPR vs FPR curve	How well model ranks positives vs negatives.
Confusion Matrix	TP / FP / FN / TN in a grid	Always check before trusting accuracy.

Regression

Metric	What It Means	Scale
MSE	Average squared error. Penalises large errors heavily.	Same as target²
MAE	Average absolute error. More interpretable.	Same as target
R²	Proportion of variance explained. 1.0 = perfect.	0 to 1

Overfitting & How to Prevent It

Overfitting is when the model memorises training noise instead of learning the pattern. Great on training data, poor on new data.

Signs of Overfitting

Training accuracy >> validation accuracy
The gap grows as training progresses
Model weights are very large
Decision boundary is overly complex

How to Fix It

Technique	What It Does
More data	More examples = harder to memorise noise.
Simplify the model	Reduce tree depth, fewer features.
Regularisation (L1/L2)	Penalty for large weights. L1 drives some to zero.
Dropout	Randomly turn off neurons during training.
Early stopping	Stop when validation loss starts increasing.
Cross-validation	More reliable performance estimate.
Data augmentation	Create synthetic examples by modifying existing ones.

Python ML Stack

Library	Purpose	Import
NumPy	Numerical arrays, math	`import numpy as np`
pandas	Data loading, cleaning	`import pandas as pd`
scikit-learn	Algorithms, preprocessing	`from sklearn import ...`
matplotlib	Basic plotting	`import matplotlib.pyplot as plt`
seaborn	Statistical visualisations	`import seaborn as sns`
XGBoost	Gradient boosting	`import xgboost as xgb`
LightGBM	Faster gradient boosting	`import lightgbm as lgb`

Your First ML Project (Iris)

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")

Common Pitfalls

Pitfall	Effect	How to Avoid
Data leakage	Test info leaks into training. Unrealistic performance.	Split before preprocessing. Never use full-dataset statistics.
Imbalanced data	99% negative, 1% positive. 99% accuracy is useless.	Use F1, class weighting, oversampling (SMOTE).
Correlation ≠ causation	Ice cream and shark attacks are correlated (summer).	Ask: is there a hidden third variable?
Normalising after splitting	Test data influences training preprocessing.	Fit transform on train only, then transform both.
No baseline	Fancy model gets 80%. Predict majority class gets 82%.	Always run a simple baseline first.
Tuning on test set	Model is overfit to test data.	Use a validation set. Touch test set once at the end.

Philosophy

Start simple. Linear regression + baseline tells you if the problem is solvable.
More data beats a better algorithm. Spend time getting more data before tuning.
A deployed model beats a perfect one. 80% in production > 99% stuck in a notebook.
Garbage in, garbage out. Bias in the data becomes bias in the model.
ML is a tool, not a solution. Sometimes a simple if/else rule is all you need.

History & Milestones

When	What Happened
1957	Perceptron invented — first neural network.
1986	Backpropagation popularised — multi-layer learning.
1997	Deep Blue beats Kasparov at chess.
2000s	SVM and Random Forest become dominant.
2012	AlexNet wins ImageNet — deep learning revolution.
2014	GANs invented — AI generates realistic images.
2017	Transformers paper — powers every modern LLM.
2022	ChatGPT launches — ML enters public consciousness.
Now	ML powers search, recommendations, self-driving cars, medical imaging.

The best model is the one you can actually use.