Tags: machine-learning, ml, ai, supervised, unsupervised, scikit-learn, algorithms, beginnerLast updated: 2026-07-01

Machine Learning Cheatsheet

What Is Machine Learning?

Machine Learning (ML) is a approach to programming where instead of writing explicit rules, you show a computer examples and let it figure out the patterns itself.

ApproachTraditional ProgrammingMachine Learning
You writeRules + dataData + expected answers
Computer producesAnswersRules (the model)
Best whenYou know all the rulesYou have lots of examples
Example"If temp > 30, return 'hot'"Show 10,000 labelled photos

Types of ML

TypeWhat It DoesExampleAlgorithms
SupervisedLearn from labelled examplesPredict house priceLinear regression, Random Forest, SVM
UnsupervisedFind hidden patternsGroup customersk-Means, DBSCAN, PCA
ReinforcementLearn through trial and errorTrain a game agentQ-Learning, PPO
Semi-SupervisedMix of labelled + unlabelledClassify with few labelsSelf-training

When ML vs Traditional Programming

Do you know all the rules?
  └── No  → Do you have examples?
       └── Yes → Use supervised learning

Rule of thumb: If a human can't do the task in 1 second, ML is probably needed.

Core Terminology

TermDefinition
FeatureInput variable the model uses to predict (e.g., square footage).
LabelOutput you're trying to predict (e.g., house price).
Training SetData the model learns from. Typically 60-80% of your data.
Validation SetData used to tune hyperparameters during development.
Test SetHeld-out data for final performance. Never peek until you're done.
OverfittingModel memorises training data but fails on new data.
UnderfittingModel too simple to capture the pattern.
Loss FunctionMeasures how wrong the model is.
Gradient DescentOptimisation that tweaks parameters to reduce loss.
EpochOne complete pass through the training dataset.
BatchSubset of data processed before updating parameters.
HyperparametersChoices made before training: learning rate, tree depth, etc.
Feature EngineeringCreating better input features from raw data.
BaselineSimple model to beat. If your model can't beat "predict the average," it's useless.

Algorithm Selection

For Regression (predicting a number)

Start HereIf That FailsBest For
Linear RegressionRidge / LassoSimple, interpretable. Fastest baseline.
Decision Tree RegressorRandom Forest RegressorNon-linear relationships, feature interactions.
Gradient Boosting (XGBoost / LightGBM)Neural NetworkStructured/tabular data. Usually the winner.

For Classification (predicting a category)

Start HereIf That FailsBest For
Logistic RegressionRegularised LRBinary classification, baseline, interpretable.
k-Nearest NeighboursSVM with RBF kernelNon-linear boundaries, small datasets.
Random ForestGradient BoostingMost tabular data. Handles missing values well.
Neural NetworkImages, audio, text, very large datasets.

For Clustering (grouping without labels)

AlgorithmBest For
k-MeansSpherical clusters, known number of groups. Fast and simple.
DBSCANArbitrary shapes, outliers detection, unknown k.
HierarchicalSmall datasets, dendrogram visualisation.
Gaussian MixtureOverlapping clusters, soft assignments (probabilities).

For Dimensionality Reduction

AlgorithmBest For
PCALinear relationships, visualising high-dimensional data.
t-SNEVisualising complex high-dimensional data (2D/3D plots).
UMAPFaster than t-SNE, preserves more global structure.

Key Algorithms Explained

AlgorithmTypeOne-Sentence Summary
Linear RegressionRegressionFits a line through the data. Simple, interpretable, often wrong.
Logistic RegressionClassificationOutputs a probability via sigmoid. Best baseline for binary classification.
Decision TreeBothAsks yes/no questions. Easy to understand, prone to overfitting.
Random ForestBothHundreds of averaged trees. Reduces overfitting, excellent default.
k-NNBothLooks at k closest examples and returns their average/vote.
SVMClassificationFinds best hyperplane separating classes. Great for small/medium data.
Naive BayesClassificationUses probability with "naive" assumption features are independent. Fast for text.
k-MeansClusteringPicks k centroids and groups by proximity. Simple and fast.
DBSCANClusteringGroups dense regions, finds outliers. No need to specify k.
Neural NetworkBothLayers of "neurons" learning hierarchies. Needs lots of data.
XGBoost / LightGBMBothGradient-boosted trees. State-of-the-art for tabular data.
PCADim ReductionProjects data onto directions of maximum variance.

The ML Pipeline

Raw Data → Clean → Explore → Engineer → Split → Train → Evaluate → Deploy → Monitor

Step-by-Step with scikit-learn

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# 1. Load
df = pd.read_csv("data.csv")

# 2. Clean
df = df.dropna()
df["feature"] = df["feature"].astype(float)

# 3. Feature engineer
df["new_feature"] = df["col_a"] / (df["col_b"] + 1e-6)

# 4. Split
X = df.drop("target", axis=1)
y = df["target"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 5. Train
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

# 6. Evaluate
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print(classification_report(y_test, y_pred))

Evaluation Metrics

Classification

MetricWhat It MeasuresWhen to Use
Accuracy(TP + TN) / (TP + TN + FP + FN)Balanced classes. Misleading when data is imbalanced.
PrecisionTP / (TP + FP)False positives are costly (spam filter).
RecallTP / (TP + FN)False negatives are costly (cancer screening).
F1 Score2 × (P × R) / (P + R)Best single metric for imbalanced data.
ROC/AUCArea under TPR vs FPR curveHow well model ranks positives vs negatives.
Confusion MatrixTP / FP / FN / TN in a gridAlways check before trusting accuracy.

Regression

MetricWhat It MeansScale
MSEAverage squared error. Penalises large errors heavily.Same as target²
MAEAverage absolute error. More interpretable.Same as target
Proportion of variance explained. 1.0 = perfect.0 to 1

Overfitting & How to Prevent It

Overfitting is when the model memorises training noise instead of learning the pattern. Great on training data, poor on new data.

Signs of Overfitting

How to Fix It

TechniqueWhat It Does
More dataMore examples = harder to memorise noise.
Simplify the modelReduce tree depth, fewer features.
Regularisation (L1/L2)Penalty for large weights. L1 drives some to zero.
DropoutRandomly turn off neurons during training.
Early stoppingStop when validation loss starts increasing.
Cross-validationMore reliable performance estimate.
Data augmentationCreate synthetic examples by modifying existing ones.

Python ML Stack

LibraryPurposeImport
NumPyNumerical arrays, mathimport numpy as np
pandasData loading, cleaningimport pandas as pd
scikit-learnAlgorithms, preprocessingfrom sklearn import ...
matplotlibBasic plottingimport matplotlib.pyplot as plt
seabornStatistical visualisationsimport seaborn as sns
XGBoostGradient boostingimport xgboost as xgb
LightGBMFaster gradient boostingimport lightgbm as lgb

Your First ML Project (Iris)

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")

Common Pitfalls

PitfallEffectHow to Avoid
Data leakageTest info leaks into training. Unrealistic performance.Split before preprocessing. Never use full-dataset statistics.
Imbalanced data99% negative, 1% positive. 99% accuracy is useless.Use F1, class weighting, oversampling (SMOTE).
Correlation ≠ causationIce cream and shark attacks are correlated (summer).Ask: is there a hidden third variable?
Normalising after splittingTest data influences training preprocessing.Fit transform on train only, then transform both.
No baselineFancy model gets 80%. Predict majority class gets 82%.Always run a simple baseline first.
Tuning on test setModel is overfit to test data.Use a validation set. Touch test set once at the end.

Philosophy

History & Milestones

WhenWhat Happened
1957Perceptron invented — first neural network.
1986Backpropagation popularised — multi-layer learning.
1997Deep Blue beats Kasparov at chess.
2000sSVM and Random Forest become dominant.
2012AlexNet wins ImageNet — deep learning revolution.
2014GANs invented — AI generates realistic images.
2017Transformers paper — powers every modern LLM.
2022ChatGPT launches — ML enters public consciousness.
NowML powers search, recommendations, self-driving cars, medical imaging.

The best model is the one you can actually use.