Machine Learning

Recipes and Formulae

Probability

Basics

\[P(A_1\cap...\cap A_n) = P(A_1)P(A_2|A_1)P(A_3|A_2\cap A_1)...P(A_n|\cap_{i=1}^{n-1} A_i)\]

Bayes’ Rule

\[P(A|B) = \frac{P(A)P(B|A)}{P(B)}\]

Marginalisation

\[P(A) = \sum_{b\in\mathcal{B}}P(A,B=b) = \sum_{b\in\mathcal{B}}P(A|B=b)P(B=b)\]

\(P(A|C) = \sum_{b\in\mathcal{B}}P(A|C,B=b)P(B=b|C)\)

Expectation Value

\[E[f(x)] = \sum_{x\in X} f(x)Px)\]

Optimisation

Exact optimisation

Gradient Descent

\[\theta \leftarrow \theta - \eta\nabla f(\theta)\]

Distance - Categorical Features

Simple matching coefficient

\[d = \frac{m - k}{m}\]

Hamming

Jaccard

\[J = \frac{|A\cap B|}{|A\cup B|} = \frac{|A\cap B|}{|A|+|B|-|A\cap B|}\] \[d = 1-J\]

Distance - Numerical features

Manhattan - L1

\[d(x,y) = \sum_i |x_i - y_i|\]

Euclidean - L2

\[d(x,y) = \sum_i (x_i - y_i)^2\]

Cosine

\[\cos(a,b) = \frac{a \cdot b}{|a||b|}\] \[d(a,b) = 1-\cos(a,b)\]

Distance - Ordinal Features

Normalised Ranks

Evaluation

Accuracy

\[Acc = \frac{TP+TN}{TP+TN+FP+FN}\]

Error

\[E = 1-Acc\]

Precision

\[P = \frac{TP}{TP+FP}\]

Recall

\[R = \frac{TP}{TP+FN}\]

F-Score

\[F_\beta = \frac{(1+\beta)^2 PR}{\beta^2 P +R}\] \[F_1 = \frac{2PR}{P +R}\]

Macro-average

\[P_M = \frac{1}{c}\sum_i P_i\]

Micro-average

\[P_\mu = \frac{\sum_i TP_i}{\sum_i (TP_i+FP_i)}\]

Weighted Average

\[P_w = \sum_i \frac{P_in_i}{N}\]

Feature Selection

Pointwise Mutual Information (PMI)

\[PMI(A,C) = \log_2\frac{P(A,C)}{P(A)P(C)}\]

Mutual Information

\[MI(A,C) = \sum_{i\in\{a,\bar a\}}\sum_{j\in\{c,\bar c\}}P(i,j)PMI(i,j) = \sum_{i\in\{a,\bar a\}}\sum_{j\in\{c,\bar c\}}P(i,j)\log_2\frac{P(A,C)}{P(A)P(C)}\]

Uniform Weighting

Inverse Distance Weighting

\[w_i = \frac{1}{d_i+\epsilon}\]

Inverse Linear Distance Weighting

\[w_i = \frac{d_k-d_i}{d_k-d_1}\]

Naive Bayes

\[P(x_1,...,x_m|y) \approx \prod_iP(x_i|y)\]

Epsilon Smoothing

Laplace Smoothing

\[P(x_m = j | y = k) = \frac{\alpha + \text{count}(y=k,x_m=j)}{M\alpha + \text{count}(y=k)}\]

Classifier

\[\hat y = \argmax_{y\in Y}P(y)P(x_1,...,x_m|y) = \argmax_{y \in Y} P(y)\prod_{i=1}^m P(x_i|y)\]

To construct the model:

for each attribute x_i 
  for each class value c
    for each attribute value v
      count instances with x_i=v and label c
    compute proportions
\[P(y)\prod_{i=1}^m P(x_i|y)\]

Underlying probabilistic model

\[P(x,y) = \prod_{i=1}^n P(y^i)\prod_{m=1}^M P(x_m^i|y^i)\]

Logistic Regression

\[\hat y = \begin{cases} 1: \sigma(\theta^Tx)>0.5 \\ 0 : \text{otherwise}\end{cases}\]

Decision Tree

Entropy

Mean Information

Information gain

\[IG(R_A|R) = H(R) - \text{mean-info}(R_A)\]

Split Information

\[SI(R_A|R) = H(R_A) = -\sum_{i=1}^m P(x_i)\log_2{P(x_i)}\]

Gain Ratio

\[GR(R_A|R) = \frac{IG(R_A|R)}{SI(R_A|R)}\]

ID3

Recursive function:

function id3(root):
  if all instances of same class, or other stopping criterion met:
      stop
  else:
      select a new attribute to use to partition node instances (IG or GR)
      create a branch for each distinct attribute value
      partition root instances by value
      for each leaf node leaf_i:
          id3(leaf_i)

Perceptron

Forward pass

\(\hat y = f(\theta^T x)\)

\[f(z) = \begin{cases} 1 : z > 0 \\ -1 :otherwise \end{cases}\]

Perceptron update rule

\[\theta \leftarrow \theta + \eta(y^i-\hat y^i)x\]

Neural Network update

Sigmoid

\(\sigma(z) = \frac{1}{1+e^{-z}}\) \(\sigma'(z) = \sigma(z)(1-\sigma(z))\)

Backpropagation Algorithm

\[\delta_i^l = \sigma'(z_i^l)(y-\hat y) : \text{final layer}\]

- use $\hat y =$ output activation - i.e. look at gradient, and the deviation in activation

\[\delta_i^l = \sum_j \sigma'(z_i^l)\theta_{ij}^{l+1}\delta_j^{l+1}: \text{hidden layer}\]

- find weighted sum of all $\delta$s from the nodes of the next layer, weighted by $\theta$ between those nodes

Classification Loss Function

\[\mathcal{L}^i = -\log P(y^i|x^i;\theta)\] \[\mathcal{L} = -\sum_i \log P(y^i|x^i;\theta)\]

Binary classification

\[\hat y_1^i = P(y^i=1|x^i; \theta)\] \[\mathcal{L} = -\sum_i [y^i \ln(\hat y_1^i)+(1-y^i)\ln(1-\hat y_1^i)]\]

Multiclass classification

\[\mathcal{L} = -\sum_i\sum_{j\in Y} y_j^i \ln(\hat y_j^i)\]

Regression Loss Function

Mean Squared Error (MSE)

\[\mathcal{L} = \frac{1}{N}\sum_i^N (y^i - \hat y^i)^2\]

Output Function

Binary Classification

\[\hat y = \begin{cases} 1: \sigma(\theta^Tx)>0.5 \\ 0 : \text{otherwise}\end{cases}\]

Multiclass classification

\[p(y^i = j|x^i;\theta) = \frac{\exp{z_j}}{\sum_{k=1}^K \exp z_k}\]

Regression Classification

Unsupervised Learning

k-means

initialise k seeds as cluster centroids
repeat 
  compute distance of all points to cluster centroids
  assign points to the closest cluster
  recompute cluster centroid
until clusters don't change

Agglomerative clustering

initialise all instances as individual clusters
while (# clusters) > 1
  compute triangular elements of proximity matrix between each cluster centroid
  locate the smallest proximity: cluster these instances
  compute updated cluster centroid (typically group average)

Bisecting k-means

initialise cluster list with a single cluster containing all points
repeat
  Pick a cluster to remove from list (e.g. largest cluster)
  // Perform trial bisections of the chosen cluster
  for i = 1..num trials
    Bisect selected cluster via k-means
  endfor
  Select the 2 clusters from bisection with minimum total SSE
  Add these 2 cluster to cluster list
until cluster list contains k clusters

Unsupervised Evaluation

Cluster Cohesion

\[\text{cohesion}(C_i) = \frac{1}{\sum_{x,y\in C_i} d(x,y)}\]

Cluster separation

\[\text{separation}(C_i, C_j) = \sum_{x\in C_i, y\in C_j} d(x,y)\]

Scatter/Sum of Squared Error (SSE)

\[SSE = \sum_{i=1}^k\sum_{x\in C_i} d(m_i, x)^2\]

Supervised Evaluation

Purity

\[\text{purity} = \sum_{i=1}^{k} \frac{|C_i|}{N} \max_j P_i(j)\]

Entropy

\[\text{entropy} = \sum_{i=1}^k \frac{|C_i|}{N}H(x_i)\]

Semi-Supervised Learning

Self-training/Bootstrapping

Algorithm

Query Strategies for Active Learning

Least confident

\[x = \argmax_x(1-P(\hat y|x))\]

Margin sampling

\[x = \argmin_x(P(\hat y_1|x;\theta)-P(\hat y_2|x;\theta))\]

Entropy

\[x = \argmax_{x} -\sum_i P(\hat y_i|x;\theta)\log_2 P(\hat y_i|x;\theta)\]

Query by committee (QBC)


Edit this page.