Machine Learning Algorithms

Machine Learning Algorithms

Textbook Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville

Scott Mueller (lvl 17)
Unsectioned

Preview this deck

Logistic regression gradient descent for \(y_i \in \{-1, 1\}\)

Front

Star 0%
Star 0%
Star 0%
Star 0%
Star 0%

0.0

0 reviews

5
0
4
0
3
0
2
0
1
0

Active users

19

All-time users

20

Favorites

0

Last updated

7 months ago

Date created

Jan 11, 2021

Cards (69)

Unsectioned

(49 cards)

Logistic regression gradient descent for \(y_i \in \{-1, 1\}\)

Front
  1. Initialize weights \(\mathbf{w}_0\)
  2. For \(t = 1, 2, \ldots\)
    1. Compute \(\nabla f(\mathbf{w}) = -\frac1{N}\sum_{n=1}^N \frac{y_n\mathbf{x}_n}{1 + e^{y_n\mathbf{w}^\top\mathbf{x}_n}}\)
    2. Update weights: \(\mathbf{w} \leftarrow \mathbf{w} - \eta\nabla f(\mathbf{w})\)
    3. Stop when \(\lVert \nabla f(\mathbf{w}) \rVert < \epsilon\)
Back

Entropy \(H(S)\) for decision trees

Front

$$H(S) = -\sum_{c=1}^C p(c)\log p(c)$$

where \(S\) is a set of data points in a node, \(c = 1, \ldots, C\) are labels, and \(p(c)\) is the proportion of data belonging to class \(c\).

 

Entropy is \(0\) if all samples are in the same class, large (worst case) when \(p(1) = \ldots = p(C)\).

Back

Gradient Boosted Decision Tree (GBDT) loss function

Front

Use 2nd order Taylor series expansion:

$$\begin{aligned}\ell_i(\hat{y}_i + f_m(\mathbf{x}_i)) &\approx \ell_i(\hat{y}_i)\ +\\&\ \ \ \ \ g_if_m(\mathbf{x}_i) + \frac12h_if_m(\mathbf{x}_i)^2\\&= \frac{h_i}2 \left\lVert f_m(\mathbf{x}_i) + \frac{g_i}{h_i} \right\rVert^2 + \text{const}\end{aligned}$$

where \(g_i = \partial_{\hat{y}_i}\ell_i(\hat{y}_i)\) is the gradient and \(h_i = \partial_{\hat{y}_i}^2\ell_i(\hat{y}_i)\) is the 2nd order derivative. Apply \(\sum_{i=1}^n\) for total tree ensemble loss, where \(n\) is the number of trees in ensemble.

Back

SGD for neural networks

Front
  1. Initialize all weights \(w_{ij}^{(\ell)}\) at random
  2. For \(\text{iter} = 0, 1, 2, \ldots\)
    1. Forward: compute all \(x_j^{(\ell)}\) from input to output
    2. Backward: compute all \(\delta_j^{(\ell)}\) from output to input
    3. Update all \(w_{ij}^{(\ell)} \leftarrow w_{ij}^{(\ell)} - \eta x_i^{(\ell - 1)}\delta_j^{(\ell)}\)
Back

Logistic regression objective function for \(y_i \in \{-1, 1\}\)

Front

$$\sum_{n=1}^N \log(1 + e^{-y_n\mathbf{w}^\top\mathbf{x}_n})$$

Back

Back-tracking line search

Front
  1. Start from large \(\alpha_0\)
  2. Try \(\alpha = \alpha_0, \frac{\alpha_0}2, \frac{\alpha_0}4, \ldots\)
  3. Stop when \(\alpha\) satisfies some sufficient decrease condition
Back

Kernel ridge regression

Front

$$\min_\mathbf{w} \frac12\Vert\mathbf{w}\Vert^2 + \frac12\sum_{i=1}^n (\mathbf{w}^\top \phi(\mathbf{x}_i) - y_i)^2$$

L2 regularization + square error. Dual problem:

$$\min_\alpha \alpha^\top Q\alpha + \Vert\alpha\Vert^2 - 2\alpha^\top\mathbf{y}$$

Back

Neuron computation with dropout

Front

$$x_i^{(\ell)} = B\sigma\left(\sum_j W_{ij}^{(\ell)}x_j^{(\ell - 1)} + b_i^{(\ell)}\right)$$

where \(B\) is a Bernoulli variable equal to \(1\) with probability \(\alpha\)

Back

$$M^\text{PPMI}$$

Front

A \(n \times d\) word feature matrix where each row is a word and each column is a context

$$M^\text{PPMI} \approx U_k\Sigma_kV_k^\top$$

\(W^\text{SVD} = U_k\Sigma_k\) is the context representation of each word (each row is a \(k\)-dimensional feature for a word)

Back

Lipschitz constant \(\mathcal{L}\) as upper bound for gradient descent convergence

Front

Gradient descent converges if \(\alpha < \frac1{\mathcal{L}}\),

$$\forall \mathbf{x}: \nabla^2f(\mathbf{x}) \preceq \mathcal{L}I$$

 

Back

Vapnik-Chervonenkis (VC) bound

Front

$$P(\exists h \in \mathcal{H} \text{ s.t. } |E_\text{tr}(h) - E(h)| > \epsilon)\\\leqslant 2 \cdot 2m_\mathcal{H}(2N) \cdot e^{-\frac18\epsilon^2N}$$

Back

Backpropagation for final layer, \(\delta_1^{(L)}\), with square loss

Front

$$\begin{aligned}\delta_1^{(L)} &= \frac{\partial e(W)}{\partial s_1^{(L)}}\\&= \frac{\partial e(W)}{\partial x_1^{(L)}} \times \frac{\partial x_1^{(L)}}{\partial s_1^{(L)}}\\&= 2(x_1^{(L)} - y_n) \times \theta'(s_1^{(L)})\end{aligned}$$

Back

Hidden state of RNN

Front

$$\mathbf{s}_t = f(U\mathbf{x}_t + W\mathbf{s}_{t-1})$$

where \(W\) is the transition matrix and \(U\) is the word embedding matrix

Back

SVM dual prediction for \(y_i \in \{-1, 1\}\) classification

Front

$$\begin{aligned}\mathbf{w}^\top \phi(\mathbf{x}) &= \sum_{i=1}^n y_i\alpha_i\phi(\mathbf{x}_i)^\top\phi(\mathbf{x})\\&= \sum_{i=1}^n y_i\alpha_i K(\mathbf{x}_i, \mathbf{x})\end{aligned}$$

Back

Growth function for hypotheses \(\mathcal{H}\)

Front

Counts the most dichotomies on any \(N\) points:

$$m_\mathcal{H}(N) = \max_{\mathbf{x}_1, \ldots, \mathbf{x}_N \in \mathcal{X}} |\mathcal{H}(\mathbf{x}_1, \ldots, \mathbf{x}_N)|$$

satisfying \(m_\mathcal{H}(N) \leqslant 2^N\)

Back

Dichotomy of hypothesis \(h: \mathcal{X} \rightarrow \{-1, +1\}\)

Front

Result for a particular dataset:

$$h: \{\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_N\} \rightarrow \{-1, +1\}^N$$

Back

Momentum gradient descent

Front
  1. Initialize \(\mathbf{w}_0, \mathbf{v}_0 = 0\)
  2. For \(t = 1, 2, \ldots\)
    1. \(\mathbf{v}_t \leftarrow \beta\mathbf{v}_{t-1} + (1 - \beta)\nabla f(\mathbf{w}_t)\)
    2. \(\mathbf{w}_{t+1} \leftarrow \mathbf{w}_t - \alpha\mathbf{v}_t\)
Back

General value for VC dimension of linear classifiers

Front

$$d_\text{VC} = d + 1$$

where \(d\) is the number of dimensions of each sample

Back

Pointwise mutual information (PMI)

Front

$$\begin{aligned}\text{PMI}(w, c) &= \log\frac{\hat{P}(w, c)}{\hat{P}(w)\hat{P}(c)}\\&=\log\frac{\#(w, c)|D|}{\#(w)\#(c)}\end{aligned}$$

where \(\#(w) = \sum_c \#(w,c)\) is the number of times word \(w\) occurred in \(D\), \(\#(c) = \sum_w \#(w,c)\) is the number of times context \(c\) occurred in \(D\), and \(|D|\) is the number of pairs in \(D\)

Back

Adam

Front

Momentum + adaptive updates

  1. Initialize \(\mathbf{w}_0, \mathbf{m}_0 = 0, \mathbf{v}_0 = 0\)
  2. For \(t = 1, 2, \ldots\)
    1. Sample \(i \in \{1, \ldots, N\}\) (batch size 1)
    2. \(\mathbf{g}_t \leftarrow \nabla f_i(\mathbf{w}_t)\)
    3. \(\mathbf{m}_t \leftarrow \beta_1\mathbf{m}_{t-1} + (1 - \beta_1)\mathbf{g}_t\)
    4. \(\mathbf{v}_t \leftarrow \beta_2\mathbf{v}_{t-1} + (1 - \beta_2)\mathbf{g}_t^2\)
    5. \(\mathbf{\hat{m}}_t \leftarrow \frac{\mathbf{m}_t}{1 - \beta_1^t}\)
    6. \(\mathbf{\hat{v}}_t \leftarrow \frac{\mathbf{v}_t}{1 - \beta_2^t}\)
    7. \(\mathbf{w}_t \leftarrow \mathbf{w}_{t-1} - \alpha\frac{\mathbf{\hat{m}}_t}{\sqrt{\mathbf{\hat{v}}_t} + \epsilon}\)
Back

GAN minmax training objective

Front

$$\max_G\min_D L_D(G, D)$$

Back

Information gain from a decision tree node split

Front

$$H(S) - \text{average entropy of split}$$

Back

Backpropagation gradient \(\delta_j^{(\ell)}\)

Front

$$\delta_j^{(\ell)} = \frac{\partial e(W)}{\partial s_j^{(\ell)}}$$

where \(\ell\) is layer and \(j\) is neuron

Back

Predicted output at time \(t\) of an RNN

Front

$$o_t = \arg\max_i(V\mathbf{s}_t)_i$$

where \(V\) is the layer for prediction

Back

SVM stochastic subgradient descent algorithm for \(y_i \in \{-1, 1\}\) classification

Front

For \(t = 1, 2, \ldots\)

  1. Sample \(i\)
  2. If \(y_i\mathbf{w}^\top\mathbf{x}_i < 1\)
    \(\mathbf{w} \leftarrow (1 - \eta_t)\mathbf{w} + \eta_tnCy_i\mathbf{x}_i\)
  3. Else (\(y_i\mathbf{w}^\top\mathbf{x}_i \geqslant 1\))
    \(\mathbf{w} \leftarrow (1 - \eta_t)\mathbf{w}\)
Back

Logistic regression hinge loss for \(y_i \in \{-1, 1\}\) classification

Front

$$\max(0, 1 - y_n\mathbf{w}^\top\mathbf{x}_n)$$

Back

Unconstrained version of

$$\min_\mathbf{w} E_\text{tr}(\mathbf{w}) = \frac1{N} (Z\mathbf{w} - \mathbf{y})^\top (Z\mathbf{w} - \mathbf{y})\\\text{s.t. } \mathbf{w}^\top\mathbf{w} \leqslant C$$

Front

$$\min_\mathbf{w} E_\text{tr}(\mathbf{w}) + \lambda\mathbf{w}^\top\mathbf{w}$$

where \(\mathbf{w}_\text{reg}\) is the solution to both optimization problems:

$$\nabla E_\text{tr}(\mathbf{w}_\text{reg}) + 2\lambda\mathbf{w}_\text{reg} = 0$$

Back

Softmax loss function

Front

$$-\log\frac{e^{\mathbf{w}_{y_i}^\top\mathbf{x}_i}}{\sum_j e^{\mathbf{w}_j^\top\mathbf{x}_i}}$$

Back

Sufficient decrease condition for back-tracking line search

Front
  • Simple condition: \(f(\mathbf{w} + \alpha\mathbf{d}) < f(\mathbf{w})\)
    • Often works in practice, not in theory
  • Provable sufficient decrease condition:
    $$f(\mathbf{w} + \alpha\mathbf{d}) < f(\mathbf{w}) + \sigma\alpha\nabla f(\mathbf{w})^\top\mathbf{d}$$
    for constant \(\sigma \in (0, 1)\)
Back

Objective function for decision tree partition \(S = S_1 \cup S_2\)

Front

$$\sum_{i \in S_1} (y_i - y^{(1)})^2 + \sum_{i \in S_2} (y_i - y^{(2)})^2$$

where \(y^{(1)} = \frac1{|S_1|} \sum_{i \in S_1} y_i\) and \(y^{(2)} = \frac1{|S_2|} \sum_{i \in S_2} y_i\)

Back

Likelihood of \(\mathcal{D} = (\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_N, y_N)\) for logistic regression with \(y_i \in \{-1, 1\}\) classification

Front

$$\prod_{n=1}^N P(y_n | \mathbf{x}_n) = \prod_{n=1}^N \theta(y_n\mathbf{w}^\top\mathbf{x}_n)$$

where \(\theta\) is the sigmoid function

Back

GAN generator loss function

Front

$$L_G = E_z[-\log D(G(z))]$$

Back

Hoeffding's inequality

Front

$$P(|v - \mu| > \epsilon) \leqslant 2e^{-2\epsilon^2N}$$

where \(N\) is sample size and \(v\) is sample mean

Back

SVM dual problem for \(y_i \in \{-1, 1\}\) classification

Front

$$\max_{0 \leqslant \alpha \leqslant C} \left\{-\frac12\mathbf{\alpha}^\top Q\mathbf{\alpha} + \mathbf{e}^\top\mathbf{\alpha}\right\} = D(\mathbf{\alpha})$$

where \(Q \in \mathbb{R}^{n \times n}\) with \(Q_{ij} = y_iy_j\phi(\mathbf{x}_i)^\top\phi(\mathbf{x}_j)\), \(\phi(\mathbf{x})\) is a higher dimensional mapping function, and \(\mathbf{e}\) is the ones vector.

\(\phi(\mathbf{x}_i)^\top\phi(\mathbf{x}_j)\) can be replaced with a kernel \(K(\mathbf{x}_i, \mathbf{x}_j)\).

Back

Logistic loss for each training example for \(y_i \in \{-1, 1\}\) classification

Front

$$\log(1 + e^{-y_n\mathbf{w}^\top\mathbf{x}_n})$$

Back

Empirical risk minimization

Front

$$\min_W \frac1{N} \sum_{n=1}^N \text{loss}(f_W(\mathbf{x}_n), y_n)$$

where \(f_W(\mathbf{x})\) is the decision function to be learned and \(W\) is the parameters of the function

Back

Backpropagation \(\delta_i^{(\ell - 1)}\)

Front

$$\begin{aligned}\delta_i^{(\ell - 1)} &= \frac{\partial e(W)}{\partial s_i^{(\ell - 1)}}\\&= \sum_{j=1}^d \frac{\partial e(W)}{\partial s_j^{(\ell)}} \times \frac{\partial s_j^{(\ell)}}{\partial x_i^{(\ell - 1)}} \times \frac{\partial x_i^{(\ell - 1)}}{\partial s_i^{(\ell - 1)}}\\&= \sum_{j=1}^d \delta_j^{(\ell)} \times w_{ij}^{(\ell)} \times \theta'(s_i^{(\ell - 1)})\end{aligned}$$

Back

Newton's method

Front

$$\begin{aligned}f(\mathbf{x}) &\approx f(\mathbf{a}) + (\mathbf{x} - \mathbf{a})^\top \nabla_\mathbf{x} f(\mathbf{a})\\&+ \frac12(\mathbf{x} - \mathbf{a})^\top \mathbf{H}(f)(\mathbf{a})(\mathbf{x} - \mathbf{a})\end{aligned}$$

which is simply a second-order Tayler series expansion. Applied to gradient descent:

$$\begin{aligned}f(\mathbf{w}^t + \mathbf{d}) &\approx f(\mathbf{w}^t) + \mathbf{d}^\top \nabla_\mathbf{w} f(\mathbf{w}^t)\\&+ \frac12\mathbf{d}^\top \mathbf{H}(f)(\mathbf{w}^t)(\mathbf{d})\end{aligned}$$

Back

Finite VC dimension implies what?

Front

\(m_\mathcal{H}(N)\) is polynomial in \(N\) and

$$m_\mathcal{H}(N) \leqslant \sum_{i=0}^{d_\text{VC}} \binom{N}{i}$$

and

$$P(|E_\text{tr} - E| > \epsilon) \leqslant O(N^{d_\text{VC}} \cdot e^{-N})$$

Back

Upper bound on \(P(|E_{\text{tr}}(g) - E(g)| > \epsilon)\)

Front

$$P(|E_{\text{tr}}(g) - E(g)| > \epsilon) \leqslant 2m_\mathcal{H}(N)e^{-2\epsilon^2N}$$

where \(m_\mathcal{H}(N)\) measures model complexity (# hypotheses in hypothesis space)

Back

SVM primal problem for \(y_i \in \{-1, 1\}\) classification

Front

$$\min_{\mathbf{w}, \xi} \frac12\mathbf{w}^\top\mathbf{w} + C\sum_{i=1}^n\xi_i$$

s.t. \(y_i(\mathbf{w}^\top\mathbf{x}_i) \geqslant 1 - \xi_i, i = 1, \ldots, n\)

and \(\xi_i \geqslant 0\).

Equivalent to

$$\min_\mathbf{w} \frac12\mathbf{w}^\top\mathbf{w} + C\sum_{i=1}^n \max(0, 1 - y_i\mathbf{w}^\top\mathbf{x}_i)$$

(L2 regularization and hinge loss)

Back

VC dimension of a hypothesis set \(\mathcal{H}\), denoted by \(d_{VC}(\mathcal{H})\)

Front

Largest value of \(N\) for which \(m_\mathcal{H}(N) = 2^N\) (the most points \(\mathcal{H}\) can shatter)

Back

Expected output of neuron with dropout

Front

$$E\left[x_i^{(\ell)}\right] = \alpha\sigma\left(\sum_j W_{ij}^{(\ell)}x_j^{(\ell - 1)} + b_i^{(\ell)}\right)$$

Necessary to multiply all weights by \(\alpha^{-1}\) because of the \(\alpha\) probability of not dropping out

Back

GAN discriminator loss function

Front

$$\begin{aligned}L_D &= E_{x\sim\text{real data}}[-\log D(x)]\\&+ E_z[-\log(1 - D(G(z)))]\end{aligned}$$

Back

Gradient descent basic process

Front

$$\mathbf{w}^{t+1} \leftarrow \mathbf{w}^t + \mathbf{d}^*$$

where \(\mathbf{d}^* = -\alpha\nabla f(\mathbf{w}^t)\) and \(\alpha > 0\) is the step size

Back

Gradient descent with back-tracking line search

Front
  1. Initialize weights \(\mathbf{w}_0\)
  2. For \(t = 1, 2, \ldots\)
    1. Compute \(\mathbf{d} = -\nabla f(\mathbf{w})\)
    2. For \(\alpha = \alpha_0, \frac{\alpha_0}2, \frac{\alpha_0}4, \ldots\)
      1. Break if \(f(\mathbf{w} + \alpha\mathbf{d}) \leqslant f(\mathbf{w}) + \sigma\alpha\nabla f(\mathbf{w})^\top \mathbf{d}\)
    3. Update weights: \(\mathbf{w} \leftarrow \mathbf{w} + \alpha\mathbf{d}\)
    4. Stop when \(\lVert \nabla f(\mathbf{w}) \rVert < \epsilon\)
Back

Multi-class loss function

Front

Minimize regularized training error:

$$\min_{\mathbf{w}_1,\ldots,\mathbf{w}_L} \sum_{i=1}^n \text{loss}(f(\mathbf{x}_i), \mathbf{y}_i) + \lambda\sum_{j=1,\ldots,L} \mathbf{w}_j^\top \mathbf{w}_j$$

Back

Average entropy of decision tree node split \(S \rightarrow S_1, S_2\)

Front

$$\frac{|S_1|}{|S|}H(S_1) + \frac{|S_2|}{|S|}H(S_2)$$

Back

Adagrad

Front
  1. Initialize \(\mathbf{w}_0\)
  2. For \(t = 1, 2, \ldots\)
    1. Sample \(i \in \{1, \ldots, N\}\) (batch size 1)
    2. \(\mathbf{g}^t \leftarrow \nabla f_i(\mathbf{w}_t)\)
    3. \(G_i^t \leftarrow G_i^{t-1} + (\mathbf{g}_i^t)^2\) (\(i\) is dimension index)
    4. \(\mathbf{w}_{t+1} \leftarrow \mathbf{w}_t - \frac{\eta}{\sqrt{G_i^t + \epsilon}}\mathbf{g}_i^t\)
Back

Chapter 3

(15 cards)

$$\log{\sigma(x)}$$

Front

$$-\zeta(-x)$$

Back

$$\forall x \in (0, 1), \sigma^{-1}(x)$$

Front

$$\log\left(\frac{x}{1-x}\right)$$

Back

Positive part function

Front

$$x^+ = \max(0,x)$$

Back

Softplus function

Front

$$\zeta(x) = \log(1 + e^x)$$

Back

Self-information of event \(X = x\)

Front

$$I(x) = -\log{P(x)}$$

Back

Negative part function

Front

$$x^- = \max(0,-x)$$

Back

Shannon entropy

Front

$$\begin{aligned}H(x) &= \mathbb{E}_{x\sim P}[I(x)]\\&= -\mathbb{E}_{x\sim P}[\log{P(x)}]\end{aligned}$$

Called differential entropy when \(x\) is continuous

Back

$$\forall x > 0, \zeta^{-1}(x)$$

Front

$$\log\left(e^x - 1\right)$$

Back

Kullback-Leibler (KL) divergence

Front

$$\begin{aligned}D_{\text{KL}}(P \Vert Q) &= \mathbb{E}_{x \sim P} \left[\log{\frac{P(x)}{Q(x)}}\right]\\&= \mathbb{E}_{x \sim P} [\log P(x) - \log Q(x)]\end{aligned}$$

measures how different \(P(x)\) and \(Q(x)\) are

Back

Integral for \(\zeta(x)\)

Front

$$\int_{-\infty}^x \sigma(y) \,dy$$

Back

$$1 - \sigma(x)$$

Front

$$\sigma(-x)$$

Back

$$\frac{d}{dx}\sigma(x)$$

Front

$$\sigma(x)(1 - \sigma(x))$$

Back

$$\zeta(x) - \zeta(-x)$$

Front

$$x$$

Back

Cross-entropy

Front

$$\begin{aligned}H(P, Q) &= H(P) + D_{\text{KL}}(P \Vert Q)\\&= -\mathbb{E}_{x \sim P} \log Q(x)\end{aligned}$$

Back

$$\frac{d}{dx}\zeta(x)$$

Front

$$\sigma(x)$$

Back

Chapter 4

(2 cards)

$$\text{softmax}(\mathbf{x})_i$$

Front

$$\frac{e^{x_i}}{\sum_{j=1}^n e^{x_j}}$$

often used to predict probabilities associated with a multinoulli distribution

Back

Condition number

Front

Ratio of the magnitude of the largest and smallest eigenvalues:

$$\max_{i,j} \left\lvert \frac{\lambda_i}{\lambda_j} \right\rvert$$

Back

Chapter 5

(3 cards)

Bias of an estimator

Front

$$\text{bias}(\hat{\mathbf{\theta}}_m) = \mathbb{E}(\hat{\mathbf{\theta}}_m) - \mathbf{\theta}$$

Back

Normal equation

Front

$$\mathbf{w}^* = (X^\top X)^{-1}X^\top \mathbf{y}$$

optimal solution for linear regression by minimizing MSE

Back

Point estimator or statistic

Front

Function of data:

$$\hat{\mathbf{\theta}}_m = g(\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(m)})$$

Back