# Machine Learning Algorithms

Textbook Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville

Scott Mueller (lvl 17)
Unsectioned

### Preview this deck

Logistic regression gradient descent for $$y_i \in \{-1, 1\}$$

Front

### 0.0

0 reviews

 5 0 4 0 3 0 2 0 1 0

Active users

19

All-time users

20

Favorites

0

Last updated

7 months ago

Date created

Jan 11, 2021

## Cards(69)

Unsectioned

(49 cards)

Logistic regression gradient descent for $$y_i \in \{-1, 1\}$$

Front
1. Initialize weights $$\mathbf{w}_0$$
2. For $$t = 1, 2, \ldots$$
1. Compute $$\nabla f(\mathbf{w}) = -\frac1{N}\sum_{n=1}^N \frac{y_n\mathbf{x}_n}{1 + e^{y_n\mathbf{w}^\top\mathbf{x}_n}}$$
2. Update weights: $$\mathbf{w} \leftarrow \mathbf{w} - \eta\nabla f(\mathbf{w})$$
3. Stop when $$\lVert \nabla f(\mathbf{w}) \rVert < \epsilon$$
Back

Entropy $$H(S)$$ for decision trees

Front

$$H(S) = -\sum_{c=1}^C p(c)\log p(c)$$

where $$S$$ is a set of data points in a node, $$c = 1, \ldots, C$$ are labels, and $$p(c)$$ is the proportion of data belonging to class $$c$$.

Entropy is $$0$$ if all samples are in the same class, large (worst case) when $$p(1) = \ldots = p(C)$$.

Back

Gradient Boosted Decision Tree (GBDT) loss function

Front

Use 2nd order Taylor series expansion:

\begin{aligned}\ell_i(\hat{y}_i + f_m(\mathbf{x}_i)) &\approx \ell_i(\hat{y}_i)\ +\\&\ \ \ \ \ g_if_m(\mathbf{x}_i) + \frac12h_if_m(\mathbf{x}_i)^2\\&= \frac{h_i}2 \left\lVert f_m(\mathbf{x}_i) + \frac{g_i}{h_i} \right\rVert^2 + \text{const}\end{aligned}

where $$g_i = \partial_{\hat{y}_i}\ell_i(\hat{y}_i)$$ is the gradient and $$h_i = \partial_{\hat{y}_i}^2\ell_i(\hat{y}_i)$$ is the 2nd order derivative. Apply $$\sum_{i=1}^n$$ for total tree ensemble loss, where $$n$$ is the number of trees in ensemble.

Back

SGD for neural networks

Front
1. Initialize all weights $$w_{ij}^{(\ell)}$$ at random
2. For $$\text{iter} = 0, 1, 2, \ldots$$
1. Forward: compute all $$x_j^{(\ell)}$$ from input to output
2. Backward: compute all $$\delta_j^{(\ell)}$$ from output to input
3. Update all $$w_{ij}^{(\ell)} \leftarrow w_{ij}^{(\ell)} - \eta x_i^{(\ell - 1)}\delta_j^{(\ell)}$$
Back

Logistic regression objective function for $$y_i \in \{-1, 1\}$$

Front

$$\sum_{n=1}^N \log(1 + e^{-y_n\mathbf{w}^\top\mathbf{x}_n})$$

Back

Back-tracking line search

Front
1. Start from large $$\alpha_0$$
2. Try $$\alpha = \alpha_0, \frac{\alpha_0}2, \frac{\alpha_0}4, \ldots$$
3. Stop when $$\alpha$$ satisfies some sufficient decrease condition
Back

Kernel ridge regression

Front

$$\min_\mathbf{w} \frac12\Vert\mathbf{w}\Vert^2 + \frac12\sum_{i=1}^n (\mathbf{w}^\top \phi(\mathbf{x}_i) - y_i)^2$$

L2 regularization + square error. Dual problem:

$$\min_\alpha \alpha^\top Q\alpha + \Vert\alpha\Vert^2 - 2\alpha^\top\mathbf{y}$$

Back

Neuron computation with dropout

Front

$$x_i^{(\ell)} = B\sigma\left(\sum_j W_{ij}^{(\ell)}x_j^{(\ell - 1)} + b_i^{(\ell)}\right)$$

where $$B$$ is a Bernoulli variable equal to $$1$$ with probability $$\alpha$$

Back

$$M^\text{PPMI}$$

Front

A $$n \times d$$ word feature matrix where each row is a word and each column is a context

$$M^\text{PPMI} \approx U_k\Sigma_kV_k^\top$$

$$W^\text{SVD} = U_k\Sigma_k$$ is the context representation of each word (each row is a $$k$$-dimensional feature for a word)

Back

Lipschitz constant $$\mathcal{L}$$ as upper bound for gradient descent convergence

Front

Gradient descent converges if $$\alpha < \frac1{\mathcal{L}}$$,

$$\forall \mathbf{x}: \nabla^2f(\mathbf{x}) \preceq \mathcal{L}I$$

Back

Vapnik-Chervonenkis (VC) bound

Front

$$P(\exists h \in \mathcal{H} \text{ s.t. } |E_\text{tr}(h) - E(h)| > \epsilon)\\\leqslant 2 \cdot 2m_\mathcal{H}(2N) \cdot e^{-\frac18\epsilon^2N}$$

Back

Backpropagation for final layer, $$\delta_1^{(L)}$$, with square loss

Front

\begin{aligned}\delta_1^{(L)} &= \frac{\partial e(W)}{\partial s_1^{(L)}}\\&= \frac{\partial e(W)}{\partial x_1^{(L)}} \times \frac{\partial x_1^{(L)}}{\partial s_1^{(L)}}\\&= 2(x_1^{(L)} - y_n) \times \theta'(s_1^{(L)})\end{aligned}

Back

Hidden state of RNN

Front

$$\mathbf{s}_t = f(U\mathbf{x}_t + W\mathbf{s}_{t-1})$$

where $$W$$ is the transition matrix and $$U$$ is the word embedding matrix

Back

SVM dual prediction for $$y_i \in \{-1, 1\}$$ classification

Front

\begin{aligned}\mathbf{w}^\top \phi(\mathbf{x}) &= \sum_{i=1}^n y_i\alpha_i\phi(\mathbf{x}_i)^\top\phi(\mathbf{x})\\&= \sum_{i=1}^n y_i\alpha_i K(\mathbf{x}_i, \mathbf{x})\end{aligned}

Back

Growth function for hypotheses $$\mathcal{H}$$

Front

Counts the most dichotomies on any $$N$$ points:

$$m_\mathcal{H}(N) = \max_{\mathbf{x}_1, \ldots, \mathbf{x}_N \in \mathcal{X}} |\mathcal{H}(\mathbf{x}_1, \ldots, \mathbf{x}_N)|$$

satisfying $$m_\mathcal{H}(N) \leqslant 2^N$$

Back

Dichotomy of hypothesis $$h: \mathcal{X} \rightarrow \{-1, +1\}$$

Front

Result for a particular dataset:

$$h: \{\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_N\} \rightarrow \{-1, +1\}^N$$

Back

Front
1. Initialize $$\mathbf{w}_0, \mathbf{v}_0 = 0$$
2. For $$t = 1, 2, \ldots$$
1. $$\mathbf{v}_t \leftarrow \beta\mathbf{v}_{t-1} + (1 - \beta)\nabla f(\mathbf{w}_t)$$
2. $$\mathbf{w}_{t+1} \leftarrow \mathbf{w}_t - \alpha\mathbf{v}_t$$
Back

General value for VC dimension of linear classifiers

Front

$$d_\text{VC} = d + 1$$

where $$d$$ is the number of dimensions of each sample

Back

Pointwise mutual information (PMI)

Front

\begin{aligned}\text{PMI}(w, c) &= \log\frac{\hat{P}(w, c)}{\hat{P}(w)\hat{P}(c)}\\&=\log\frac{\#(w, c)|D|}{\#(w)\#(c)}\end{aligned}

where $$\#(w) = \sum_c \#(w,c)$$ is the number of times word $$w$$ occurred in $$D$$, $$\#(c) = \sum_w \#(w,c)$$ is the number of times context $$c$$ occurred in $$D$$, and $$|D|$$ is the number of pairs in $$D$$

Back

Front

1. Initialize $$\mathbf{w}_0, \mathbf{m}_0 = 0, \mathbf{v}_0 = 0$$
2. For $$t = 1, 2, \ldots$$
1. Sample $$i \in \{1, \ldots, N\}$$ (batch size 1)
2. $$\mathbf{g}_t \leftarrow \nabla f_i(\mathbf{w}_t)$$
3. $$\mathbf{m}_t \leftarrow \beta_1\mathbf{m}_{t-1} + (1 - \beta_1)\mathbf{g}_t$$
4. $$\mathbf{v}_t \leftarrow \beta_2\mathbf{v}_{t-1} + (1 - \beta_2)\mathbf{g}_t^2$$
5. $$\mathbf{\hat{m}}_t \leftarrow \frac{\mathbf{m}_t}{1 - \beta_1^t}$$
6. $$\mathbf{\hat{v}}_t \leftarrow \frac{\mathbf{v}_t}{1 - \beta_2^t}$$
7. $$\mathbf{w}_t \leftarrow \mathbf{w}_{t-1} - \alpha\frac{\mathbf{\hat{m}}_t}{\sqrt{\mathbf{\hat{v}}_t} + \epsilon}$$
Back

GAN minmax training objective

Front

$$\max_G\min_D L_D(G, D)$$

Back

Information gain from a decision tree node split

Front

$$H(S) - \text{average entropy of split}$$

Back

Backpropagation gradient $$\delta_j^{(\ell)}$$

Front

$$\delta_j^{(\ell)} = \frac{\partial e(W)}{\partial s_j^{(\ell)}}$$

where $$\ell$$ is layer and $$j$$ is neuron

Back

Predicted output at time $$t$$ of an RNN

Front

$$o_t = \arg\max_i(V\mathbf{s}_t)_i$$

where $$V$$ is the layer for prediction

Back

SVM stochastic subgradient descent algorithm for $$y_i \in \{-1, 1\}$$ classification

Front

For $$t = 1, 2, \ldots$$

1. Sample $$i$$
2. If $$y_i\mathbf{w}^\top\mathbf{x}_i < 1$$
$$\mathbf{w} \leftarrow (1 - \eta_t)\mathbf{w} + \eta_tnCy_i\mathbf{x}_i$$
3. Else ($$y_i\mathbf{w}^\top\mathbf{x}_i \geqslant 1$$)
$$\mathbf{w} \leftarrow (1 - \eta_t)\mathbf{w}$$
Back

Logistic regression hinge loss for $$y_i \in \{-1, 1\}$$ classification

Front

$$\max(0, 1 - y_n\mathbf{w}^\top\mathbf{x}_n)$$

Back

Unconstrained version of

$$\min_\mathbf{w} E_\text{tr}(\mathbf{w}) = \frac1{N} (Z\mathbf{w} - \mathbf{y})^\top (Z\mathbf{w} - \mathbf{y})\\\text{s.t. } \mathbf{w}^\top\mathbf{w} \leqslant C$$

Front

$$\min_\mathbf{w} E_\text{tr}(\mathbf{w}) + \lambda\mathbf{w}^\top\mathbf{w}$$

where $$\mathbf{w}_\text{reg}$$ is the solution to both optimization problems:

$$\nabla E_\text{tr}(\mathbf{w}_\text{reg}) + 2\lambda\mathbf{w}_\text{reg} = 0$$

Back

Softmax loss function

Front

$$-\log\frac{e^{\mathbf{w}_{y_i}^\top\mathbf{x}_i}}{\sum_j e^{\mathbf{w}_j^\top\mathbf{x}_i}}$$

Back

Sufficient decrease condition for back-tracking line search

Front
• Simple condition: $$f(\mathbf{w} + \alpha\mathbf{d}) < f(\mathbf{w})$$
• Often works in practice, not in theory
• Provable sufficient decrease condition:
$$f(\mathbf{w} + \alpha\mathbf{d}) < f(\mathbf{w}) + \sigma\alpha\nabla f(\mathbf{w})^\top\mathbf{d}$$
for constant $$\sigma \in (0, 1)$$
Back

Objective function for decision tree partition $$S = S_1 \cup S_2$$

Front

$$\sum_{i \in S_1} (y_i - y^{(1)})^2 + \sum_{i \in S_2} (y_i - y^{(2)})^2$$

where $$y^{(1)} = \frac1{|S_1|} \sum_{i \in S_1} y_i$$ and $$y^{(2)} = \frac1{|S_2|} \sum_{i \in S_2} y_i$$

Back

Likelihood of $$\mathcal{D} = (\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_N, y_N)$$ for logistic regression with $$y_i \in \{-1, 1\}$$ classification

Front

$$\prod_{n=1}^N P(y_n | \mathbf{x}_n) = \prod_{n=1}^N \theta(y_n\mathbf{w}^\top\mathbf{x}_n)$$

where $$\theta$$ is the sigmoid function

Back

GAN generator loss function

Front

$$L_G = E_z[-\log D(G(z))]$$

Back

Hoeffding's inequality

Front

$$P(|v - \mu| > \epsilon) \leqslant 2e^{-2\epsilon^2N}$$

where $$N$$ is sample size and $$v$$ is sample mean

Back

SVM dual problem for $$y_i \in \{-1, 1\}$$ classification

Front

$$\max_{0 \leqslant \alpha \leqslant C} \left\{-\frac12\mathbf{\alpha}^\top Q\mathbf{\alpha} + \mathbf{e}^\top\mathbf{\alpha}\right\} = D(\mathbf{\alpha})$$

where $$Q \in \mathbb{R}^{n \times n}$$ with $$Q_{ij} = y_iy_j\phi(\mathbf{x}_i)^\top\phi(\mathbf{x}_j)$$, $$\phi(\mathbf{x})$$ is a higher dimensional mapping function, and $$\mathbf{e}$$ is the ones vector.

$$\phi(\mathbf{x}_i)^\top\phi(\mathbf{x}_j)$$ can be replaced with a kernel $$K(\mathbf{x}_i, \mathbf{x}_j)$$.

Back

Logistic loss for each training example for $$y_i \in \{-1, 1\}$$ classification

Front

$$\log(1 + e^{-y_n\mathbf{w}^\top\mathbf{x}_n})$$

Back

Empirical risk minimization

Front

$$\min_W \frac1{N} \sum_{n=1}^N \text{loss}(f_W(\mathbf{x}_n), y_n)$$

where $$f_W(\mathbf{x})$$ is the decision function to be learned and $$W$$ is the parameters of the function

Back

Backpropagation $$\delta_i^{(\ell - 1)}$$

Front

\begin{aligned}\delta_i^{(\ell - 1)} &= \frac{\partial e(W)}{\partial s_i^{(\ell - 1)}}\\&= \sum_{j=1}^d \frac{\partial e(W)}{\partial s_j^{(\ell)}} \times \frac{\partial s_j^{(\ell)}}{\partial x_i^{(\ell - 1)}} \times \frac{\partial x_i^{(\ell - 1)}}{\partial s_i^{(\ell - 1)}}\\&= \sum_{j=1}^d \delta_j^{(\ell)} \times w_{ij}^{(\ell)} \times \theta'(s_i^{(\ell - 1)})\end{aligned}

Back

Newton's method

Front

\begin{aligned}f(\mathbf{x}) &\approx f(\mathbf{a}) + (\mathbf{x} - \mathbf{a})^\top \nabla_\mathbf{x} f(\mathbf{a})\\&+ \frac12(\mathbf{x} - \mathbf{a})^\top \mathbf{H}(f)(\mathbf{a})(\mathbf{x} - \mathbf{a})\end{aligned}

which is simply a second-order Tayler series expansion. Applied to gradient descent:

\begin{aligned}f(\mathbf{w}^t + \mathbf{d}) &\approx f(\mathbf{w}^t) + \mathbf{d}^\top \nabla_\mathbf{w} f(\mathbf{w}^t)\\&+ \frac12\mathbf{d}^\top \mathbf{H}(f)(\mathbf{w}^t)(\mathbf{d})\end{aligned}

Back

Finite VC dimension implies what?

Front

$$m_\mathcal{H}(N)$$ is polynomial in $$N$$ and

$$m_\mathcal{H}(N) \leqslant \sum_{i=0}^{d_\text{VC}} \binom{N}{i}$$

and

$$P(|E_\text{tr} - E| > \epsilon) \leqslant O(N^{d_\text{VC}} \cdot e^{-N})$$

Back

Upper bound on $$P(|E_{\text{tr}}(g) - E(g)| > \epsilon)$$

Front

$$P(|E_{\text{tr}}(g) - E(g)| > \epsilon) \leqslant 2m_\mathcal{H}(N)e^{-2\epsilon^2N}$$

where $$m_\mathcal{H}(N)$$ measures model complexity (# hypotheses in hypothesis space)

Back

SVM primal problem for $$y_i \in \{-1, 1\}$$ classification

Front

$$\min_{\mathbf{w}, \xi} \frac12\mathbf{w}^\top\mathbf{w} + C\sum_{i=1}^n\xi_i$$

s.t. $$y_i(\mathbf{w}^\top\mathbf{x}_i) \geqslant 1 - \xi_i, i = 1, \ldots, n$$

and $$\xi_i \geqslant 0$$.

Equivalent to

$$\min_\mathbf{w} \frac12\mathbf{w}^\top\mathbf{w} + C\sum_{i=1}^n \max(0, 1 - y_i\mathbf{w}^\top\mathbf{x}_i)$$

(L2 regularization and hinge loss)

Back

VC dimension of a hypothesis set $$\mathcal{H}$$, denoted by $$d_{VC}(\mathcal{H})$$

Front

Largest value of $$N$$ for which $$m_\mathcal{H}(N) = 2^N$$ (the most points $$\mathcal{H}$$ can shatter)

Back

Expected output of neuron with dropout

Front

$$E\left[x_i^{(\ell)}\right] = \alpha\sigma\left(\sum_j W_{ij}^{(\ell)}x_j^{(\ell - 1)} + b_i^{(\ell)}\right)$$

Necessary to multiply all weights by $$\alpha^{-1}$$ because of the $$\alpha$$ probability of not dropping out

Back

GAN discriminator loss function

Front

\begin{aligned}L_D &= E_{x\sim\text{real data}}[-\log D(x)]\\&+ E_z[-\log(1 - D(G(z)))]\end{aligned}

Back

Front

$$\mathbf{w}^{t+1} \leftarrow \mathbf{w}^t + \mathbf{d}^*$$

where $$\mathbf{d}^* = -\alpha\nabla f(\mathbf{w}^t)$$ and $$\alpha > 0$$ is the step size

Back

Gradient descent with back-tracking line search

Front
1. Initialize weights $$\mathbf{w}_0$$
2. For $$t = 1, 2, \ldots$$
1. Compute $$\mathbf{d} = -\nabla f(\mathbf{w})$$
2. For $$\alpha = \alpha_0, \frac{\alpha_0}2, \frac{\alpha_0}4, \ldots$$
1. Break if $$f(\mathbf{w} + \alpha\mathbf{d}) \leqslant f(\mathbf{w}) + \sigma\alpha\nabla f(\mathbf{w})^\top \mathbf{d}$$
3. Update weights: $$\mathbf{w} \leftarrow \mathbf{w} + \alpha\mathbf{d}$$
4. Stop when $$\lVert \nabla f(\mathbf{w}) \rVert < \epsilon$$
Back

Multi-class loss function

Front

Minimize regularized training error:

$$\min_{\mathbf{w}_1,\ldots,\mathbf{w}_L} \sum_{i=1}^n \text{loss}(f(\mathbf{x}_i), \mathbf{y}_i) + \lambda\sum_{j=1,\ldots,L} \mathbf{w}_j^\top \mathbf{w}_j$$

Back

Average entropy of decision tree node split $$S \rightarrow S_1, S_2$$

Front

$$\frac{|S_1|}{|S|}H(S_1) + \frac{|S_2|}{|S|}H(S_2)$$

Back

Front
1. Initialize $$\mathbf{w}_0$$
2. For $$t = 1, 2, \ldots$$
1. Sample $$i \in \{1, \ldots, N\}$$ (batch size 1)
2. $$\mathbf{g}^t \leftarrow \nabla f_i(\mathbf{w}_t)$$
3. $$G_i^t \leftarrow G_i^{t-1} + (\mathbf{g}_i^t)^2$$ ($$i$$ is dimension index)
4. $$\mathbf{w}_{t+1} \leftarrow \mathbf{w}_t - \frac{\eta}{\sqrt{G_i^t + \epsilon}}\mathbf{g}_i^t$$
Back

Chapter 3

(15 cards)

$$\log{\sigma(x)}$$

Front

$$-\zeta(-x)$$

Back

$$\forall x \in (0, 1), \sigma^{-1}(x)$$

Front

$$\log\left(\frac{x}{1-x}\right)$$

Back

Positive part function

Front

$$x^+ = \max(0,x)$$

Back

Softplus function

Front

$$\zeta(x) = \log(1 + e^x)$$

Back

Self-information of event $$X = x$$

Front

$$I(x) = -\log{P(x)}$$

Back

Negative part function

Front

$$x^- = \max(0,-x)$$

Back

Shannon entropy

Front

\begin{aligned}H(x) &= \mathbb{E}_{x\sim P}[I(x)]\\&= -\mathbb{E}_{x\sim P}[\log{P(x)}]\end{aligned}

Called differential entropy when $$x$$ is continuous

Back

$$\forall x > 0, \zeta^{-1}(x)$$

Front

$$\log\left(e^x - 1\right)$$

Back

Kullback-Leibler (KL) divergence

Front

\begin{aligned}D_{\text{KL}}(P \Vert Q) &= \mathbb{E}_{x \sim P} \left[\log{\frac{P(x)}{Q(x)}}\right]\\&= \mathbb{E}_{x \sim P} [\log P(x) - \log Q(x)]\end{aligned}

measures how different $$P(x)$$ and $$Q(x)$$ are

Back

Integral for $$\zeta(x)$$

Front

$$\int_{-\infty}^x \sigma(y) \,dy$$

Back

$$1 - \sigma(x)$$

Front

$$\sigma(-x)$$

Back

$$\frac{d}{dx}\sigma(x)$$

Front

$$\sigma(x)(1 - \sigma(x))$$

Back

$$\zeta(x) - \zeta(-x)$$

Front

$$x$$

Back

Cross-entropy

Front

\begin{aligned}H(P, Q) &= H(P) + D_{\text{KL}}(P \Vert Q)\\&= -\mathbb{E}_{x \sim P} \log Q(x)\end{aligned}

Back

$$\frac{d}{dx}\zeta(x)$$

Front

$$\sigma(x)$$

Back

Chapter 4

(2 cards)

$$\text{softmax}(\mathbf{x})_i$$

Front

$$\frac{e^{x_i}}{\sum_{j=1}^n e^{x_j}}$$

often used to predict probabilities associated with a multinoulli distribution

Back

Condition number

Front

Ratio of the magnitude of the largest and smallest eigenvalues:

$$\max_{i,j} \left\lvert \frac{\lambda_i}{\lambda_j} \right\rvert$$

Back

Chapter 5

(3 cards)

Bias of an estimator

Front

$$\text{bias}(\hat{\mathbf{\theta}}_m) = \mathbb{E}(\hat{\mathbf{\theta}}_m) - \mathbf{\theta}$$

Back

Normal equation

Front

$$\mathbf{w}^* = (X^\top X)^{-1}X^\top \mathbf{y}$$

optimal solution for linear regression by minimizing MSE

Back

Point estimator or statistic

Front

Function of data:

$$\hat{\mathbf{\theta}}_m = g(\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(m)})$$

Back