Textbook Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville
Scott Mueller (lvl 19)Logistic regression gradient descent for \(y_i \in \{-1, 1\}\)
Front
Active users
29
All-time users
30
Favorites
0
Last updated
4 years ago
Date created
Jan 11, 2021
Unsectioned
(49 cards)
Logistic regression gradient descent for \(y_i \in \{-1, 1\}\)
Entropy \(H(S)\) for decision trees
$$H(S) = -\sum_{c=1}^C p(c)\log p(c)$$
where \(S\) is a set of data points in a node, \(c = 1, \ldots, C\) are labels, and \(p(c)\) is the proportion of data belonging to class \(c\).
Entropy is \(0\) if all samples are in the same class, large (worst case) when \(p(1) = \ldots = p(C)\).
Gradient Boosted Decision Tree (GBDT) loss function
Use 2nd order Taylor series expansion:
$$\begin{aligned}\ell_i(\hat{y}_i + f_m(\mathbf{x}_i)) &\approx \ell_i(\hat{y}_i)\ +\\&\ \ \ \ \ g_if_m(\mathbf{x}_i) + \frac12h_if_m(\mathbf{x}_i)^2\\&= \frac{h_i}2 \left\lVert f_m(\mathbf{x}_i) + \frac{g_i}{h_i} \right\rVert^2 + \text{const}\end{aligned}$$
where \(g_i = \partial_{\hat{y}_i}\ell_i(\hat{y}_i)\) is the gradient and \(h_i = \partial_{\hat{y}_i}^2\ell_i(\hat{y}_i)\) is the 2nd order derivative. Apply \(\sum_{i=1}^n\) for total tree ensemble loss, where \(n\) is the number of trees in ensemble.
SGD for neural networks
Logistic regression objective function for \(y_i \in \{-1, 1\}\)
$$\sum_{n=1}^N \log(1 + e^{-y_n\mathbf{w}^\top\mathbf{x}_n})$$
Back-tracking line search
Kernel ridge regression
$$\min_\mathbf{w} \frac12\Vert\mathbf{w}\Vert^2 + \frac12\sum_{i=1}^n (\mathbf{w}^\top \phi(\mathbf{x}_i) - y_i)^2$$
L2 regularization + square error. Dual problem:
$$\min_\alpha \alpha^\top Q\alpha + \Vert\alpha\Vert^2 - 2\alpha^\top\mathbf{y}$$
Neuron computation with dropout
$$x_i^{(\ell)} = B\sigma\left(\sum_j W_{ij}^{(\ell)}x_j^{(\ell - 1)} + b_i^{(\ell)}\right)$$
where \(B\) is a Bernoulli variable equal to \(1\) with probability \(\alpha\)
$$M^\text{PPMI}$$
A \(n \times d\) word feature matrix where each row is a word and each column is a context
$$M^\text{PPMI} \approx U_k\Sigma_kV_k^\top$$
\(W^\text{SVD} = U_k\Sigma_k\) is the context representation of each word (each row is a \(k\)-dimensional feature for a word)
Lipschitz constant \(\mathcal{L}\) as upper bound for gradient descent convergence
Gradient descent converges if \(\alpha < \frac1{\mathcal{L}}\),
$$\forall \mathbf{x}: \nabla^2f(\mathbf{x}) \preceq \mathcal{L}I$$
Vapnik-Chervonenkis (VC) bound
$$P(\exists h \in \mathcal{H} \text{ s.t. } |E_\text{tr}(h) - E(h)| > \epsilon)\\\leqslant 2 \cdot 2m_\mathcal{H}(2N) \cdot e^{-\frac18\epsilon^2N}$$
Backpropagation for final layer, \(\delta_1^{(L)}\), with square loss
$$\begin{aligned}\delta_1^{(L)} &= \frac{\partial e(W)}{\partial s_1^{(L)}}\\&= \frac{\partial e(W)}{\partial x_1^{(L)}} \times \frac{\partial x_1^{(L)}}{\partial s_1^{(L)}}\\&= 2(x_1^{(L)} - y_n) \times \theta'(s_1^{(L)})\end{aligned}$$
Hidden state of RNN
$$\mathbf{s}_t = f(U\mathbf{x}_t + W\mathbf{s}_{t-1})$$
where \(W\) is the transition matrix and \(U\) is the word embedding matrix
SVM dual prediction for \(y_i \in \{-1, 1\}\) classification
$$\begin{aligned}\mathbf{w}^\top \phi(\mathbf{x}) &= \sum_{i=1}^n y_i\alpha_i\phi(\mathbf{x}_i)^\top\phi(\mathbf{x})\\&= \sum_{i=1}^n y_i\alpha_i K(\mathbf{x}_i, \mathbf{x})\end{aligned}$$
Growth function for hypotheses \(\mathcal{H}\)
Counts the most dichotomies on any \(N\) points:
$$m_\mathcal{H}(N) = \max_{\mathbf{x}_1, \ldots, \mathbf{x}_N \in \mathcal{X}} |\mathcal{H}(\mathbf{x}_1, \ldots, \mathbf{x}_N)|$$
satisfying \(m_\mathcal{H}(N) \leqslant 2^N\)
Dichotomy of hypothesis \(h: \mathcal{X} \rightarrow \{-1, +1\}\)
Result for a particular dataset:
$$h: \{\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_N\} \rightarrow \{-1, +1\}^N$$
Momentum gradient descent
General value for VC dimension of linear classifiers
$$d_\text{VC} = d + 1$$
where \(d\) is the number of dimensions of each sample
Pointwise mutual information (PMI)
$$\begin{aligned}\text{PMI}(w, c) &= \log\frac{\hat{P}(w, c)}{\hat{P}(w)\hat{P}(c)}\\&=\log\frac{\#(w, c)|D|}{\#(w)\#(c)}\end{aligned}$$
where \(\#(w) = \sum_c \#(w,c)\) is the number of times word \(w\) occurred in \(D\), \(\#(c) = \sum_w \#(w,c)\) is the number of times context \(c\) occurred in \(D\), and \(|D|\) is the number of pairs in \(D\)
Adam
Momentum + adaptive updates
GAN minmax training objective
$$\max_G\min_D L_D(G, D)$$
Information gain from a decision tree node split
$$H(S) - \text{average entropy of split}$$
Backpropagation gradient \(\delta_j^{(\ell)}\)
$$\delta_j^{(\ell)} = \frac{\partial e(W)}{\partial s_j^{(\ell)}}$$
where \(\ell\) is layer and \(j\) is neuron
Predicted output at time \(t\) of an RNN
$$o_t = \arg\max_i(V\mathbf{s}_t)_i$$
where \(V\) is the layer for prediction
SVM stochastic subgradient descent algorithm for \(y_i \in \{-1, 1\}\) classification
For \(t = 1, 2, \ldots\)
Logistic regression hinge loss for \(y_i \in \{-1, 1\}\) classification
$$\max(0, 1 - y_n\mathbf{w}^\top\mathbf{x}_n)$$
Unconstrained version of
$$\min_\mathbf{w} E_\text{tr}(\mathbf{w}) = \frac1{N} (Z\mathbf{w} - \mathbf{y})^\top (Z\mathbf{w} - \mathbf{y})\\\text{s.t. } \mathbf{w}^\top\mathbf{w} \leqslant C$$
$$\min_\mathbf{w} E_\text{tr}(\mathbf{w}) + \lambda\mathbf{w}^\top\mathbf{w}$$
where \(\mathbf{w}_\text{reg}\) is the solution to both optimization problems:
$$\nabla E_\text{tr}(\mathbf{w}_\text{reg}) + 2\lambda\mathbf{w}_\text{reg} = 0$$
Softmax loss function
$$-\log\frac{e^{\mathbf{w}_{y_i}^\top\mathbf{x}_i}}{\sum_j e^{\mathbf{w}_j^\top\mathbf{x}_i}}$$
Sufficient decrease condition for back-tracking line search
Objective function for decision tree partition \(S = S_1 \cup S_2\)
$$\sum_{i \in S_1} (y_i - y^{(1)})^2 + \sum_{i \in S_2} (y_i - y^{(2)})^2$$
where \(y^{(1)} = \frac1{|S_1|} \sum_{i \in S_1} y_i\) and \(y^{(2)} = \frac1{|S_2|} \sum_{i \in S_2} y_i\)
Likelihood of \(\mathcal{D} = (\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_N, y_N)\) for logistic regression with \(y_i \in \{-1, 1\}\) classification
$$\prod_{n=1}^N P(y_n | \mathbf{x}_n) = \prod_{n=1}^N \theta(y_n\mathbf{w}^\top\mathbf{x}_n)$$
where \(\theta\) is the sigmoid function
GAN generator loss function
$$L_G = E_z[-\log D(G(z))]$$
Hoeffding's inequality
$$P(|v - \mu| > \epsilon) \leqslant 2e^{-2\epsilon^2N}$$
where \(N\) is sample size and \(v\) is sample mean
SVM dual problem for \(y_i \in \{-1, 1\}\) classification
$$\max_{0 \leqslant \alpha \leqslant C} \left\{-\frac12\mathbf{\alpha}^\top Q\mathbf{\alpha} + \mathbf{e}^\top\mathbf{\alpha}\right\} = D(\mathbf{\alpha})$$
where \(Q \in \mathbb{R}^{n \times n}\) with \(Q_{ij} = y_iy_j\phi(\mathbf{x}_i)^\top\phi(\mathbf{x}_j)\), \(\phi(\mathbf{x})\) is a higher dimensional mapping function, and \(\mathbf{e}\) is the ones vector.
\(\phi(\mathbf{x}_i)^\top\phi(\mathbf{x}_j)\) can be replaced with a kernel \(K(\mathbf{x}_i, \mathbf{x}_j)\).
Logistic loss for each training example for \(y_i \in \{-1, 1\}\) classification
$$\log(1 + e^{-y_n\mathbf{w}^\top\mathbf{x}_n})$$
Empirical risk minimization
$$\min_W \frac1{N} \sum_{n=1}^N \text{loss}(f_W(\mathbf{x}_n), y_n)$$
where \(f_W(\mathbf{x})\) is the decision function to be learned and \(W\) is the parameters of the function
Backpropagation \(\delta_i^{(\ell - 1)}\)
$$\begin{aligned}\delta_i^{(\ell - 1)} &= \frac{\partial e(W)}{\partial s_i^{(\ell - 1)}}\\&= \sum_{j=1}^d \frac{\partial e(W)}{\partial s_j^{(\ell)}} \times \frac{\partial s_j^{(\ell)}}{\partial x_i^{(\ell - 1)}} \times \frac{\partial x_i^{(\ell - 1)}}{\partial s_i^{(\ell - 1)}}\\&= \sum_{j=1}^d \delta_j^{(\ell)} \times w_{ij}^{(\ell)} \times \theta'(s_i^{(\ell - 1)})\end{aligned}$$
Newton's method
$$\begin{aligned}f(\mathbf{x}) &\approx f(\mathbf{a}) + (\mathbf{x} - \mathbf{a})^\top \nabla_\mathbf{x} f(\mathbf{a})\\&+ \frac12(\mathbf{x} - \mathbf{a})^\top \mathbf{H}(f)(\mathbf{a})(\mathbf{x} - \mathbf{a})\end{aligned}$$
which is simply a second-order Tayler series expansion. Applied to gradient descent:
$$\begin{aligned}f(\mathbf{w}^t + \mathbf{d}) &\approx f(\mathbf{w}^t) + \mathbf{d}^\top \nabla_\mathbf{w} f(\mathbf{w}^t)\\&+ \frac12\mathbf{d}^\top \mathbf{H}(f)(\mathbf{w}^t)(\mathbf{d})\end{aligned}$$
Finite VC dimension implies what?
\(m_\mathcal{H}(N)\) is polynomial in \(N\) and
$$m_\mathcal{H}(N) \leqslant \sum_{i=0}^{d_\text{VC}} \binom{N}{i}$$
and
$$P(|E_\text{tr} - E| > \epsilon) \leqslant O(N^{d_\text{VC}} \cdot e^{-N})$$
Upper bound on \(P(|E_{\text{tr}}(g) - E(g)| > \epsilon)\)
$$P(|E_{\text{tr}}(g) - E(g)| > \epsilon) \leqslant 2m_\mathcal{H}(N)e^{-2\epsilon^2N}$$
where \(m_\mathcal{H}(N)\) measures model complexity (# hypotheses in hypothesis space)
SVM primal problem for \(y_i \in \{-1, 1\}\) classification
$$\min_{\mathbf{w}, \xi} \frac12\mathbf{w}^\top\mathbf{w} + C\sum_{i=1}^n\xi_i$$
s.t. \(y_i(\mathbf{w}^\top\mathbf{x}_i) \geqslant 1 - \xi_i, i = 1, \ldots, n\)
and \(\xi_i \geqslant 0\).
Equivalent to
$$\min_\mathbf{w} \frac12\mathbf{w}^\top\mathbf{w} + C\sum_{i=1}^n \max(0, 1 - y_i\mathbf{w}^\top\mathbf{x}_i)$$
(L2 regularization and hinge loss)
VC dimension of a hypothesis set \(\mathcal{H}\), denoted by \(d_{VC}(\mathcal{H})\)
Largest value of \(N\) for which \(m_\mathcal{H}(N) = 2^N\) (the most points \(\mathcal{H}\) can shatter)
Expected output of neuron with dropout
$$E\left[x_i^{(\ell)}\right] = \alpha\sigma\left(\sum_j W_{ij}^{(\ell)}x_j^{(\ell - 1)} + b_i^{(\ell)}\right)$$
Necessary to multiply all weights by \(\alpha^{-1}\) because of the \(\alpha\) probability of not dropping out
GAN discriminator loss function
$$\begin{aligned}L_D &= E_{x\sim\text{real data}}[-\log D(x)]\\&+ E_z[-\log(1 - D(G(z)))]\end{aligned}$$
Gradient descent basic process
$$\mathbf{w}^{t+1} \leftarrow \mathbf{w}^t + \mathbf{d}^*$$
where \(\mathbf{d}^* = -\alpha\nabla f(\mathbf{w}^t)\) and \(\alpha > 0\) is the step size
Gradient descent with back-tracking line search
Multi-class loss function
Minimize regularized training error:
$$\min_{\mathbf{w}_1,\ldots,\mathbf{w}_L} \sum_{i=1}^n \text{loss}(f(\mathbf{x}_i), \mathbf{y}_i) + \lambda\sum_{j=1,\ldots,L} \mathbf{w}_j^\top \mathbf{w}_j$$
Average entropy of decision tree node split \(S \rightarrow S_1, S_2\)
$$\frac{|S_1|}{|S|}H(S_1) + \frac{|S_2|}{|S|}H(S_2)$$
Adagrad
Chapter 3
(15 cards)
$$\log{\sigma(x)}$$
$$-\zeta(-x)$$
$$\forall x \in (0, 1), \sigma^{-1}(x)$$
$$\log\left(\frac{x}{1-x}\right)$$
Positive part function
$$x^+ = \max(0,x)$$
Softplus function
$$\zeta(x) = \log(1 + e^x)$$
Self-information of event \(X = x\)
$$I(x) = -\log{P(x)}$$
Negative part function
$$x^- = \max(0,-x)$$
Shannon entropy
$$\begin{aligned}H(x) &= \mathbb{E}_{x\sim P}[I(x)]\\&= -\mathbb{E}_{x\sim P}[\log{P(x)}]\end{aligned}$$
Called differential entropy when \(x\) is continuous
$$\forall x > 0, \zeta^{-1}(x)$$
$$\log\left(e^x - 1\right)$$
Kullback-Leibler (KL) divergence
$$\begin{aligned}D_{\text{KL}}(P \Vert Q) &= \mathbb{E}_{x \sim P} \left[\log{\frac{P(x)}{Q(x)}}\right]\\&= \mathbb{E}_{x \sim P} [\log P(x) - \log Q(x)]\end{aligned}$$
measures how different \(P(x)\) and \(Q(x)\) are
Integral for \(\zeta(x)\)
$$\int_{-\infty}^x \sigma(y) \,dy$$
$$1 - \sigma(x)$$
$$\sigma(-x)$$
$$\frac{d}{dx}\sigma(x)$$
$$\sigma(x)(1 - \sigma(x))$$
$$\zeta(x) - \zeta(-x)$$
$$x$$
Cross-entropy
$$\begin{aligned}H(P, Q) &= H(P) + D_{\text{KL}}(P \Vert Q)\\&= -\mathbb{E}_{x \sim P} \log Q(x)\end{aligned}$$
$$\frac{d}{dx}\zeta(x)$$
$$\sigma(x)$$
Chapter 4
(2 cards)
$$\text{softmax}(\mathbf{x})_i$$
$$\frac{e^{x_i}}{\sum_{j=1}^n e^{x_j}}$$
often used to predict probabilities associated with a multinoulli distribution
Condition number
Ratio of the magnitude of the largest and smallest eigenvalues:
$$\max_{i,j} \left\lvert \frac{\lambda_i}{\lambda_j} \right\rvert$$
Chapter 5
(3 cards)
Bias of an estimator
$$\text{bias}(\hat{\mathbf{\theta}}_m) = \mathbb{E}(\hat{\mathbf{\theta}}_m) - \mathbf{\theta}$$
Normal equation
$$\mathbf{w}^* = (X^\top X)^{-1}X^\top \mathbf{y}$$
optimal solution for linear regression by minimizing MSE
Point estimator or statistic
Function of data:
$$\hat{\mathbf{\theta}}_m = g(\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(m)})$$