Deep Learning

Notations#

Let $i=1,2,\dots,N$ (the sample index)

  • Samples $\boldsymbol{x}_i \in \mathcal{X}$ ($\mathbb{R}^d$)

  • Dataset $\boldsymbol{X} = \begin{bmatrix} \boldsymbol{x}_1 \\ \boldsymbol{x}_2 \\ \vdots \\ \boldsymbol{x}_N \end{bmatrix}$

  • Labels $y_i \in \mathcal{Y}$

  • Model params $\boldsymbol{\theta} \in \Theta$ (weights, intercept)

  • Prediction $\hat{y} = f(\boldsymbol{x}, \boldsymbol{\theta})$

  • Loss $L(\hat{y}, y) = (\hat{y} - y)^2$ ($y$ is the ground true)

  • Empirical risk minimization (ERM) $min_{\boldsymbol{\theta} \in \Theta} J(\boldsymbol{\theta}) := \frac{1}{N} \sum_{i=1}^N L(\hat{y_i}, y_i)$

  • Population risk minimization (PRM) $min_{\boldsymbol{\theta} \in \Theta} J(\boldsymbol{\theta}) := \mathbb{E}_{(\boldsymbol{x}, y) \sim \mathcal{D}}[L(\hat{y}, y)]$

  • $\text{softmax}(\boldsymbol{x})$

    let $\boldsymbol{x} = [x_1, x_2, \dots, x_n]^T$

    \[ \text{softmax}(\boldsymbol{x})_i = \frac{e^{x_i}}{\sum_{j=1}^n e^{x_j}} \]\[ \text{softmax}(\boldsymbol{x}) = \begin{bmatrix} \text{softmax}(\boldsymbol{x})_1 \\ \text{softmax}(\boldsymbol{x})_2 \\ \cdots \\ \text{softmax}(\boldsymbol{x})_n \end{bmatrix} \]

余弦相似度#

设有两个向量 $\boldsymbol{x}$ 和 $\boldsymbol{y}$, 其中 $\theta$ 为 $\boldsymbol{x}$ 和 $\boldsymbol{y}$ 的夹角。

根据余弦定理,

\[ \| \boldsymbol{x} - \boldsymbol{y} \|^2 = \|\boldsymbol{x}\|^2 + \|\boldsymbol{y}\|^2 - 2\|\boldsymbol{x}\|\|\boldsymbol{y}\| \cos \theta \\ (\boldsymbol{x} - \boldsymbol{y}) \cdot (\boldsymbol{x} - \boldsymbol{y}) = \|\boldsymbol{x}\|^2 + \|\boldsymbol{y}\|^2 - 2\|\boldsymbol{x}\|\|\boldsymbol{y}\| \cos \theta \\ \|\boldsymbol{x}\|^2 + \|\boldsymbol{y}\|^2 - 2\boldsymbol{x} \cdot \boldsymbol{y} = \|\boldsymbol{x}\|^2 + \|\boldsymbol{y}\|^2 - 2\|\boldsymbol{x}\|\|\boldsymbol{y}\| \cos \theta \\ \color{red} \cos \theta = \frac{\boldsymbol{x} \cdot \boldsymbol{y}}{\|\boldsymbol{x}\| \|\boldsymbol{y}\|} \]

Gradients#

  • Given $f: \mathbb{R}^d \to \mathbb{R}$ differentiable, let $\boldsymbol{x}=[x_1,\dots,x_d]^T$, the gradient is
\[ \color{red} \nabla_{\boldsymbol{x}} f(\boldsymbol{x})=[\frac{\partial f(\boldsymbol{x})}{\partial x_1},\dots,\frac{\partial f(\boldsymbol{x})}{\partial x_d}]^T \]

(partial derivate as a column vector)

  • Locally direction of most decrease is $-\nabla_{\boldsymbol{x}} f(\boldsymbol{x})$
  • Necessary condition for local minimality: If $ \boldsymbol{x}^* $ is a local minimum of $f$, then $\nabla_{\boldsymbol{x}} f(\boldsymbol{x}^*) = 0$

Linear Regression#

Least Squares Method (LSM)#

Notation:

  • Parameter: $\boldsymbol{w}=\begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_d \end{bmatrix}$
  • Parameter: $\boldsymbol{\theta}=\begin{bmatrix} b \\ w_1 \\ w_2 \\ \vdots \\ w_d \end{bmatrix}$
  • Data: $\boldsymbol{x}_i = [x_1, x_2, \dots, x_d]$
  • Data: $\tilde{\boldsymbol{x}}_i = [1, x_1, x_2, \dots, x_d]$
  • Dataset: $\boldsymbol{X} = \begin{bmatrix} \boldsymbol{x}_1 \\ \boldsymbol{x}_2 \\ \vdots \\ \boldsymbol{x}_N \end{bmatrix}$
  • Dataset: $\tilde{\boldsymbol{X}} = \begin{bmatrix} \tilde{\boldsymbol{x}}_1 \\ \tilde{\boldsymbol{x}}_2 \\ \vdots \\ \tilde{\boldsymbol{x}}_N \end{bmatrix}$
  • Prediction: $\hat{y}_i = f(\boldsymbol{x}_i, \boldsymbol{\theta}) = b + \boldsymbol{x}_i \boldsymbol{w} = \tilde{\boldsymbol{x}}_i \boldsymbol{\theta}$
  • Ground truth: $\boldsymbol{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix}$

Calculate EMR $min_{\boldsymbol{\theta}} J(\boldsymbol{\theta})$:

\[ \begin{aligned} J(\boldsymbol{\theta}) &= \frac{1}{N} \sum_{i=1}^N (y_i - f(\tilde{\boldsymbol{x}}_i;\boldsymbol{\theta}))^2 \\ &= \frac{1}{N} \left\| \begin{bmatrix} y_1 - f(\tilde{\boldsymbol{x}}_1;\boldsymbol{\theta}) \\ y_2 - f(\tilde{\boldsymbol{x}}_2;\boldsymbol{\theta}) \\ \vdots \\ y_N - f(\tilde{\boldsymbol{x}}_N;\boldsymbol{\theta}) \end{bmatrix} \right\|_2^2 \\ &= \frac{1}{N} \left\| \boldsymbol{y} - \tilde{\boldsymbol{X}} \boldsymbol{\theta} \right\|_2^2 \\ &= \frac{1}{N} \left( \boldsymbol{y}^T \boldsymbol{y} - 2(\tilde{\boldsymbol{X}} \boldsymbol{\theta})^T \boldsymbol{y} + (\tilde{\boldsymbol{X}} \boldsymbol{\theta})^T \tilde{\boldsymbol{X}} \boldsymbol{\theta} \right) \\ &= \frac{1}{N} \left( \boldsymbol{y}^T \boldsymbol{y} - 2\boldsymbol{\theta}^T \tilde{\boldsymbol{X}}^T \boldsymbol{y} + \boldsymbol{\theta}^T \tilde{\boldsymbol{X}}^T \tilde{\boldsymbol{X}} \boldsymbol{\theta} \right) \end{aligned} \]

To find the least squares solution, taking the derivative and set it to zero:

\[ \nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}) = \frac{1}{N} \left( -2\tilde{\boldsymbol{X}}^T \boldsymbol{y} + 2\tilde{\boldsymbol{X}}^T \tilde{\boldsymbol{X}} \boldsymbol{\theta} \right) = 0 \]\[ \tilde{\boldsymbol{X}}^T \tilde{\boldsymbol{X}} \boldsymbol{\theta} = \tilde{\boldsymbol{X}}^T \boldsymbol{y} \]\[ \fbox{$ \boldsymbol{\theta} = (\tilde{\boldsymbol{X}}^T \tilde{\boldsymbol{X}})^{-1} \tilde{\boldsymbol{X}}^T \boldsymbol{y} $} \]

Gradient Descent#

Let $\epsilon_k > 0$ be the learning rates, $k$ is the iteration counter, $k = 1,2,\dots$

  • Initialize $\boldsymbol{\theta}$
  • While not converged:
    • Select $m$ samples $\left \{ \boldsymbol{x}^{(1)}, \cdots , \boldsymbol{x}^{(m)} \right \}$ and matching labels $\left \{ y^{(1)}, \cdots , y^{(m)} \right \}$
    • Computer gradient $\boldsymbol{g} \leftarrow \nabla_{\boldsymbol{\theta}} \frac{1}{m} \sum_{i=1}^m L(f(\boldsymbol{x}^{(i)},\boldsymbol{\theta}), y^{(i)})$
    • Compute update $\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \epsilon_k \boldsymbol{g}$