ML Note#1:Linear Models(1)

ML系列学习笔记基于吴恩达教授的斯坦福CS229 2018课程📒

本文介绍首个模型——线性回归,展示机器学习基本步骤;回顾(预习)线性代数相关知识,并基于几何角度重新理解线性回归中的模型参数估计过程。

Overview概览

  • Supervised learning: regression, classification, etc.
  • Unsupervised learning: clustering, etc.
  • Reinforcement learning
  • etc.

Notation符号规则

  • \(\theta\) = parameters
  • \(m\) = # training samples
  • \(n\) = # features
  • \(x\) = inputs
  • \(y\) = outputs
  • \((x, y)\) = training examples
  • \((x^{(i)}, y^{(i)})\) = \(\mathrm{i^{th}}\) training example

Linear Regression线性回归

有监督线性回归模型,解决回归问题。

learn a hypothesis \(h\) \[ h(x)=\sum^n_{j=0}\theta_jx_j\\ \] where \(x_0=1\), so that \(\theta_0\) is the bias.

Target优化目标

basic intuition: \[ \theta\ s.t.\ h_\theta(x)\approx y \] minimize the cost function: \[ J(\theta)=\frac{1}{2}\sum^m_{i=1}(h(x^{(i)})-y^{(i)})^2 \]

Optimization参数优化

Gradient descent

  • start with some \(\theta\)

  • keep changing \(\theta\) to reduce \(J(\theta)\)

  • how to downhill? \[ \begin{aligned} \theta_j &:=\theta_j-\alpha\frac{\partial}{\partial \theta_j}J(\theta)\\\\ &=\theta_j-\alpha\sum^m_{i=1}(h_\theta(x^{(i)})-y^{(i)})x^{(i)}_j \end{aligned} \]

    • \(\alpha\) : learning rate

batch gradient descent: consider entire training set in each step.

stochastic gradient descent: consider single or several samples instead of all of them in each step, noisy but faster.

Another method直接求参数

似然函数求极值,直接计算最优参数。

Matrix derivative矩阵求导

\[ \begin{gathered} \nabla_Af(A)= \begin{bmatrix} \frac{\partial f}{\partial A_{11}} & \frac{\partial f}{\partial A_{12}} \\\\ \frac{\partial f}{\partial A_{21}} & \frac{\partial f}{\partial A_{22}} \end{bmatrix} \end{gathered} \] where \(f:R^{2\times2}\rightarrow R\)

Parameter selection参数计算

rewrite the cost function \[ \begin{aligned} J(\theta) & =\frac{1}{2}\sum^m_{i=1}(h(x^{i})-y^{(i)})^2 \\\\ & = \frac{1}{2}(X\theta-y)^T(X\theta-y) \end{aligned} \] compute the derivative \[ \begin{aligned} \nabla_\theta J(\theta) & = \nabla_\theta\frac{1}{2}(X\theta-y)^T(X\theta-y)\\\\ & = \frac{1}{2}\nabla_\theta[\theta^TX^TX\theta-\theta^TX^Ty-y^TX\theta+y^Ty]\\\\ & = \frac{1}{2}[X^TX\theta+X^TX\theta-X^Ty-X^Ty]\\\\ & = X^TX\theta-X^Ty \end{aligned} \] set to 0 \[ X^TX\theta=X^Ty\\ \theta=(X^TX)^{-1}X^Ty \]

Why not 为什么不用?

具体解释参考 [1]

  • expensive computation

  • need inverse of the matrix

Linear algebra线代回顾

回顾一些线性代数的基础知识

  • Inner product \[ v^Tu=\langle u, v\rangle\in R \] where \(v, u\in R^n\).

  • Outer product \[ vu^T \] result in a rank 1 matrix.

  • Matrix product

    \(A^TA\) gets a symmetric matrix (S).

  • Matrix trace \[ \mathrm{tr}(A)=\sum^n_{i=1}A_{ii} \] Notice: if there is \(f(A)=\mathrm{tr}(AB)\), then \(\nabla_Af(A)=B^T\)

    Properties (commutative):

    • \(\mathrm{tr}(AB)=\mathrm{tr}(BA)\)

    • \(\mathrm{tr}(ABC)=\mathrm{tr}(CAB)\)

    Other: \[ \nabla_A\mathrm{tr}(AA^TC)=CA+C^TA \]

  • Derivative pattern

    value 1st derivative 2nd derivative
    \(f: R\rightarrow R\) \(R\) \(R\) \(R\)
    \(f: Rn\rightarrow R\) \(R\) \(R^n\) (gradient) \(S^n\) (Hessian)
    \(f: Rm\rightarrow Rn\) \(R^n\) \(R^{m\times n}\) (Jacobian) \(R^{m\times n\times n}\)

Further intuition矩阵理解

从另一种角度理解矩阵。

Regarding matrix \(A\) as a function (mapping): \[ Aa=b\longrightarrow A(a)=b \]

\[ (A:R^n\rightarrow R^m) \]

where \(A\in R^{m\times n},\ a\in R^n,\ b\in R^m\).

Something Interesting

assume \(A\): \[ e.g.\ A:R^3\rightarrow R^3 \]

  • when \(A\) is full rank (rank 3)

    \(A\) is a bijection between two \(R^3\) space, which means there is \(A^{-1}\).

  • when A is rank 2

    \(A\) is a bijection between two 2-dim subspaces, the input one called row space (decided by rows of A) and the output one called column space (decided by columns). (passing the origin)

    Moreover, any input in \(R^3\) can be projected on the row space, which can be projected to an output in the column space by \(A\). However, there is no inverse projection in output space. (no \(A^{-1}\))

How to project a vector to the subspace?
  • Projection of two vectors (\(b\) on \(v\)) \[ \mathrm{proj}(b, v)= \begin{bmatrix} \frac{vv^T}{v^Tv} \end{bmatrix} b=\hat{b} \]

  • Projection from vector to subspace (e.g. 2d in 3d)

    the subspace can be defined by two vectors (linearly independent) \[ V = [v^{(1)}\ v^{(2)}] \] (column space defined by column vectors) \[ \mathrm{proj}(b, V)=[V(V^TV)^{-1}V^T]b \]

Reconsider重新理解参数选取

几何角度理解线性回归的参数选择。

Regard\(X\)as a mapping from\(\theta\)to\(y\). (n features, m samples) \[ X(\theta)=y \]

\[ X:R^n\rightarrow R^m \]

\(X\)defines a n-dim column space (subspace) in\(R^m\), any vector in output space (\(R^m\)) consists of m values, which are the predicted values of each sample.

Then the parameter\(\theta\)selection can be considered as finding a nearest point to the real\(y\)point on the n-dim column space.

The nearest point is also the projection from real \(y\) to the subspace: \[ \mathrm{proj}(y, X)=[X(X^TX)^{-1}X^T]y \] Then \(\theta\) can be derived from: \[ X\theta=[X(X^TX)^{-1}X^T]y \] so: \[ \theta=(X^TX)^{-1}X^Ty \] same to the equation above.

Decomposition矩阵分解

Any matrix can be considered as transformation between two spaces, which can be decomposed into 3 steps: \[ Rotation1, Scaling, Rotation2 \] which are orthonormal, diagonal, orthonormal repectively.

  1. Two subtypes

    • eigen decomposition: \(Rotation1, Scaling (can be complex number), Rotation1^{-1}\).
    • singular value decomposition (SVD): \(Rotation1, Scaling (real number), Rotation2\).

    Notice: symmetric matrices have the same ED and SVD.

  2. Definition

    • eigen vectors: primary axes where the vectors along will have similar directions between input and output.

    • eigen values: scale value.

    • spectrum: the set of eigen values.

  3. Other

    • sum of eigen values = trace = sum of eigen diagonal matrix.
    • product of eigen values = determinant of \(A\).
  4. An intuition about determinant \[ determinant = \frac{Volume(A(shape))}{Volume(shape)} \] Overall sense of how much the matrix increases or decreases the space.

    Will encounter this when do the linear transformation of probability distributions, because wanting to set the overall volume of the space can be equal to 1. (by dividing the original one by its determinant (?))

Questions问题

  • How the rank of \(X\) influences when consider the parameter selection?
  • How the determinant can be applied when doing linear transformation of probability distributions?

Vocab词汇

  • eigen vector 特征向量
  • expected value of a random variable 期望
  • variance of a random variable 方差
  • quadratic function 二次方程
  • oscillation 摆动
  • fiddle with 摆弄
  • n dimensional real space n维实数空间
  • diagonal matrix 对角矩阵
  • identity matrix 单位矩阵
  • symmetric matrix 对称矩阵
  • orthogonal 正交的
  • scalar 标量
  • bijection 双射
  • determinant 行列式

Reference参考