The Matrix Calculus

介绍 Introduction

In this post, we hope to solve all derivative problems with matrix via the matrix differentiation + trace trick. We denote by lowercase letter xx, bold lowercase letter x\boldsymbol{x}, and bold uppercase letter X\boldsymbol{X} a scalar, a vector, and a matrix, respectively. If we denote the entry of a vector and a matrix by xix_i and xi,jx_{i,j}, and f(X)f(\boldsymbol{X}) denotes the function of matrix X\boldsymbol{X}, then we can intuitively define the dirivative of ff to X\boldsymbol{X} as:

Xf=[fx1,1fx1,nfxm,1fxm,n]\nabla_{\boldsymbol{X}}f = \begin{bmatrix} \frac{\partial f}{\partial x_{1,1}} & \cdots & \frac{\partial f}{\partial x_{1,n}} \\ \vdots & \ddots & \vdots \\ \frac{\partial f}{\partial x_{m,1}} & \cdots & \frac{\partial f}{\partial x_{m,n}} \end{bmatrix}

Futhermore, we can build the connection between this derivation and matrix calculus as follows:

df=i=1mj=1nfxi,j=tr[(Xf)dX]\text{d}f=\sum_{i=1}^m\sum_{j=1}^n\frac{\partial f}{\partial x_{i,j}}=\text{tr}[(\nabla_{\boldsymbol{X}}f)^\top\text{d}\boldsymbol{X}].

在这篇博客中,我们通过微分的思想来统一解决矩阵的一系列求导问题。我们用小写字母 xx, 小写粗体字母 x\boldsymbol{x}, 大写粗体字母 X\boldsymbol{X} 分别表示标量,向量和矩阵。
其中向量的元素表示为 xix_i, 矩阵的元素表示为 xi,jx_{i,j}。 如果 f(X)f(\boldsymbol{X}) 表示关于矩阵X\boldsymbol{X}的函数,则ff关于矩阵X\boldsymbol{X}的导数可以直觉地定义为:

Xf=[fx1,1fx1,nfxm,1fxm,n]\nabla_{\boldsymbol{X}}f = \begin{bmatrix} \frac{\partial f}{\partial x_{1,1}} & \cdots & \frac{\partial f}{\partial x_{1,n}} \\ \vdots & \ddots & \vdots \\ \frac{\partial f}{\partial x_{m,1}} & \cdots & \frac{\partial f}{\partial x_{m,n}} \end{bmatrix}

从而,我们可以建立起矩阵与微分的关联:

df=i=1mj=1nfxi,j=tr[(Xf)dX]\text{d}f=\sum_{i=1}^m\sum_{j=1}^n\frac{\partial f}{\partial x_{i,j}}=\text{tr}[(\nabla_{\boldsymbol{X}}f)^\top\text{d}\boldsymbol{X}].

The operational criterion for matrix differentiation 矩阵微分运算法则

  • d(X±Y)=dX±dY\color{green} \text{d}(\boldsymbol{X}\pm\boldsymbol{Y})=\text{d}\boldsymbol{X}\pm\text{d}\boldsymbol{Y}
  • d(XY=(dX)Y+XdY\color{green}\text{d}(\boldsymbol{X}\boldsymbol{Y})=(\text{d}\boldsymbol{X})\boldsymbol{Y}+\boldsymbol{X}\text{d}\boldsymbol{Y}
  • d(X)=(dX)\color{green}\text{d}(\boldsymbol{X}^\top) = (\text{d}\boldsymbol{X})^\top
  • d(tr(X))=tr(dX)\color{green}\text{d}(\text{tr}(\boldsymbol{X})) = \text{tr}(\text{d}\boldsymbol{X})
  • dX1=X1(dX)X1\color{green}\text{d} \boldsymbol{X}^{-1} = -\boldsymbol{X}^{-1}(\text{d}\boldsymbol{X})\boldsymbol{X}^{-1}
  • dX=Xtr(X1dX)\color{green}\text{d}|\boldsymbol{X}| = |\boldsymbol{X}|\text{tr}(\boldsymbol{X}^{-1}\text{d}\boldsymbol{X})
  • d(XY)=dXY+XdY\color{green}\text{d} (\boldsymbol{X}\odot\boldsymbol{Y}) = \text{d}\boldsymbol{X}\odot\boldsymbol{Y} + \boldsymbol{X}\odot \text{d}\boldsymbol{Y}
  • dσ(X)=σ(X)dX,其中 σ() 为在位函数操作\color{green}\text{d} \sigma(\boldsymbol{X}) = \sigma^{\prime}(\boldsymbol{X})\odot \text{d}\boldsymbol{X}\text{,其中}~\sigma(\cdot)~为在位函数操作

Derivative of the Scalar to Matrix

Trace Trick

  • a=tr(a)\color{green}a=\text{tr}(a)

  • tr(A)=tr(A)\color{green}\text{tr}(\boldsymbol{A})=\text{tr}(\boldsymbol{A}^\top)

  • tr(A±B)=tr(A)±tr(B)\color{green}\text{tr}(\boldsymbol{A}\pm\boldsymbol{B}) = \text{tr}(\boldsymbol{A}) \pm \text{tr}(\boldsymbol{B})

  • tr(AB)=ijai,jbi,j\color{green}\text{tr}(\boldsymbol{A}^\top\boldsymbol{B}) = \sum_i\sum_j a_{i,j}b_{i,j}

  • tr(AB)=tr(BA)=tr(BA)=tr(AB)\color{green}\text{tr}(\boldsymbol{A}^\top\boldsymbol{B}) = \text{tr}(\boldsymbol{B}\boldsymbol{A}^\top) = \text{tr}(\boldsymbol{B}^\top\boldsymbol{A}) = \text{tr}(\boldsymbol{A}\boldsymbol{B}^\top)

  • tr[A(BC)]=tr[(AB)C]=ijai,jbi,jci,j\color{green}\text{tr}[\boldsymbol{A}^\top(\boldsymbol{B}\odot\boldsymbol{C})] = \text{tr}[(\boldsymbol{A}\odot\boldsymbol{B})^\top\boldsymbol{C}] = \sum_i\sum_j a_{i,j}b_{i,j}c_{i,j}

例1. 构造函数 No.1

f=aXb,aRm×1,XRm×n,bRn×1f = \boldsymbol{a}^\top \boldsymbol{X} \boldsymbol{b}, \boldsymbol{a}\in\mathbb{R}^{m\times 1}, \boldsymbol{X}\in\mathbb{R}^{m\times n}, \boldsymbol{b}\in\mathbb{R}^{n\times1}.

Process:

df\text{d} f

$ = \text{d}(\boldsymbol{a}^\top\boldsymbol{X}\boldsymbol{b})$

=tr[d(aXb)]= \text{tr}[\text{d}(\boldsymbol{a}^\top\boldsymbol{X}\boldsymbol{b})]

=tr[a(dX)b]= \text{tr}[\boldsymbol{a}^\top(\text{d}\boldsymbol{X})\boldsymbol{b}]

=tr(badX)= \text{tr} (\boldsymbol{b}\boldsymbol{a}^\top\text{d}\boldsymbol{X})

=tr[(ab)dX]= \text{tr} [(\boldsymbol{a}\boldsymbol{b}^\top)^\top\text{d}\boldsymbol{X}]

Answer:

fX=ab\frac{\partial f}{\partial \boldsymbol{X}} = \boldsymbol{a}\boldsymbol{b}^\top

例 2. 构造函数 No.2

f=aeXb,aRm×1,XRm×n,bRn×1f = \boldsymbol{a}^\top e^{\boldsymbol{X}\boldsymbol{b}}, \boldsymbol{a}\in\mathbb{R}^{m\times 1}, \boldsymbol{X}\in\mathbb{R}^{m\times n}, \boldsymbol{b}\in\mathbb{R}^{n\times 1}.

Process:

df\text{d} f

$ = \text{d}(\boldsymbol{a}^\top e^{\boldsymbol{X}\boldsymbol{b}}) = \text{tr}[\boldsymbol{a}\top\text{d}(e{\boldsymbol{X}\boldsymbol{b}})]$

=tr{a[eXbd(Xb)]}= \text{tr} \{\boldsymbol{a}^\top [e^{\boldsymbol{X}\boldsymbol{b}}\odot\text{d}(\boldsymbol{X}\boldsymbol{b})]\}

=tr[(aeXb)d(Xb)]= \text{tr} [{(\boldsymbol{a}\odot e^{\boldsymbol{X}\boldsymbol{b}})}^\top \text{d}(\boldsymbol{X}\boldsymbol{b})]

=tr[(aeXb)(dX)b]= \text{tr} [{(\boldsymbol{a}\odot e^{\boldsymbol{X}\boldsymbol{b}})}^\top (\text{d}\boldsymbol{X})\boldsymbol{b}]

=tr[b(aeXb)dX]= \text{tr} [\boldsymbol{b}{(\boldsymbol{a}\odot e^{\boldsymbol{X}\boldsymbol{b}})}^\top \text{d}\boldsymbol{X}]

=tr{[(aeXb)b]dX}= \text{tr} \{[(\boldsymbol{a}\odot e^{\boldsymbol{X}\boldsymbol{b}})\boldsymbol{b}^\top]^\top \text{d}\boldsymbol{X}\}

Answer:

fX=(aeXb)b\frac{\partial f}{\partial \boldsymbol{X}} = (\boldsymbol{a}\odot e^{\boldsymbol{X}\boldsymbol{b}})\boldsymbol{b}^\top

例 3. 构造函数 No.3

f=tr(YMY),Y=σ(WX),WRl×m,XRm×n,MRl×lf = tr(\boldsymbol{Y}^\top \boldsymbol{M}\boldsymbol{Y}), Y = \sigma(\boldsymbol{W}\boldsymbol{X}), \boldsymbol{W}\in\mathbb{R}^{l\times m}, \boldsymbol{X}\in\mathbb{R}^{m\times n}, \boldsymbol{M}\in\mathbb{R}^{l\times l}.

Process:

df\text{d} f

$ = \text{tr} [\text{d} (\boldsymbol{Y}^\top\boldsymbol{M}\boldsymbol{Y})]$

=tr[d(Y)MY+YMdY]= \text{tr} [\text{d} (\boldsymbol{Y}^\top) \boldsymbol{M}\boldsymbol{Y} + \boldsymbol{Y}^\top\boldsymbol{M}\text{d}\boldsymbol{Y}]

=tr[(dY)MY]+tr(YMdY)= \text{tr}[(\text{d}\boldsymbol{Y})^\top\boldsymbol{M}\boldsymbol{Y}] + \text{tr} (\boldsymbol{Y}^\top\boldsymbol{M}\text{d}\boldsymbol{Y})

=tr(YMdY)+tr(YMdY)= \text{tr}(\boldsymbol{Y}^\top\boldsymbol{M}^\top\text{d}\boldsymbol{Y}) + \text{tr} (\boldsymbol{Y}^\top\boldsymbol{M}\text{d}\boldsymbol{Y})

=tr[Y(M+M)dY]= \text{tr} [\boldsymbol{Y}^\top(\boldsymbol{M}^\top + \boldsymbol{M})\text{d}\boldsymbol{Y}]

=tr[Y(M+M)dσ(WX)]= \text{tr} [\boldsymbol{Y}^\top(\boldsymbol{M}^\top + \boldsymbol{M})\text{d}\sigma(\boldsymbol{W}\boldsymbol{X})]

=tr{Y(M+M)[σ(WX)d(WX)]}= \text{tr} \{\boldsymbol{Y}^\top(\boldsymbol{M}^\top + \boldsymbol{M})[\sigma^{\prime}(\boldsymbol{W}\boldsymbol{X}) \odot \text{d}(\boldsymbol{W}\boldsymbol{X})]\}

=tr{{[(M+M)Y]σ(WX)}d(WX)}= \text{tr} \left\{\{[(\boldsymbol{M}^\top + \boldsymbol{M})^\top\boldsymbol{Y}]\odot\sigma^{\prime}(\boldsymbol{W}\boldsymbol{X})\}^\top\text{d}(\boldsymbol{W}\boldsymbol{X})\right\}

=tr{{[(M+M)Y]σ(WX)}WdX}= \text{tr} \left\{\{[(\boldsymbol{M}^\top + \boldsymbol{M})\boldsymbol{Y}]\odot\sigma^{\prime}(\boldsymbol{W}\boldsymbol{X})\}^\top\boldsymbol{W}\text{d}\boldsymbol{X}\right\}

=tr{{W{[(M+M)Y]σ(WX)}}dX}= \text{tr} \left\{\left\{\boldsymbol{W}^\top\{[(\boldsymbol{M}^\top + \boldsymbol{M})\boldsymbol{Y}]\odot\sigma^{\prime}(\boldsymbol{W}\boldsymbol{X})\}\right\}^\top\text{d}\boldsymbol{X}\right\}

Answer:

fX=W{[(M+M)σ(WX)]σ(WX)}\frac{\partial f}{\partial \boldsymbol{X}} = \boldsymbol{W}^\top\{[(\boldsymbol{M} + \boldsymbol{M}^\top)\sigma(\boldsymbol{W}\boldsymbol{X})] \odot \sigma^{\prime}(\boldsymbol{W}\boldsymbol{X})\}

例 4. Linear Regression

=Xwy22,XRm×n,wRn×1,yRm×1\ell = \|\boldsymbol{X}\boldsymbol{w} - \boldsymbol{y}\|_2^2, \boldsymbol{X}\in\mathbb{R}^{m\times n}, \boldsymbol{w}\in\mathbb{R}^{n\times 1}, \boldsymbol{y}\in\mathbb{R}^{m\times 1}.

Process:

$\text{d} l $

=d[(Xwy)(Xwy)]= \text{d} [(\boldsymbol{X}\boldsymbol{w} - \boldsymbol{y})^\top(\boldsymbol{X}\boldsymbol{w} - \boldsymbol{y})]

=tr{[d(Xwy)](Xwy)+(Xwy)d(Xwy)}= \text{tr}\{[\text{d} (\boldsymbol{X}\boldsymbol{w} - \boldsymbol{y})^\top] (\boldsymbol{X}\boldsymbol{w} - \boldsymbol{y}) + (\boldsymbol{X}\boldsymbol{w} - \boldsymbol{y})^\top \text{d}(\boldsymbol{X}\boldsymbol{w} - \boldsymbol{y})\}

=tr{(Xdw)(Xwy)+(Xwy)(Xdw)}= \text{tr}\{(\boldsymbol{X}\text{d}\boldsymbol{w})^\top(\boldsymbol{X}\boldsymbol{w} - \boldsymbol{y}) + (\boldsymbol{X}\boldsymbol{w} - \boldsymbol{y})^\top(\boldsymbol{X}\text{d}\boldsymbol{w})\}

=tr[(Xdw)(Xwy)+(Xwy)(Xdw)]= \text{tr}[(\boldsymbol{X}\text{d}\boldsymbol{w})^\top(\boldsymbol{X}\boldsymbol{w} - \boldsymbol{y}) + (\boldsymbol{X}\boldsymbol{w} - \boldsymbol{y})^\top(\boldsymbol{X}\text{d}\boldsymbol{w})]

=tr[(Xwy)(Xdw)+(Xwy)(Xdw)]= \text{tr}[(\boldsymbol{X}\boldsymbol{w} - \boldsymbol{y})^\top(\boldsymbol{X}\text{d}\boldsymbol{w}) + (\boldsymbol{X}\boldsymbol{w} - \boldsymbol{y})^\top(\boldsymbol{X}\text{d}\boldsymbol{w})]

=tr[2(Xwy)Xdw]= \text{tr}[2(\boldsymbol{X}\boldsymbol{w} - \boldsymbol{y})^\top\boldsymbol{X}\text{d}\boldsymbol{w}]

=tr{2[X(Xwy)]dw}= \text{tr}\{2[\boldsymbol{X}^\top(\boldsymbol{X}\boldsymbol{w} - \boldsymbol{y})]^\top\text{d}\boldsymbol{w}\}

Answer:

w=2XXw2Xyw=(XX)1Xy\frac{\partial \ell}{\partial \boldsymbol{w}} = 2\boldsymbol{X}^\top\boldsymbol{X}\boldsymbol{w} - 2\boldsymbol{X}^\top\boldsymbol{y} \rightarrow \boldsymbol{w} = (\boldsymbol{X}^\top\boldsymbol{X})^{-1}\boldsymbol{X}^\top\boldsymbol{y}

例 5. Maximum Likelihood Estimation (MLE) for the Multivariate Normal Distribution

x1,,xNN(μ,Σ),=MLE(Σ)=logΣ+1Ni=1N(xix)Σ1(xix),x=1Ni=1Nxi,ΣRm×m\boldsymbol{x}_1, \cdots, \boldsymbol{x}_N \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}), \ell = MLE(\boldsymbol{\Sigma}) = \log |\boldsymbol{\Sigma}| + \frac{1}{N}\sum_{i=1}^N{(\boldsymbol{x}_i - \overline{\boldsymbol{x}})}^\top\boldsymbol{\Sigma}^{-1}(\boldsymbol{x}_i - \overline{\boldsymbol{x}}), \overline{\boldsymbol{x}} = \frac{1}{N}\sum_{i=1}^N\boldsymbol{x}_i, \boldsymbol{\Sigma}\in\mathbb{R}^{m\times m}.

Process:

$\text{d} \ell $

=tr[d(logΣ)+1Ni=1N(xix)(dΣ1)(xix)]= \text{tr} [\text{d} (\log|\boldsymbol{\Sigma}|) + \frac{1}{N}\sum_{i=1}^N(\boldsymbol{x}_i - \overline{\boldsymbol{x}})^\top(\text{d}\boldsymbol{\Sigma}^{-1})(\boldsymbol{x}_i - \overline{\boldsymbol{x}})]

=tr(1ΣdΣ)+1Ni=1Ntr[(xix)(dΣ1)(xix)]= \text{tr} (\frac{1}{|\boldsymbol{\Sigma}|}\odot\text{d}|\boldsymbol{\Sigma}|) + \frac{1}{N}\sum_{i=1}^N\text{tr}[(\boldsymbol{x}_i - \overline{\boldsymbol{x}})^\top(\text{d}\boldsymbol{\Sigma}^{-1})(\boldsymbol{x}_i - \overline{\boldsymbol{x}})]

=tr{1ΣΣtr(Σ1dΣ)}+1Ni=1Ntr[(xix)(xix)(dΣ1)]= \text{tr} \{\frac{1}{|\boldsymbol{\Sigma}|}\cdot|\boldsymbol{\Sigma}|\text{tr}(\boldsymbol{\Sigma}^{-1}\text{d}\boldsymbol{\Sigma})\} + \frac{1}{N}\sum_{i=1}^N\text{tr}[(\boldsymbol{x}_i - \overline{\boldsymbol{x}})(\boldsymbol{x}_i - \overline{\boldsymbol{x}})^\top(\text{d}\boldsymbol{\Sigma}^{-1})]

=tr(Σ1dΣ)1Ni=1Ntr{(xix)(xix)[Σ1(dΣ)Σ1]}= \text{tr}(\boldsymbol{\Sigma}^{-1}\text{d}\boldsymbol{\Sigma}) - \frac{1}{N}\sum_{i=1}^N\text{tr}\{(\boldsymbol{x}_i - \overline{\boldsymbol{x}})(\boldsymbol{x}_i - \overline{\boldsymbol{x}})^\top[\boldsymbol{\Sigma}^{-1}(\text{d}\boldsymbol{\Sigma})\boldsymbol{\Sigma}^{-1}]\}

=tr(Σ1dΣ)1Ni=1Ntr[Σ1(xix)(xix)Σ1dΣ]= \text{tr}(\boldsymbol{\Sigma}^{-1}\text{d}\boldsymbol{\Sigma}) - \frac{1}{N}\sum_{i=1}^N\text{tr}[\boldsymbol{\Sigma}^{-1}(\boldsymbol{x}_i - \overline{\boldsymbol{x}})(\boldsymbol{x}_i - \overline{\boldsymbol{x}})^\top\boldsymbol{\Sigma}^{-1}\text{d}\boldsymbol{\Sigma}]

=tr{[Σ11Ni=1NΣ1(xix)(xix)Σ1]dΣ}= \text{tr}\left\{[\boldsymbol{\Sigma}^{-1} - \frac{1}{N}\sum_{i=1}^N \boldsymbol{\Sigma}^{-1}(\boldsymbol{x}_i - \overline{\boldsymbol{x}})(\boldsymbol{x}_i - \overline{\boldsymbol{x}})^\top\boldsymbol{\Sigma}^{-1}]\text{d}\boldsymbol{\Sigma}\right\}

Answer:

Σ={Σ11Ni=1NΣ1[(xix)(xix)]Σ1}\frac{\partial \ell}{\partial \boldsymbol{\Sigma}} = {\{\boldsymbol{\Sigma}^{-1} - \frac{1}{N}\sum_{i=1}^N\boldsymbol{\Sigma}^{-1}[(\boldsymbol{x}_i - \overline{\boldsymbol{x}}){(\boldsymbol{x}_i - \overline{\boldsymbol{x}})}^\top]\boldsymbol{\Sigma}^{-1}\}}^\top

例 6. Multinomial Logistic Regression.

=ylogsoftmax(Wx),y is a one-hot encoding vector with size m,WRm×n,xRn×1,yRm×1,softmax(a)=ea1ea\ell = -\boldsymbol{y}^\top \log softmax(\boldsymbol{W}\boldsymbol{x}), \boldsymbol{y} \text{ is a one-hot encoding vector with size } m, \boldsymbol{W}\in\mathbb{R}^{m\times n}, \boldsymbol{x}\in\mathbb{R}^{n\times 1}, \boldsymbol{y}\in\mathbb{R}^{m\times 1}, softmax(\boldsymbol{a}) = \frac{e^{\boldsymbol{a}}}{\boldsymbol{1}^\top e^{\boldsymbol{a}}}.

Process:

$\ell $

=ylogsoftmax(Wx)= -\boldsymbol{y}^\top\log \text{softmax}(\boldsymbol{W}\boldsymbol{x})

=ylogeWx1eWx= -\boldsymbol{y}^\top\log \frac{\text{e}^{\boldsymbol{W}\boldsymbol{x}}}{\boldsymbol{1}^\top\text{e}^{\boldsymbol{W}\boldsymbol{x}}}

=y(Wx)+ylog[1(1eWx)]= -\boldsymbol{y}^\top (\boldsymbol{W}\boldsymbol{x}) + \boldsymbol{y}^\top\log [\boldsymbol{1}\cdot(\boldsymbol{1}^\top e^{\boldsymbol{W}\boldsymbol{x}})]

=y(Wx)+(y1)log(1eWx)= -\boldsymbol{y}^\top (\boldsymbol{W}\boldsymbol{x}) + (\boldsymbol{y}^\top\boldsymbol{1})\log (\boldsymbol{1}^\top e^{\boldsymbol{W}\boldsymbol{x}})

=y(Wx)+log(1eWx)= -\boldsymbol{y}^\top (\boldsymbol{W}\boldsymbol{x}) + \log (\boldsymbol{1}^\top e^{\boldsymbol{W}\boldsymbol{x}})

$\text{d} \ell $

=tr[y(dW)x+11eWxd(1eWx)]= \text{tr} [-\boldsymbol{y}^\top(\text{d}\boldsymbol{W})\boldsymbol{x} + \frac{1}{\boldsymbol{1}^\top e^{\boldsymbol{W}\boldsymbol{x}}}\text{d}(\boldsymbol{1}^\top e^{\boldsymbol{W}\boldsymbol{x}})]

=tr[y(dW)x+11eWx1deWx]= \text{tr} [-\boldsymbol{y}^\top(\text{d}\boldsymbol{W})\boldsymbol{x} + \frac{1}{\boldsymbol{1}^\top e^{\boldsymbol{W}\boldsymbol{x}}}\cdot\boldsymbol{1}^\top\text{d}e^{\boldsymbol{W}\boldsymbol{x}}]

=tr{y(dW)x+11eWx1[eWxd(Wx)]}= \text{tr} \{-\boldsymbol{y}^\top(\text{d}\boldsymbol{W})\boldsymbol{x} + \frac{1}{\boldsymbol{1}^\top e^{\boldsymbol{W}\boldsymbol{x}}}\cdot \boldsymbol{1}^\top [e^{\boldsymbol{W}\boldsymbol{x}}\odot\text{d}(\boldsymbol{W}\boldsymbol{x})]\}

=tr[y(dW)x]+11eWxtr{1[eWxd(Wx)]}= \text{tr} [-\boldsymbol{y}^\top(\text{d}\boldsymbol{W})\boldsymbol{x}] + \frac{1}{\boldsymbol{1}^\top e^{\boldsymbol{W}\boldsymbol{x}}}\text{tr} \{\boldsymbol{1}^\top [e^{\boldsymbol{W}\boldsymbol{x}}\odot\text{d}(\boldsymbol{W}\boldsymbol{x})]\}

=tr[y(dW)x]+11eWxtr[(1eWx)d(Wx)]= \text{tr} [-\boldsymbol{y}^\top(\text{d}\boldsymbol{W})\boldsymbol{x}] + \frac{1}{\boldsymbol{1}^\top e^{\boldsymbol{W}\boldsymbol{x}}}\text{tr} [(\boldsymbol{1}\odot e^{\boldsymbol{W}\boldsymbol{x}})^\top\text{d}(\boldsymbol{W}\boldsymbol{x})]

=tr(xydW)+11eWxtr[(1eWx)(dW)x]= \text{tr} (-\boldsymbol{x}\boldsymbol{y}^\top\text{d}\boldsymbol{W}) + \frac{1}{\boldsymbol{1}^\top e^{\boldsymbol{W}\boldsymbol{x}}}\text{tr} [(\boldsymbol{1}\odot e^{\boldsymbol{W}\boldsymbol{x}})^\top(\text{d}\boldsymbol{W})\boldsymbol{x}]

=tr{[xy+x(eWx)1eWx]dW}= \text{tr} \{[-\boldsymbol{x}\boldsymbol{y}^\top + \boldsymbol{x}\frac{(e^{\boldsymbol{W}\boldsymbol{x}})^\top}{\boldsymbol{1}^\top e^{\boldsymbol{W}\boldsymbol{x}}}]\text{d}\boldsymbol{W}\}

=tr{[(eWx1eWxy)x]dW}= \text{tr} \left\{[(\frac{e^{\boldsymbol{W}\boldsymbol{x}}}{\boldsymbol{1}^\top e^{\boldsymbol{W}\boldsymbol{x}}} - \boldsymbol{y})\boldsymbol{x}^\top]^\top\text{d}\boldsymbol{W}\right\}

Answer:

W=[softmax(Wx)y]x\frac{\partial \ell}{\partial \boldsymbol{W}} = [softmax(\boldsymbol{W}\boldsymbol{x}) - \boldsymbol{y}]\boldsymbol{x}^\top

例 7. Back Propagation (BP) for Multi-Layer Perceptron (MLP)

=i=1Nyilogsoftmax[W2σ(W1xi+b1)+b2],xiRn×1,yi is a one-hot encoding vector with size m,\ell = - \sum_{i=1}^N \boldsymbol{y}_i^\top\log softmax[\boldsymbol{W}_2 \sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2], \boldsymbol{x}_i \in \mathbb{R}^{n\times 1}, \boldsymbol{y}_i \text{ is a one-hot encoding vector with size } m,

W1Rp×n,b1Rp×1,W2Rm×p,b2Rm×1\boldsymbol{W}_1 \in \mathbb{R}^{p\times n}, \boldsymbol{b}_1 \in \mathbb{R}^{p\times 1}, \boldsymbol{W}_2 \in \mathbb{R}^{m\times p}, \boldsymbol{b}_2 \in \mathbb{R}^{m\times 1}.

Process:

$\ell $

=i=1Nyilogsoftmax[W2σ(W1xi+b1)+b2]= -\sum_{i=1}^N\boldsymbol{y}_i^{\top}\log\text{softmax}[\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2]

=i=1NyilogeW2σ(W1xi+b1)+b21eW2σ(W1xi+b1)+b2= -\sum_{i=1}^N \boldsymbol{y}_i^\top \log\frac{e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}}{\boldsymbol{1}^\top e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}}

=i=1Nyi[W2σ(W1xi+b1)+b2log(11eW2σ(W1xi+b1)+b2)]= -\sum_{i=1}^N \boldsymbol{y}_i^\top[\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2 - \log(\boldsymbol{1}\cdot\boldsymbol{1}^\top e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2})]

=i=1N{yi[W2σ(W1xi+b1)+b2](yi1)log(1eW2σ(W1xi+b1)+b2)}= -\sum_{i=1}^N\left\{\boldsymbol{y}_i^\top[\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2] - (\boldsymbol{y}_i^\top\boldsymbol{1})\log(\boldsymbol{1}^\top e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2})\right\}

=i=1N{yi[W2σ(W1xi+b1)+b2]log(1eW2σ(W1xi+b1)+b2)}= -\sum_{i=1}^N\left\{\boldsymbol{y}_i^\top[\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2] - \log(\boldsymbol{1}^\top e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2})\right\}

d\text{d} \ell

$ = -\sum_{i=1}^N \text{tr}\left{\boldsymbol{y}_i^\top\text{d} [\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2] - \frac{1}{\boldsymbol{1}^\top e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}} \text{d} [\boldsymbol{1}^\top e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}]\right}$

=====w.r.t. W2i=1Ntr{yi(dW2)σ(W1xi+b1)11eW2σ(W1xi+b1)+b2[1deW2σ(W1xi+b1)+b2]}\overset{\text{w.r.t.}~\boldsymbol{W}_2}{=====} -\sum_{i=1}^N \text{tr}\left\{\boldsymbol{y}_i^\top(\text{d}\boldsymbol{W}_2)\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) - \frac{1}{\boldsymbol{1}^\top e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}} [\boldsymbol{1}^\top \text{d} e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}]\right\}

=====w.r.t. W2i=1Ntr{yi(dW2)σ(W1xi+b1)1{eW2σ(W1xi+b1)+b2d[W2σ(W1xi+b1)]}1eW2σ(W1xi+b1)+b2}\overset{\text{w.r.t.}~\boldsymbol{W}_2}{=====} -\sum_{i=1}^N \text{tr}\left\{\boldsymbol{y}_i^\top(\text{d}\boldsymbol{W}_2)\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) - \frac{\boldsymbol{1}^\top \{e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2} \odot \text{d}[\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1)]\}}{\boldsymbol{1}^\top e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}} \right\}

=====w.r.t. W2i=1Ntr{yi(dW2)σ(W1xi+b1)(eW2σ(W1xi+b1)+b2)d[W2σ(W1xi+b1)]1eW2σ(W1xi+b1)+b2}\overset{\text{w.r.t.}~\boldsymbol{W}_2}{=====} -\sum_{i=1}^N \text{tr}\left\{\boldsymbol{y}_i^\top(\text{d}\boldsymbol{W}_2)\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) - \frac{(e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2})^\top \cdot \text{d}[\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1)]}{\boldsymbol{1}^\top e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}} \right\}

=====w.r.t. W2i=1Ntr{yi(dW2)σ(W1xi+b1)(eW2σ(W1xi+b1)+b2)(dW2)σ(W1xi+b1)1eW2σ(W1xi+b1)+b2}\overset{\text{w.r.t.}~\boldsymbol{W}_2}{=====} -\sum_{i=1}^N \text{tr}\left\{\boldsymbol{y}_i^\top(\text{d}\boldsymbol{W}_2)\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) - \frac{(e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2})^\top (\text{d}\boldsymbol{W}_2)\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1)}{\boldsymbol{1}^\top e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}} \right\}

=====w.r.t. W2i=1Ntr{σ(W1xi+b1)yidW2σ(W1xi+b1)(eW2σ(W1xi+b1)+b2)dW21eW2σ(W1xi+b1)+b2}\overset{\text{w.r.t.}~\boldsymbol{W}_2}{=====} -\sum_{i=1}^N \text{tr}\left\{\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1)\boldsymbol{y}_i^\top\text{d}\boldsymbol{W}_2 - \frac{\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1)(e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2})^\top \text{d}\boldsymbol{W}_2}{\boldsymbol{1}^\top e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}} \right\}

=====w.r.t. W2i=1Ntr{σ(W1xi+b1)[(eW2σ(W1xi+b1)+b2)1eW2σ(W1xi+b1)+b2yi]dW2}\overset{\text{w.r.t.}~\boldsymbol{W}_2}{=====} \sum_{i=1}^N \text{tr}\left\{\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1)[\frac{(e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2})^\top}{\boldsymbol{1}^\top e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}} - \boldsymbol{y}_i^\top]\text{d}\boldsymbol{W}_2 \right\}

=====w.r.t. W2i=1Ntr{{[softmax(W2σ(W1xi+b1)+b2)yi]σ(W1xi+b1)}dW2}\overset{\text{w.r.t.}~\boldsymbol{W}_2}{=====} \sum_{i=1}^N \text{tr}\left\{\{[\text{softmax} (\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2) - \boldsymbol{y}_i]\sigma^\top(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1)\}^\top\text{d}\boldsymbol{W}_2 \right\}

d=i=1Ntr{yid[W2σ(W1xi+b1)+b2]11eW2σ(W1xi+b1)+b2d[1eW2σ(W1xi+b1)+b2]}\text{d} \ell = -\sum_{i=1}^N \text{tr}\left\{\boldsymbol{y}_i^\top\text{d} [\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2] - \frac{1}{\boldsymbol{1}^\top e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}} \text{d} [\boldsymbol{1}^\top e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}]\right\}

=====w.r.t. b2i=1Ntr{yidb211eW2σ(W1xi+b1)+b2[1deW2σ(W1xi+b1)+b2]}\overset{\text{w.r.t.}~\boldsymbol{b}_2}{=====} -\sum_{i=1}^N \text{tr}\left\{\boldsymbol{y}_i^\top\text{d}\boldsymbol{b}_2 - \frac{1}{\boldsymbol{1}^\top e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}} [\boldsymbol{1}^\top \text{d} e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}]\right\}

=====w.r.t. b2i=1Ntr{yidb21{eW2σ(W1xi+b1)+b2d[W2σ(W1xi+b1)+b2]}1eW2σ(W1xi+b1)+b2}\overset{\text{w.r.t.}~\boldsymbol{b}_2}{=====} -\sum_{i=1}^N \text{tr}\left\{\boldsymbol{y}_i^\top\text{d}\boldsymbol{b}_2 - \frac{\boldsymbol{1}^\top \{e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2} \odot \text{d} [\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2]\}}{\boldsymbol{1}^\top e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}}\right\}

=====w.r.t. b2i=1Ntr{yidb2[eW2σ(W1xi+b1)+b2]d[W2σ(W1xi+b1)+b2]1eW2σ(W1xi+b1)+b2}\overset{\text{w.r.t.}~\boldsymbol{b}_2}{=====} -\sum_{i=1}^N \text{tr}\left\{\boldsymbol{y}_i^\top\text{d}\boldsymbol{b}_2 - \frac{[e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}]^\top \text{d} [\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2]}{\boldsymbol{1}^\top e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}}\right\}

=====w.r.t. b2i=1Ntr{yidb2[eW2σ(W1xi+b1)+b2]db21eW2σ(W1xi+b1)+b2}\overset{\text{w.r.t.}~\boldsymbol{b}_2}{=====} -\sum_{i=1}^N \text{tr}\left\{\boldsymbol{y}_i^\top\text{d}\boldsymbol{b}_2 - \frac{[e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}]^\top \text{d} \boldsymbol{b}_2}{\boldsymbol{1}^\top e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}}\right\}

=====w.r.t. b2i=1Ntr{(eW2σ(W1xi+b1)+b21eW2σ(W1xi+b1)+b2yi)db2}\overset{\text{w.r.t.}~\boldsymbol{b}_2}{=====} \sum_{i=1}^N \text{tr}\left\{ (\frac{e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}}{\boldsymbol{1}^\top e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}} - \boldsymbol{y}_i)^\top\text{d}\boldsymbol{b}_2 \right\}

=====w.r.t. b2i=1Ntr{(softmax(W2σ(W1xi+b1)+b2)yi)db2}\overset{\text{w.r.t.}~\boldsymbol{b}_2}{=====} \sum_{i=1}^N \text{tr}\left\{ (\text{softmax}(\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2) - \boldsymbol{y}_i)^\top\text{d}\boldsymbol{b}_2 \right\}

d=i=1Ntr{yid[W2σ(W1xi+b1)+b2]11eW2σ(W1xi+b1)+b2d[1eW2σ(W1xi+b1)+b2]}\text{d} \ell = -\sum_{i=1}^N \text{tr}\left\{\boldsymbol{y}_i^\top\text{d} [\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2] - \frac{1}{\boldsymbol{1}^\top e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}} \text{d} [\boldsymbol{1}^\top e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}]\right\}

=====w.r.t. W1i=1Ntr{yiW2dσ(W1xi+b1)11eW2σ(W1xi+b1)+b2[1deW2σ(W1xi+b1)+b2]}\overset{\text{w.r.t.}~\boldsymbol{W}_1}{=====} -\sum_{i=1}^N \text{tr}\left\{\boldsymbol{y}_i^\top\boldsymbol{W}_2\text{d}\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) - \frac{1}{\boldsymbol{1}^\top e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}} [\boldsymbol{1}^\top \text{d} e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}]\right\}

=====w.r.t. W1i=1Ntr{{yi[eW2σ(W1xi+b1)+b2]1eW2σ(W1xi+b1)+b2}W2[σ(W1xi+b1)d(W1xi)]}\overset{\text{w.r.t.}~\boldsymbol{W}_1}{=====} -\sum_{i=1}^N \text{tr}\left\{\{\boldsymbol{y}_i^\top - \frac{[e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}]^\top}{\boldsymbol{1}^\top e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}}\}\boldsymbol{W}_2[\sigma^{\prime}(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) \odot\text{d}(\boldsymbol{W}_1\boldsymbol{x}_i)]\right\}

=====w.r.t. W1i=1Ntr{{W2[yieW2σ(W1xi+b1)+b21eW2σ(W1xi+b1)+b2]}[σ(W1xi+b1)d(W1xi)]}\overset{\text{w.r.t.}~\boldsymbol{W}_1}{=====} -\sum_{i=1}^N \text{tr}\left\{\{ {\boldsymbol{W}_2}^\top[\boldsymbol{y}_i - \frac{e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}}{\boldsymbol{1}^\top e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}}]\}^\top[\sigma^{\prime}(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) \odot\text{d}(\boldsymbol{W}_1\boldsymbol{x}_i)]\right\}

=====w.r.t. W1i=1Ntr{{{W2[yisoftmax(W2σ(W1xi+b1)+b2)]}σ(W1xi+b1)}d(W1xi)}\overset{\text{w.r.t.}~\boldsymbol{W}_1}{=====} -\sum_{i=1}^N \text{tr}\left\{\left\{\{ {\boldsymbol{W}_2}^\top[\boldsymbol{y}_i - \text{softmax}(\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2)]\}\odot\sigma^{\prime}(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1)\right\}^\top\text{d}(\boldsymbol{W}_1\boldsymbol{x}_i)\right\}

=====w.r.t. W1i=1Ntr{{{W2[softmax(W2σ(W1xi+b1)+b2)yi]}σ(W1xi+b1)xi}dW1}\overset{\text{w.r.t.}~\boldsymbol{W}_1}{=====} \sum_{i=1}^N \text{tr}\left\{\left\{\{ {\boldsymbol{W}_2}^\top[\text{softmax}(\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2) - \boldsymbol{y}_i]\}\odot\sigma^{\prime}(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1)\cdot\boldsymbol{x}_i^\top\right\}^\top\text{d}\boldsymbol{W}_1\right\}

d=i=1Ntr{yid[W2σ(W1xi+b1)+b2]11eW2σ(W1xi+b1)+b2d[1eW2σ(W1xi+b1)+b2]}\text{d} \ell = -\sum_{i=1}^N \text{tr}\left\{\boldsymbol{y}_i^\top\text{d} [\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2] - \frac{1}{\boldsymbol{1}^\top e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}} \text{d} [\boldsymbol{1}^\top e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}]\right\}

=====w.r.t. b1i=1Ntr{yiW2dσ(W1xi+b1)11eW2σ(W1xi+b1)+b2[1deW2σ(W1xi+b1)+b2]}\overset{\text{w.r.t.}~\boldsymbol{b}_1}{=====} -\sum_{i=1}^N \text{tr}\left\{\boldsymbol{y}_i^\top\boldsymbol{W}_2\text{d}\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) - \frac{1}{\boldsymbol{1}^\top e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}} [\boldsymbol{1}^\top \text{d} e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}]\right\}

=====w.r.t. b1i=1Ntr{{yi[eW2σ(W1xi+b1)+b2]1eW2σ(W1xi+b1)+b2}W2[σ(W1xi+b1)db1]}\overset{\text{w.r.t.}~\boldsymbol{b}_1}{=====} -\sum_{i=1}^N \text{tr}\left\{\{\boldsymbol{y}_i^\top - \frac{[e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}]^\top}{\boldsymbol{1}^\top e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}}\}\boldsymbol{W}_2[\sigma^{\prime}(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) \odot\text{d}\boldsymbol{b}_1]\right\}

=====w.r.t. b1i=1Ntr{{W2[yieW2σ(W1xi+b1)+b21eW2σ(W1xi+b1)+b2]}[σ(W1xi+b1)db1]}\overset{\text{w.r.t.}~\boldsymbol{b}_1}{=====} -\sum_{i=1}^N \text{tr}\left\{\{ {\boldsymbol{W}_2}^\top[\boldsymbol{y}_i - \frac{e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}}{\boldsymbol{1}^\top e^{\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2}}]\}^\top[\sigma^{\prime}(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) \odot\text{d}\boldsymbol{b}_1]\right\}

=====w.r.t. b1i=1Ntr{{{W2[softmax(W2σ(W1xi+b1)+b2)yi]}σ(W1xi+b1)}db1}\overset{\text{w.r.t.}~\boldsymbol{b}_1}{=====} \sum_{i=1}^N \text{tr}\left\{\left\{\{ {\boldsymbol{W}_2}^\top[\text{softmax}(\boldsymbol{W}_2\sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2) - \boldsymbol{y}_i]\}\odot\sigma^{\prime}(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1)\right\}^\top\text{d}\boldsymbol{b}_1\right\}

Answer:

W2=i=1N{softmax[W2σ(W1xi+b1)+b2]yi}σ(W1xi+b1)\frac{\partial \ell}{\partial \boldsymbol{W}_2} = \sum_{i=1}^N \{softmax[\boldsymbol{W}_2 \sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2] - \boldsymbol{y}_i\} \sigma^\top(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1)

b2=i=1Nsoftmax[W2σ(W1xi+b1)+b2]yi\frac{\partial \ell}{\partial \boldsymbol{b}_2} = \sum_{i=1}^N softmax[\boldsymbol{W}_2 \sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2] - \boldsymbol{y}_i

W1=i=1N{{W2{softmax[W2σ(W1xi+b1)+b2]yi}}σ(W1xi+b1)}xi\frac{\partial \ell}{\partial \boldsymbol{W}_1} = \sum_{i=1}^N \{\{\boldsymbol{W}_2^\top\{softmax[\boldsymbol{W}_2 \sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2] - \boldsymbol{y}_i\}\} \odot \sigma^{\prime}(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) \} \boldsymbol{x}_i^\top

b1=i=1N{W2{softmax[W2σ(W1xi+b1)+b2]yi}}σ(W1xi+b1)\frac{\partial \ell}{\partial \boldsymbol{b}_1} = \sum_{i=1}^N \{\boldsymbol{W}_2^\top\{softmax[\boldsymbol{W}_2 \sigma(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1) + \boldsymbol{b}_2] - \boldsymbol{y}_i\}\} \odot \sigma^{\prime}(\boldsymbol{W}_1\boldsymbol{x}_i + \boldsymbol{b}_1)

Derivative of the Matrix to Matrix

假设f(x)Rp×1\boldsymbol{f}(\boldsymbol{x})\in\mathbb{R}^{p\times 1}是关于向量xRn×1\boldsymbol{x}\in\mathbb{R}^{n\times 1}的函数。不失优雅地(不考虑矩阵布局),我们由向量微分与偏导的关联(df=fxdxd\boldsymbol{f}=\frac{\partial\boldsymbol{f}}{\partial\boldsymbol{x}}^\top d\boldsymbol{x}),反推给出列向量对列向量的偏导定义如下:

fx=[f1x1f1xnfpx1fpxn]Rp×n\frac{\partial \boldsymbol{f}}{\partial \boldsymbol{x}} = \begin{bmatrix} \frac{\partial f_1}{\partial x_{1}} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_p}{\partial x_{1}} & \cdots & \frac{\partial f_p}{\partial x_n} \end{bmatrix} \in \mathbb{R}^{p\times n}

该布局默认为分母布局。在分子布局下,

fx=[f1x1fpx1f1xnfpxn]Rn×p\frac{\partial \boldsymbol{f}}{\partial \boldsymbol{x}} = \begin{bmatrix} \frac{\partial f_1}{\partial x_{1}} & \cdots & \frac{\partial f_p}{\partial x_1} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_1}{\partial x_{n}} & \cdots & \frac{\partial f_p}{\partial x_n} \end{bmatrix} \in \mathbb{R}^{n\times p}

在给出矩阵对矩阵的导数定义前,我们先定义矩阵XRm×n\boldsymbol{X}\in\mathbb{R}^{m\times n}的向量化为Vec(X)=[x1,1,x2,1,,xm,1,x1,2,,xm,n]\text{Vec}(\boldsymbol{X})=[x_{1,1},x_{2,1},\cdots,x_{m,1},x_{1,2},\cdots,x_{m,n}]^\top。从而进一步地我们定义矩阵FRp×q\boldsymbol{F}\in\mathbb{R}^{p\times q}对矩阵XRm×n\boldsymbol{X}\in\mathbb{R}^{m\times n}的导数如下:

FX=Vec(F)Vec(X)Rmn×pq\frac{\partial \boldsymbol{F}}{\partial \boldsymbol{X}} = \frac{\partial \text{Vec}(\boldsymbol{F})}{\partial \text{Vec}(\boldsymbol{X})}\in\mathbb{R}^{mn\times pq}

同时,导数与微分的联系可以不失一般地表示为:

Vec(dF)=(Vec(F)Vec(X))Vec(dX)=(FX)Vec(dX)\text{Vec}(d\boldsymbol{F}) = (\frac{\partial \text{Vec}(\boldsymbol{F})}{\partial \text{Vec}(\boldsymbol{X})})^\top\text{Vec}(d\boldsymbol{X}) = (\frac{\partial \boldsymbol{F}}{\partial \boldsymbol{X}})^\top\text{Vec}(d\boldsymbol{X}).

FF降级为标量ff时,为了统一表示,我们记fX=Vec(Xf)\frac{\partial f}{\partial \boldsymbol{X}} = \text{Vec}(\nabla_{\boldsymbol{X}}f)。在此基础上求二阶导,得到机器学习中常见的 Hessian matrix:

X2f=2fX2=(Xf)XRmn×mn\nabla^2_{\boldsymbol{X}}f = \frac{\partial^2 f}{\partial\boldsymbol{X}^2} = \frac{\partial (\nabla_{\boldsymbol{X}}f)}{\partial \boldsymbol{X}}\in\mathbb{R}^{mn\times mn}.

Notice 1: 若从分子布局和分母布局的角度出发,与微分相关联且不失一般性的布局为分母布局,机器学习和优化理论常用此布局。

Notice 2: 矩阵对矩阵求导定义除了分子布局与分母布局外,还可以有其它定义,例如:

FX=[f1,1Xf1,qXfp,1Xfp,qX]=[Fx1,1Fx1,nFxm,1Fxm,n]\frac{\partial \boldsymbol{F}}{\partial \boldsymbol{X}} = \begin{bmatrix} \frac{\partial f_{1,1}}{\partial \boldsymbol{X}} & \cdots & \frac{\partial f_{1,q}}{\partial \boldsymbol{X}} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_{p,1}}{\partial \boldsymbol{X}} & \cdots & \frac{\partial f_{p,q}}{\partial \boldsymbol{X}} \end{bmatrix} = \begin{bmatrix} \frac{\partial \boldsymbol{F}}{\partial x_{1,1}} & \cdots & \frac{\partial \boldsymbol{F}}{\partial x_{1,n}} \\ \vdots & \ddots & \vdots \\ \frac{\partial \boldsymbol{F}}{\partial x_{m,1}} & \cdots & \frac{\partial \boldsymbol{F}}{\partial x_{m,n}} \end{bmatrix}

然而,该定义只可兼容标量对矩阵的求导,无法配合微分进行运算,因此不是"好"的定义。

假设纯在复合函数YF\boldsymbol{Y}\cdot\boldsymbol{F},则 Vec(dF)=(FY)Vec(dY)=(FY)((YX)Vec(dX))=(YXFY)Vec(dX)\text{Vec}(\text{d} \boldsymbol{F}) = {(\frac{\partial \boldsymbol{F}}{\partial \boldsymbol{Y}})}^\top \text{Vec}(\text{d} \boldsymbol{Y}) = {(\frac{\partial \boldsymbol{F}}{\partial \boldsymbol{Y}})}^\top \left( {(\frac{\partial \boldsymbol{Y}}{\partial \boldsymbol{X}})}^\top \text{Vec}(\text{d} \boldsymbol{X})\right) = {\left(\frac{\partial \boldsymbol{Y}}{\partial \boldsymbol{X}} \frac{\partial \boldsymbol{F}}{\partial \boldsymbol{Y}}\right)} ^\top \text{Vec}(\text{d} \boldsymbol{X}). 因此,dFdX=YXFY\frac{\text{d} \boldsymbol{F}}{\text{d} \boldsymbol{X}} = \frac{\partial \boldsymbol{Y}}{\partial \boldsymbol{X}} \frac{\partial \boldsymbol{F}}{\partial \boldsymbol{Y}}.

Vectorization Trick

  • Vec(A+B)=Vec(A)+Vec(B)\color{green}\text{Vec}(\boldsymbol{A} + \boldsymbol{B}) = \text{Vec}(\boldsymbol{A}) + \text{Vec}(\boldsymbol{B})

  • Vec(AXB)=(BA)Vec(X),其中  表示 Kronecker Product:\color{green}\text{Vec}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}) = (\boldsymbol{B}^\top\otimes \boldsymbol{A})\text{Vec}(\boldsymbol{X}),其中~\otimes~表示~\text{Kronecker Product}:

    AB(ARm×n,BRp×q)=[a1,1Ba1,nBam,1Bam,nB]\color{green} \boldsymbol{A}\otimes\boldsymbol{B}(\boldsymbol{A}\in\mathbb{R}^{m\times n}, \boldsymbol{B}\in\mathbb{R}^{p\times q}) = \begin{bmatrix} a_{1,1}\boldsymbol{B} & \cdots & a_{1,n}\boldsymbol{B}\\ \vdots & \ddots & \vdots\\ a_{m,1}\boldsymbol{B} & \cdots & a_{m,n}\boldsymbol{B} \end{bmatrix}

    • Vec(ab)=ba\color{green}\text{Vec}(\boldsymbol{a}\boldsymbol{b}^\top) = \boldsymbol{b}\otimes\boldsymbol{a}
  • Vec(AT)=KmnVec(A),其中 Kmn 为交换矩阵 (iki,j=jki,j=1,Kmn=Knm,KmnKnm=I)\color{green}\text{Vec}(\boldsymbol{A}^T)=\boldsymbol{K}_{mn}\text{Vec}(\boldsymbol{A}), 其中~ \boldsymbol{K}_{mn}~为交换矩阵~(\sum_i k_{i,j} = \sum_j k_{i,j} = 1, \boldsymbol{K}_{mn} = {\boldsymbol{K}_{nm}}^\top, \boldsymbol{K}_{mn}\boldsymbol{K}_{nm} = \boldsymbol{I})

  • Kp,m(AB)Knq=BA(ARm×n,BRp×q)\color{green} \boldsymbol{K}_{p,m}(\boldsymbol{A}\otimes\boldsymbol{B})\boldsymbol{K}_{nq}=\boldsymbol{B}\otimes\boldsymbol{A}(\boldsymbol{A}\in\mathbb{R}^{m\times n}, \boldsymbol{B}\in\mathbb{R}^{p\times q})

    Proof:

    Vec(AXB)=BAVec(X),ARm×n,XRn×q,BRp×q\text{Vec}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}^\top)=\boldsymbol{B}\otimes\boldsymbol{A}\text{Vec}(X), \boldsymbol{A}\in\mathbb{R}^{m\times n}, \boldsymbol{X}\in\mathbb{R}^{n\times q}, \boldsymbol{B}\in\mathbb{R}^{p\times q}

    Vec(BXA)=(AB)Vec(X)=(AB)Kn,qVec(X)\text{Vec}(\boldsymbol{B}\boldsymbol{X}^\top\boldsymbol{A}^\top)=(\boldsymbol{A}\otimes\boldsymbol{B})\text{Vec}(\boldsymbol{X}^\top)=(\boldsymbol{A}\otimes\boldsymbol{B})\boldsymbol{K}_{n,q}\text{Vec}(\boldsymbol{X})

    Vec[(AXB)]=Km,pVec(AXB)=Km,pBAVec(X)\text{Vec}\left[(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}^\top)^\top\right]=\boldsymbol{K}_{m,p}\text{Vec}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}^\top)=\boldsymbol{K}_{m,p}\boldsymbol{B}\otimes\boldsymbol{A}\text{Vec}(\boldsymbol{X})

    Thus, Kp,m(AB)Kn,qVec(X)=Kp,mKm,pBAVec(X)=BAVec(X)\boldsymbol{K}_{p,m}(\boldsymbol{A}\otimes\boldsymbol{B})\boldsymbol{K}_{n,q}\text{Vec}(\boldsymbol{X})=\boldsymbol{K}_{p,m}\boldsymbol{K}_{m,p}\boldsymbol{B}\otimes\boldsymbol{A}\text{Vec}(\boldsymbol{X})=\boldsymbol{B}\otimes\boldsymbol{A}\text{Vec}(\boldsymbol{X})

  • Vec(AB)=Diag(A)Vec(B),其中 Diag(A) 为对角化操作,即把 Vec(A) 中所有元素排成对角阵(除对角线外其余元素全为零)\color{green}\text{Vec}(\boldsymbol{A}\odot\boldsymbol{B}) = \text{Diag}(\boldsymbol{A})\text{Vec}(\boldsymbol{B}),其中~\text{Diag}(\boldsymbol{A})~为对角化操作,即把~\text{Vec}(\boldsymbol{A})~中所有元素排成对角阵(除对角线外其余元素全为零)

    • ab=Vec(ab)=Diag(a)b\color{green}\boldsymbol{a}\odot\boldsymbol{b}=\text{Vec}(\boldsymbol{a}\odot\boldsymbol{b})=\text{Diag}(\boldsymbol{a})\boldsymbol{b}
  • (AB)=AB\color{green}(\boldsymbol{A}\otimes\boldsymbol{B})^\top = \boldsymbol{A}^\top\otimes\boldsymbol{B}^\top

  • (AB)(CD)=ACBD\color{green}(\boldsymbol{A}\otimes\boldsymbol{B})(\boldsymbol{C}\otimes\boldsymbol{D}) = \boldsymbol{AC}\otimes\boldsymbol{BD}

Proof:

构造函数 F=DBXAC=D(BXA)C\boldsymbol{F}=\boldsymbol{D}^\top\boldsymbol{B}^\top\boldsymbol{X}\boldsymbol{A}\boldsymbol{C}=\boldsymbol{D}^\top(\boldsymbol{B}^\top\boldsymbol{X}\boldsymbol{A})\boldsymbol{C}

For F=DBXAC,Vec(dF)=Vec(DBdXAC)=[(AC)(DB)]Vec(dX)=[(AC)(BD)]Vec(dX)\text{For~} \boldsymbol{F}=\boldsymbol{D}^\top\boldsymbol{B}^\top\boldsymbol{X}\boldsymbol{A}\boldsymbol{C}, \text{Vec}(\text{d}\boldsymbol{F})=\text{Vec}(\boldsymbol{D}^\top\boldsymbol{B}^\top\text{d}\boldsymbol{X}\boldsymbol{A}\boldsymbol{C})=\left[(\boldsymbol{AC})^\top\otimes(\boldsymbol{D}^\top\boldsymbol{B}^\top)\right]\text{Vec}(\text{d}\boldsymbol{X})=\left[(\boldsymbol{A}\boldsymbol{C})\otimes(\boldsymbol{B}\boldsymbol{D})\right]^\top\text{Vec}(\text{d}\boldsymbol{X})

For F=D(BXA)C,Vec(dF)=Vec[Dd(BXA)C]=(CD)Vec[d(BXA)]=(CD)Vec[BdXA]\text{For~} \boldsymbol{F}=\boldsymbol{D}^\top(\boldsymbol{B}^\top\boldsymbol{X}\boldsymbol{A})\boldsymbol{C}, \text{Vec}(\text{d}\boldsymbol{F})=\text{Vec}\left[\boldsymbol{D}^\top\text{d}(\boldsymbol{B}^\top\boldsymbol{X}\boldsymbol{A})\boldsymbol{C}\right]=(\boldsymbol{C}^\top\otimes\boldsymbol{D}^\top)\text{Vec}\left[\text{d}(\boldsymbol{B}^\top\boldsymbol{X}\boldsymbol{A})\right]=(\boldsymbol{C}^\top\otimes\boldsymbol{D}^\top)\text{Vec}\left[\boldsymbol{B}^\top\text{d}\boldsymbol{X}\boldsymbol{A}\right]

=(CD)(AB)Vec(dX)=[(AB)(CD)]Vec(dX)=(\boldsymbol{C}^\top\otimes\boldsymbol{D}^\top)(\boldsymbol{A}^\top\otimes\boldsymbol{B}^\top)\text{Vec}(\text{d}\boldsymbol{X})=\left[(\boldsymbol{A}\otimes\boldsymbol{B})(\boldsymbol{C}\otimes\boldsymbol{D})\right]^\top\text{Vec}(\text{d}\boldsymbol{X})

Thus, dFdX=(AC)(BD)=(AB)(CD)\text{Thus,~}\frac{\text{d}\boldsymbol{F}}{\text{d}\boldsymbol{X}}=(\boldsymbol{A}\boldsymbol{C})\otimes(\boldsymbol{B}\boldsymbol{D})=(\boldsymbol{A}\otimes\boldsymbol{B})(\boldsymbol{C}\otimes\boldsymbol{D})

例 8. 构造函数

F=AX,XRm×n\boldsymbol{F} = \boldsymbol{A}\boldsymbol{X}, \boldsymbol{X}\in\mathbb{R}^{m\times n}

Process:

Vec(dF)\text{Vec}(\text{d}\boldsymbol{F})

=Vec(AdX)=\text{Vec}(\boldsymbol{A}\text{d}\boldsymbol{X})

=InAVec(dX)=\boldsymbol{I}_n\otimes\boldsymbol{A}\text{Vec}(\text{d}\boldsymbol{X})

Answer: dFdX=InA\frac{\text{d}\boldsymbol{F}}{\text{d}\boldsymbol{X}}=\boldsymbol{I}_n\otimes\boldsymbol{A}^\top

例 9. 构造函数

f=logX,XRn×nf = \text{log} |\boldsymbol{X}|, \boldsymbol{X}\in\mathbb{R}^{n\times n},求二阶导(Hessian Matrix).

Process:

df\text{d} f

$ = \text{d}\log|\boldsymbol{X}|$

=1XdX=\frac{1}{|\boldsymbol{X}|}\text{d}|\boldsymbol{X}|

=1XXtr(X1dX)=\frac{1}{|\boldsymbol{X}|}|\boldsymbol{X}|\text{tr}(\boldsymbol{X}^{-1}\text{d}\boldsymbol{X})

=tr(X1dX)=\text{tr}(\boldsymbol{X}^{-1}\text{d}\boldsymbol{X})

Hence, dfdX=Vec[(X1)], Xf=(X1)=(X)1\frac{\text{d}f}{\text{d}\boldsymbol{X}}=\text{Vec}\left[(\boldsymbol{X}^{-1})^\top\right],~\nabla_{\boldsymbol{X}} f = (\boldsymbol{X}^{-1})^\top=(\boldsymbol{X}^\top)^{-1}

Vec(dXf)\text{Vec}(\text{d}\nabla_{\boldsymbol{X}}f)

=Vec[(X)1dX(X)1]=-\text{Vec}\left[(\boldsymbol{X}^\top)^{-1}\text{d}\boldsymbol{X}^\top(\boldsymbol{X}^\top)^{-1}\right]

=X1(X1)Vec(dX)= -\boldsymbol{X}^{-1}\otimes(\boldsymbol{X}^{-1})^\top\text{Vec}(\text{d}\boldsymbol{X}^\top)

=X1(X1)Kn,nVec(dX)= -\boldsymbol{X}^{-1}\otimes(\boldsymbol{X}^{-1})^\top\boldsymbol{K}_{n,n}\text{Vec}(\text{d}\boldsymbol{X})

Answer:

Xf=X1\nabla_{\boldsymbol{X}} f = {\boldsymbol{X}^{-1}}^\top

X2f=Kn,n(X1)X1\nabla^2_{\boldsymbol{X}} f = -\boldsymbol{K}_{n,n}(\boldsymbol{X}^{-1})^\top\otimes\boldsymbol{X}^{-1}

例 10. 构造函数

F=AeXB,ARl×m,XRm×n,BRn×p.\boldsymbol{F} = \boldsymbol{A}e^{\boldsymbol{X}\boldsymbol{B}}, \boldsymbol{A}\in\mathbb{R}^{l\times m}, \boldsymbol{X}\in\mathbb{R}^{m\times n}, \boldsymbol{B}\in\mathbb{R}^{n\times p}.

Process:

Vec(dF)\text{Vec}(\text{d}\boldsymbol{F})

=Vec{A[eXB(dXB)]}=\text{Vec}\left\{\boldsymbol{A}\left[e^{\boldsymbol{X}\boldsymbol{B}}\odot(\text{d}\boldsymbol{X}\boldsymbol{B})\right]\right\}

=(IpA)Vec[eXB(dXB)]=(\boldsymbol{I}_p^\top\otimes\boldsymbol{A})\text{Vec}\left[e^{\boldsymbol{X}\boldsymbol{B}}\odot(\text{d}\boldsymbol{X}\boldsymbol{B})\right]

=(IpA)Diag(eXB)Vec(dXB)=(\boldsymbol{I}_p\otimes\boldsymbol{A})\text{Diag}(e^{\boldsymbol{X}\boldsymbol{B}})\text{Vec}(\text{d}\boldsymbol{X}\boldsymbol{B})

=(IpA)Diag(eXB)(BIm)Vec(dX)=(\boldsymbol{I}_p\otimes\boldsymbol{A})\text{Diag}(e^{\boldsymbol{X}\boldsymbol{B}})(\boldsymbol{B}^\top\otimes\boldsymbol{I}_m)\text{Vec}(\text{d}\boldsymbol{X})

Answer:

dFdX=(BIm)Diag(eXB)(IpA)\frac{\text{d}\boldsymbol{F}}{\text{d}\boldsymbol{X}}=(\boldsymbol{B}\otimes\boldsymbol{I}_m)\text{Diag}(e^{\boldsymbol{X}\boldsymbol{B}})(\boldsymbol{I}_p\otimes\boldsymbol{A})

例 11. 一元 logistic 回归

=yxw+log(1+exw)\ell = -y\boldsymbol{x}^\top\boldsymbol{w} + \log(1 + e^{\boldsymbol{x}^\top\boldsymbol{w}}),求w\nabla_{\boldsymbol{w}}\ellw2\nabla^2_{\boldsymbol{w}}\ell,其中y0,1,x,wRn×1y\in{0,1}, \boldsymbol{x},\boldsymbol{w}\in\mathbb{R}^{n\times 1}.

Process:

$\text{d} \ell $

=yxdw+11+exw[exw(xdw)]= -y\boldsymbol{x}^\top\text{d}\boldsymbol{w}+\frac{1}{1+e^{\boldsymbol{x}^\top\boldsymbol{w}}}\left[e^{\boldsymbol{x}^\top\boldsymbol{w}}\odot(\boldsymbol{x}^\top\text{d}\boldsymbol{w})\right]

=yxdw+exw1+exw(xdw)= -y\boldsymbol{x}^\top\text{d}\boldsymbol{w}+\frac{e^{\boldsymbol{x}^\top\boldsymbol{w}}}{1+e^{\boldsymbol{x}^\top\boldsymbol{w}}}(\boldsymbol{x}^\top\text{d}\boldsymbol{w})

=(exw1+exwy)xdw= (\frac{e^{\boldsymbol{x}^\top\boldsymbol{w}}}{1+e^{\boldsymbol{x}^\top\boldsymbol{w}}}-y)\boldsymbol{x}^\top\text{d}\boldsymbol{w}

dw\text{d}\nabla_{\boldsymbol{w}}\ell

=xdexw1+exw={\color{red}\boldsymbol{x}} \text{d}\frac{e^{\boldsymbol{x}^\top\boldsymbol{w}}}{1+e^{\boldsymbol{x}^\top\boldsymbol{w}}}

=xexw(1+exw)e2xw(1+exw)2xdw=\boldsymbol{x}\frac{e^{\boldsymbol{x}^\top\boldsymbol{w}}(1+e^{\boldsymbol{x}^\top\boldsymbol{w}})-e^{2\boldsymbol{x}^\top\boldsymbol{w}}}{(1+e^{\boldsymbol{x}^\top\boldsymbol{w}})^2}\boldsymbol{x}^\top\text{d}\boldsymbol{w}

=xexw(1+exw)2xdw=\boldsymbol{x}\frac{e^{\boldsymbol{x}^\top\boldsymbol{w}}}{(1+e^{\boldsymbol{x}^\top\boldsymbol{w}})^2}\boldsymbol{x}^\top\text{d}\boldsymbol{w}

Answer:

ddw=w=(exw1+exwy)x\frac{\text{d}\ell}{\text{d}\boldsymbol{w}} = \nabla_{\boldsymbol{w}}\ell = (\frac{e^{\boldsymbol{x}^\top\boldsymbol{w}}}{1+e^{\boldsymbol{x}^\top\boldsymbol{w}}}-y)\boldsymbol{x}

w2=xexw(1+exw)2x\nabla^2_{\boldsymbol{w}}\ell=\boldsymbol{x}\frac{e^{\boldsymbol{x}^\top\boldsymbol{w}}}{(1+e^{\boldsymbol{x}^\top\boldsymbol{w}})^2}\boldsymbol{x}^\top

例 12. 带多重样本的一元 logistic 回归

针对包含多重样本的数据集 {(x1,y1),,(xN,yN)}\{(\boldsymbol{x}_1, y_1),\cdots,(\boldsymbol{x}_N, y_N)\}=i=1N[yixiw+log(1+exiw)]\ell = \sum_{i=1}^N\left[-y_i\boldsymbol{x}_i^\top\boldsymbol{w}+\log(1+e^{\boldsymbol{x}_i^\top\boldsymbol{w}})\right]

Process No.1:

The process of 例 11

Answer No.1:

ddw=w=i=1N(exiw1+exiwyi)xi\frac{\text{d}\ell}{\text{d}\boldsymbol{w}} = \nabla_{\boldsymbol{w}}\ell = \sum_{i=1}^N(\frac{e^{\boldsymbol{x}_i^\top\boldsymbol{w}}}{1+e^{\boldsymbol{x}_i^\top\boldsymbol{w}}}-y_i)\boldsymbol{x}_i

w2=i=1Nxiexiw(1+exiw)2xi\nabla^2_{\boldsymbol{w}}\ell=\sum_{i=1}^N\boldsymbol{x}_i\frac{e^{\boldsymbol{x}_i^\top\boldsymbol{w}}}{(1+e^{\boldsymbol{x}_i^\top\boldsymbol{w}})^2}\boldsymbol{x}_i^\top

Process No. 2:

X=[x1xN],y=[y1yN],=yXw+1log(1+eXw)\boldsymbol{X}=\begin{bmatrix}\boldsymbol{x}_1^\top\\\vdots\\\boldsymbol{x}_N^\top\end{bmatrix}, \boldsymbol{y}=\begin{bmatrix}y_1\\\vdots\\y_N\end{bmatrix},\ell=-\boldsymbol{y}^\top\boldsymbol{X}\boldsymbol{w}+\boldsymbol{1}^\top\log(\boldsymbol{1}+e^{\boldsymbol{X}\boldsymbol{w}})

d\text{d}\ell

=yXdw+1[eXx1+eXw(Xdw)]=-\boldsymbol{y}^\top\boldsymbol{X}\text{d}\boldsymbol{w}+\boldsymbol{1}^\top\left[\frac{e^{\boldsymbol{X}\boldsymbol{x}}}{\boldsymbol{1}+e^{\boldsymbol{X}^\top\boldsymbol{w}}}\odot(\boldsymbol{X}\text{d}\boldsymbol{w})\right]

=yXdw+(1eXx1+eXw)(Xdw)=-\boldsymbol{y}^\top\boldsymbol{X}\text{d}\boldsymbol{w}+(\boldsymbol{1}\odot\frac{e^{\boldsymbol{X}\boldsymbol{x}}}{\boldsymbol{1}+e^{\boldsymbol{X}^\top\boldsymbol{w}}})^\top(\boldsymbol{X}\text{d}\boldsymbol{w})

=(eXw1+eXwy)Xdw=\left(\frac{e^{\boldsymbol{X}\boldsymbol{w}}}{\boldsymbol{1}+e^{\boldsymbol{X}\boldsymbol{w}}}-\boldsymbol{y}\right)^\top\boldsymbol{X}\text{d}\boldsymbol{w}

dw\text{d}\nabla_{\boldsymbol{w}}\ell

=X[eXw(1+eXw)2(Xdw)]=\boldsymbol{X}^\top\left[\frac{e^{\boldsymbol{X}\boldsymbol{w}}}{(\boldsymbol{1}+e^{\boldsymbol{X}\boldsymbol{w}})^2}\odot(\boldsymbol{X}\text{d}\boldsymbol{w})\right]

=XDiag[eXw(1+eXw)2]Xdw=\boldsymbol{X}^\top\text{Diag}\left[\frac{e^{\boldsymbol{X}\boldsymbol{w}}}{(\boldsymbol{1}+e^{\boldsymbol{X}\boldsymbol{w}})^2}\right]\boldsymbol{X}\text{d}\boldsymbol{w}

Answer No.2:

w=X(eXw1+eXwy)\nabla_{\boldsymbol{w}}\ell = \boldsymbol{X}^\top(\frac{e^{\boldsymbol{X}\boldsymbol{w}}}{\boldsymbol{1}+e^{\boldsymbol{X}\boldsymbol{w}}}-\boldsymbol{y})

w2=XDiag[eXw(1+eXw)2]X\nabla_{\boldsymbol{w}}^2\ell=\boldsymbol{X}^\top\text{Diag}\left[\frac{e^{\boldsymbol{X}\boldsymbol{w}}}{(\boldsymbol{1}+e^{\boldsymbol{X}\boldsymbol{w}})^2}\right]\boldsymbol{X}

例 13. 多元 logistic 回归

=ylog softmax(Wx)=yWx+log(1eWx)\ell = -\boldsymbol{y}^\top\text{log~softmax}(\boldsymbol{W}\boldsymbol{x}) = -\boldsymbol{y}^\top\boldsymbol{W}\boldsymbol{x} + \text{log}(\boldsymbol{1}^\top e^{\boldsymbol{W}\boldsymbol{x}}),求wl\nabla_{\boldsymbol{w}}lw2l\nabla^2_{\boldsymbol{w}}l,其中yRmy\in\mathbb{R}^{m}且为 one-hot 编码, xRn,WRm×n\boldsymbol{x}\in\mathbb{R}^{n}, \boldsymbol{W}\in\mathbb{R}^{m\times n}.

Process:

d\text{d}\ell

=tr(ydWx)+11eWxd(1eWx)=-\text{tr}(\boldsymbol{y}^\top\text{d}\boldsymbol{W}\boldsymbol{x})+\frac{1}{\boldsymbol{1}^\top e^{\boldsymbol{W}\boldsymbol{x}}}\odot\text{d}(\boldsymbol{1}^\top e^{\boldsymbol{W}\boldsymbol{x}})

=tr(ydWx)+11eWx1[eWx(dWx)]=-\text{tr}(\boldsymbol{y}^\top\text{d}\boldsymbol{W}\boldsymbol{x})+\frac{1}{\boldsymbol{1}^\top e^{\boldsymbol{W}\boldsymbol{x}}}\boldsymbol{1}^\top\left[e^{\boldsymbol{W}\boldsymbol{x}}\odot(\text{d} \boldsymbol{W}\boldsymbol{x})\right]

=tr(ydWx)+tr[(eWx1eWx)(dWx)]=-\text{tr}(\boldsymbol{y}^\top\text{d}\boldsymbol{W}\boldsymbol{x})+\text{tr}\left[(\frac{e^{\boldsymbol{W}\boldsymbol{x}}}{\boldsymbol{1}^\top e^{\boldsymbol{W}\boldsymbol{x}}})^\top(\text{d} \boldsymbol{W}\boldsymbol{x})\right]

=tr[[x(eWx1eWx)xy]dW]=\text{tr}\left[[\boldsymbol{x}(\frac{e^{\boldsymbol{W}\boldsymbol{x}}}{\boldsymbol{1}^\top e^{\boldsymbol{W}\boldsymbol{x}}})^\top - \boldsymbol{x}\boldsymbol{y}^\top]\text{d} \boldsymbol{W}\right]

dW\text{d}\nabla_{\boldsymbol{W}}\ell

=deWx1eWxx=\text{d} \frac{e^{\boldsymbol{W}\boldsymbol{x}}}{\boldsymbol{1}^\top e^{\boldsymbol{W}\boldsymbol{x}}} \boldsymbol{x}^\top

=[d(eWx)1eWxeWx1d(eWx)(1eWx)2]x=\left[\frac{\text{d}(e^{\boldsymbol{W}\boldsymbol{x}})}{\boldsymbol{1}^\top e^{\boldsymbol{W}\boldsymbol{x}}} - \frac{ {\color{red}e^{\boldsymbol{W}\boldsymbol{x}}}\boldsymbol{1}^\top\text{d}(e^{\boldsymbol{W}\boldsymbol{x}})}{(\boldsymbol{1}^\top e^{\boldsymbol{W}\boldsymbol{x}})^2}\right]\boldsymbol{x}^\top

=[eWx(dWx)1eWxeWx1[eWx(dWx)](1eWx)2]x=\left[\frac{e^{\boldsymbol{W}\boldsymbol{x}}\odot(\text{d}\boldsymbol{W}\boldsymbol{x})}{\boldsymbol{1}^\top e^{\boldsymbol{W}\boldsymbol{x}}} - \frac{e^{\boldsymbol{W}\boldsymbol{x}}\boldsymbol{1}^\top[e^{\boldsymbol{W}\boldsymbol{x}}\odot(\text{d}\boldsymbol{W}\boldsymbol{x})]}{(\boldsymbol{1}^\top e^{\boldsymbol{W}\boldsymbol{x}})^2}\right]\boldsymbol{x}^\top

=[Diag(eWx)(dWx)1eWxeWx1Diag(eWx)(dWx)(1eWx)2]x=\left[\frac{\text{Diag}(e^{\boldsymbol{W}\boldsymbol{x}})(\text{d}\boldsymbol{W}\boldsymbol{x})}{\boldsymbol{1}^\top e^{\boldsymbol{W}\boldsymbol{x}}}-\frac{e^{\boldsymbol{W}\boldsymbol{x}}\boldsymbol{1}^\top\text{Diag}(e^{\boldsymbol{W}\boldsymbol{x}})(\text{d}\boldsymbol{W}\boldsymbol{x})}{(\boldsymbol{1}^\top e^{\boldsymbol{W}\boldsymbol{x}})^2}\right]\boldsymbol{x}^\top

=[Diag(eWx)(dWx)1eWxeWx(eWx)(dWx)(1eWx)2]x=\left[\frac{\text{Diag}(e^{\boldsymbol{W}\boldsymbol{x}})(\text{d}\boldsymbol{W}\boldsymbol{x})}{\boldsymbol{1}^\top e^{\boldsymbol{W}\boldsymbol{x}}} - \frac{e^{\boldsymbol{W}\boldsymbol{x}}(e^{\boldsymbol{W}\boldsymbol{x}})^\top(\text{d}\boldsymbol{W}\boldsymbol{x})}{(\boldsymbol{1}^\top e^{\boldsymbol{W}\boldsymbol{x}})^2}\right]\boldsymbol{x}^\top

=Diag(eWx)(dW)xx1eWxeWx(eWx)(dW)xx(1eWx)2=\frac{\text{Diag}(e^{\boldsymbol{W}\boldsymbol{x}})(\text{d}\boldsymbol{W})\boldsymbol{x}\boldsymbol{x}^\top}{\boldsymbol{1}^\top e^{\boldsymbol{W}\boldsymbol{x}}} - \frac{e^{\boldsymbol{W}\boldsymbol{x}}(e^{\boldsymbol{W}\boldsymbol{x}})^\top(\text{d}\boldsymbol{W})\boldsymbol{x}\boldsymbol{x}^\top}{(\boldsymbol{1}^\top e^{\boldsymbol{W}\boldsymbol{x}})^2}

Vec(dW)\text{Vec}(\text{d}\nabla_{\boldsymbol{W}}\ell)

=xx[Diag[softmax(Wx)]softmax(Wx)softmax(Wx)]Vec(dW)=\boldsymbol{x}\boldsymbol{x}^\top\otimes\left[\text{Diag}[\text{softmax}(\boldsymbol{W}\boldsymbol{x})]-\text{softmax}(\boldsymbol{W}\boldsymbol{x})\text{softmax}(\boldsymbol{W}\boldsymbol{x})^\top\right]\text{Vec}(\text{d}\boldsymbol{W})

Answer:

W=[eWx1eWxy]x\nabla_{\boldsymbol{W}}\ell=\left[\frac{e^{\boldsymbol{W}\boldsymbol{x}}}{\boldsymbol{1}^\top e^{\boldsymbol{W}\boldsymbol{x}}} - \boldsymbol{y}\right]\boldsymbol{x}^\top

W2=xx[Diag[softmax(Wx)]softmax(Wx)softmax(Wx)]\nabla_{\boldsymbol{W}}^2\ell=\boldsymbol{x}\boldsymbol{x}^\top\otimes\left[\text{Diag}[\text{softmax}(\boldsymbol{W}\boldsymbol{x})]-\text{softmax}(\boldsymbol{W}\boldsymbol{x})\text{softmax}(\boldsymbol{W}\boldsymbol{x})^\top\right]

写在最后:
  • 注意例 4 和 5 中红色标注部分,不含微分对象的变量尽量左移可以给微分求解过程去除不必要的麻烦。

  • 本文为参考文献所涉及的两篇博客的吸收消化理解精炼,仅供学习参考使用。

Reference

[1] 矩阵求导术(上) - 长躯鬼侠的文章 - 知乎 https://zhuanlan.zhihu.com/p/24709748

[2] 矩阵求导术(下) - 长躯鬼侠的文章 - 知乎 https://zhuanlan.zhihu.com/p/24863977