Gradient of the 2-Norm

Prove $||\mathbf{x}||_2^2 = \mathbf{x}^T\mathbf{x}$ when $\mathbf{x}$ is a vector:

$$
\begin{align}
\mathbf{x} =
\begin{bmatrix}
x_1 \\
x_2 \\
x_3
\end{bmatrix}
\Rightarrow \mathbf{x}^T\mathbf{x} =
\begin{bmatrix}
x_1 & x_2 & x_3
\end{bmatrix}
\begin{bmatrix}
x_1 \\
x_2 \\
x_3
\end{bmatrix}
= x_1^2 + x_2^2 + x_3^2 = ||\mathbf{x}||_2^2
\end{align}
$$

And the properties of the transpose, we obtain:

$$
\begin{align}
f(\mathbf{x}) &= ||\mathbf{b} – A\mathbf{x}||_2^2 \\
&= (\mathbf{b} – A\mathbf{x})^T (\mathbf{b} – A\mathbf{x}) \\
&= (\mathbf{b}^T – (A\mathbf{x})^T)(\mathbf{b} – A\mathbf{x}) \\
&= \mathbf{b}^T\mathbf{b} – (A\mathbf{x})^T\mathbf{b} – \mathbf{b}^TA\mathbf{x} + (A\mathbf{x})^TA\mathbf{x} \\
&= \mathbf{b}^T\mathbf{b} – (A\mathbf{x})^T\mathbf{b} – \mathbf{b}^TA\mathbf{x} + \mathbf{x}^T A^T A\mathbf{x} ~~~~~ (1)
\end{align}
$$

Since $A^TB$ is a scalar it equals its transpose: $A^TB = (A^TB)^T = B^T(A^T)^T = B^TA$ So (1) can be simplified to

$$
\begin{align}
f(\mathbf{x}) &= \mathbf{b}^T\mathbf{b} – 2\mathbf{b}^TA\mathbf{x} + \mathbf{x}^T A^T A\mathbf{x} \\
&= \mathbf{b}^T\mathbf{b} – 2(A^T\mathbf{b})^T\mathbf{x} + \mathbf{x}^T A^T A\mathbf{x} ~~~~~ (2)
\end{align}
$$

The gradient vector is a column vector containing the first-order partial derivatives

$$
y = f(\mathbf{x})
\Rightarrow \nabla f(\mathbf{x}) = \frac{\partial f(\mathbf{x})}{\partial \mathbf{x}} =
\begin{bmatrix}
\frac{\partial y}{\partial x_1} \\
\frac{\partial y}{\partial x_2} \\
\vdots \\
\frac{\partial y}{\partial x_n}
\end{bmatrix}
$$

1. $f(\mathbf{x}) = \mathbf{c}^T\mathbf{x}$ where $\mathbf{c}$ is a constant vector
$$
\begin{align}
\mathbf{c}^T\mathbf{x} =
\begin{bmatrix}
c_1 & \dots & c_n
\end{bmatrix}
\begin{bmatrix}
x_1 \\
\vdots \\
x_n
\end{bmatrix}
\end{align}
= c_1x_1 + … + c_nx_n = \sum_{i = 1}^n c_ix_i
$$

$$
\begin{align}
\Rightarrow \frac{\partial f(\mathbf{x})}{\partial \mathbf{x}} &= \frac{\partial \mathbf{c}^T\mathbf{x}}{\partial \mathbf{x}} \\
&=
\begin{bmatrix}
\frac{\partial \mathbf{c}^T\mathbf{x}}{\partial x_1} \\
\vdots \\
\frac{\partial \mathbf{c}^T\mathbf{x}}{\partial x_n} \\
\end{bmatrix} \\
&=
\begin{bmatrix}
\frac{\partial (c_1x_1 + … + c_nx_n)}{\partial x_1} \\
\vdots \\
\frac{\partial (c_1x_1 + … + c_nx_n)}{\partial x_n} \\
\end{bmatrix} \\
&=
\begin{bmatrix}
c_1 \\
\vdots \\
c_n \\
\end{bmatrix}
= \mathbf{c}
\end{align}
$$

2. $f(\mathbf{x}) = \mathbf{x}^T\mathbf{B}\mathbf{x}$ where $\mathbf{B}$ is a symmetric matrix

$$
\begin{align}
\mathbf{x}^T\mathbf{B}\mathbf{x} &=
\begin{bmatrix}
x_1 & \dots & x_n
\end{bmatrix}
\begin{bmatrix}
b_{11} & \dots & b_{1n} \\
\vdots & \ddots & \vdots \\
b_{n1} & \dots & b_{nn} \\
\end{bmatrix}
\begin{bmatrix}
x_1 \\
\vdots \\
x_n \\
\end{bmatrix} \\
&=
\begin{bmatrix}
(b_{11}x_1 + … + b_{n1}x_n) & \dots & (b_{1n}x_1 + … + b_{nn}x_n)
\end{bmatrix}
\begin{bmatrix}
x_1 \\
\vdots \\
x_n \\
\end{bmatrix} \\
&=
\begin{bmatrix}
\sum_{i = 1}^n b_{i1}x_i & \dots & \sum_{i = 1}^n b_{in}x_i
\end{bmatrix}
\begin{bmatrix}
x_1 \\
\vdots \\
x_n \\
\end{bmatrix} \\
&= x_1\sum_{i = 1}^n b_{i1}x_i + … + x_n\sum_{i = 1}^n b_{in}x_i \\
&= \sum_{j = 1}^n x_j \sum_{i = 1}^n b_{ij}x_i \\
&= \sum_{j = 1}^n \sum_{i = 1}^n b_{ij}x_ix_j ~~~~~ (3)
\end{align}
$$

$$
\begin{align}
\Rightarrow \frac{\partial f(\mathbf{x})}{\partial \mathbf{x}} &= \frac{\partial \mathbf{x}^T\mathbf{B}\mathbf{x}}{\partial \mathbf{x}} \\
&=
\begin{bmatrix}
\frac{\partial \mathbf{x}^T\mathbf{B}\mathbf{x}}{\partial \mathbf{x_1}} \\
\vdots \\
\frac{\partial \mathbf{x}^T\mathbf{B}\mathbf{x}}{\partial \mathbf{x_n}}
\end{bmatrix}
\end{align}
$$

From (3), consider the $k^{th}$ row in the above vector

$$
\begin{align}
\Rightarrow \frac{\partial \mathbf{x}^T\mathbf{B}\mathbf{x}}{\partial x_k} &= \frac{\partial}{\partial x_k}(\sum_{j = 1}^n \sum_{i = 1}^n b_{ij}x_ix_j) \\
&= \frac{\partial}{\partial x_k}(x_1\sum_{i = 1}^n b_{i1}x_i + … + x_k\sum_{i = 1}^n b_{ik}x_i + … + x_n\sum_{i = 1}^n b_{in}x_i) \\
&= x_1b_{k1} + … + (\sum_{i = 1}^n b_{ik}x_i + x_kb_{kk}) + … + x_nb_{kn} \\
&= \sum_{j = 1}^n b_{kj}x_j + \sum_{i = 1}^n b_{ik}x_i \\
&=
\begin{bmatrix}
b_{k1} & \dots & b_{kn}
\end{bmatrix}
\begin{bmatrix}
x_1 \\
\vdots \\
x_n
\end{bmatrix} +
\begin{bmatrix}
b_{1k} & \dots & b_{nk}
\end{bmatrix}
\begin{bmatrix}
x_1 \\
\vdots \\
x_n
\end{bmatrix} \\
&=
(\begin{bmatrix}
b_{k1} & \dots & b_{kn}
\end{bmatrix} +
\begin{bmatrix}
b_{1k} & \dots & b_{nk}
\end{bmatrix}
)
\begin{bmatrix}
x_1 \\
\vdots \\
x_n
\end{bmatrix} \\
&=
[(k^{th} \text{ row of } \mathbf{B}) + (\text{transpose of } k^{th} \text{ column of } \mathbf{B})]\mathbf{x}
\end{align}
$$

Therefore,

\begin{align}
\frac{\partial \mathbf{x}^T\mathbf{B}\mathbf{x}}{\partial \mathbf{x}} &=
\begin{bmatrix}
[(1^{st} \text{ row of } \mathbf{B}) + (\text{transpose of } 1^{st} \text{ column of } \mathbf{B})]\mathbf{x} \\
\vdots \\
[(n^{th} \text{ row of } \mathbf{B}) + (\text{transpose of } n^{th} \text{ column of } \mathbf{B})]\mathbf{x}
\end{bmatrix} \\
&=
\begin{bmatrix}
[(1^{st} \text{ row of } \mathbf{B}) + (\text{transpose of } 1^{st} \text{ column of } \mathbf{B})] \\
\vdots \\
[(n^{th} \text{ row of } \mathbf{B}) + (\text{transpose of } n^{th} \text{ column of } \mathbf{B})]
\end{bmatrix}\mathbf{x} \\
&=
(
\begin{bmatrix}
(1^{st} \text{ row of } \mathbf{B}) \\
\vdots \\
(n^{th} \text{ row of } \mathbf{B})
\end{bmatrix} +
\begin{bmatrix}
(\text{transpose of } 1^{st} \text{ column of } \mathbf{B}) \\
\vdots \\
(\text{transpose of } n^{th} \text{ column of } \mathbf{B})
\end{bmatrix}
)\mathbf{x} \\
&= (\mathbf{B} + \mathbf{B^T})\mathbf{x}
\end{align}

From (2), we have

\begin{align}
f(\mathbf{x}) &= \mathbf{b}^T\mathbf{b} – 2(A^T\mathbf{b})^T\mathbf{x} + \mathbf{x}^T A^T A\mathbf{x}
\end{align}

Using the formulas from the sections 1 & 2, with $\mathbf{c} = A^T\mathbf{b}$ & $\mathbf{B} = A^T A$, we have
\begin{align}
\nabla f(\mathbf{x}) &= -2A^T\mathbf{b} + (A^TA + (A^TA)^T)\mathbf{x} \\
&= -2A^T\mathbf{b} + (A^TA + A^T(A^T)^T)\mathbf{x} \\
&= -2A^T\mathbf{b} + (A^TA + A^TA)\mathbf{x} \\
&= -2A^T\mathbf{b} + 2A^TA\mathbf{x} \\
&= 2(A^TA\mathbf{x} – A^T\mathbf{b})
\end{align}

Kien Duong

April 29, 2023

Gradient of the 2-Norm

Recent Blogs

LSTM with Pytorch

June 15, 2025

Retrieval Augmented Generation

June 8, 2025

Long Short-Term Memory (LSTM)

June 2, 2025

Variance

May 29, 2025