Seto's Page

Switch to

Last update:

Marquardt Method

This page explains the Marquardt method, also known as the Levenberg–Marquardt (LM) method. This technique is a type of least-squares method that combines fast convergence speed with stability, and in many cases can find optimal values with sufficient accuracy, making it widely used in the field of computational science. Note that the solution obtained by this algorithm is merely a local minimum near the initial values, and there is no guarantee that the global minimum will be found.

Given $ n$ observation points$$
(X_1, Y_1),(X_2, Y_2),\cdots,(X_n, Y_n)
$$ let us consider fitting a function (observation equation) containing $m$ parameters$$
F=F(a_1, a_2,\cdots, a_{m-1},a_m,X)
$$to these points. In other words, the goal is to find the function $F$ that best explains the observed values $Y$ (= to find the parameters $a$). Here, “best explains” means computing the sum of squared residuals between the observed values $Y$ and the function $F$ at each observation point, and making this sum as small as possible. When considering the error at each observation point, we multiply the squared residuals by the weight $w_i$ of each observation point (the reciprocal of the squared error) (if the errors are unknown, you may set $w_i=1$ for now). That is, the “quantity we want to minimize” is the weighted sum of squared residuals $R$, defined as follows:$$
R=\sum_{i=1}^n{w_i[Y_i-F(a_1, a_2,\cdots, a_{m-1},a_m,X_i)]^2}
$$We want to find the parameters $a$ that minimize this $R$, and one powerful method for doing so is the Marquardt method. Note that the function $F$ is assumed to be differentiable with respect to the parameters $a$¹. The procedure of the Marquardt method is explained below.

STEP 1

First, give an appropriate initial parameter set $a^0 = (a^0_1,a^0_2,\cdots,a^0_{m-1},a^0_m) $ and compute the weighted sum of squared residuals $R$ at that point.

$$R=\sum_{i=1}^n{w_i[Y_i-F(a^0,X_i)]^2}$$

STEP 2

Next, compute the partial derivatives of $F$ with respect to each parameter, and using the components evaluated at the current parameter values, construct the following $m \times m$ symmetric matrix $\alpha$:$$
\alpha=
\begin{pmatrix}
(1+\lambda)\displaystyle\sum_{i=1}^n w_i \left(\frac{\partial F(a^0,X_i)}{\partial a_1}\right)^2 &
\displaystyle\sum_{i=1}^n w_i\frac{\partial F(a^0,X_i)}{\partial a_1}\frac{\partial F(a^0,X_i)}{\partial a_2} &
\ldots &
\displaystyle\sum_{i=1}^n w_i\frac{\partial F(a^0,X_i)}{\partial a_1}\frac{\partial F(a^0,X_i)}{\partial a_m}\\
\displaystyle\sum_{i=1}^n w_i\frac{\partial F(a^0,X_i)}{\partial a_2}\frac{\partial F(a^0,X_i)}{\partial a_1} &
(1+\lambda)\displaystyle\sum_{i=1}^n w_i \left(\frac{\partial F(a^0,X_i)}{\partial a_2}\right)^2 &
\ldots &
\displaystyle\sum_{i=1}^n w_i\frac{\partial F(a^0,X_i)}{\partial a_2}\frac{\partial F(a^0,X_i)}{\partial a_m}\\
\vdots & \vdots & \ddots & \vdots \\
\displaystyle\sum_{i=1}^n w_i\frac{\partial F(a^0,X_i)}{\partial a_m}\frac{\partial F(a^0,X_i)}{\partial a_1} &
\displaystyle\sum_{i=1}^n w_i\frac{\partial F(a^0,X_i)}{\partial a_m}\frac{\partial F(a^0,X_i)}{\partial a_2} &
\ldots &
(1+\lambda)\displaystyle\sum_{i=1}^n w_i \left( \frac{\partial F(a^0,X_i)}{\partial a_m} \right)^2
\end{pmatrix}
$$Here, $\lambda$ is a coefficient for emphasizing the diagonal elements. A good starting value is around $\lambda=5$. The inverse matrix $\alpha^{-1}$ of this matrix (with $\lambda = 0$) is called the “covariance matrix.” This is because when the function $F$ is linear, the diagonal elements of $\alpha^{-1}$ correspond to the variance of each parameter. We omit the detailed explanation here.

STEP 3

Compute the $m$-by-1 matrix $\beta$, obtained by multiplying the residuals by the partial derivatives:$$
\beta=
\begin{pmatrix}
\displaystyle\sum_{i=1}^n w_i[Y_i-F(a^0,X_i)]\frac{\partial F}{\partial a_1} \\
\displaystyle\sum_{i=1}^n w_i[Y_i-F(a^0,X_i)]\frac{\partial F}{\partial a_2} \\
\vdots \\
\displaystyle\sum_{i=1}^n w_i[Y_i-F(a^0,X_i)]\frac{\partial F}{\partial a_m}
\end{pmatrix}$$

STEP 4

Using $\alpha$ and $\beta$, obtain the new parameter set $(a’_1,a’_2,\cdots,a’_m)$ as follows:$$
\begin{pmatrix} a’_1 \\ a’_2 \\ \vdots \\ a’_m \end{pmatrix}
=\begin{pmatrix} a^0_1 \\ a^0_2 \\ \vdots \\ a^0_m \end{pmatrix} + \alpha^{-1}\beta
$$

STEP 5

Compute the weighted sum of squared residuals $R’$ using the new parameter set $a’$. If $R'<R$ (i.e., the residual decreases), adopt the new parameters as the initial values, multiply $ \lambda$ by 1/2 to 1/10, and return to STEP 1. On the other hand, if $R’ >= R$, discard the new parameters, use the old parameters again as the initial values, multiply $ \lambda$ by 2 to 10, and return to STEP 1.
Repeat this procedure many times, and terminate the computation when the change in $R$ becomes sufficiently small.

The key point of this method is that it progressively optimizes by varying the weighting of the diagonal elements of the error matrix. When $R$ is decreasing smoothly, the value of $\lambda$ automatically approaches 0, so the diagonal and off-diagonal terms are increasingly treated equally. In other words, the behavior becomes similar to the Gauss–Newton method.
On the other hand, when $R$ does not decrease easily, a linear treatment is not appropriate, and it is wiser to move the parameters along $\textrm{grad} R$ in the sense of vector analysis. This corresponds to increasing the value of $\lambda$ to emphasize the diagonal terms. In other words, the behavior becomes similar to the steepest descent method.
In summary, the Marquardt method moves parameters more boldly (= linearly) when successive computations succeed, and more cautiously along the gradient ($\textrm{grad}$) when they fail. The algorithm’s philosophy of “if you fail, go back and carefully choose a new path” applies to life as well.

To apply the Marquardt method, the function $F$ may be nonlinear with respect to the parameters $a$, but it must be differentiable. ↩︎