Optimization

Introduction to optimization

Optimization methods are at the core of many practical applications in science and engineering. Traditionally, optimization is concerned with minimizing an objective or cost function \(f(x)\) (Nocedal and Wright, 1999). Where \(x\) is a real number \(\mathbb{R}\), given the cost, we must find the cheapest choice \(x^*\). In other words

\begin{equation} x^* = \min_{x \in \mathbb{R}} f(x) \end{equation}

Let's start with a simple example and use \(f(x) = x^2\). The figure below plots this parabola in the interval from \([-1, 1]\).

Parabola

Any human can immediately spot the minimum in this case. It's at zero! Of course. But let's spend a little bit of extra time on this example. After all, we want an automatic way to find the minimum that will also allow us to work with more challenging examples. The argument here is that we want to start simple.

The derivative

We require the derivative to achieve our goal of obtaining an automatic way that allows us to find minima. Since \(x \in [0,1]\), we can move within a single-dimensional line between zero and one. If we started at some point \(x_0\) somewhere on the line, we could move slightly to the left or right. We can make this choice with information about the slope of our function \(f(x)\). To learn something about this slope, let's consider the derivative next,

\begin{equation} \frac{df(x)}{dx} = \lim_{h \rightarrow 0} \frac{f(x+h) - f(x)}{h}. \end{equation}

To find the derivate for the parabola, we plug \(x^2\) into the above equation

\begin{align} \lim_{h \rightarrow 0} \dfrac{(x + h)^2 - x^2}{h} &= \lim_{h \rightarrow 0} \dfrac{x^2 + 2xh + h^2 - x^2}{h} \\ &= \lim_{h \rightarrow 0} \dfrac{2xh + h^2}{h} \\ &= \lim_{h \rightarrow 0} \dfrac{h (2x + h)}{h} \\ &= \lim_{h \rightarrow 0} 2x + h \\ &= 2x. \end{align}

We know now that \(\frac{df(x)}{dx}\) or \(f'(x)\) of the parabola is \(2x\).

To find a minimum, we follow the derivate in a downward, that is, a negative direction. The step number is denoted as \(n\), \(\epsilon \in \mathbb{R}\) is the step size and \(\dfrac{d f}{dx}\) the derivate of \(f(x)\) along \(x \in \mathbb{R}\)

\begin{equation} x_n = x_{n-1} - \epsilon \cdot \dfrac{d f(x_{n - 1})}{dx}. \end{equation}

Setting the initial position \(x_0 = 1\) and choosing a step size of ϵ = 0.1 for 25 steps leads to yields the plots below. The top plot shows the parabola \(f(x)\) and the position of the optimization process starting at \(x_0\). The bottom plot shows the size of the derivate, conveniently the derivate vanishes at the minimum.

Descent_Parabola

The plots above illustrate how we can follow the derivate until we arrive at the minimum of the parabola. In this case, the function only had a single minimum. Let's consider a more elaborate example next.

Local minima

Functions with a local minimum are interesting. For example

\begin{equation} f_p(x ) = 1x + -19x^2 + 70x^3 + -96x^4 + 44x^5. \end{equation}

To simplify our discussion, let's limit ourselves to the interval between zero and one \([0, 1]\). This function seems arbitrary at first. The plot below reveals why it is an interesting choice.

Polinomial_Fun

Take note of the two bumps when following the curve from left to right. These points where the function has smaller values than neighbouring points on the same line are called minima. In this case, we can see that the second minimum is smaller than the first. This text focuses on ways to identify these minima numerically.

Since \(f(x)_p\) is a polynomial, we will use the fact that the derivate for a monomial \(x^n\) is given by \(nx^{n-1}\) Deisenroth et al. (2020). Following the rule, we subtract minus one from all exponents and scale all coefficients accordingly. The process yields for the derivative

\begin{equation}\label{eq:fpprime} f'(x)_p = 1 - 38 x + 210 x^2 - 384 x^3 + 220 x^4 . \end{equation}

If we repeat the descent process as described in the derivative section with \(x_0 = 0.05\). We end up in the local minimum. The plot below illustrates the situation.

Polinomial_Fun_opt

We are aware of the global function topology in the interval \([0, 1]\). If we choose to start at \(1.0\), we can find the global minimum. In neural networks with thousands, millions or billions of parameters, understanding the global function topology is an unsolved problem.

Consequently, we are looking for a way to overcome the local minimum. Momentum is a good way to power through the bump and reach the global minimum Goodfellow et al. (2016). The idea here is to work with a moving average of derivative values as follows

\begin{equation}\label{eq:descent_1d} v_n = \alpha v_{n-1} - \epsilon \cdot \dfrac{d f(x_{n - 1})}{dx}, \\ x_n = x_{n - 1} + v_n, \end{equation}

in the equations above \(v\) now contains the average derivate values at step \(n\), \(\epsilon\) continues to denote the step size as before. A new momentum parameter appears as \(\alpha\).

The video above illustrates the effect of momentum, which allows the process to converge to the global minimum.

We want to understand the optimization process for more complex neural networks, which we can think of as functions as well. To get there, we require the ability to handle composite functions with high dimensional inputs. After all, our networks should be able to handle high-dimensional inputs like images, audio, or text.

So far, this text has dealt only with monomials, i.e. \(x^n\) and polynomials or sums of monomials. Finding the derivative of a polynomial, for example, requires the sum rule, which we have not yet introduced formally. Let's do so now Deisenroth et al. (2020).

\begin{align} \text{Product Rule: } (f(x)g(x))' = f'(x)g(x) + f(x)g'(x) \\ \text{Quotient Rule: } \big (\frac{f(x)}{g(x)} \big)' = \frac{f'(x)g(x) - f(x)g'(x)}{g(x)^2} \\ \text{Sum Rule: } (f(x) + g(x))' = f'(x) + g'(x) \\ \text{Chain Rule: } (f(g(x)))' = f'(g(x))\cdot g'(x) \label{eq:chainrule} \end{align}

The sum rule tells us that we get to differentiate individual summands in order to obtain the overall derivative. We already used it to find the derivative form of the polynomial \(f_p\) (equation \(\ref{eq:fpprime}\)).

Let us consider another example, the logistic sigmoid function. The logistic sigmoid appears as an activation function in the literature. Its outputs are confined to the interval between zero and one, which makes it an attractive choice. The function is defined as Goodfellow et al. (2016)

\begin{equation} \sigma(x) = \frac{1}{1+e^{-x}} \end{equation}

for real valued input values \(x \in \mathbb{R}\), with the exponential function \(e^x\). Let's consider a plot of this function.

Sigmoid

The plot above reveals the appeal of the logistic sigmoid for neural network applications. The function produces values bounded by zero and one, when the function input is negative we can consider a corresponding neuron as inactive. Similarly when the input is large the output is one which would correspond to neuron activity.

To optimize a network that uses a sigmoid activation function, we require its derivative. Let's find it. The sigmoid functions is a fraction, therefore we are going to apply the chain rule. To make our lives easier, let's restructure the function first.

\begin{align} \dfrac{\text{d} \sigma(x)}{\text{dx}} &= \dfrac{\text{d} }{\text{dx}} \dfrac{1}{1+ e^{-x}} \\ &= \dfrac{\text{d} }{\text{dx}} \dfrac{1}{1+ e^{-x}} \cdot 1 \\ &= \dfrac{\text{d} }{\text{dx}} \dfrac{1}{1+ e^{-x}} \cdot \dfrac{e^x}{e^x} \\ &= \dfrac{\text{d} }{\text{dx}} \dfrac{e^x}{e^x + 1} \end{align}

Now we are able to identify \(g(x) = e^x\) and \(h(x) = e^x + 1\). Let's now to plug \(g(x)\) and \(h(x)\) into the chain rule (equation \(\ref{eq:chainrule}\))

\begin{align} \dfrac{\text{d} \sigma(x)}{\text{dx}} &= \dfrac{e^x \cdot (e^x + 1) - e^x \cdot e^x}{(e^x + 1)^2} \\ &= \dfrac{e^x \cdot e^x + e^x - e^x \cdot e^x}{(e^x + 1)^2} \\ &= \dfrac{e^x}{(e^x + 1)^2} \\ &= \dfrac{e^x}{(e^x + 1)}\dfrac{1}{(e^x + 1)} \\ &= \dfrac{e^x}{(e^x + 1)}(\dfrac{1 + e^x - e^x}{(e^x + 1)}) \\ &= \dfrac{e^x}{(e^x + 1)}(1 - \dfrac{ e^x}{(e^x + 1)}) \\ &= \sigma(x)(1 - \sigma(x)) \end{align}

We found

\begin{equation} \sigma'(x) = \frac{d\sigma(x)}{dx} = \sigma(x)(1 - \sigma(x)), \end{equation}

which means we can use it to descent towards a minimum. Furthermore, it turns out we can conveniently express \(\sigma'(x)\) in terms of \(\sigma(x)\). The plot below looks at the topology of the derivative.

sigmoid_prime

The plot reveals \(\sigma'(x)\) bell-shaped form, with a maximum value below one.

Now that we have seen a more involved application of the chain rule let's turn our attention to multivariate functions. After all, we would like to be able to process pixels or audio sample vectors. Moving from one to two dimensions will get us started. Let us take an in-depth look at the Rosenbrock function, a standard two-dimensional test function in the optimization literature

\begin{equation} f(x_1, x_2) = (a - x_1)^2 + b(x_2 - x_1^2)^2, \end{equation}

in the equation above the input to \(f\) is the two dimensional vector \(\mathbf{x} = (x_1, x_2)^T\), where \(x_1\) and \(x_2\) denote the vector elements. Overall, the function maps this two-dimensional vector \(\mathbf{x} \in \mathbb{R}^2\) to a number \(\mathbb{R}^1\). In other terms we are dealing with a function \(f: \mathbb{R}^2 \rightarrow \mathbb{R}\).

The plot below illustrates why this function is interesting. To create the plot we set \(a=1\) and \(b=100\). Both values will remain the same going forward.

Rosenbrock_function

The function creates a valley with many local minima. It is therefore an ideal example to explore optimization algorithms!

Before we can get started, we need a way to process the vector \(\mathbf{x} = (x_1, x_2)^T\). Or in other words, we are looking for a way to deal with derivatives of multivariate functions.

Luckily, for artificial neural network training, the cost function usually produces a single number or scalar value. In other words, we can limit ourselves to functions from many inputs to a single output \(f: \mathbb{R}^n \rightarrow \mathbb{R}^1\). The input vector is now high-dimensional \(\mathbf{x} = (x_1, x_2 , \dots, x_n)^T\). A function with many inputs has a partial derivative for every input variable. Partial derivatives are written as \(\frac{\partial f}{\partial x_n}\).

To compute the partial derivate for a specific input, we reuse the difference quotient and keep all other inputs constant Deisenroth et al. (2020),

\begin{align} \frac{\partial f (\mathbf{x})}{\partial x_1} = &\lim_{h \rightarrow 0} \frac{f(x_1 + h, x_2, \dots, x_n) - f(\mathbf{x})}{h} \\ \vdots & \\ \frac{\partial f (\mathbf{x})}{\partial x_n} = & \lim_{h \rightarrow 0} \frac{f(x_1, x_2, \dots, x_n + h) - f(\mathbf{x})}{h}. \end{align}

Using these partial derivatives \(\nabla_x f(\mathbf{x}) = grad(f(\mathbf{x})) = \frac{d f(\mathbf{x})}{d\mathbf{x}} = ( \frac{\partial f (\mathbf{x})}{\partial x_1}, \frac{\partial f (\mathbf{x})}{\partial x_2}, \dots \frac{\partial f (\mathbf{x})}{\partial x_n} )^T \in \mathbb{R}^{n}\) for all \(n\) inputs.

Armed with this general expression, we can return to the Rosenbrock function. We derive:

\begin{align} f(x_1,y_2 ) &= (a - x_1)^2 + b(y_2 -x_1^2)^2 \\ &= a^2 - 2ax_1 + x_1^2 + b(y_2 ^2 - 2y_2 x_1^2 + x_1^4) \\ &= a^2 - 2ax_1 + x_1^2 + by_2 ^2 - 2by_2 x_1^2 + bx_1^4 \\ \end{align}

\(\;\)

\begin{align} \Rightarrow \dfrac{\partial f(x_1,y_2 )}{\partial x_1} &= -2a + 2x_1 - 4by_2 x_1 + 4bx_1^3 \\ \Rightarrow \dfrac{\partial f(x_1,y_2 )}{\partial y_2 } &= 2by_2 - 2bx_1^2 \end{align}

Now it's possible to write down the analytics expression for the gradient:

\begin{align} \nabla f(x_1, y_2 ) = \begin{pmatrix} -2a + 2x_1 - 4by_2 x_1 + 4bx_1^3 \\ 2by_2 - 2bx_1^2 \\ \end{pmatrix} \end{align}

The plot below shows gradients evaluated on a grid.

Rosenbrock_function_with_grads

The gradient vectors always point up. For minimization purposes, we can turn them around by multiplying the vectors with minus one. Following these values should let us descent into the Rosenbrock valley. To do so we modify our single dimensional descent algorithm (from equation \(\ref{eq:descent_1d}\)),

\begin{align} \mathbf{v}_n = \alpha \mathbf{v}_{n-1} - \epsilon \cdot \nabla_x f(\mathbf{x}_{n-1}), \\ \mathbf{x}_n = \mathbf{x}_{n - 1} + \mathbf{v}_n, \end{align}

now \(\mathbf{v}\) and \(\mathbf{x}\) are vectors, the momentum parameter \(\alpha\) and the step-size \(\epsilon\) appear once more. This algorithm is useful for neural network training Goodfellow et al. (2016). Let's observe its performance on the Rosenbrock function. The algorithm starts from \((-0.1, 3.)^T\) with a step size \(\epsilon = 0.01\) a momentum parameter of \(\alpha = 0.8\) and stops after 600 steps.

The method correctly converges to the global minimum at \((1,1)^T\).

This concludes this article. Hopefully, readers leave with some extra intuition regarding gradient descent with momentum.

Bibliography

Marc Peter Deisenroth, A Aldo Faisal, and Cheng Soon Ong. Mathematics for machine learning. Cambridge University Press, 2020. ↩ ¹ ² ³

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. \url http://www.deeplearningbook.org. ↩ ¹ ² ³

Jorge Nocedal and Stephen J Wright. Numerical optimization. Springer, 1999. ↩