Skip to contents

NOT COMPLETE, WORKING ON ADDITIONAL OPTIMIZATION OPTIONS

Optimization Theory

Colossus offers three levels of score calculation, calculating only the score, calculating the score and first derivative, and calculating the score and both first and second derivatives. The second and third options correspond to the Gradient Descent and Newton-Raphson optimization approaches. The goal of this vignette is to discuss how these methods are different, and in what circumstances each might be most appropriate. In both cases the algorithm is designed to iteratively change the parameter estimates to approach a set of parameter values which optimize the score. The major difference is how much information is being calculated and used. The Newton-Raphson algorithm calculates the second derivative matrix, inverts it, and solves a linear system of equations to set the first derivative vector to zero. This method establishes both a magnitude and direction for every step. So every step has several time-intensive calculations, but the new parameter estimates are informed. In this algorithm Colossus uses both a learning rate (η\eta) and maximum allowable parameter change (βmax\beta_{max}).

Δβ×2LLβ2LLβΔβ=ηLLβt×(2LLβt2)1βt+1=βt+sign(Δβ)*min([|Δβ|,βmax]) \begin{aligned} \Delta \beta \times \frac{\partial^2 LL}{\partial \beta^2} \approx - \frac{\partial LL}{\partial \beta} \\ \Delta \beta = - \eta \frac{\partial LL}{\partial \beta_{t}} \times \left ( \frac{\partial^2 LL}{\partial \beta_{t}^2} \right)^{-1} \\ \beta_{t+1} = \beta_{t} + sign(\Delta \beta)*min \left( \left[ \left|\Delta \beta \right|, \beta_{max} \right] \right) \end{aligned}

The alternative is a Gradient descent approach. In this algorithm, the first derivatives are calculated and used to determine the vector with highest change in score. This establishes a direction for the change in parameters, which is multiplied by the learning rate (η\eta). Similar to the Newton-Raphson algorithm, the magnitude is normalized to the maximum allowable parameter change (βmax\beta_{max}). Colossus uses half-steps to slowly reduce the allowable step size as the solution approaches the optimum. The Gradient algorithm avoids the time-intensive second-derivative calculations, but takes less informed steps. So each iteration runs faster, but more iterations may be required.

Δβ=η*LLββt+1=βt+sign(Δβ)*min([|Δβ|,βmax]) \begin{aligned} \Delta \beta = \eta * \frac{\partial LL}{\partial \beta}\\ \beta_{t+1} = \beta_{t} + sign(\Delta \beta)*min \left( \left[ \left|\Delta \beta \right|, \beta_{max} \right] \right) \end{aligned}

The standard half-step framework is not likely to be sufficient for the Gradient descent algorithm. Because of this, several different optimization options have been or will be added, like momentum, adadelta, and adam, which use previous information about the gradient to inform the step size for future steps. The first method, momentum, applies a weighted sum (γ\gamma) of the current and previous step. This is done to speed up steps moving toward the optimum position and correct for when the algorithm oversteps. This can avoid the issue of oscillation around an optimum value.

Δβt=γ*Δβt1+η*LLββt+1=βt+sign(Δβt)*min([|Δβt|,βmax]) \begin{aligned} \Delta \beta_{t} = \gamma * \Delta \beta_{t-1} + \eta * \frac{\partial LL}{\partial \beta}\\ \beta_{t+1} = \beta_{t} + sign(\Delta \beta_{t})*min \left( \left[ \left|\Delta \beta_{t} \right|, \beta_{max} \right] \right) \end{aligned}

The next method, the adadelta method, applies a parameter specific learning rate by tracking the root mean square (RMS) gradient and parameter updates within a window. Instead of tracking a true window of iteration, the old estimate of RMS is decayed by a weight (γ\gamma) before being added to the new estimate. The ratio of RMS parameter update to RMS gradient is used to normalize the results back in the correct units. A small offset (ϵ\epsilon) is used to avoid the case of division by zero.

gt=(LLβ)tE[g2]t=γ*E[g2]t1+(1γ)*gt2E[Δβ2]t1=γ*E[Δβ2]t2+(1γ)*Δβt12RMS[g]t=E[g2]t+ϵRMS[Δβ]t1=E[Δβ2]t1+ϵΔβt=RMS[Δβ]t1RMS[g]t*gtβt+1=βt+sign(Δβt)*min([|Δβt|,βmax]) \begin{aligned} g_t = \left (\frac{\partial LL}{\partial \beta} \right)_{t} \\ E[g^2]_{t} = \gamma * E[g^2]_{t-1} + (1-\gamma) * g^2_{t} \\ E[\Delta \beta^2]_{t-1} = \gamma * E[\Delta \beta^2]_{t-2} + (1-\gamma) * \Delta \beta^2_{t-1} \\ RMS[g]_t = \sqrt{E[g^2]_{t} + \epsilon} \\ RMS[\Delta \beta]_{t-1} = \sqrt{E[\Delta \beta^2]_{t-1} + \epsilon} \\ \Delta \beta_{t} = \frac{RMS[\Delta \beta]_{t-1}}{RMS[g]_t} * g_t\\ \beta_{t+1} = \beta_{t} + sign(\Delta \beta_{t})*min \left( \left[ \left|\Delta \beta_{t} \right|, \beta_{max} \right] \right) \end{aligned}

The final method, adam, combines the theory behind the momentum and adadelta methods. The adam method tracks an estimate of the first moment vector (mm) and second moment vector (vv), which are weighted by decay parameters (β1,β2\beta_1, \beta_2). These are bias corrected to correct for bias in early iterations (m̂,v̂\hat{m}, \hat{v}). The learning rate (η\eta) and second moment vector provide the decaying learning rate from adadelta, and the first moment vector provides an effect similar to momentum. Combined these have generally been able to stabilize gradient descent algorithms without incurring a significant computational cost.

gt=(LLβ)tm0,v0=0,0mt=β1*mt1+(1β1)*gtvt=β2*vt1+(1β2)*gt2m̂t=mt/(1β1t)v̂t=vt/(1β2t)Δβt=ηv̂t+ϵ*m̂tβt+1=βt+sign(Δβt)*min([|Δβt|,βmax]) \begin{aligned} g_t = \left (\frac{\partial LL}{\partial \beta} \right)_{t} \\ m_0, v_0 = 0, 0 \\ m_t = \beta_1 * m_{t-1} + (1-\beta_1) * g_t \\ v_t = \beta_2 * v_{t-1} + (1-\beta_2) * g^2_t \\ \hat{m}_t = m_t / (1-\beta_1^t) \\ \hat{v}_t = v_t / (1-\beta_2^t) \\ \Delta \beta_{t} = \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} * \hat{m}_t \\ \beta_{t+1} = \beta_{t} + sign(\Delta \beta_{t})*min \left( \left[ \left|\Delta \beta_{t} \right|, \beta_{max} \right] \right) \end{aligned}