Exponentially Weighted Averages

Introducing Exponentially Weighted Averages

  • Let's say we have some temperature data θ\theta:
θ1=40°F\theta_{1} = 40 \degree\text{F} θ2=49°F\theta_{2} = 49 \degree\text{F} ...... θn=60°F\theta_{n} = 60 \degree\text{F}
  • Then our exponentially weighted averages would look like:
v0=0v_{0} = 0 v1=0.9v0+0.1θ1v_{1} = 0.9v_{0} + 0.1 \theta_{1} v2=0.9v1+0.1θ2v_{2} = 0.9v_{1} + 0.1 \theta_{2} ...... vn=0.9vn1+0.1θnv_{n} = 0.9v_{n-1} + 0.1 \theta_{n}
  • Which can be simplified to:
v0=0v_{0} = 0 v1=0.9v0+0.1×40=4v_{1} = 0.9v_{0} + 0.1 \times 40 = 4 v2=0.9×4+0.1×49=8.5v_{2} = 0.9 \times 4 + 0.1 \times 49 = 8.5 ...... vn=0.9vn1+0.1×60v_{n} = 0.9v_{n-1} + 0.1 \times 60
  • Here, our hyperparameter β=0.9\beta = 0.9
  • Here, θt\theta_{t} represents the ttht^{th} temperature value
  • Here, vtv_{t} represents the weighted average of the ttht^{th} temperature value

Defining Exponentially Weighted Averages

  • Exponentially weighted averages are sometimes referred to as exponentially weighted moving averages in statistics
  • The general formula for an exponentially weighted average is:
vt=βvt1+(1β)θtv_{t} = \beta v_{t-1} + (1-\beta) \theta_{t}

Interpreting the Parameters

  • We can think of vtv_{t} as an average of 11β\frac{1}{1-\beta} previous days
  • Specifically, we can use whatever units of time (not just days)
  • Roughly, β=0.9\beta = 0.9 looks at previous 1010 units of time
  • Roughly, β=0.98\beta = 0.98 looks at previous 5050 units of time
  • Roughly, β=0.5\beta = 0.5 looks at previous 22 units of time
  • Small values of β\beta provide us with a very wiggly curve
  • Large values of β\beta provide us with a smooth curve
  • Large values of β\beta also cause the curve to shift rightward
  • This is because we're averaging over a larger window of values
  • In other words, a large β\beta indicates we're adapting slowly to changes in our data
  • This is because large β\beta values are giving more weight to previous values rather than more recent (or current) values

weightedaverages

What is Exponential about the Weighted Averages?

  • The data points further away from our current value become exponentially less important
  • This exponential decay is captured by the weights β\beta
  • We can rewrite the equations for our example data in the following steps:
vn=0.1θn+0.9vn1v_{n} = 0.1\theta_{n} + 0.9v_{n-1} vn=0.1θn+0.9(0.1θn1+0.9vn2)v_{n} = 0.1\theta_{n} + 0.9(0.1\theta_{n-1} + 0.9v_{n-2}) vn=0.1θn+0.9(0.1θn1+0.9(0.1θn2+0.9vn2))v_{n} = 0.1\theta_{n} + 0.9(0.1\theta_{n-1} + 0.9(0.1\theta_{n-2} + 0.9v_{n-2})) vn=0.1θn+0.9(0.1θn1+0.9(0.1θn2+0.9(0.1θn3+0.9vn3)))v_{n} = 0.1\theta_{n} + 0.9(0.1\theta_{n-1} + 0.9(0.1\theta_{n-2} + 0.9(0.1\theta_{n-3} + 0.9v_{n-3}))) vn=0.1θngt=1+0.1×0.9θn1gt=2+0.1×0.92θn2+0.1×0.93vn3v_{n} = \underbrace{0.1\theta_{n}}_{g_{t=1}} + \underbrace{0.1 \times 0.9 \theta_{n-1}}_{g_{t=2}} + 0.1 \times 0.9^{2} \theta_{n-2} + 0.1 \times 0.9^{3}v_{n-3}

exponentialdecay

  • As previously stated, we roughly look at the previous 1010 units of time when β=0.9\beta = 0.9
  • Here, we can see gg roughly becomes equal to 00 when t=10t=10
  • In other words, we're saying temperatures that are 1111, 1212, or more days away from the current day don't have much influence on our current temperature

Advantages and Disadvantages

  • An advantage of using exponentially weighted averages is the performance boost

    • If we were to average values using a moving window, we would need to save the values and averages of previous days
    • By averaging values using exponential weights, we only need to save the current vtv_{t} in memory
  • A disadvantage of using exponentially weighted averages is the decrease in accuracy

    • If we were to average values using a moving window, we could accurately average all of the previous days
    • By averaging values using exponential weights, we can only look at the previous vtv_{t}

Describing Bias Correction

  • By setting v0=0v_{0}=0, we add some degree of bias to our model

biascorrection

  • Since v1,v2,...,vnv_{1}, v_{2},...,v_{n} are all based on v0v_{0}, then the earlier values of vv will be slightly smaller than expected
  • As vv increases, this bias will become negligible
  • However, we should still correct for this bias, since earlier values of vv will be underestimated
  • We can do this by transforming each vtv_{t} term:
vt=vt1β2v_{t}^{*} = \frac{v_{t}}{1-\beta^{2}}

tldr

  • Exponentially weighted averages are sometimes referred to as exponentially weighted moving averages in statistics
  • The general formula for an exponentially weighted average is:
vt=βvt1+(1β)θtv_{t} = \beta v_{t-1} + (1-\beta) \theta_{t}
  • We can think of vtv_{t} as an average of 11β\frac{1}{1-\beta} previous days
  • Specifically, we can use whatever units of time (not just days)
  • Roughly, β=0.9\beta = 0.9 looks at previous 1010 units of time
  • Roughly, β=0.98\beta = 0.98 looks at previous 5050 units of time
  • Roughly, β=0.5\beta = 0.5 looks at previous 22 units of time

References

Previous
Next

Mini-Batch Gradient Descent

Gradient Descent with Momentum