Understanding the Glicko Rating System

Written on 2020-11-07

In 2007, Microsoft released their TrueSkill 1 paper, which essentially was a modified implentation of the Elo system. At the time, most of their games used Trueskill 1 as a player ranking system. Later on, the Trueskill 2 paper was released, which replaced their Trueskill 1 ranking system in most of their Xbox games, such as Halo.

Both Trueskill 1 and Trueskill 2 treat a player's ranking as a distribution, rather than a single summary statistic. To achieve this, both ranking systems use a principled Bayesian framework. As a result, the Trueskill algorithm considers the fact that all players play differently depending on their individual circumstances.

Comparing Elo with Glicko

In 1960, the Elo system was invented and used as an improved chess-rating system. The Elo system is a method used for calculating the relative skill levels of players in zero-sum games, such as chess. At the time, the Elo system became very popular in a broad range of other sports, such as basketball and football. In 1995, the Glicko system was invented to improve the Elo system by introducing a rate volatility measure. Afterward, it became a popularly adopted rating system in most well-known sports. For a more detailed explanation of the Glicko system, refer to its original paper.

The Trueskill rating systems borrow many ideas from Glicko, but also includes a measure of match quality between a set of players. There are a few other mathematical differences within the paper, and a few differences between the Trueskill 1 and Trueskill 2 rating systems. However, they share more similarities than differences. For the remainder of this post, I'll be focused on illustrating the intuition behind the Glicko algorithm.

RD=min(RDold2+c2,350)\text{RD} = \min(\sqrt{\text{RD}^{2}_{\text{old}} + c^{2}}, 350)

As stated previously, the Glicko system extends the Elo system by computing not only a rating, which can be thought of as a best guess of one’s playing strength, but also a ratings deviation RD\text{RD}. In statistical terminology, this RD\text{RD} term represents a standard deviation, which measures the uncertainty of a rating. Thus, a high RD\text{RD} corresponds to an unreliable rating, or that a player roughly has only competed in a small number of tournament games. Whereas, a low RD\text{RD} indicates the player competes frequently.

Intuition behind Uncertainty of Ranking

In the Glicko system, a player's rating only changes based on their game outcomes, but players' RD\text{RD} changes based on their game outcomes and their time not playing. Thus, there are two features in the Glicko system that don't exist in the Elo system. First, RD\text{RD} increases as time passes without a player competing in many games. Second, if one player’s rating increases by xx, the opponent’s rating does not usually decrease by xx. In the Glicko system, the amount by which the opponent’s rating decreases is governed by both players’ RD\text{RD} terms.

To apply the rating algorithm, we treat a collection of games within a rating period to have occurred simultaneously. A rating period could be as long as several months, or could be as short as one minute. As a result, an RD\text{RD} is calculated for each rating period, which is based on RDold\text{RD}_{\text{old}} from the previous rating period.

Here, there is a constant cc governing the increase in uncertainty between rating periods, which can be precisely determined by optimizing predictive accuracy of future games. Once RD\text{RD} is computed properly, the new rating rr' must be updated for each player.

r=r+q1RD2+1d2j=1mg(RDj)(sjE(r,rj,RDj))r' = r + \frac{q}{\frac{1}{\text{RD}^{2}} + \frac{1}{d^{2}}} \sum^{m}_{j=1} g(\text{RD}_{j})(s_{j} | \text{E}(r,r_{j},\text{RD}_{j})) RD=11RD2+1d2\text{RD}' = \sqrt{\frac{1}{\frac{1}{\text{RD}^{2}} + \frac{1}{d^{2}}}}

The RD\text{RD}' term refers to the updated confidence in our own ranking after playing a bunch of players. Notice, it's basically the same as our previous RD\text{RD}, but adds a dd variable. Basically, the dd variable can be thought of as the change in uncertainty of RD\text{RD}, or how much more certain or less certain we are about our ranking. A really large dd (i.e. dd \to \infty) will give us the same RD\text{RD}', whereas a really small dd (i.e. d0d \to 0) will give us an RD\text{RD}' of 00.

Essentially, the dd variable really starts to shrink when we play more players, which supports the fact that if we play more players, then we'll have a low RD\text{RD} (and vice versa). So, that is how the uncertainty is updated.

Intuition behind Expected Ranking

The new actual ranking rr' is dependent on the difference between the outcome of an opponent ss and the expected outcome E(s...)\text{E}(s|...) of playing that opponent. This expected outcome will be around 00 if a player is expected to lose, and around 11 if a player expected to win. This difference is what really determines the update to a player's ranking.

E(r,rj,RDj)=11+10g(RDj)(rrj)400\text{E}(r,r_{j},\text{RD}_{j}) = \frac{1}{1+10^{\frac{-g(RD_{j})(r-r_{j})}{400}}} d2=1q2j=1mg(RDj)2×E(r,rj,RDj)×(1E(r,rj,RDj))d^{2} = \frac{1}{q^{2} \sum_{j=1}^{m} g(\text{RD}_{j})^{2} \times \text{E}(r,r_{j},\text{RD}_{j}) \times (1 - \text{E}(r,r_{j},\text{RD}_{j}))}

If a player beats someone when he/she is expected to lose, this difference will be positive and high. Meaning, his/her new ranking will be relatively much higher. And, if a player beats someone when he/she is expected to win, this difference will be positive and low. Meaning, his/her new ranking won't change that much. However, if a player loses to someone he/she is expected to win against, this difference will be negative and large. Meaning, his/her new ranking will be relatively much lower. Lastly, if a player loses to someone he/she is expected to lose against, this difference will be negative and small. Meaning, his/her ranking won't change that much.

sjs_{j} E(s...)\text{E} (s \vert ...) sjE(s...)s_{j} - \text{E}(s \vert ...)
win lose much higher\text{much higher}
win win not much higher\text{not much higher}
lose win much lower\text{much lower}
lose lose not much lower\text{not much lower}

After playing a bunch of opponents, the updated ranking rr' is also dependent on the g(RD)g(\text{RD}) term, which essentially is the inverse of each opponent's RD\text{RD} term. The following formula defines this function:

g(RD)=11+3q2RD2π2g(\text{RD}) = \frac{1}{\sqrt{\frac{1 + 3q^{2}\text{RD}^{2}}{\pi^{2}}}} q=ln10400q = \frac{\ln 10}{400}

Again, g(RD)g(\text{RD}) is roughly the inverse of RD\text{RD}. Normally, a high RD\text{RD} indicates a player's ranking is fairly uncertain. In this case, g(RD)g(\text{RD}) outputs a high RD\text{RD} if a player's ranking is fairly certain. Meaning, if a player has played many games, then he/she generally will have a low RD\text{RD} and high g(RD)g(\text{RD}). On the other hand, if a player doesn't play many games, then he/she generally will have a high RD\text{RD} and low g(RD)g(\text{RD}).

The inverse of RD\text{RD} represents the certainty of a player's ranking. Mathematically, we take the inverse because we want to multiply that difference mentioned earlier based on our opponent's uncertainty. So, if an opponent's ranking is highly certain, then his/her ranking will go up even more (i.e. a multiple of the difference). If an opponent's ranking is highly uncertain, then his/her ranking won't go up by very much.

Number of Games Player RD\text{RD} g(RD)g(\text{RD})
many high low
few low high

Designed and developed by Darius Kharazi © 2020

Built with Gatsby