A statistical model of the 2014 World Cup
3 Jul 2014 Marcel 0
Today we introduce our statistical model for predicting the outcome of the 2014 World Cup. At a very high level, our approach is as follows:
■ We construct a stochastic model that generates a distribution of outcomes for each of the 64 matches of the 2014 World Cup, from the opener between Brazil and Croatia on June 12 in São Paulo through the final on July 13 in Rio de Janeiro.
■ The predictions for each match are based on a regression analysis that uses the entire history of mandatory international football matches—i.e., no friendlies—since 1960. This gives us about 14,000 observations to estimate the coefficients of our model. The dependent variable in the regression analysis is the number of goals scored by each side in each match. Following the literature on modelling football matches, we assume that the number of goals scored by a particular side in a particular match follows a Poisson distribution.
■ The explanatory variables in the regression analysis are as follows:
1. The difference in the Elo rankings between the two teams. The Elo ranking is a composite measure of national football team success that is based on the entire historical track record. Unlike the somewhat better known FIFA/Coca-Cola rating, the Elo rating is available for the entire history of international football matches. Statistically, we find that the difference in Elo rankings is the most powerful variable in the model.
2. The average number of goals scored by the team over the last ten mandatory international games.
3. The average number of goals received by the opposing team over the last five mandatory international games.
4. A country-specific dummy variable indicating whether the game in question took place at a World Cup. This variable is meant to capture whether a team has a tendency to systematically outperform or underperform at a World Cup. We only include this variable for countries that have participated in a sufficient number of post-1960 World Cup games (including Brazil, Germany, Argentina, Spain, Netherlands, England, Italy and France).
5. A dummy variable indicating whether the team played in its home country. 6. A dummy variable indicating whether the team played on its home continent, with coefficients that are allowed to vary by country.
■ We generate a probability distribution for the outcome of each game using a Monte Carlo simulation with 100,000 draws, using the parameters estimated in the regression analysis described above. We use the results of this simulation analysis to generate the probabilities of teams reaching particular stages of the tournament, up to winning the championship. We use the rounded prediction of the goals scored to determine the outcomes of each game during the group stage and the unrounded forecast to pick the winner in the knockout stage.
■ To be clear, our model does not use any information on the quality of teams or individual players that is not reflected in a team’s track record. For example, if a key player who was responsible for a team’s recent successes is injured, this will have no bearing on our predictions. There is also no role for human judgment as the approach is purely statistical.
We can summarise the predictions of our model in two main ways. Exhibits 1 and 2 show the point estimates for the outcomes of the group and knockout stages, defined as the single most likely path of the tournament based on the information included in the model. The model predicts that Brazil, Germany, Argentina and Spain will reach the semifinals, and that Brazil will beat Argentina in the final. We will update these predictions after every game of the tournament on our portal.
However, the predictions shown in Exhibits 1 and 2 are very uncertain because football is a low-scoring and unpredictable game. For this reason, we believe that Exhibit 3 showing the probabilities of each team reaching a particular stage of the World Cup generated from our Monte Carlo analysis may provide a better illustration of the model. We will also update these probabilities after every game on our portal.
The most striking aspect of our model is how heavily it favours Brazil to win the World Cup, with Argentina and Germany next most favoured but much lower down in probability. Of course, it is hardly surprising that the most successful team in football history is favoured to win a World Cup at home. But the extent of the Brazilian advantage in our model is nevertheless striking. Our probability for an overall Brazil win is almost 50%, versus 25% for Ladbrokes bookmakers, as shown in Exhibit 4.
There are four main reasons why the model favours Brazil by such a large margin:
- Brazil is the highest rated team in the Elo system, the single most important predictor of tournament success in our model. Since the Elo system dynamically updates its scores based on recent performance, the high rating is partly due to Brazil’s success in the 2013 Confederations Cup, including a 3:0 win against Spain and a 4:2 win against Italy. Admittedly, other measures of overall team strength do not show quite as favourable a picture for Brazil, but it is still one of the highest rated teams. For example, Exhibit 5 shows that there is a decent relationship between the Elo rating and the FIFA/Coca-Cola rating. Moreover, Exhibit 6 shows that there is also a decent relationship between the Elo rating and the aggregate transfer value of all the players in the different national squads. (Our statistical analysis does not use either the FIFA/Coca-Cola ratings, which are only available back to 1992, or aggregate transfer values, which are only available as a snapshot.)
- Brazil is a particularly strong performer at World Cup tournaments, relative to other matches. This is, of course, one reason why Brazil has won a record five World Cups. Other teams that tend to perform particularly strongly at World Cups relative to other matches include Germany and Argentina. Among the traditional football powerhouses, England is the only one that does not perform meaningfully better at World Cups than in other matches, adjusting for all the other parameters in the model. The statistical World Cup effects for each of the major teams are illustrated in Exhibit 7.
- Home advantage is an important predictor of international football matches. According to our model, it is worth an extra 0.4 goals scored per match, adjusting for all the other parameters in the model. Related to this, Exhibit 8 shows that home advantage is a very strong predictor of winning the World Cup. The home team has won 30% of all World Cups since 1930, and over 50% of all World Cups held in a traditional football powerhouse (Brazil, Italy, Germany, Argentina, Uruguay, Spain, France and England).
- Home continent advantage is also an important factor, particularly for Latin American teams. Exhibit 9 shows the home continent coefficient for all participating teams. Consistent with this, no European team has ever won a World Cup held in the Americas. While past performance is of course no guarantee of future results, this observation is consistent with our model’s prediction that the probability of a Latin American team winning the 2014 World Cup is 65%. This implies that the composite home advantage for Brazil—consisting of the home country and home continent effect—is 0.6 goals per game. More generally, we can illustrate the importance of different variables in the model via a set of ‘waterfall charts’ for the probability of each of the four top teams winning the World Cup. In Exhibit 10, we start from a model that assumes an equal scoring propensity per game which, by construction, results in roughly even probabilities of each team winning the championship. We then bring in subsequently more information, including the actual tournament structure and the Elo rating, goals scored and conceded in recent games, the World Cup effect, the home country effect and the home continent effect. The chart illustrates the point above that Brazil draws most of its strength from the Elo rating, the home country and home continent effect, and the World Cup effect.
Having described our model, we can also ask how well the model would have done in predicting the outcome of past World Cups, using only data that were available prior to the tournament. Focusing on the 2010 World Cup in South Africa, Exhibit 11 plots the model prediction for the goal difference in each game against the actual result. Overall, there is a positive and statistically significant relationship between the actual and predicted outcomes. However, the fit of the relationship is not particularly tight with an r-squared of 0.24, because football is ultimately a pretty random game.
Another perspective on past performance is the table of predicted probabilities that our model would have generated at the start of the tournament compared with the actual outcomes of the tournament. Exhibit 12 shows that the model correctly predicted 13 of the 16 teams that advanced to the knockout stages, 5 of 8 teams that advanced to the quarter finals and 3 out of 4 teams that made it to the semi-finals. The model did not, however, correctly predict (ex ante) that Spain would win the World Cup. Spain had a 15.7% probability of winning, behind Brazil at 26.6% though narrowly ahead of eventual runners-up Netherlands.
Jan Hatzius, Sven Jari Stehn and Donnie Millar
Jan Hatzius – Goldman, Sachs & Co.
(212) 902-0394 firstname.lastname@example.org
Jari Stehn – Goldman, Sachs & Co.
(212) 357-6224 email@example.com
Source: Goldman Sachs
Popular Last 7 Days
- 50 Great Examples of Data Visualization! 8 views
- Coca Cola reveals big data-driven operational improvements 8 views
- Opendata: Digitale hoogtekaart komt vrij beschikbaar 6 views
- How Trump beat Ada’s big data 6 views
- De cijfers over internet & de wereld 5 views
- Will Washington Put a Big Hurt on Amazon and Other Big Data? 5 views
- Ebola: Can big data analytics help contain its spread? 4 views
- De historie van de diskette 4 views