Is It Possible To Interpret the “Possession” More Effectively?

Sezer Unar
7 min readJul 1, 2022

--

Possession is the simplest, most common and the most dangerous statistic in the football world. It does not guarantee success, but with exceptions, champions have a higher possession rate on average. When we look at the winners of the Top-5 European Leagues, these rates are as follows: Man City 67.9%, Bayern Munich 64.5%, PSG 62.9%, Real Madrid 60% and Milan 54%.

On the other hand, does more possession mean easy victory? Not actually…

Data Source: StatsBomb via FBRef

But is possession really such useless data?

While this question was echoing in my head, I came across Ben Griffis’s tweets. Recently, he has been sharing a lot of valuable information about using statistics correctly. I highly recommend following him.

There is a subject he also stresses often, it is adjustment. Possession-adjusting is not something new. It is already used frequently in defensive metrics.

A team in possession of the ball does not need to perform actions such as tackle or interception, because they already have the ball. It is possible to explain this situation like this, a team that does not have the ball cannot pass or shoot.

In this blog-post, I will examine this issue. First, I’m going to find the correlation between Possession and other statistics, pick some of them, and look at matches that have inbalanced possession rates and end with an unexpected result with these chosen metrics. In a second step, I will build a model using residuals rather than raw numbers. This model will tell us the probability of a team winning. We will find important variables. And this time, we’re going to see data from some of the matches that sound ridiculous to the human mind but end up as it should according to the model.

I’ll do all this with the StatsBomb data via FBRef.

Let’s rock and roll!

Correlation

There are many variables that are strongly correlated with Possession. However, most of them are far from answering the question of why “they” won the match. For example, since Possession is already a pass-related data, its relation to the total number of short or medium passes will not add value to us. Or the amount of touches in the penalty area will be more significant than the touches in the final 3rd. To explain the last example a little more, there is already a strong correlation between “Touches in final 3rd and Touches in penalty area”. As a football watcher, I don’t see the need to use both, as I think the second one is more decisive in winning a match.

Thinking this way, I identified some variables that could give a message about the match result.

I did not choose variables with a correlation below 0.30. Also, 0.30 can be seen as low. My goal is not to just pick one feature and explain everything with it, but instead to look at and interpret a few data together with possession. Let’s start with our first match.

What does the data from the match played between Everton and Chelsea on May 1, 2022 tell us?

In terms of possession, it’s one of the most extreme games. We see that Chelsea have the possession with 77%. xG’s are almost equal, but the Blues were on the losing side. As you will notice, Chelsea are actually better than their opponent in many metrics. They made more passes that led the shot, They completed more passes into the 18-yard box. Number of shot-creating actions are more in favor of Chelsea. In the opponent penalty area, the away side touches more. Also, more shots, more ball recoveries, more progressive passes…

More possession comes with more responsibility. The white line in the facets represents Generalized Additive Model. Each dot is a match played in EPL, Ligue 1, La Liga and Serie A.

As the number of actions increases, many metrics naturally increase or decrease. We already talked about that. So, comparing a metric directly sometimes doesn’t make much sense. For example, despite Chelsea making more key passes, this figure is less than expected for a team playing with 77% possession. This is true of many statistics I have just mentioned. For Tuchel’s team, this is very clear. On the other hand, Everton are above the line in many areas.

Sometimes less is more.

It’s not like this will always be the case. Sometimes teams can take advantage of possession. The Liverpool vs Newcastle match played on 16 December 2021 is an example of this situation.

The fact that Newcastle United are under expectation in many data types have a share here, as well as the successful indicators of Liverpool.

Of course, sometimes it is not possible to explain everything with numbers. In other words, since we do not live in such a world yet, sometimes saying “luck” is the most humane thing to do.

It must be sad for Napoli.

We’ve come this far with human intelligence. Now it’s time to ask some things to the machine. Does data win games? Is there a message that we don’t see but that exists in the big picture?

I will build a classification model. This model will tell us the probability of a team winning the match. As predictors, I will use residuals generated by the relationship between the metrics and possession, rather than raw data. Remember the Everton vs Chelsea game. Despite Chelsea’s high number of key passes, they were below the expectation, so their residual was negative.

Except for the Bundesliga, I use the other TOP-4 European leagues for training set. I will get the residual numbers of the Bundesliga according to those leagues.

I’ll try two algorithms. They are Logistic Regression and Random Forest. And then I will compare the results with the AUC.

I selected some variables for Logistic Regression with stepwise. These are as follows: SH_STANDARD, NPXG_EXPECTED, CMP_PERCENT_TOTAL, KP, PPA, PROG, CRS_PASS_TYPES, CK_PASS_TYPES, SCA_SCA_TYPES, DEF_3RD_TACKLES, PRESS_PRESSURES, DEF_3RD_PRESSURES, MID_3RD_PRESSURES, SH_BLOCKS, ATT_PEN_TOUCHES, CPA_CARRIES.

It is largely the same as what I have chosen by observation. “xA”, “Recovery” and “Successful Pressure %” no longer exist. Instead of them, there are “Cross”, “Corner”, “Def 3rd Tackles” and “Mid 3rd Pressures”.

I followed the same process for Random Forest. After I selected some feaures with Boruta, I built the model. Unlike Logistic Regression, “xA”, “PROG_RECEIVING”, “Recovery” appeared to be among the important variables. However, with the exception of “DEF_3RD_PRESSURES”, other pressure and tackle related stats were not significant.

So, I am going with Logistic Reg.

A total of 306 matches were played in the Bundesliga for the 2021–22 season. Based on the mutual data of both teams in a match, we have a total of 612 rows. If the model predicts a team’s win rate of more than 50%, let’s note that we expect a win and compare this to actual results.

Model’s accuracy is %78.2

To be honest I was expecting higher. Let’s look at some results without getting pessimistic.

We see the relationship of a total of 16 features with Possession. They were arranged in order of variable importance of the model. In the first 6 metrics, Dortmund’s residual is on the negative side, despite achieving higher raw numbers compared to its opponent. Already, Freiburg’s winning rate has been estimated at 77%. Of course it’s just one game. But am I wrong somewhere?

There is one point that is very important.

There is a positive situation for Köln in many indicators. They won the match, but according to the model, the score isn’t all that fair. I think it is very difficult to understand this with the human mind. Let’s take a look at the “break-down” plot that I created with the ModelStudio package and realize something.

The first result, which I found very interesting, was that Köln made a lot of crosses than expected, which was found to be bad by the model. In addition, the number of shots, although above the expected line, was not enough. Bochum’s positive signals in some metrics are also one of the main reasons for winning prob’s decreasing.

In fact, there is a 0.68 correlation between the score difference and the model’s estimation. However, when we build the model with raw data instead of residuals, we get similar results. AUC and Brier scores are almost the same, and the model’s accuracy is %79.

So, in short, I was thinking that maybe I would get more successful results with residuals. But it wasn’t what I expected. Failed attempts are ideal for learning. Maybe there is a situation that I haven’t seen and you readers have discovered it.

Maybe it would be more logical to make an adjustment directly via touches instead of possession.

As a result, possession still remains a mysterious statistic.

Thank you very much for taking your time to read my post :)

--

--