Understanding Football Through Data
From the first event log to a tactical report a head coach can act on — a practical, end-to-end guide to how modern football is measured, modelled and explained.
Why Football Analytics Matters
Football is a low-scoring, fluid, invasion game played by twenty-two people making thousands of decisions over ninety chaotic minutes. That combination makes it both the hardest team sport to quantify and the one where good measurement pays the largest dividend.
For most of the sport's history, the scoreboard was the only number that mattered, and it lied constantly. A team could dominate, hit the post three times, concede a deflected goal and "lose 1–0" — a result the raw scoreline records as a deserved defeat. Analytics exists to close the gap between what happened and what was likely to happen, so that clubs make decisions on the underlying process rather than on noisy, small-sample outcomes.
The reason this matters so much in football specifically is the goal. Goals are rare — roughly 2.7 per match across Europe's top leagues — which means a single fortunate or unfortunate event can swing a result, and a season is short enough (typically 34–38 league games) that luck does not fully wash out. A striker who scores 15 from 12 expected goals looks elite; the data tells you to expect regression. A pressing system that concedes few shots but loses on the day is, over time, very likely sound. Analytics is the discipline of separating signal (repeatable skill and tactics) from noise (variance you cannot bank on).
From Moneyball to the modern recruitment department
The cultural reference point is baseball's Moneyball, but football's analytics revolution arrived later and looks different. Baseball is a sequence of discrete one-on-one events that decompose cleanly into individual statistics. Football is continuous and deeply interdependent: a full-back's overlap only "works" because of where the winger and the opposition's defensive line are standing. The field had to invent new tools — expected goals, possession-value models, tracking-data physics — precisely because borrowed baseball thinking did not transfer. Today, every elite club runs a data department, recruitment is routinely informed by models, and opposition analysis is a blend of video and numbers.
Who this handbook is for
Aspiring analysts
You will get the conceptual scaffolding and the practical Python/visualization workflow used in professional environments — enough to build a portfolio that gets you hired.
Scouts & recruiters
You will learn how metrics translate into player profiles, why context (league, role, minutes) is everything, and how to combine the eye test with the spreadsheet.
Coaches
You will see how raw feeds become a tactical report you can act on — opponent pressing triggers, set-piece tendencies, where to attack and where you are exposed.
The curious fan
You will never watch a match the same way again. xG, PPDA and pitch control will stop being broadcast jargon and start being lenses.
The promise & the limit
Data does not replace the coach's eye, the scout's instinct or the manager's feel for a dressing room. It disciplines them — it tells you when your intuition is fighting the evidence, sizes the sample you are reasoning from, and turns "I think" into "here is how strongly, and how confident we should be." The best practitioners are bilingual: fluent in football and in data, and humble about both.
How to read this handbook
The four parts build on each other. Part 1 establishes what data is and how it is collected. Part 2 covers the advanced metrics and models that turn data into meaning. Part 3 is the hands-on toolkit — workflow, Python, visualization and video. Part 4 ties it together with realistic, worked case studies, ending in a full tactical report. You can read linearly or jump via the sidebar; cross-references are linked throughout.
1.1 The Role of a Football Data Analyst
"Football data analyst" is not one job. It is a family of roles that share a skill set but point at very different decisions — from the recruitment meeting to the training-ground whiteboard to the betting trading desk.
At its core the role is translation. On one side sits messy, high-volume data; on the other sit football people — coaches, sporting directors, scouts — who need a clear answer to a concrete question. The analyst turns "we have 38 matches of event data" into "their right-back steps out to press and leaves space in behind; we should target it with early diagonals." The value is never the chart. It is the decision the chart changes.
The main flavours of the role
| Role | Primary question | Typical employer |
|---|---|---|
| Recruitment / scouting analyst | Who should we sign, and what is fair value? | Clubs, agencies, data firms |
| Opposition analyst | How does the next opponent play, and how do we beat them? | First-team staff |
| Performance / post-match analyst | Did our process work, regardless of the result? | First-team staff |
| Set-piece analyst | Where are the marginal goals in dead-ball situations? | Clubs (a fast-growing specialism) |
| Data scientist / engineer | How do we build the models and pipelines everyone else uses? | Clubs, data vendors, betting |
| Trading / quant analyst | Is the market price wrong, and by how much? | Betting syndicates, bookmakers |
What the job actually demands
Three competencies sit underneath all of these, and the best analysts are strong in each: football understanding (you must know why a stat means something tactically), technical skill (querying, cleaning and modelling data — typically in Python, SQL and a BI/visualization tool), and communication (a brilliant model that nobody acts on is worthless). A common beginner mistake is to over-index on the technical axis. In a club, a clear one-slide answer beats an elegant notebook every time.
Mindset
Think of yourself as a decision-support function, not a stats provider. Before any piece of work, ask: "Whose decision am I trying to improve, and what would change their mind?" If you cannot answer that, you are not ready to open the data.
1.2 Data Literacy and a Scientific Approach to Data
Data literacy is not about memorising metrics. It is the habit of asking where a number came from, what it can and cannot say, and how confident you should be — the scientific method applied to a football pitch.
The analytical loop
Question
Start with a football question, not a dataset. "Are we creating enough high-quality chances from open play?" is researchable. "Let's look at the data" is not.
Hypothesis
State what you expect and why. "I think our low chance volume is a build-up problem, not a finishing problem." A hypothesis you can be wrong about is the engine of good analysis.
Evidence
Gather the right data at the right grain. Match the metric to the question and be honest about sample size.
Test
Compare against a baseline — league average, the opponent, your own past form. A number without a reference point is meaningless.
Communicate
Deliver the answer in the language of the decision-maker, with the uncertainty attached.
Three habits that separate professionals from dashboards
Always ask about sample size. Football's low event-rate means most single-match numbers are noise. One game of xG tells you almost nothing; ten games starts to. Per-90 rates from 200 minutes of play are dangerous. The professional instinct is to distrust extreme values from small samples and to favour rolling windows.
Correlation, causation and confounders. Teams that press high tend to win more — but elite teams both press high and have better players, so the press is partly a marker of quality, not only a cause of it. A scientific analyst names the confounder before drawing a conclusion.
Distributions over averages. "Average shot distance: 18m" hides whether a team takes many close-range chances and a few hopeful long-rangers, or a steady stream of 18m efforts. The shape of the data usually carries the tactical story; the mean often erases it.
Common analytical traps
Survivorship bias — judging a recruitment model only on the players you signed. Outcome bias — calling a good process bad because it lost on the day. Cherry-picking the window — choosing the date range that flatters your point. Metric overfitting — inventing a stat that "explains" last season but predicts nothing about the next.
1.3 Data Types and Data Collection Systems
Everything downstream — every metric, model and report — is constrained by the data underneath it. Knowing the four major data types, and their blind spots, is the single most useful piece of literacy in the field.
The four layers of football data
1 · Box-score data
The traditional match summary: goals, shots, possession %, passes, cards. Cheap, universal, and almost free of context. Useful for a first glance, misleading on its own.
2 · Event data
Every on-the-ball action, time-stamped and located on the pitch: passes, shots, tackles, carries, with x, y coordinates and rich attributes. 3,400+ events per match. The workhorse of modern analysis.
3 · Tracking data
The position of all 22 players (and the ball) sampled ~25 times per second. Captures the 98% of the game that happens off the ball — shape, space, runs, pressing distances.
4 · Physical / GPS data
Distance covered, sprints, accelerations, high-speed running — from wearables in training and, via computer vision, from match broadcast. The fuel for load management and fitness.
Event data, in detail
Event data is what most analysts spend their day in. A provider's operators (or, increasingly, machine-vision systems with human verification) tag every on-the-ball action. A single shot event might carry: the player, the team, the minute and second, the x/y location, the body part, the play pattern (open play, corner, fast break), whether it was a "big chance," and — crucially for modelling — the position of every other player at the moment of the shot (a freeze-frame) in the richest feeds.
player: "A. Rossi", type: "Shot", x: 88.5, y: 41.2,
body_part: "Right", play_pattern: "From Corner",
outcome: "Saved", xg: 0.137, freeze_frame: [...] }
Its limitation is right there in the name: event data only records moments when the ball is touched. It is blind to the run a striker made that dragged a centre-back away, or the cover-shadow a midfielder cast to block a passing lane. For that, you need tracking.
Tracking data and the off-ball game
Tracking data turns football into geometry. With 22 player coordinates ten to twenty-five times a second you can compute distances between lines, the compactness of a block, how quickly a press collapses space, who is free, and which zones a team controls. It powers the most advanced models in the sport — pitch control, off-ball value, defensive line-height analysis. Two collection methods dominate:
- Optical / in-stadium — fixed multi-camera rigs track every player. Highest accuracy, but requires installation and is club-specific.
- Broadcast computer vision — companies such as SkillCorner extract positional and physical data straight from the TV feed, with no sensors required. This democratised tracking: you can now get running and speed numbers for opponents you will never install hardware for. The trade-off is occasional gaps (players off-camera) and slightly lower precision.
The provider landscape (2025–26)
The market consolidated sharply. Hudl acquired Wyscout, then InStat, and most recently StatsBomb, bundling video scouting, event data and player-location data. Opta (Stats Perform) remains the data that powers much of the industry and broadcast. SkillCorner leads broadcast-derived tracking and physical data. IMPECT, known for its packing data, was acquired by Catapult in late 2025. For learning and portfolios, FBref (free, Opta/Stats Perform-powered) and StatsBomb's free Open Data are the standard entry points.
| Provider | Strength | Data type |
|---|---|---|
| Hudl StatsBomb | Deep event data + freeze-frames, analytics platform | Event + location |
| Opta / Stats Perform | Coverage, broadcast, established xG | Event |
| Hudl Wyscout | Largest video library, day-to-day scouting | Video + event |
| SkillCorner | Broadcast tracking & physical data | Tracking + physical |
| FBref (free) | Best free starting point for learners | Aggregated event |
Key takeaway
Match the data type to the question. Asking "how good was that chance?" → event data with xG. Asking "why was nobody marking him?" → tracking data. Asking "are our players cooked by minute 70?" → physical/GPS. Most mistakes in football analytics come from forcing one data type to answer a question it cannot see.
1.4 A Day in the Life of a Professional Football Data Analyst
The job is far less glamorous and far more rhythmic than outsiders imagine. It runs on the fixture calendar, and most of it is preparation for a forty-five-minute conversation that has to land.
Below is a representative match-week day for a club opposition analyst two days before a fixture (often called "MD-2," matchday minus two).
The previous night's data has landed. Verify the opponent's last match imported cleanly, spot-check obvious tagging errors, refresh the rolling-form database.
Pull the opponent's last 6–10 games. Build the picture: average formation and in-possession shape, pressing intensity (PPDA), where they progress the ball, set-piece routines, individual threats and weaknesses.
Tie the numbers to clips. A PPDA of 7 is abstract; three clips of their front three triggering the press makes it coachable. Tag 8–12 key moments the staff can show players.
Present findings to the head coach and assistants. This is the moment the week's work either lands or doesn't. Lead with the answer, hold detail in reserve.
Translate the plan into what players actually see: unit meetings, individual clips, a one-page graphic on the dressing-room wall. Less is more.
Between match-prep deadlines: a sporting-director query on a transfer target, maintaining the shortlist model, building tooling for next week.
On matchday the rhythm changes: live tagging or live data monitoring, half-time numbers to inform the coach's talk, and a rapid post-match summary. The day after is for the post-match deep-dive (see 4.3). The throughline is that data work in a club is never an end in itself — it is always scaffolding for a coaching decision, delivered on a deadline set by the next kick-off.
"The analyst's job is to make the coach's next decision a little less of a guess — and to do it before the bus leaves."A working principle of club analysis departments
2.1 Introduction to Advanced Metrics
Traditional stats count what happened. Advanced metrics estimate what should have happened, and how much each action moved a team toward a goal. That shift — from counting to valuing — is the whole revolution.
A "shot" is a count. It treats a tap-in and a 35-yard hopeful effort as equal. xG replaces the count with a probability: this chance, given its location, angle, body part and pressure, would be scored a certain fraction of the time. Do that for every action — not just shots — and you get a family of models that value passing, carrying, pressing and positioning. Three properties make a good advanced metric:
- Predictive — it should describe future performance better than the raw outcome. Expected goals predict future goals better than past goals do.
- Stable — it should stabilise on a reasonable sample, so it reflects skill, not variance.
- Interpretable — a coach should be able to understand what it rewards. A black-box number nobody trusts gets ignored.
The mental model
Picture a "scoring probability" that exists at every moment of the game. Every action — a pass, a carry, a tackle — nudges that probability up or down. Possession-value models (xT, VAEP) try to measure that nudge for every action. xG is the special case measured at the moment of a shot. Hold this picture and the rest of Part 2 is just different ways of estimating it.
2.2 Shooting Metrics
Shooting metrics answer two separate questions that beginners constantly conflate: how good were the chances? (creation & concession) and how well were they taken? (finishing).
Expected Goals (xG)
xG assigns every shot a probability of becoming a goal, between 0 and 1, based on the characteristics of the chance. A penalty is worth about 0.76 xG; a tap-in might be 0.9; a speculative effort from distance, 0.03. Sum a team's xG across a match and you get a far better estimate of who "deserved" to win than the scoreline. The model is trained on hundreds of thousands of historical shots, learning the conversion rate for each combination of features.
under pressure?, play pattern, defenders in lane,
goalkeeper position* ) → probability ∈ [0, 1]
* freeze-frame features available only in the richest feeds
Distance and angle do most of the work — a shot's value collapses quickly as you move away from goal and toward the byline. Richer models add context. StatsBomb's model uses freeze-frames (the position of the keeper and every defender at the moment of the shot) and shot-impact height; Opta's analyses up to ~20 contextual factors per shot but, like most models, has no freeze-frame. This is why the same shot can read 0.12 on one provider and 0.22 on another — there is no single "true" xG, only models with different inputs. Always know whose xG you are quoting.
Reading a shot map
The shot map is the analyst's most recognisable visual. Each shot is a dot, placed where it was taken, sized by its xG and coloured by outcome. At a glance you can read a team's chance quality and profile: a cluster of big dots in the box signals a side that works the ball into high-value areas; a scatter of small dots from distance signals a team that settles for low-probability efforts.
The finishing layer: xG vs goals, and post-shot xG
Compare a player's goals to their xG and you get a finishing signal. Persistently outscoring xG by a wide margin can indicate elite finishing — but over small samples it is usually variance, and the honest default is to expect regression toward the model. To isolate finishing more cleanly, analysts use post-shot xG (PSxG), which is computed only for shots on target and adds where in the goal the ball was heading. PSxG minus xG measures shot-placement skill; for goalkeepers, PSxG minus goals conceded is the standard shot-stopping metric.
xG
Chance quality at the moment of the shot. Built from all shots (on and off target). Measures creation & concession.
Post-shot xG (PSxG)
Only on-target shots, adding shot placement. Isolates finishing and, for keepers, shot-stopping.
The cardinal sins of xG
Don't read a single match's xG as destiny — it is one noisy draw. Don't compare xG across providers as if they were the same number. Don't treat a striker beating xG over 10 games as proven elite finishing. And never forget xG values the chance, not the build-up that created it — for that, you need chance-creation and possession-value metrics.
2.3 Chance Creation Metrics
Goals are the tip; chance creation is the iceberg. These metrics value the passing, carrying and progression that manufacture shots — and, in their most advanced form, value any action that increases the threat of a possession.
Expected Assists (xA) and beyond
Expected Assists (xA) takes every completed pass that led to a shot and assigns it the xG of the shot it created. It rewards the quality of chance created, not whether the teammate finished — a perfect cut-back that the striker skies still earns its xA. It is to creators what xG is to finishers. Its limitation: it only credits the final pass before a shot. The line-breaking pass two phases earlier, the carry that drew defenders — these get nothing. That gap is what possession-value models fill.
Possession value: xT and VAEP
This is the conceptual heart of modern analysis. Both models answer: how much did this action increase my team's chance of scoring?
Expected Threat (xT)
Introduced by analyst Karun Singh in 2018, xT lays a value surface over the pitch by dividing it into a grid (commonly 16×12 zones). Each zone gets a value: how likely a goal is to result from possession there. Moving the ball from a low-value zone to a high-value one earns the player the difference in zone value. A carry from the centre circle into the box is highly rewarded; a sideways pass between two low-value zones earns almost nothing.
+ P(move)·Σ P(move → zoneₖ)·xT(zoneₖ)
value of a zone = shoot now, or move the ball and inherit where you move it to (solved iteratively over the grid)
xT is intuitive and easy to communicate, which is why it is the most widely used possession-value model in the industry. Its weaknesses: it values only ball-progression actions (passes and carries), ignores defensive actions and shots, and — because it is usually built on event data alone — it cannot see whether the player was under pressure or whether the pass was risky.
VAEP — Valuing Actions by Estimating Probabilities
VAEP goes further. It values a broad set of actions (including defensive ones) by estimating how each changes two probabilities: the chance your team scores in the next few actions, and the chance your team concedes. An action's value is the gain in scoring probability minus the rise in conceding probability. Crucially, VAEP includes game context — score difference, time remaining, field position — and frames valuation as a machine-learning classification problem. It is implemented in the open-source socceraction library, which first converts any provider's event feed into a common format called SPADL.
| xT | VAEP | |
|---|---|---|
| Values | Ball progression (pass, carry) | Almost all actions, incl. defending |
| Risk / turnovers | Largely ignores | Models conceding probability |
| Game context | No | Yes (score, time, position) |
| Strength | Simple, interpretable, fast | Comprehensive, context-aware |
| Best for | Quick progression profiling | Holistic player valuation |
How to choose
Reach for xT when you want a quick, explainable read on who progresses the ball and how — ideal for a coach-facing chart. Reach for VAEP when you want a single, context-aware value for a player's total contribution, including defending — ideal for recruitment ranking. Many departments run both.
2.4 Pressing & Defensive Metrics
Defending is the hardest part of the game to measure, because the best defensive action is often the one that never has to happen. The metrics here are proxies — useful, but to be read with care.
PPDA — Passes Per Defensive Action
PPDA is the most widely used single-number proxy for pressing intensity. It counts how many passes the opponent is allowed to complete before the pressing team makes a defensive action (tackle, interception, challenge or foul), measured in the attacking ~60% of the pitch (the opponent's defensive three-fifths).
(your defensive actions in that zone)
lower = more aggressive pressing · higher = sitting deeper
A PPDA around 7–8 indicates suffocating, aggressive pressing — the team intervenes after only a handful of opponent passes. A PPDA of 15+ signals a side content to let the opponent have the ball and defend in a mid- or low-block. Neither is "better"; they are tactical choices.
Why PPDA misleads if you're not careful
A team protecting a two-goal lead will post a high PPDA — not because it can't press but because it has chosen not to. PPDA also doesn't capture press effectiveness (did the press win the ball or just get played through?). Treat any single-match PPDA with suspicion; a rolling average over 6–10 matches is a far more stable signal of pressing identity.
Other defensive lenses
Defensive actions by zone & height of the defensive line. Where a team makes its tackles and interceptions tells you its block height. A high cluster = aggressive front-foot defending; a deep cluster = a low block. From tracking data you can measure the literal height of the defensive line and how it moves.
Packing (Impect). Packing counts how many opponents a pass or dribble takes out of the game by playing past them. Defensively, it measures how many defenders an action bypassed — a way to value the act of not being eliminated. It reframes both passing and defending around the number of players removed from the play.
Expected goals against / prevented (PSxG–GA). The flip side of creation: how many high-quality chances a defence concedes (xGA), and how a keeper performs against them.
Tracking-era defensive value. The frontier — pitch control and off-ball models — credits defenders for the space they deny and the passing lanes they shadow, finally putting a number on the "action that never had to happen."
2.5 Player Functions and Clustering
A "midfielder" can be a destroyer, a metronome, a box-crasher or a deep-lying playmaker. Positional labels are too blunt for recruitment. Clustering lets the data define roles from how players actually play.
From positions to data-driven roles
The traditional position (CB, FB, DM, CM, AM, W, ST) tells you where a player stands, not what they do. Two "centre-backs" might be a ball-playing libero and a no-nonsense stopper — opposite profiles sharing a label. Clustering solves this by representing each player as a vector of per-90, style-based metrics (progressive passes, carries, pressures, aerials, shot volume, defensive actions, xT generated…) and letting an algorithm group players whose statistical fingerprints are similar, regardless of nominal position.
The clustering workflow
Select features
Choose metrics that describe style, not just quality — you want to group a good and an average deep-lying playmaker together, then judge quality separately.
Normalise & control for position
Scale features (e.g. z-scores) so no single metric dominates. Often cluster within position groups, or per-possession-adjust.
Reduce dimensions
Use PCA or similar to compress dozens of correlated metrics into a handful of meaningful style axes and to visualise.
Cluster
Apply k-means (or hierarchical / Gaussian mixtures). Use the elbow method or silhouette score to choose the number of clusters K.
Interpret & name
The hard, football part: look at each cluster's average profile and give it a human label — "ball-progressing 6," "wide creator," "pressing forward."
Why scouts love this
Clustering powers player similarity — "find me cheaper players who play like our departing No. 10." It surfaces undervalued players whose role isn't reflected in their position or their goals/assists, and it lets a club recruit for fit with a system, not just for raw output. Player radars (the spider charts you've seen) are the per-player view of the same idea: a visual fingerprint of a player's statistical profile versus positional peers.
A caution: clustering groups players by style, not level. A Championship metronome and a Champions League metronome can land in the same cluster. Always pair the role label with a quality and level-of-competition adjustment before drawing a recruitment conclusion. See 4.4 for a worked player analysis, and 3.3 for building radars in Python.
2.6 How Data Shapes Technical and Scouting Decisions
Metrics are only worth the decisions they improve. Here is how the models above feed the three decisions clubs spend most money and emotion on: who to sign, how to play, and whether the process is working.
Recruitment
Modern recruitment is a funnel. Data widens the top of it — instead of watching the handful of players a scout happened to see, a club can filter every player in dozens of leagues against a statistical profile in minutes, then send scouts to watch the shortlist. Possession-value models rank contribution; clustering ensures stylistic fit; age and contract data flag value. The model never makes the final call — it decides who gets watched, which is where most of the leverage is. It also guards against bias: the eye over-weights the last spectacular game; the data remembers the whole season.
Tactics & opposition planning
Data shapes both how a team plays and how it prepares for opponents. A coach can see, with evidence, that the side concedes most of its xG from crosses and adjust the defensive scheme; that the press is bypassed by the opponent's goalkeeper's long kicking and plan a trap; that a winger's xT is highest cutting inside onto his stronger foot. The opponent report in 4.2 is this in action.
Performance review & player development
Post-match, data separates process from result (4.3): did we generate good chances and limit theirs, regardless of the 1–1? For individuals, tracking and event data flag development targets — a full-back whose final-third decision-making lags his physical output, a striker whose movement creates xG he isn't converting.
The governing principle
Data earns its place when it changes a decision a club would otherwise make worse. If a beautiful model wouldn't alter who you sign, how you set up, or what you tell a player, it is decoration. Start from the decision and work backward to the metric — never the other way around.
3.1 The Football Data Analyst Workflow
Behind every clean chart is a pipeline. Professional analysis is repeatable: the same steps, run reliably, every match week. Learn the workflow and the specific tools become interchangeable.
Acquire
Get the data — a provider API, a CSV export, FBref or StatsBomb open data for learning. Know the licence and the schema before anything else.
Clean & standardise
Real data is messy: missing values, inconsistent names, different coordinate systems per provider. Standardising pitch coordinates (e.g. to 0–100) and player IDs is unglamorous and essential.
Transform & model
Aggregate to the grain you need (per-90, per-possession), compute metrics, apply models (xG, xT, VAEP).
Visualise & explore
Pitch plots, shot maps, radars, scatters — both to find the story and to tell it.
Communicate
Package the answer for the decision-maker: a slide, a one-pager, a video with overlaid numbers.
Automate
Turn the one-off into a template. Next week's opponent report should be a parameter change, not a rebuild.
The standard toolkit
| Layer | Tools | Why |
|---|---|---|
| Wrangling & modelling | Python (pandas, numpy, scikit-learn) | The industry default for analysis & ML |
| Football-specific | mplsoccer, socceraction, kloppy, statsbombpy | Pitches, possession value, feed parsing |
| Storage / querying | SQL, PostgreSQL, DuckDB | Season-scale data lives in databases |
| Reporting / BI | Tableau, Power BI, Excel | Stakeholder-facing dashboards |
| Video | Hudl / Wyscout / Sportscode | Tagging & clip delivery to coaches |
Where to start, free
Install Python (via Anaconda), then pip install statsbombpy mplsoccer socceraction. Pull StatsBomb Open Data (free event data including freeze-frames for selected competitions) and you can reproduce most of this handbook on real matches without spending a penny. FBref covers aggregated stats for nearly every league.
3.2 Python Basics for Football
You don't need to be a software engineer. You need to load a table, filter it, group it and compute a metric — 80% of football analysis is exactly that, done well.
The workhorse is pandas, whose core object is the DataFrame — a spreadsheet you control with code. Below, we load a competition's event data with statsbombpy and compute each team's total xG: the single most common starting task in the field.
from statsbombpy import sb
import pandas as pd
# 1 — load events for a single match (StatsBomb free open data)
events = sb.events(match_id=3795506)
# 2 — keep only shots, with the columns we care about
shots = events[events["type"] == "Shot"][
["team", "player", "minute", "shot_statsbomb_xg", "shot_outcome", "location"]
].copy()
# 3 — total xG and goals by team (the core "who deserved it" table)
summary = (
shots
.assign(goal=lambda d: (d["shot_outcome"] == "Goal").astype(int))
.groupby("team")
.agg(xG=("shot_statsbomb_xg", "sum"),
goals=("goal", "sum"),
shots=("shot_statsbomb_xg", "size"))
.round(2)
)
print(summary)
# xG goals shots
# team
# Arsenal 1.84 2 14
# Chelsea 0.97 1 9
Three steps — filter, derive, group — answer "who created the better chances?" in a dozen lines.
The same three-verb pattern (filter → derive → group) scales to almost everything. Want each player's shots and average shot distance? Group by player instead of team. Want per-90 rates? Join a minutes table and divide. Below, a small helper turns StatsBomb's [x, y] location list into a shot distance — the kind of feature engineering that feeds an xG model.
import numpy as np
GOAL = np.array([120, 40]) # StatsBomb pitch: 120 x 80, goal at x=120
def shot_distance(loc):
if loc is None: return np.nan
x, y = loc
return np.hypot(GOAL[0] - x, GOAL[1] - y)
shots["distance"] = shots["location"].apply(shot_distance)
# average distance of shots that became goals vs. those that didn't
print(shots.groupby(shots["shot_outcome"] == "Goal")["distance"].mean().round(1))
# shot_outcome
# False 18.3 # missed/saved shots taken from further out
# True 11.2 # goals scored from closer in — distance matters
Feature engineering: deriving shot distance from raw coordinates. Notice the result already "rediscovers" why xG weights distance so heavily.
Beginner pitfalls
Coordinate systems differ by provider — StatsBomb is 120×80, Opta is 100×100, others vary; never mix them without converting. Per-90 needs minutes, not appearances. Mind the small sample — groupby will happily compute a per-90 from 30 minutes of football and report nonsense. Always carry a minutes/sample column and filter on it.
3.3 Data Visualization Basics for Football
In football, the pitch is the axis. A good football visual places data in its spatial context so a coach reads it in seconds. The mplsoccer library makes professional pitch plots straightforward.
Plotting a shot map on a pitch
from mplsoccer import Pitch
import matplotlib.pyplot as plt
pitch = Pitch(pitch_type="statsbomb", line_color="#c4d6c9", pitch_color="#f4f8f5")
fig, ax = pitch.draw(figsize=(8, 5.2))
team_shots = shots[shots["team"] == "Arsenal"]
goals = team_shots[team_shots["shot_outcome"] == "Goal"]
# every shot: marker size scaled by xG, low-value shots faded
pitch.scatter(team_shots["location"].str[0], team_shots["location"].str[1],
s=team_shots["shot_statsbomb_xg"] * 900 + 40,
c="#b4532a", alpha=0.55, edgecolors="white", ax=ax, label="Shots")
# goals highlighted in green on top
pitch.scatter(goals["location"].str[0], goals["location"].str[1],
s=goals["shot_statsbomb_xg"] * 900 + 40,
c="#0f5132", edgecolors="white", ax=ax, label="Goals")
ax.set_title("Arsenal — shot map (size ∝ xG)", fontsize=14)
ax.legend(loc="lower left")
plt.show()
The same shot map shown schematically in 2.2 — here generated from real coordinates with mplsoccer.
The player radar
Radars (spider charts) are the standard way to show a player's statistical fingerprint against positional peers. Each spoke is a metric, scaled to a percentile so the shape — not the raw value — carries the meaning. The interactive chart below is the same idea rendered for the web.
Principles for football charts
- Percentiles beat raw numbers for comparison — "83rd percentile for progressive passes" is instantly meaningful; "6.2 per 90" is not until you know the distribution.
- Always show the reference — versus league average, peers, or the opponent. A lone number is not a story.
- Direction-encode the pitch — make clear which way a team attacks; mirror opponent data so both read left-to-right consistently.
- Less ink, more signal — a coach has thirty seconds. Strip chart-junk; highlight the one thing that matters.
- Colour with meaning and access — consistent team colours, outcome-based encodings, and palettes that survive colour-blindness and a projector.
3.4 Combining Video and Data Using Professional Tools
Numbers convince analysts; video convinces players. The decisive skill in a club is welding the two so a statistic becomes a coachable moment on screen.
A PPDA of 7 means nothing to a winger. Three clips of the opponent's front three springing the press — each tagged to the exact second the data flagged — means everything. The pipeline that connects them is the core of applied performance analysis:
Tagging platforms
Hudl Sportscode, Wyscout and similar let analysts code events against video, building searchable timelines. Increasingly, provider event data auto-syncs to footage, so a data filter instantly returns the matching clips.
Synced data + film
Because events carry timestamps, you can query the data ("all opponent high turnovers leading to a shot") and jump straight to those video moments — analysis and evidence in one motion.
The data-to-video loop in practice
Find the pattern in the data
e.g. "we concede 0.6 xG per game from balls into the left half-space."
Filter to the moments
Pull every event matching the pattern with its timestamp.
Pull the clips
The tagging tool returns those exact video windows.
Build the package
A short, sharp reel — ideally with on-screen overlays (the run, the space, the number) — for the unit meeting.
Why this is the job
The analyst who can say "here's the pattern, here's the proof on film, here's what we do about it" is worth far more than one who only produces charts. Video is how data clears the final, hardest hurdle: getting a player to believe it and change behaviour.
3.5 A Day in the Life of a Professional Football Data Scientist
If the analyst (1.4) lives on the match-week calendar, the data scientist lives on the model-and-pipeline calendar. The work is less about the next fixture and more about the systems everyone else depends on.
Check overnight jobs. Did every match feed ingest? Did the nightly model run finish? Data engineering is most of data science, and broken pipelines block the whole department.
Improve an in-house model — re-train the recruitment-value model on new season data, validate an xT surface against held-out matches, debug why a feature drifted.
A model is only trusted if it's tested. Back-test predictions against what actually happened; quantify uncertainty; resist the temptation to ship something that fit last season but won't generalise.
Build the internal app, query layer or notebook template that lets non-coding analysts self-serve. Force-multiplying the department is the highest-leverage work here.
Read a new paper (possession value, tracking models), prototype an idea, present findings to the sporting director. The frontier moves fast; staying current is part of the job.
| Data analyst | Data scientist | |
|---|---|---|
| Cadence | The fixture list | The model/release cycle |
| Output | Reports, clips, answers | Models, pipelines, tools |
| Audience | Coaches, scouts | Analysts, sporting director |
| Core skills | Football + comms + analysis | ML + engineering + stats |
In small clubs one person wears both hats; in elite setups they are distinct teams. Both need football understanding — a data scientist who can't tell a meaningful feature from a noisy one builds elegant, useless models.
4.1 Introduction to Use Cases
Everything so far — data types, metrics, tools — exists to serve a decision. Part 4 walks four realistic jobs an analyst is actually handed, each ending in something a coach or director can use.
The four cases below are deliberately the everyday core of the role, not exotic projects. They are illustrated with representative (not real) numbers so you can follow the reasoning, which is the transferable part. Notice that each follows the analytical loop from 1.2: a question, a hypothesis, evidence weighed against a baseline, and a clearly communicated answer with its uncertainty.
4.2 Reading the opponent
Pre-match. How do they play, and where can we hurt them?
4.3 Post-match review
Did our process work, beyond the scoreline?
4.4 Player analysis
Is this player a fit — for our system and our budget?
4.5 The tactical report
Pulling it together into a document staff act on.
4.2 Reading Your Opponent: A Practical Case Study
The brief: "We play Riverside FC on Saturday. Tell the staff how they play and how we beat them." You have their last eight matches of event and tracking data.
Step 1 — Establish identity (the baseline)
Start broad. Over eight games, Riverside average 54% possession, a PPDA of 9.2 (aggressive press), and a 4-3-3 that becomes a 2-3-5 in possession with both full-backs high. Already a picture forms: a proactive, high-pressing, possession side that commits numbers forward. The tactical opportunity with such teams is usually the space they vacate behind the full-backs.
Step 2 — Find the weakness (test the hypothesis)
Hypothesis: their high press leaves them open to counters in behind. The evidence backs it — Riverside concede 1.6 xG per game despite dominating the ball, and 63% of that xG comes from transitions, most of it down their left where the full-back pushes highest. The defensive-action map confirms their line sits high; the tracking data shows their recovery runs are slow on the left side.
Step 3 — Pressing triggers & build-up
When we have the ball, where will they press? The data shows Riverside trigger their press on our centre-backs' first touch and aggressively jump our full-backs. Their weakness: the single pivot leaves the space between their lines open if we can play through the first wave. Plan: bait the press, then break it with a forward pass into the No. 10 dropping between their midfield and defence.
Step 4 — Set pieces & individuals
Granularity wins matches. Riverside concede a high share of their chances from near-post corners and zonal-mark the six-yard box — a routine to exploit. Individually: their left-back (the one bombing forward) is beaten 1v1 more than any defender in the division; their keeper's PSxG–GA is negative (under-performing his shot-stopping). Both go on the dossier.
The answer, in one line
"Absorb their press patiently, then attack the space behind their high left-back on the counter; target near-post corners." Everything else in the report is evidence for that sentence. A coach can build a session around it on Monday — which is the test of a good opponent analysis.
4.3 Post-match Analysis: A Practical Case Study
The brief: "We drew 1–1 with a side we should beat. The coach is frustrated. Was the performance actually bad?" This is where analytics most directly protects a club from over-reacting to a result.
Separate process from outcome
The scoreline says "dropped points." The xG story may say something completely different. Suppose the match xG was 2.3 – 0.7 in our favour: we created chances worth more than two goals, conceded almost nothing, and drew because of one wonder-strike against the run of play and a string of saved efforts. That is a good performance with a bad result — and the correct message to the dressing room is "keep doing this," not "tear it up."
The post-match checklist
- xG for & against — did we deserve more? Were chances high quality or did we inflate xG with volume of poor shots?
- The shape of chance creation — open play vs set pieces, sustained vs one-off. A flattering xG built on one penalty is not the same as a steady stream of box entries.
- Finishing vs creation — was the problem getting into good areas (a tactical issue) or converting from them (a finishing issue, often variance)?
- Defensive concession — was the goal we conceded a systemic breakdown or a freak? One tells you to fix something; the other tells you to do nothing.
- Individual underlying numbers — who progressed the ball, who created, who lost it in dangerous areas.
The discipline of not over-reacting
Outcome bias is the post-match analyst's chief enemy. A win can hide a poor performance the data would flag; a loss can mask a strong one. The job is to report the process honestly — and, just as importantly, to know when the result genuinely did reflect a real, fixable problem rather than variance. One match is a tiny sample; trends across several are where truth lives.
4.4 Player Analysis: A Practical Case Study
The brief: "Our deep-lying playmaker is leaving. Here's a shortlist of three replacements within budget. Who fits, and what are the risks?"
Step 1 — Define the role, not the position
"Replace the No. 6" is too vague. From 2.5, define the role by function: this player receives under pressure, progresses the ball through passes and carries, dictates tempo, and screens the defence. That becomes a target statistical profile — high progressive passes and pass completion under pressure, strong xT generated, solid (not elite) defensive volume.
Step 2 — Compare like-for-like, in context
Now the discipline of 1.2 bites. Candidate B has gaudy progression numbers — but he plays for a dominant side in a weaker league, against deep blocks, with time on the ball he won't get here. Adjust for level of competition, team style and minutes before comparing. Percentile radars against positional peers in comparable leagues make the styles legible at a glance.
Step 3 — Weigh fit, quality, risk and value
| Candidate | Style fit | Risk flags | Verdict |
|---|---|---|---|
| A | Excellent — mirrors the role | Age 30, one year left | Best fit, short-term; strong value |
| B | Good progression, weak screen | Weaker league, defensive gaps | Higher ceiling, higher risk |
| C | More destroyer than creator | Different role entirely | Only if we change system |
The honest recommendation
"A is the safest like-for-like and excellent value, but age limits resale; B is the upside bet if we can coach the defensive side; C only makes sense if we shift to a double pivot." Notice the data didn't make the decision — it framed the trade-offs clearly and stopped the club over-rating B's league-inflated numbers. Pair every model output with the eye test and live scouting before signing.
4.5 From Data to Tactics: Building a Tactical Report
The tactical report is where the whole handbook converges: data types, metrics, tools and interpretation, compressed into a document that changes how a team plays on Saturday.
A report nobody reads is a failure regardless of its analytical quality. The craft is ruthless prioritisation — surfacing the two or three things that will decide the match and burying everything else in an appendix. Structure beats volume.
Anatomy of a tactical report
1 · Executive summary — the one-slide answer
Three bullets a coach can act on, up front. "Counter the space behind their high left-back; survive their press through our pivot; attack near-post corners." If they read nothing else, this is the report.
2 · Opponent in & out of possession
Their shape with and without the ball, how they build, where they progress, pressing intensity (PPDA) and triggers — each claim backed by a chart or a clip, never an assertion alone.
3 · Strengths to neutralise
Their dangerous patterns and players, with the specific defensive adjustments to limit them. Honesty about their quality earns trust.
4 · Weaknesses to exploit
The heart of the report: where they concede xG, which match-ups favour us, the routes to goal — translated into instructions, not just observations.
5 · Set pieces
Attacking and defensive dead-ball tendencies — increasingly a decisive, coachable edge worth its own section.
6 · Our game plan
It loops back to us: concrete principles for our shape, our build-up against their press, our pressing scheme, and our transition triggers.
From insight to instruction
The final translation is the one that matters most. A data insight ("they concede 63% of their xG in transition") is useless to a player until it becomes an instruction ("win the ball and get it forward within three seconds — look for the run in behind their left-back"). The best analysts live in this last mile: turning a probability into a behaviour.
"The report is not the deliverable. The changed decision on the pitch is the deliverable. Everything else is plumbing."The closing principle of football data analysis
Putting it all together
You now have the full arc: data (Part 1) is collected and made trustworthy; models (Part 2) turn it into meaning; tools (Part 3) make the workflow repeatable; and interpretation (Part 4) converts meaning into decisions. The fundamentals end here — but the craft is a lifetime. The fastest way to learn it is to take a real match from StatsBomb's open data and walk it through all four parts yourself.
Glossary & Sources
A quick-reference glossary of the metrics and terms used in this handbook, followed by the sources consulted.
Core metrics & terms
| Term | Meaning |
|---|---|
| xG — Expected Goals | Probability (0–1) that a shot becomes a goal, given its characteristics. |
| PSxG — Post-shot xG | xG for on-target shots including shot placement; isolates finishing & shot-stopping. |
| xA — Expected Assists | xG of the shot a completed pass created; credits chance quality created. |
| xT — Expected Threat | Possession-value model; values ball progression via a pitch value-surface (Karun Singh, 2018). |
| VAEP | Valuing Actions by Estimating Probabilities; context-aware value for nearly all actions. |
| SPADL | Common action language that normalises providers' event feeds (socceraction). |
| PPDA | Passes Per Defensive Action; proxy for pressing intensity (lower = more aggressive). |
| Packing | Number of opponents taken out of play by a pass or carry (Impect). |
| Event data | Time-stamped, located on-ball actions (~3,400+ per match). |
| Tracking data | Position of all 22 players + ball, sampled ~25×/second; the off-ball game. |
| Freeze-frame | Snapshot of all player positions at a key event (e.g. a shot); enriches xG. |
| Clustering | Grouping players by statistical style to define data-driven roles. |
Sources & further reading
Hudl — Expected Goals (xG) explained
Stats Perform — Expected Goals
Hudl — Possession value models (xT) explained
Karun Singh — Introducing Expected Threat (xT)
socceraction (SPADL, xT, VAEP)
Hudl — Defensive metrics & the high press
Stats Perform — How we measure pressure
PITCH IQ — K-Means player clustering
Liam Henshaw — Where to find football data