This is a passion project of mine that I did during an Independent Study in my Masters program. I only had a short amount of time the first time around so I built a proof-of-concept, but I plan to grow this into a much larger project now. This project will be a foray into using network analysis and graph machine learning to analyze flow sports, specifically Basketball because it’s my favorite sport. I hope this project inspires other sports analyst enthusiasts to keep pushing the craft forward.
I wrote a python script that takes in the desired season’s playoffs games here (NBA Graph Machine Learning Github). Check out the README for directions on how to use the script to get your own playoff data if you so desire. The data is compiled via two primary sources:
In this story, I’ll discuss the network topology of the player passing network in the 2019–2020 NBA Playoffs and showcase a new way of analyzing player & team performance.
Table of Contents
- Introduction & Inspiration
- Network Topology
- Flow Metrics
- Analysis and Evaluation
Introduction & Inspiration
I was inspired to pursue this project after I read Luís Amaral’s paper on analyzing team based activities using network analysis. Amaral & his team analyzed EuroCup Soccer games based on how the ball was passed for each game. He quantified a new metric for player performance coined “Flow Centrality”, and proposed that it can be a suitable metric to average at the team level as well. Amaral concluded that his approach can be a sound way to measure performance of individual entities in a flow pattern where information, content, data, items, etc. pass from one entity to the next.
I chose to apply this and expand on it to the sport I’m passionate about — Basketball. Basketball is quite different from Soccer so it likely won’t transfer over perfectly, but I do believe that analyzing the way the ball is moved (or lack thereof) is a phenomenal indicator of analyzing offensive and defensive strategy and performance of teams.
The original script provided us the mean and standard deviation of each player’s passing frequency for that season. We use in our data ingestion script to normalize the frequency (edges between nodes) which is quite important since some players will naturally have higher frequency due to their positions (eg. Point Guard).
Now that we’re working with a sensible frequency of passing, we can do some further data cleaning & manipulation by taking the log and normalizing the frequency. This will yield a gaussian distribution for our frequency, which makes it really easy to model in various degrees.
Now we can work to actually create our graphs for each game and calculate the metrics. This has to be done in a systematic way since we want a different graph for each team for each game. I created a number of helper functions and wrapped them together with a controller function that yields all the relevant information we’d need for each game.
The code above iterates through each game and creates a dictionary that stores each game’s relevant dataframe for player passing (and other stats) data, the network for the game, node attributes, and the player performance metrics (will cover this in the next section).
With an additional function to just plot the network, this makes it really easy to then pick a set of games and visualize each network while also obtaining the pertinent data to analyze.
These networks have player nodes and 2 non-player nodes (Shot Made and Shot Miss). The edges between player nodes are encoded via their normalized log pass frequency. The player nodes are connected to the shot nodes via their make or miss frequencies, respectively. One side of the graph is one team and the other is the other team. Furthermore, the player nodes are encoded (size and color) by Flow Centrality (top) and Flow Dominance (bottom). These are two flow metrics we can use to measure player performance, which we’ll dive deeper in the next section.
Number of nodes : 18
Number of edges : 82
Maximum degree : 15
Minimum degree : 1
Average degree : 9.11111111111111
Median degree : 10.0
Strongly Connected Components : 8
Weakly Connected Components : 1
Average Distance : 0.5294117647058824
Transitivity : 0.47277227722772275
Average Clustering Coefficient : 0.43494245763770245
The above stats are just for the Lakers vs Heat 2020-10–09 network. It can be interesting to note how this sports network differs from common networks like social ones. Those often have high triadic closure (if person A is connected to person B and person B is connected to person C, then person A likely would want to be connected with person C) and are very large. A network for each game is kind of the opposite — they tend to have low triadic closure (because it’s driven by a basketball acumen of players & the coach and not just social connectivity) and also relatively small. This can lead us into new territory of how to reason from and interpret graph information.
When we look at the topology across all the Playoff games, here’s what we find.
Across multiple games, we do find the degree frequency to follow a slight power law distribution which seems natural. When we take the log of the degree frequency, we find less of a distribution and more of specific values that peak out. This is likely because each game has certain players that stand out, certain players that perform average, and certain players that barely perform at all — and they all fall into very similar buckets. This could also very likely be due to us only looking at one playoff season. If we added more data, we could likely see more variation.
What are we really measuring though? What are we looking at?
When we measure basketball games via a graph where we encode the edges as pass and shot frequencies, we are modeling the fundamental game strategy the team plays. Each team’s objective is to score more points than the opposing team so this way of encoding information about a game helps us model these objectives well because how players move the ball from one entity to the next is the fundamental leverage players have to optimize for this objective.
Technically, this can be even more effective if we draw a graph for a level even more granular than the game but that can be incredibly computationally intensive to monitor and calculate.
Flow & Centrality Metrics
When we want to reason from this graph, we can use centrality metrics and variants of them to assess importance of players and teams.
An additional interesting centrality measure we can look at as good measure of performance can be Load Centrality for each game network. Load Centrality of a node is the fraction of all shortest paths that pass through that node. This makes sense in Basketball as you ultimately want to find the shortest path the ball can find the shot made node.
Amaral coins the metric of Flow Centrality in his paper, which is a normalized log of betweenness centrality — essentially how often a player comes in between paths when you have shot made/miss nodes at the ends.
In my initial analysis of this metric being applied to Basketball (Amaral proved it can be effective when measured to Soccer), I found that it wasn’t as effective of a measure mainly due to the differences Basketball and Soccer. In Soccer, the game is primarily spent passing the ball and occasionally scoring. This is starkly different from Basketball which has less players on a smaller field. In Basketball, we have a much more frequent frequency of shots happening and this has cascading effects in the measurement of importance.
Basketball promotes a rich gets richer effect with the ball — players who start to make more shots will tend to get the ball more then next play and same for vice-versa. There is a more frequent, shorter feedback cycle that happens in basketball which is why I coin a metric call “Flow Dominance” — a variant of Hubs & Authorities.
According to Stanford NLP Group, “This approach stems from a particular insight into the creation of web pages, that there are two primary kinds of web pages useful as results for broad-topic searches. By a broad topic search, we mean an informational query such as “I wish to learn about leukemia”. There are authoritative sources of information on the topic; in this case, the National Cancer Institute’s page on leukemia would be such a page. We will call such page’s authorities; in the computation we are about to describe, they are the pages that will emerge with high authority scores.”
On the other hand, there are many pages on the Web that are hand-compiled lists of links to authoritative web pages on a specific topic. These hub pages are not in themselves authoritative sources of topic-specific information, but rather compilations that someone with an interest in the topic has spent time putting together.
A key assumption here is that nodes that have incoming edges from good hubs are good authorities, and nodes that have outgoing edges to good authorities are good hubs. For basketball, this seems to model our ‘rich get richer’ phenomena well:
Player X starts doing well -> Player X gets passed the ball more often (becomes a good authority) -> Strategy shifts around Player X -> Player X has more people to pass to by virtue of having the ball more (good hub)
If at any point Player X starts missing more then the flow goes in the direction of having the ball less and having less people to pass to (low authority and hub). So, an ideal player would be both: a good hub and authority. In my study, I multiply each node’s hub and authority score and then apply a Yeo-Johnson Power Transformation, and this becomes Flow Dominance.
Analysis and Evaluation
We can start to assess these new metrics by measures Pearson Correlation to other measures of success that we already have (Points, Assists, Blocks, Rebounds, Steals, etc.).
As we found, Flow Centrality doesn’t seem to correlate much to other notable metrics as consistently as Load Centrality and Flow Dominance. Flow Dominance seems moderately correlated but Load Centrality has very high correlation. This is really fascinating because we may have found a new way to appropriately measure performance and strategy. How does this metric scale up to the team level?
We can average these metrics up for each team in a game, and see if the team with the higher average wins or not. A simple way to evaluate this is via a ROC Curve and assess the AUC (Area Under Curve). We have a field in the original dataset of whether the Home Team wins or not, so we can use that as “y_true” and our new field based on the average aggregated up can be our “y_pred”.
Each of the metrics we’re assessing here seem to have a moderate success at aggregating up at the team level (this is averaging the top 5 players). Flow Dominance has the best AUC Score here with a 0.6. For context, here’s the AUC Scores for Assists, Steals, Blocks, and Rebounds (did not look at Points because that should be an obvious 1.0).
AST AUC: 0.7212209302325582
BLK AUC: 0.5863372093023256
STL AUC: 0.5523255813953488
REB AUC: 0.6854651162790697
Considering these are standard metrics that we commonly use to measure performance, we can reasonably accept the AUC we find with our new network metrics. Considering these metrics alongside the correlation matrix, we can see how metrics like Load Centrality and Flow Dominance are not only measures that reflect performance but they have a unique component of encoding game strategy in them as well. This is due to network metrics being specifically based on the relationship that nodes (players) have with each other, and in this case it reflects how the ball moves in a game.
Here’s what the top 10 players come out to be for the 2019–20 Playoffs which these metrics:
You do see notable players stick out here like LeBron James, Nikola Jokic, Giannis Antetokounmpo, and Jimmy Butler. This was the finals that had the Lakers beat out the Miami Heat so it makes sense to see prominent players from those teams. We also see players stick out that we may not expect as all stars like Monte Morris. He played decent but not someone I’d expect to be in top 10.
In this story, I dove into NBA Player Passing & Shooting data by modeling each game’s frequencies as a graph. This opens the door for us to look at the sport as it should be measured — relationally. Sports, specifically flow sports, should be measured not just individualistically but as a network. This allows us to use a larger array of metrics to look at performance and strategy, which can yield new insights in how the game of Basketball is played. Reasoning via graphs forces you to expand your scope of inference by at least 1 dimension more. It forces you to think about your data and the concept your studying differently than you would have otherwise; most of the times this forces a deeper study and review into the true nature of the underlying concept we’re studying.
I only measured this for the 2019–20 Playoffs, but for future iterations I think it would be really fascinating to see how flow metrics may differ for Regular Season Games and Playoffs/Finals. Furthermore, I think it would be especially useful if we can measure graphs at a more granular level than at the end of a game. How the flow of a game’s network evolves adds a fourth dimension to this — time. This could also be done with our method if that proves to be too computationally intensive by measuring rate metrics such as Average Points per Minute.
If you read this and take inspiration to move this analysis forward, please do let me know! I’d love to see how network analysis of sports can go even further.
This story is connected to my Graph Machine Learning series I’m working on right now. Here’s Part 1 and Part2:
Graph Machine Learning with Python Part 1: Basics, Metrics, and Algorithms
An introduction to networks via key metrics and algorithms on a Football dataset
Graph Machine Learning with Python Part 2: Random Graphs and Diffusion Models of CryptoPunks…
Simulating and modeling the CryptoPunks trading data via a graph
In Part 3 of my series, I’ll use this NBA dataset to go into Unsupervised Learning, Supervised Learning, and Neural Networks. Thanks much for reading along!
As a bonus, here’s what the network of teams playing in the 2019–20 NBA Finals looks like. The node size and color are encoded by the team’s wins.
 nba_api Client (Swar Patel)
 NBA games data (Nathan Lauga, Kaggle)
 Quantifying the Performance of Individual Players in a Team Activity (Jordi Duch, Joshua S. Waitzman, Luís A. Nunes Amaral)
 Easley, David and Kleinberg, Jon. 2010. Networks, Crowds, and Markets: Reasoning About a Highly Connected World
 NBA Statistics