Summary

Gyms play an important role in the development of the rapidly evolving meta strategy of mixed martial arts (MMA). Motivated by curiosity as a casual fan and hobbyist of the sport, I aggregate and visualize publicly available data to provide insight into gym performance and distribution across California.

Scraping and Tabulating: Pandas and BeautifulSoup

California’s amateur MMA governing body maintains a website with fighter information and bout statistics. Queries are limited to searching for individual fighters.

The data as it is presented on the website of the governing body for amateur MMA in California.

The data exist as a paginated table, making it relatively straightforward to scrape. I used the Pandas function read_html() to scrape and pipe the data directly into a dataframe. It uses BeautifulSoup4 on the back-end, which itself is just a wrapper for a set of html parsers. This loop goes through each page of the table and appends the data to the main dataframe object:

import pandas as pd

page_size = 20 #number of rows per page
row_count = 1
keep_going = True
df = pd.DataFrame()

while keep_going:
    url = f'https://camomma.org/Users-fighters&from={row_count}'
    temp_df = pd.read_html(url)[0]    
    df = pd.concat([df,temp_df])    
    row_count += page_size

    if len(temp_df) == 0:
        keep_going = False

The resulting dataframe contains the fighter’s name, nickname, age, record, weight class, gym association and city:

Name Weight Class Association / Ring Name City
Derrick Ko MMA record: 0-0-0-0 Age: 25 Bantamweight - 125 to 135 pounds Goon Squad MMADerrick "Cash" Ko San Francisco
Kaien Cicero MMA record: 0-0-0-0 Age: 20 Lightweight - 145.1 to 155 pounds None Brea
Darrian Williams MMA record: 0-0-0-0 Age: 28 Heavyweight - 230.1 to 265 pounds Samurai Dojo Exeter
Christian Zubiate MMA record: 0-0-0-0 Age: 25 Welterweight - 155.1 to 170 pounds 10th planetHead Hunter Newport Beach
Kevin Wright MMA record: 0-0-0-0 Age: 25 Middleweight - 170.1 to 185 pounds 10th PlanetKevin Wright phoenix

The formatting isn’t quite right, as the Name column also contains the fighter’s record and age. However, this can be fixed with a little bit of regex’ing:

df[['name', 'record', 'age']] = df['Name'].str.extract('(?i)(?P<name>.+?(?=\sMMA))\sMMA\srecord:\s(?P<record>(?<=MMA\srecord:\s)\d+-\d+-\d+-\d+).*?\sAge:\s(?P<age>\d+)', expand=True)
df.pop('Name')

#reorder columns
cols = df.columns.to_list()
cols = cols[-3:] + cols[:-3]
df = df[cols]

Here is a breakdown of the expression:

(?i)(?P<name>.+?(?=\sMMA))\sMMA\srecord:\s(?P<record>(?<=MMA\srecord:\s)\d+-\d+-\d+-\d+).*?\sAge:\s(?P<age>\d+)

(?i) : global case insensitive modifier

(?P<name>.+?(?=\sMMA)) : lazy match everything until reaching 'MMA'

(?P<record>(?<=MMA\srecord:\s)\d+-\d+-\d+-\d+) :  After 'MMA record: ', capture the 4 digits that are separated by dashes

.*?\s : Lazy catch for rows where a pankration/combat grappling record was also listed for a fighter. This data wasn't used.

Age:\s(?P<age>\d+) : Capture the digits after 'Age: '

After the initial formatting, the table looks like:

name record age Weight Class Association / Ring Name City
Derrick Ko 0-0-0-0 25 Bantamweight - 125 to 135 pounds Goon Squad MMADerrick "Cash" Ko San Francisco
Kaien Cicero 0-0-0-0 20 Lightweight - 145.1 to 155 pounds None Brea
Darrian Williams 0-0-0-0 28 Heavyweight - 230.1 to 265 pounds Samurai Dojo Exeter
Christian Zubiate 0-0-0-0 25 Welterweight - 155.1 to 170 pounds 10th planetHead Hunter Newport Beach
Kevin Wright 0-0-0-0 25 Middleweight - 170.1 to 185 pounds 10th PlanetKevin Wright phoenix

After a bit more re-structuring and filtering, I get a table that looks like:

name age weight gym city mma_w mma_l mma_d mma_nc mma_total
Gabriella Sullivan 26 Flyweight - 115.1 to 125 pounds Azteca Lakeside 0 1 0 0 1
Jeremiah Garber 27 Lightweight - 145.1 to 155 pounds 10th Planet HQ - The Yard Los Angeles 0 1 0 0 1
Rodrigo Oliveira 32 Lightweight - 145.1 to 155 pounds Triunfo Jiu-jitsu & mma Pico Rivera 1 0 0 0 1
Ronell White 20 Lightweight - 145.1 to 155 pounds Team Quest Oceanside 0 1 0 0 1
Enrique Valdez 22 Middleweight - 170.1 to 185 pounds Yuma top team Yuma 1 0 0 0 1

Grouping and Aggregation

The first step to any sort of gym-wise analysis is to group the data by gym. If the data veracity is perfect, this is accomplished with a simple call to groupby(). However, the gym name field is a human-typed string, subject to a variety of inconsistencies including:

  1. typos, e.g.
    ['systems training center', 'syatems training center']
    
  2. abbreviations, e.g.
    ['american kickboxing academy', 'american kickboxing academy (aka)',
     'aka (american kickboxing academy)',
     'american kickboxing academy a.k.a', 'aka', 'a.k.a.']
    
  3. prefix/suffix, e.g.
    ['bear pit', 'the bear pit', 'bear pit mma']
    
  4. semantics, e.g.
    ['independent', 'no team', 'solo', 'myself']
    

The semantics issue was relatively minor, so I handled it manually. I approached the other three issues collectively as a clustering problem. This immediately gave rise to two questions. Firstly, clustering necessitates a notion of distance between elements. What is a good way to define distance between strings? Secondly, which type of clustering is appropriate here?

A Distance Metric for Strings

There have been a wide variety of string metrics developed to address the problem of approximate string matching. Most approaches devise a score by counting the number of some well-defined operations needed to transform string A to string B. As an example, Levenshtein distance counts the number of substitutions, additions and deletions needed to transform one string into another.

\[lev(\text{kitten},\text{sitting}) = 3 \\ \textbf{k}\text{itten} \rightarrow \textbf{s}\text{itten} \\ \text{sitt}\textbf{e}\text{n} \rightarrow \text{sitt}\textbf{i}\text{n} \\ \text{sittin} \rightarrow \text{sittin} \textbf{g} \\\]

After testing a few different metrics, Jaro similarity produced the most reliable groupings.

Agglomerative Clustering

Clustering refers to the process of grouping data into subsets based on some criteria of similarity, i.e. distance, mean value, density, connectivity etc. There are many different algorithms to chose from because a cluster is not a well-defined object – the best way to define a cluster depends on the context of the problem.

A comparison of various clustering algorithms. Figure reproduced from the scikit-learn documentation.

Hierarchical clustering aims to build nested groups of data as a function of a pairwise distance metric. This grouping permits an ordering of the clusters into levels, forming a hierarchy. Agglomerative clustering begins with each data point in its own cluster, and iteratively merges clusters based on distance and connectivity constraints. This is in contrast to divisive clustering, which is also hierarchical, but begins with all the data in a single cluster and iteratively divides them into individual clusters. The flexibility in choosing a distance metric, and not needing a priori knowledge about the number of clusters makes hierarchical clustering a good choice for the gym data.

Before clustering, I cleaned the list of gym names by removing elements that would dilute the metric, such as spaces and commonly used terms:

gym_names_full = df['gym'].str.lower().str.replace(" ","").tolist()

import re
gym_names = [re.sub('\W', '', i) for i in gym_names_full]

#remove common string patterns to improve clustering accuracy
common_gym_terms = ['jiujitsu', 'mma', 'team', 'fighter',
                    'gym', 'fight', 'center', 'fighting']

for i, gym in enumerate(gym_names):

    for j in common_gym_terms:
        gym = gym.replace(j, '')

    gym_names[i] = gym

Then, I pre-computed the distance matrix and performed clustering:

from sklearn import cluster
from jellyfish import jaro_distance

#pre-compute the pairwise distance matrix
X_jaro_distance = np.zeros((len(gym_names), len(gym_names)))
for ii in np.arange(len(gym_names)):
    for jj in np.arange(len(gym_names)):
        X_jaro_distance[ii][jj] = 1 - jaro_distance(str(gym_names[ii]),
                                                    str(gym_names[jj]))

clustering = cluster.AgglomerativeClustering(n_clusters=None,
                                             distance_threshold=0.2,
                                             affinity='precomputed',
                                             linkage='complete'
                                             ).fit(X_jaro_distance)

Note: It is important to understand the effect of the linkage criterion on the clustering result. Linkage is how the distance between sets is calculated when determining whether or not to merge them. Complete linkage means that we consider the maximum pairwise distance between all members in the two sets as the cluster distance. In contrast, single linkage uses the minimum pairwise distance between members of the two sets. Single linkage can cause runaway cluster behavior, where one cluster grows extremely large.

A dendrogram illustrating the results of agglomerative clustering by Jaro similarity on a subset of the gym names.

The object returned by AgglomerativeClustering() includes the cluster labels for each item, which I append to my dataframe as the column gym_cluster_labels.

df['gym_cluster_labels'] = clustering.labels_
df.head()
name age weight gym city mma_w mma_l mma_d mma_nc mma_total gym_cluster_labels
Gabriella Sullivan 26 Flyweight - 115.1 to 125 pounds Azteca Lakeside 0 1 0 0 1 1255
Jeremiah Garber 27 Lightweight - 145.1 to 155 pounds 10th Planet HQ - The Yard Los Angeles 0 1 0 0 1 324
Rodrigo Oliveira 32 Lightweight - 145.1 to 155 pounds Triunfo Jiu-jitsu & mma Pico Rivera 1 0 0 0 1 206
Ronell White 20 Lightweight - 145.1 to 155 pounds Team Quest Oceanside 0 1 0 0 1 719
Enrique Valdez 22 Middleweight - 170.1 to 185 pounds Yuma top team Yuma 1 0 0 0 1 785

Now with reliable labels, it becomes possible to aggregate and get descriptive statistics. A natural question may be: how successful are the largest gyms? This code returns g as a dataframe containing the name, total mma fight count and win rate of each gym:

g = df.groupby(['gym_cluster_labels']).agg(
        gym_name = pd.NamedAgg('gym', lambda x: x.iloc[0]), #use the first gym name in the cluster as a representative, for readability
        N_wins = pd.NamedAgg('mma_w', 'sum'),
        N_total = pd.NamedAgg('mma_total', 'sum'))

Note: NamedAgg() returns g as a single-indexed dataframe, which will be easier to query later. It can be thought of as a drop-in replacement for SQL AS.

To improve readability, I modify the styler object corresponding to g so that the win rate cell is colored in proportion to its value.

s = g.sort_values(by='N_total', ascending=False).iloc[0:10,:].style
s.format(precision=2)
s.background_gradient(cmap='RdYlGn',vmin=0, vmax=1, subset=['win_rate'], axis=1)

Here is a look at the 10 largest gyms in the dataset:

gym_name N_wins N_total win_rate
Independent 289 880 0.33
Team Quest 181 355 0.51
DragonHouse MMA 101 215 0.47
Millennia 125 203 0.62
The Arena 131 202 0.65
Team Alpha Male 115 167 0.69
Bodyshop 86 153 0.56
CSW 88 145 0.61
CMMA 86 142 0.61
Victory MMA 82 130 0.63

According to the data, the largest “gym” is actually the group of fighters with no gym affiliation. This group performs notably worse than the next largest gyms. This may imply that the benefits of belonging to a gym – coaching, infrastructure, training partners – have an effect on fighter outcomes. Alternatively, good fighters may be seeking out the best gyms and vice-versa. It is difficult to infer causality without individual fight and gym affiliation information as a function of time.

The tabular representation is fine for a few rows of information, but quickly becomes overwhelming as the data grow larger.

Geospatial Visualization

A good visualization of large, multi-dimensional data is one that compresses multiple features into a relatively small number of plotted objects in an appealing way, without sacrificing information content. In this case, I use geography as the foundation to build a global visual representation of the data.

Geopy

First, I had to geocode the data – i.e., convert a list of gym names into map coordinates. GeoPy is a python library which allows a user to query the API of a variety of geocoding services. I used Google’s geocoding API, which is not free in general, but allows a certain number of free queries per month.

I query the API sequentially with every gym name in a given cluster until I get the coordinates for that gym, then go to the next cluster. This is a lazy way to to avoid excessively redundant calls to the API. The returned object is a geopy.Point() containing a latitude and longitude.

#a null column to populate with coordinates
g['gym_lat'] = pd.NA
g['gym_lon'] = pd.NA

from geopy.geocoders import GoogleV3
geolocator = GoogleV3(api_key='<MY_API_KEY>')

from geopy.extra.rate_limiter import RateLimiter
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1, max_retries=1, error_wait_seconds=10)

for gcl in g.index:
      gym_group = df[(df['gym_cluster_labels'] == gcl)]['gym']

      for gym_name in gym_group:
          location = geocode(query=gym_name, components=[('administrative_area', 'CA'), ('country', 'US')])

          #filter out null and default response from API
          if (location is not None) and (location.raw['place_id'] != 'ChIJPV4oX_65j4ARVW8IJ6IJUYs'):
              g.loc[gcl, 'gym_lat'] = location.latitude
              g.loc[gcl, 'gym_lon'] = location.longitude
              break

Sidenote: Setting up the Geocoding API

On GCP, geocoding is handled by the Geocoding API. To access the API, you have to first generate an API key. From the cloud console dashboard, enter the navigation menu, then go to APIs & ServicesCredentials

On the top of the Credentials page, click the Create Credentials button and select API key from the dropdown menu. The new key will show up on your credentials dashboard. In the row with the newly created key, select ActionsEdit API Key. Under API restrictions, select Restrict Key, then in the dropdown menu, check the Geocoding API and OK.

Now if you click SHOW KEY on the credentials dashboard, you’ll see the alphanumeric string you need to pass to geopy to authenticate the calls to the API.

Warning: The API key is linked to your billing account, so keep it protected!

Folium

With the geocoded data, I generated an interactive map of the MMA gyms across California with Folium, a python wrapper for leaflet.js. First, I generate the folium.Map object with a default map center and starting zoom value. Then, I define a RGBA colormap to color the map markers. Each marker is a circle centered on a gym location, with its size and color corresponding to the gym’s total fight count and win rate, respectively. The unaffiliated fighters are represented by a marker placed off the coast of Santa Cruz:

marker_data = g
marker_data.loc[413, 'gym_lon'] = -125. #shift the 'Independent' entry

import folium
ca_coordinates = (37.41928,-119.30639)
map = folium.Map(location=ca_coordinates,
                 zoom_start=6,
                 max_zoom=9,
                 tiles='cartodb dark_matter')

import branca.colormap as cm
#Red-White-Green colormap
gradient = cm.LinearColormap([(255,0,0,255), (255,255,255,125),(0,255,0,255)],
                              vmin=0., vmax=1.,
                              caption='win rate')

for idx,row in marker_data.iterrows():
    folium.Circle(
        location=[row.gym_lat, row.gym_lon],
        radius=row.N_total*100,
        weight=0, #stroke weight
        fill_color= gradient.rgba_hex_str(row.win_rate),
        fill_opacity=gradient.rgba_floats_tuple(row.win_rate)[-1]
    ).add_to(map)

map.add_child(gradient)

map
A map of MMA gyms in California. The size of each marker is scaled to the total number of fights of the gym, and the color corresponds to the percentage of fights won. The unaffiliated fighters are aggregated into a single marker shown off the coast of Santa Cruz.

If you are curious about the thumbnail image for this project, see here

Updated: