NOTE: The accompanying code for this project can be found here

# Summary

In this project, I am provided with a minimal dataset containing locations and sizes of self-storage facilities across the country. The objective is to identify demographic variables that influence the self-storage market. Using data from the American Community Survey, I perform non-parametric robust regression to identify key factors that influence self-storage demand. I also explain and demonstrate the importance of choosing an optimization scheme that matches the business context of a problem. I conclude with a geospatial visualization of the results and a recommendation of specific regions with high unmet demand.

# Intro

Self storage is an expansive, multi-billion dollar industry utilized by 1 in 11 Americans (ref). In contrast to residential real-estate, self-storage is a relatively stable investment as indicated by a low SBA loan default rate (ref). Major industry players compete for control of metropolitan areas due to their high profitability. In this case study, a self-storage company gives me a .csv containing locations and square footage of self-storage facilities with an open-ended objective: explore demographic trends that influence the self-storage market and provide recommendations for investment strategy to the company.

Market Owner/Operator, Franchise ADDRESS CTY ST Zip Area Year
0 Albuquerque ##### ##### Santa Fe NM 87505 73934 2000
1 Albuquerque ##### ##### Rio Rancho NM 87124 72836 2000
2 Albuquerque ##### ##### Albuquerque NM 87114 80889 1998
3 Albuquerque ##### ##### Albuquerque NM 87111 62697 1997
4 Albuquerque ##### ##### Albuquerque NM 87114 60821 1998

Scenarios like this are an accurate representation of real-world data science where a solution involves business context, messy data from multiple sources and a healthy amount of trial and error. Market strategy is complex, dynamic system involving expansion, contraction and homeostasis, so the first step is to form a quantitative problem statement. My plan was to find localized demographic data and perform regression to a demand metric that I derived from the given data.

## Model Selection

Gradient-boosted decision trees(GBDT) are the state-of-the-art in regression/classification tasks on structured data, as demonstrated by the sustained popularity of libraries like XGBoost and LightGBM (ref). I used LightGBM with 3-fold cross validation for all the results shown in this project.

# Modeling Demand

One challenge faced in this project is to model demand from the given data, which contains no direct measure of demand. According to financial models of self-storage, cash flow for a facility is simply proportional to its square footage, so an approximation of the realized demand for a given geographical region $A$ can be modeled as

$D_{A} = \sum_{i \in A} S_i$

where $S_i$ is the leasable square footage of facility $i$ in region $A$ (ref). This simple model assumes a national average rental price.

There are many different geographical areas defined by the census bureau. While zip codes offer granularity, core based statistical areas(CBSAs) are more useful from a business perspective because they indicate areas of high population density and economic activity. So, the demand target was modeled as an aggregation over CBSAs. A crosswalk table provided by the Department of Housing and Urban Development was used to convert zip codes to CBSAs (ref).

# Demographic Data

The demographic data was collected from the American Community Survey(ACS), which contains detailed demographic information about housing, employment and education. The following code snippet was used to download the data:

NOTE: Use of the census API requires an API key

import requests
import pandas as pd

with open('API_census.txt') as key:

def get_acs5_2020_group(group):

year_='2020'
source_='acs'
name_='acs5' #5 year estimates
table_='profile' #[detailed, subject, profile]
base_ = f'https://api.census.gov/data/{year_}/{source_}/{name_}/{table_}'
geog_type = 'metropolitan%20statistical%20area/micropolitan%20statistical%20area'
data = requests.get(f'{base_}?get=group({group})&for={geog_type}:*&key={API_KEY}')
data = data.json()
acs_data=pd.DataFrame(data[1:], columns=data[0]).set_index(' '.join(geog_type.split(sep='%20')))

#only keep 'estimates' cols

Updated: