Data Retrieval
You can get the dataset directly from Kaggle or use kaggle-cli. Both require a Kaggle login to download datasets. Use the inofficial Kaggle command line interface like so:
$ pip install kaggle-cli
$ kg download -u <username> -p <password> -c <competition>
For the data analysis, it is assumed the dataset bird-strikes.zip
resides in ${pwd}/kaggle/
.
pandas
complains about mixed column types when processing the dataset in chunks (default). With low_memory=False
the dataset is loaded all at once, apparently helping with guessing the proper column type.
data = extract(Path() / 'kaggle' / 'bird-strikes.zip')['Bird Strikes Test.csv']
pd.options.display.max_columns = None
data.head(5)
data.info()
Which columns are missing data/incomplete? Note that this does not include the dataset's own chosen encoding of missing data for a given column.
data.isna().sum().sort_values().plot.barh(figsize=(10,12));
Let's look for categorical values. We can assume that categorical data has way fewer unique values than a column's total values. It's trivial to automate that assumption.
For each categorical candidate, we'd like to have a tally for each possible category. This can show us problematic categories but also indicates columns that can be readily used for groupby/pivot tables. I use pd.DataFrame as a result type mainly for improved readability but also because it feels as if I was building my own database catalogue.
categoricalTally(data, findCategoricalCandidates(data)['name'])
data.describe()
Only two non-object columns is not what we expected from this dataset. Let's fix the columns that were parsed as strings but should have been numerical instead.
for col in ['Speed (IAS) in knots', 'Feet above ground', 'Cost: Other (inflation adj)',
'Cost: Aircraft time out of service (hours)', 'Cost: Other (inflation adj)',
'Cost: Repair (inflation adj)', 'Cost: Total $']:
data[col] = pd.to_numeric(data[col], errors='coerce')
data.describe()
The dataset uses its own values to fill in for missing data. It can be useful to just fold those into n/a values, so that Pandas knows how to deal with them, i.e, exlude them from groupby/crosstab.
data[data['Airport: Name'] == 'UNKNOWN'] = None
data['Airport: Name'].isna().sum()
We don't want the FAA's own record id to be accidentally used for correlation analysis, so we exclude the column by making it the dataset's index. The much easier fix would be to drop the column but that just feels wrong.
There are a couple missing record indices, so we'd expect pd.DataFrame.set_index
to fail but pandas doesn't require the index to be unique. This can become problematic once you try to use data exchange formats such as feather
. We fill up the missing indices with negative values. That way, we can still differentiate between original and synthetic record indices.
def negativeCounter(start:int):
count = start
while True:
yield count
count -= 1
nc = negativeCounter(-1)
if data.index.name != 'Record ID':
data.set_index('Record ID', inplace=True)
data.index = data.index.map(lambda idx: next(nc) if pd.isna(idx) else int(idx))
data.index
Let's also fix the FlightDate
:
data['FlightDate'] = pd.to_datetime(data['FlightDate'], infer_datetime_format=True)
As a first approach, a correlation matrix isn't going to produce much insight, simply because categorical columns are ignored in this analysis.
sns.heatmap(data.corr(), cmap='coolwarm', annot=True, vmin=-1, vmax=1);
sns.pairplot(data);
We want to know what contributes to an increased risk of bird strikes. We have useful categorical data. Mainly we look for clustering in our data. After-the-fact data such as Cost: Aircraft time out of service (hours)
can be safely ignored. Effect: {Impact to flight, Indicated Damage}
can be seen as our target variables, to a degree. We can imagine multiple factors influencing the result:
- regionality: wildlife prefers some region over others,
- aircraft make/model: A Cesna is probably more affected by bird strikes than a Boeing 747,
- number of engines: Purely from a redundancy standpoint, more engines is likely to be better,
- altitude: birds fly at certain heights,
- altitude: aircraft could be more vulnerable at certain heights, with lower altitudes probably being riskier,
- time of day: wildlife has behavioural patterns,
- airline operators: some airline operators might service their aircrafts better than others, or use more reliable models,
- ...
Not all of the above mentioned factors increase the risk of a bird strike, however. In that sense, we can use the effect columns as indicators but not as final result. Also, the FAA compiled a FAQ that could help us understand the data. The FAA's form to report bird strikes can give us clues as to why columns might have bad or unreliable data.
data[['When: Time of day', 'When: Time (HHMM)']].isna().sum() / len(data)
We can attempt to fix missing time of day data by using the airport's distribution of flights over the day and apply that distribution to the missing data. Not every record carries an airport name, so this won't fix all records wrt. time of day. We exploit how pd.qcut distributes records among its categories, meaning that once we know our probabilities for a given airport, we can select one of the When: cat
columns in our crosstable and generate a When: Time (HHMM)
entry.
The next code cell takes a while to compute, it's at least 3 nested loops, with quite a few memory allocations!
data['When: cat'] = pd.qcut(data['When: Time (HHMM)'], 24)
data_tod = data.copy()
tod_ct = pd.crosstab(data_tod['Airport: Name'], data_tod['When: cat'], margins=True)
tod_na = data_tod[data_tod['When: Time (HHMM)'].isna()]
for idx in tod_ct[:-1].index:
row = tod_ct.loc[idx]
total = row['All']
prob = [i / total for i in row[:-1]]
to_fix = tod_na.loc[tod_na['Airport: Name'] == idx]
for idx_to_fix in to_fix.index:
tod_interval = np.random.choice(tod_ct.columns[:-1], p=prob)
mt_hours = np.random.randint(tod_interval.left, tod_interval.right) // 100 * 100
mt_minutes = np.random.randint(tod_interval.left, tod_interval.right) % 60 # only an approximation, not the correct way to do it
data_tod.loc[idx_to_fix, ['When: Time (HHMM)', 'When: cat', 'When: Time of day']] = (mt_hours + mt_minutes, tod_interval, timeOfDay(tod_interval))
data_tod[['When: Time of day', 'When: Time (HHMM)', 'When: cat']].isna().sum() / len(data)
Aircraft: Make/Model → UNKNOWN
UNKNOWN
aircraft at unknown
altitude can be missing data but probably also covers military aircrafts. Let's fix those UNKNOWN
near to known military bases or mentioning a military operator to 'MIL. AIRCRAFT'. The remaining UNKNOWN
s we can set to None
.
pred_mil_ac = (data_tod['Aircraft: Make/Model'] == 'UNKNOWN') & (
data_tod['Aircraft: Airline/Operator'] == 'MILITARY')
pred_mil_ap = (data_tod['Aircraft: Make/Model'] == 'UNKNOWN') & (
data_tod['Airport: Name'].isin(['DENVER INTL AIRPORT', 'DALLAS/FORT WORTH INTL ARPT']))
data_tod.loc[pred_mil_ac, 'Aircraft: Make/Model'] = 'MIL. AIRCRAFT'
data_tod.loc[pred_mil_ap, 'Aircraft: Make/Model'] = 'MIL. AIRCRAFT'
data_tod.loc[data_tod['Aircraft: Make/Model'] == 'UNKNOWN', 'Aircraft: Make/Model'] = None
save(data_tod, 'bird-strikes')
The analysis & visualization follows in the next notebook.