Data cleanup, imputation, feature engineering and visualization.

Data Retrieval

You can get the dataset directly from Kaggle or use kaggle-cli. Both require a Kaggle login to download datasets. Use the inofficial Kaggle command line interface like so:

$ pip install kaggle-cli
$ kg download -u <username> -p <password> -c <competition>

For the data analysis, it is assumed the dataset bird-strikes.zip resides in ${pwd}/kaggle/.

Load Data

pandas complains about mixed column types when processing the dataset in chunks (default). With low_memory=False the dataset is loaded all at once, apparently helping with guessing the proper column type.

extract[source]

extract(fn:Path)

data = extract(Path() / 'kaggle' / 'bird-strikes.zip')['Bird Strikes Test.csv']
pd.options.display.max_columns = None
data.head(5)
Airport: Name Altitude bin Aircraft: Make/Model Effect: Impact to flight FlightDate Record ID Effect: Indicated Damage Aircraft: Number of engines? Aircraft: Airline/Operator Origin State When: Phase of flight Wildlife: Size Wildlife: Species When: Time (HHMM) When: Time of day Pilot warned of birds or wildlife? Cost: Aircraft time out of service (hours) Cost: Other (inflation adj) Cost: Repair (inflation adj) Cost: Total $ Miles from airport Feet above ground Speed (IAS) in knots
0 DETROIT METRO WAYNE COUNTY ARPT < 1000 ft A-319 NaN 1/1/2005 0:00 227413 No damage 2 NORTHWEST AIRLINES Michigan Climb Medium Unknown bird - medium NaN NaN NaN NaN 0 0 0 NaN 400 NaN
1 DANE COUNTY REGIONAL ARPT-TRUAX FLD < 1000 ft A-320 NaN 1/1/2005 0:00 226578 No damage 2 NORTHWEST AIRLINES Wisconsin Take-off run Medium Unknown bird - medium NaN NaN NaN NaN 0 0 0 0 0 NaN
2 HOGUE ARPT < 1000 ft PA-46 MALIBU NaN 1/1/2005 0:00 268383 Caused damage 1 PRIVATELY OWNED Tennessee Landing Roll Large White-tailed deer NaN Day NaN NaN 0 0 0 0 0 20
3 SACRAMENTO INTL < 1000 ft B-737-300 None 1/1/2005 0:00 229514 Caused damage 2 SOUTHWEST AIRLINES California Approach Large Greater white-fronted goose 2140.0 Night Y NaN 0 117,135 117,135 NaN 500 142
4 GUARULHOS INTL < 1000 ft B-767-300 NaN 1/1/2005 0:00 225661 No damage 2 UNITED AIRLINES NaN Landing Roll Medium Unknown bird - medium NaN NaN NaN NaN 0 0 0 0 0 NaN

Data Summary

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65610 entries, 0 to 65609
Data columns (total 23 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   Airport: Name                               65601 non-null  object 
 1   Altitude bin                                65610 non-null  object 
 2   Aircraft: Make/Model                        65610 non-null  object 
 3   Effect: Impact to flight                    36516 non-null  object 
 4   FlightDate                                  65610 non-null  object 
 5   Record ID                                   65610 non-null  int64  
 6   Effect: Indicated Damage                    65610 non-null  object 
 7   Aircraft: Number of engines?                42766 non-null  object 
 8   Aircraft: Airline/Operator                  65610 non-null  object 
 9   Origin State                                59163 non-null  object 
 10  When: Phase of flight                       41051 non-null  object 
 11  Wildlife: Size                              56684 non-null  object 
 12  Wildlife: Species                           65593 non-null  object 
 13  When: Time (HHMM)                           36636 non-null  float64
 14  When: Time of day                           41858 non-null  object 
 15  Pilot warned of birds or wildlife?          28361 non-null  object 
 16  Cost: Aircraft time out of service (hours)  4363 non-null   object 
 17  Cost: Other (inflation adj)                 65610 non-null  object 
 18  Cost: Repair (inflation adj)                65610 non-null  object 
 19  Cost: Total $                               65610 non-null  object 
 20  Miles from airport                          37867 non-null  object 
 21  Feet above ground                           35069 non-null  object 
 22  Speed (IAS) in knots                        25315 non-null  object 
dtypes: float64(1), int64(1), object(21)
memory usage: 11.5+ MB

Which columns are missing data/incomplete? Note that this does not include the dataset's own chosen encoding of missing data for a given column.

data.isna().sum().sort_values().plot.barh(figsize=(10,12));

Categorical Data

Let's look for categorical values. We can assume that categorical data has way fewer unique values than a column's total values. It's trivial to automate that assumption.

findCategoricalCandidates[source]

findCategoricalCandidates(data:DataFrame, ratio:float=0.01)

For each categorical candidate, we'd like to have a tally for each possible category. This can show us problematic categories but also indicates columns that can be readily used for groupby/pivot tables. I use pd.DataFrame as a result type mainly for improved readability but also because it feels as if I was building my own database catalogue.

categoricalTally[source]

categoricalTally(data:DataFrame, cat_candidates:list, limit:int=10)

categoricalTally(data, findCategoricalCandidates(data)['name'])
column cat00 cat01 cat02 cat03 cat04 cat05 cat06 cat07 cat08 cat09
0 Effect: Indicated Damage (No damage, 61535) (Caused damage, 4075) None None None None None None None None
1 Altitude bin (< 1000 ft, 27523) (Unknown, 30541) (> 1000 ft, 7546) None None None None None None None
2 Pilot warned of birds or wildlife? (nan, 0) (Y, 13100) (N, 15261) None None None None None None None
3 Wildlife: Size (Medium, 18064) (Large, 3416) (Small, 35204) (nan, 0) None None None None None None
4 When: Time of day (nan, 0) (Day, 25269) (Night, 13623) (Dusk, 1759) (Dawn, 1207) None None None None None
5 Effect: Impact to flight (nan, 0) (None, 33247) (Aborted Take-off, 594) (Precautionary Landing, 1767) (Other, 812) (Engine Shut Down, 96) None None None None
6 Aircraft: Number of engines? (2, 37127) (1, 2114) (4, 1068) (nan, 0) (3, 2456) (S, 1) None None None None
7 When: Phase of flight (Climb, 6317) (Take-off run, 7239) (Landing Roll, 7304) (Approach, 16914) (nan, 0) (Taxi, 126) (Descent, 1261) (En Route, 1533) (Parked, 42) (Landing, 315)
8 Miles from airport (nan, 0) (0, 34610) (12, 77) (13, 13) (20, 172) (3, 341) (8, 127) (5, 394) (50, 10) (7, 121)
9 Origin State (Michigan, 1373) (Wisconsin, 695) (Tennessee, 2097) (California, 5278) (nan, 0) (Prince Edward Island, 97) (Ohio, 2297) (Florida, 3634) (Louisiana, 1409) (Georgia, 1083)
10 Cost: Aircraft time out of service (hours) (nan, 0) (1, 1056) (0, 1348) (504, 19) (24, 139) (113, 1) (162, 1) (288, 3) (4, 103) (2, 335)
11 Speed (IAS) in knots (nan, 0) (20, 22) (142, 75) (100, 1469) (15, 8) (180, 1006) (135, 1351) (65, 109) (120, 1817) (220, 313)
12 Aircraft: Airline/Operator (NORTHWEST AIRLINES, 868) (PRIVATELY OWNED, 547) (SOUTHWEST AIRLINES, 5995) (UNITED AIRLINES, 2805) (GOVERNMENT, 357) (UNKNOWN, 16342) (ASTAR AIR CARGO, 87) (DELTA AIR LINES, 1573) (ATLANTIC SOUTHEAST, 496) (COMAIR AIRLINES, 340)
13 Feet above ground (400, 551) (0, 14184) (500, 1260) (50, 1485) (nan, 0) (100, 1639) (1,500, 759) (300, 895) (75, 137) (1,850, 6)
14 Aircraft: Make/Model (A-319, 1967) (A-320, 2825) (PA-46 MALIBU, 20) (B-737-300, 2912) (B-767-300, 907) (LOCKHEED C-130, 84) (B-757-200, 2363) (UNKNOWN, 19540) (A-300, 1748) (CL-RJ100/200, 3133)
15 Cost: Other (inflation adj) (0, 64576) (53, 20) (5,857, 3) (82,727, 1) (35,141, 1) (29,284, 3) (2,343, 7) (1,171, 4) (545, 1) (122,992, 1)
16 Wildlife: Species (Unknown bird - medium, 11513) (White-tailed deer, 198) (Greater white-fronted goose, 25) (Laysan albatross, 19) (Unknown bird - small, 12645) (Herring gull, 511) (Ring-billed gull, 624) (Mourning dove, 3265) (Egrets, 74) (Unknown bird - large, 957)

Data Repair & Cleanup

data.describe()
Record ID When: Time (HHMM)
count 65610.000000 36636.000000
mean 246208.681116 1389.074271
std 80226.508941 613.410508
min 9005.000000 0.000000
25% 240765.250000 915.000000
50% 257226.500000 1410.000000
75% 303635.750000 1930.000000
max 322935.000000 2359.000000

Only two non-object columns is not what we expected from this dataset. Let's fix the columns that were parsed as strings but should have been numerical instead.

for col in ['Speed (IAS) in knots', 'Feet above ground', 'Cost: Other (inflation adj)',
            'Cost: Aircraft time out of service (hours)', 'Cost: Other (inflation adj)',
            'Cost: Repair (inflation adj)', 'Cost: Total $']:
    data[col] = pd.to_numeric(data[col], errors='coerce')
data.describe()
Record ID When: Time (HHMM) Cost: Aircraft time out of service (hours) Cost: Other (inflation adj) Cost: Repair (inflation adj) Cost: Total $ Feet above ground Speed (IAS) in knots
count 65610.000000 36636.000000 4294.000000 65063.000000 64258.000000 64081.000000 26351.000000 25296.000000
mean 246208.681116 1389.074271 30.902422 2.069686 1.268745 2.786192 106.772912 144.463631
std 80226.508941 613.410508 102.120082 29.903886 26.112340 35.504005 199.300885 47.737767
min 9005.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 240765.250000 915.000000 0.000000 0.000000 0.000000 0.000000 0.000000 120.000000
50% 257226.500000 1410.000000 1.000000 0.000000 0.000000 0.000000 0.000000 140.000000
75% 303635.750000 1930.000000 10.000000 0.000000 0.000000 0.000000 100.000000 160.000000
max 322935.000000 2359.000000 984.000000 998.000000 984.000000 984.000000 995.000000 500.000000

The dataset uses its own values to fill in for missing data. It can be useful to just fold those into n/a values, so that Pandas knows how to deal with them, i.e, exlude them from groupby/crosstab.

data[data['Airport: Name'] == 'UNKNOWN'] = None
data['Airport: Name'].isna().sum()
5998

We don't want the FAA's own record id to be accidentally used for correlation analysis, so we exclude the column by making it the dataset's index. The much easier fix would be to drop the column but that just feels wrong.

There are a couple missing record indices, so we'd expect pd.DataFrame.set_index to fail but pandas doesn't require the index to be unique. This can become problematic once you try to use data exchange formats such as feather. We fill up the missing indices with negative values. That way, we can still differentiate between original and synthetic record indices.

def negativeCounter(start:int):
    count = start
    while True:
        yield count
        count -= 1

nc = negativeCounter(-1)

if data.index.name != 'Record ID':
    data.set_index('Record ID', inplace=True)
    data.index = data.index.map(lambda idx: next(nc) if pd.isna(idx) else int(idx))

data.index
Int64Index([227413, 226578, 268383, 229514, 225661, 230694, 234288, 231042,
                -1, 233011,
            ...
             -5987,  -5988,  -5989, 319982, 319981, 319591, 319983, 319593,
            319589, 319594],
           dtype='int64', name='Record ID', length=65610)

Let's also fix the FlightDate:

data['FlightDate'] = pd.to_datetime(data['FlightDate'], infer_datetime_format=True)

Correlation

As a first approach, a correlation matrix isn't going to produce much insight, simply because categorical columns are ignored in this analysis.

sns.heatmap(data.corr(), cmap='coolwarm', annot=True, vmin=-1, vmax=1);
sns.pairplot(data);

Preliminary Analysis

We want to know what contributes to an increased risk of bird strikes. We have useful categorical data. Mainly we look for clustering in our data. After-the-fact data such as Cost: Aircraft time out of service (hours) can be safely ignored. Effect: {Impact to flight, Indicated Damage} can be seen as our target variables, to a degree. We can imagine multiple factors influencing the result:

  • regionality: wildlife prefers some region over others,
  • aircraft make/model: A Cesna is probably more affected by bird strikes than a Boeing 747,
  • number of engines: Purely from a redundancy standpoint, more engines is likely to be better,
  • altitude: birds fly at certain heights,
  • altitude: aircraft could be more vulnerable at certain heights, with lower altitudes probably being riskier,
  • time of day: wildlife has behavioural patterns,
  • airline operators: some airline operators might service their aircrafts better than others, or use more reliable models,
  • ...

Not all of the above mentioned factors increase the risk of a bird strike, however. In that sense, we can use the effect columns as indicators but not as final result. Also, the FAA compiled a FAQ that could help us understand the data. The FAA's form to report bird strikes can give us clues as to why columns might have bad or unreliable data.

Data Imputation & Feature Engineering

Time of Day

Querying by time/time of day is problematic due to the high percentage of missing data.

data[['When: Time of day', 'When: Time (HHMM)']].isna().sum() / len(data)
When: Time of day    0.396037
When: Time (HHMM)    0.457186
dtype: float64

We can attempt to fix missing time of day data by using the airport's distribution of flights over the day and apply that distribution to the missing data. Not every record carries an airport name, so this won't fix all records wrt. time of day. We exploit how pd.qcut distributes records among its categories, meaning that once we know our probabilities for a given airport, we can select one of the When: cat columns in our crosstable and generate a When: Time (HHMM) entry.

timeOfDay[source]

timeOfDay(time_iv:Interval)

The next code cell takes a while to compute, it's at least 3 nested loops, with quite a few memory allocations!

data['When: cat'] = pd.qcut(data['When: Time (HHMM)'], 24)
data_tod = data.copy()
tod_ct = pd.crosstab(data_tod['Airport: Name'], data_tod['When: cat'], margins=True)
tod_na = data_tod[data_tod['When: Time (HHMM)'].isna()]

for idx in tod_ct[:-1].index:
    row = tod_ct.loc[idx]
    total = row['All']
    prob = [i / total for i in row[:-1]]
    to_fix = tod_na.loc[tod_na['Airport: Name'] == idx]
    
    for idx_to_fix in to_fix.index:
        tod_interval = np.random.choice(tod_ct.columns[:-1], p=prob)
        mt_hours = np.random.randint(tod_interval.left, tod_interval.right) // 100 * 100
        mt_minutes = np.random.randint(tod_interval.left, tod_interval.right) % 60 # only an approximation, not the correct way to do it
        data_tod.loc[idx_to_fix, ['When: Time (HHMM)', 'When: cat', 'When: Time of day']] = (mt_hours + mt_minutes, tod_interval, timeOfDay(tod_interval))
data_tod[['When: Time of day', 'When: Time (HHMM)', 'When: cat']].isna().sum() / len(data)
When: Time of day    0.109770
When: Time (HHMM)    0.094041
When: cat            0.094041
dtype: float64

Aircraft: Make/Model → UNKNOWN

UNKNOWN aircraft at unknown altitude can be missing data but probably also covers military aircrafts. Let's fix those UNKNOWN near to known military bases or mentioning a military operator to 'MIL. AIRCRAFT'. The remaining UNKNOWNs we can set to None.

pred_mil_ac = (data_tod['Aircraft: Make/Model'] == 'UNKNOWN') & (
    data_tod['Aircraft: Airline/Operator'] == 'MILITARY')
pred_mil_ap = (data_tod['Aircraft: Make/Model'] == 'UNKNOWN') & (
    data_tod['Airport: Name'].isin(['DENVER INTL AIRPORT', 'DALLAS/FORT WORTH INTL ARPT']))
data_tod.loc[pred_mil_ac, 'Aircraft: Make/Model'] = 'MIL. AIRCRAFT'
data_tod.loc[pred_mil_ap, 'Aircraft: Make/Model'] = 'MIL. AIRCRAFT'
data_tod.loc[data_tod['Aircraft: Make/Model'] == 'UNKNOWN', 'Aircraft: Make/Model']  = None

Save/Load functionality

Let's save the current state of our model so we can continue from here in a new notebook. Pickle was the only export format that worked properly; both feather and parquet tripped over trivial problems with the dataframe.

filename[source]

filename(name:str)

save[source]

save(data:DataFrame, name:str)

load[source]

load(name:str)

save(data_tod, 'bird-strikes')

The analysis & visualization follows in the next notebook.