data = eda.load('bird-strikes')
Earlier we tried to look into correlations, but failed because too many columns were categorial. We can cast those categories into pd.Categorical
types and then access their automagically created numeric codes so that a numerical analysis, such as pd.DataFrame.corr()
, will produce proper results.
data_cc = makeCatCodes(data.copy(), eda.findCategoricalCandidates(data)['name'])
fig, ax = plt.subplots(figsize=(14,14))
sns.heatmap(data_cc.corr(), vmin=-1, vmax=1, cmap='coolwarm', annot=True, ax=ax);
Several clusters can be identified directly:
{'Effect: Indicated Damage', 'Altitude bin', 'Wildlife: Size', 'When: Time of day'}²
{'Aircraft: Airline/Operator', 'Aircraft: Make/Model', 'Wildlife: Species'} x {'Effect: Indicated Damage', 'Altitude bin', 'Wildlife: Size'}
{'When: Time of day', 'Effect: Impact to flight', 'Aircraft: Number of engines?'}²
{'When: Time of day', 'Effect: Impact to flight', 'Aircraft: Number of engines?'} x {'Speed (IAS) in knots', 'Aircraft: Make/Model', 'Wildlife: Species}
Effect: {Impact to flight, Indicated Damage}
are both related, yet one is missing data for nearly half the records, whereas the other is mostly complete. Instead of attempting to fix Effect: Impact to flight
we will stick to Effect: Indicated Damage
.
data['Effect: Impact to flight'].isna().sum() / len(data), data['Effect: Indicated Damage'].isna().sum() / len(data)
Airport: Name → Effect: Indicated Damage
Notice that we use crosstables to get margins on each aggregated sub table. Margins show up as an additional All
column. This simplifies sorting and makes for easier "top x"-style queries.
data_ct = pd.crosstab([data['Airport: Name']], data['Effect: Indicated Damage'], margins=True)
data_ct.sort_values(by='All', ascending=False)[1:].head(5)
We limit our result to the first 20 records via dataframe slicing like so: some_df[-21:-1]
. We skip the first record because in our reversely sorted dataframe, this would show the totals (margins) for each column
data_ct.sort_values(by='All', ascending=True)[-21:-1].plot.barh(figsize=(10,10));
We can see that certain airports among the top 20, such as Denver INTL Airport or Dallas/Fort Worth INTL ARPT, report bird strikes at a much higher rate. A quick check reveals those airports to be near military (air) bases.
The majority of reported bird strikes cause no damage to the aircraft.
Aircraft: Airline/Operator, Airport: Name → Effect: Indicated Damage
data_op_ct = pd.crosstab([data['Aircraft: Airline/Operator'], data['Airport: Name']], data['Effect: Indicated Damage'], margins=True)
data_op_ct.sort_values(by='All', ascending=True)[-21:-1].plot.barh(figsize=(8,10));
Southwest Airlines shows up often, with Fedex Express at Memphis International taking the 2nd spot and UPS Airlines at Louisville INTL ARPT taking the 4th spot. Dallas/Fort Worth INTL ARPT and Denver INTL Airport are dominated by 'unknown'. The latter two are airfreight carriers, with Southwest Airlines running a domestic airfreight branch. One can see that Fedex Express also claims the majority of bird strike reports at Memphis International (86.7%):
data_op_ct.loc['FEDEX EXPRESS', 'MEMPHIS INTL']['All'] / data_ct.loc['MEMPHIS INTL']['All']
UPS Airlines dominates bird strike reports at Louisville International Airport (76.4%):
data_op_ct.loc['UPS AIRLINES', 'LOUISVILLE INTL ARPT']['All'] / data_ct.loc['LOUISVILLE INTL ARPT']['All']
Let's check if we find other airports where bird strike reports are dominated by a single airline operator.
operators = []
airports = []
ratios = []
totals = []
bounds = (.1, 1)
min_reports = 100
for idx in data_op_ct.index[:-1]:
op = idx[0]
ap = idx[1]
t = data_ct.loc[ap]['All']
r = data_op_ct.loc[idx]['All'] / max(1, t)
if r >= bounds[0] and r < bounds[1] and t >= min_reports:
operators.append(op)
airports.append(ap)
ratios.append(r)
totals.append(t)
op_ap_r = pd.DataFrame({'operator': operators,
'airports': airports,
'ratio': ratios,
'totals': totals})
op_ap_r.sort_values(by='ratio', ascending=False).head(10)
Military bases obviously dominate bird strike reports at their respective airport, so let's filter them out.
op_ap_r[op_ap_r['operator'] != 'MILITARY'].sort_values(by='ratio', ascending=False).head(10)
Memphis INTL and Louisville INTL ARPT look like air freight hubs (see operators). If air freight operations take mostly place at night, then their abnormal flight schedule could impose a higher risk of bird strikes, too.
airfreight_pred = data['Aircraft: Airline/Operator'].isin(['FEDEX EXPRESS', 'UPS AIRLINES'])
data.groupby('When: cat').count()['When: Time (HHMM)'].plot.barh(figsize=(12,12))
data[airfreight_pred].groupby('When: cat').count()['When: Time (HHMM)'].plot.barh(color='orange');
The time of day categories generated by pd.qcut
are often close to one hour in range, apart from the late night/early hours categories. This indicates that flight schedules are mostly stable from noon to midnight (in terms of frequency).
We can assume that airfreight carriers mostly operate during night time, which is why for them the risk of bird strikes is naturally higher in those hours.
Wildlife: Size, Altitude bin → Effect: Indicated Damage
pd.crosstab([data['Wildlife: Size'],
data['Altitude bin']],
data['Effect: Indicated Damage'],
margins=True).sort_values(by='All', ascending=True)[:-1].plot.barh(figsize=(10, 4));
Bird strikes seem to happen with much higher frequency at lower altitudes. Smaller birds cause more reports! Large birds might be rarer, or perhaps it's the nature of smaller birds flying in large swarms.
When: Time (HHMM) → Effect: Indicated Damage
Notice that we fold When: Time (HHMM)
into hours for improved clarity. The column uses military time.
pd.crosstab(data['When: Time (HHMM)'] // 100 * 100,
data['Effect: Indicated Damage'],
margins=True)[:-1].plot.bar(figsize=(12, 8));
We see a strong daily increase in bird strike reports beginning at 0600 til 1100, then a decline during noon/early afternoon (1200-1600) and again a rise in the evening/late night (1700-2300), followed by a sharp drop during night hours (0000-0500).
When: Time of day, Aircraft: Number of engines? → Effect: Indicated Damage
pd.crosstab([data['When: Time of day'],
data['Aircraft: Number of engines?']],
data['Effect: Indicated Damage'],
margins=True).sort_values(by='All', ascending=True)[:-1].plot.barh(figsize=(12, 8));
What shows up as strong correlation might just be a result of frequency wrt. twin-engine aircrafts flying during the day (read: almost all commercial airlines or modern military jets).
Aircraft: Make/Model → Effect: Indicated Damage
pd.crosstab(data['Aircraft: Make/Model'],
data['Effect: Indicated Damage'],
margins=True).sort_values(by='All', ascending=True)[-21:-1].plot.barh(figsize=(12, 8));
Aircraft models favored by (commercial) operators fly more often and therefore report more bird strikes, naturally. Military aircrafts report the majority of bird strikes.
It's difficult to see whether the size of an aircraft, such as the massive Boeing 737 variants as compared to a much smaller model such as the DHC8 DASH 8, has an effect. Type of engine, such as turboprop vs jet engines, which is a distinctive feature among aircraft models, could also influence this figure.
Speed (IAS) in knots, When: Phase of flight → Effect: Indicated Damage
pd.crosstab(data['Speed (IAS) in knots'],
data['Effect: Indicated Damage'],
margins=True).sort_values(by='All', ascending=True)[-21:-1].plot.bar(figsize=(12, 8));
There is a speed interval around 100-180 knots that reports the majority of bird strike reports.
pd.crosstab(data['Speed (IAS) in knots'],
data['When: Phase of flight'],
margins=True).sort_values(by='All', ascending=True)[-11:-1].plot.bar(figsize=(12, 8));
Most bird strike seemingly happen during approach, with a speed (IAS) between 120 and 140 knots. Climb phase (at slightly higher speeds) and descent (at approx. twice the speed) also show an increase in bird strike reports.
data.groupby(['FlightDate']).count()['When: Time of day'].plot(figsize=(12,8));
data['FlightDate: Month'] = data['FlightDate'].dt.strftime('%B') # dependent on locale, use .dt.month elsewhere!
data.groupby(['FlightDate: Month']).count()['FlightDate'].sort_values().plot.bar(figsize=(12,8));
We see a seasonal increase in bird strikes during summer and autumn, from July to October. Most birds probably migrate to the south during the colder seasons (given that the data is from the US, which is in the northern hemisphere), so this is entirely expected.
When: Time (HHMM) → Effect: Indicated Damage over Bird Season vs. ~Bird Season
bird_season = [7, 8, 9, 10]
data_bs = data[data['FlightDate'].dt.month.isin(bird_season)]
data_not_bs = data[~data['FlightDate'].dt.month.isin(bird_season)]
(pd.crosstab(data_bs['When: Time (HHMM)'] // 100 * 100,
data_bs['Effect: Indicated Damage'],
margins=True) - pd.crosstab(data_not_bs['When: Time (HHMM)'] // 100 * 100,
data_not_bs['Effect: Indicated Damage'],
margins=True))[:-1].plot.bar(figsize=(12, 8));
Earlier we've shown a figure of bird strikes likelihood distributed over the whole day. Knowing that there is a bird season we can now get a clearer picture of the same correlation. We see a much stronger likelihood of bird strike reports during morning/early morning and during the evening. Surprisingly, perhaps, the risk of bird strikes causing damage to the aircraft is higher during the off-season.
Once the data had been cleaned up, with feature engineering around time of day and (military) aircrafts, the analysis proved to be much easier. Making proper use of categorical data was essential. Seemingly a roadblock at first, pd.DataFrame.corr() could be made to operate well on categorical data, via pd.Categorical() and the codes Pandas generates automatically for each categorical dtype. The resulting heatmap could then be used to guide us towards visualization of strong correlations.
Features that correlate with increased bird strike reports
- Airport: Name,
- Aircraft: Airline/Operator in combination with Airport: Name,
- Whether Airline/Operator is an air freight carrier or operates air freight hubs,
- Altitude in combination with Wildlife: Size,
- When: Time (HHMM),
- When: Time of day in combination with Aircraft: Number of engines,
- Aircraft: Make/Model,
- Speed (IAS) in knots in combination with When: Phase of flights
- Seasonal effects wrt. bird season