Managing Risk in Commercial Aviation: Strategies to Minimize Fatalities

Anudeepa Reddy
8 min readJun 7, 2022

Table of Contents:

· Overview

· Problem Statement

· Data set

· Mapping the real-world problem to an ML problem

· Exploratory Data Analysis and Feature Engineering

· Feature Importance

· Modelling

· Future Works

· References

Overview:

An Aviation fatality is defined as an occurrence associated with the operation of an aircraft, which takes place from the time any person boards the aircraft until all such persons have disembarked, in which

a) A person is fatally or seriously injured,

b) The aircraft sustains significant damage or structural failure, or

c) The aircraft goes missing or becomes completely inaccessible.

The reasons for these accidents may be Equipment failure, weather or pilot errors.

Statistics show that 56% of accidents occur due to human errors.

https://www.1001crash.com/stats/graph/cause_en.gif

Data set:

The train and test data has the data about

a. Id — A unique identifier for a crew + time combination.

b. Crew — A unique id for a pair of pilots. There are 9 crews in the data.

c. Experiment — One of CA, DA, SS, or LOFT. The first 3 comprise the training set. The latter is in the test set.

d. Time — Seconds into the experiment.

e. Seat — Whether the pilot is in the left seat(0) or right seat(1).

f. EEG- Electro EncephaloGram- eeg_fp1 , eeg_f7, eeg_f8 , eeg_t4 , eeg_t6 , eeg_t5 , eeg_t3 , eeg_fp2 , eeg_o1 , eeg_p3 , eeg_pz , eeg_f3 , eeg_fz , eeg_f4 , eeg_c4 , eeg_p4 , eeg_poz , eeg_c3 , eeg_cz , eeg_o2 are the voltages collected at different points on scalp.

g. ECG — 3-point Electro Cardio Gram signal. The data is provided in microvolts.

h. r — Respiration, a measure of the rise and fall of the chest. The data is provided in microvolts.

i. GSR — Galvanic Skin Response, a measure of electrodermal activity. The data is provided in microvolts.

j. Event — The state of the pilot at the given time: one of A = Baseline/No event, B = SS, C = CA, D = DA.

The pilots when distracted intended to induce one of the following three cognitive states:

  • Channelized Attention (CA) is, roughly speaking, the state of being focused on one task to the exclusion of all others. This is induced in benchmarking by having the subjects play an engaging puzzle-based video game.
  • Diverted Attention (DA) is the state of having one’s attention diverted by actions or thought processes associated with a decision. This is induced by having the subjects perform a display monitoring task. Periodically, a math problem showed up which had to be solved before returning to the monitoring task.
  • Startle/Surprise (SS) is induced by having the subjects watch movie clips with jump scares.

EEG, ECG, Galvanic Skin Response, and respiration devices are used to collect the physiological signals from the crew at the time of the experiment.

Mapping the real-world problem to an ML problem:

a. For each data point with information given about medical signals, crew and time, we need to build a model that predicts the state of pilot during real world situations. There are four states namely A (No event), B (SS), C (CA), D (DA).

b. So, this is the multiclass classification problem.

c. Objective: Predict the probability of occurrence of event for each data point.

EDA & Feature engineering:

Firstly it’s important to check if there are any null values.

As there are no nulls, we can proceed to EDA.

Let us know the distributions of each class label in the data.

The events are not distributed equally which shows that the data is imbalanced.

More data points are classified as event ‘A’. Which means, most of the time the pilot is in normal state.

Univariate Analysis:

Let’s look at the effect of variation of each feature on the classifying the event.

SEAT:

When the seat is 0 or 1, the output has no effect.

CREW:

Crew has little variations with respect to event. Crew 1 has less recordings of pilots being in event 1 than other crews.

EXPERIMENT:

This plot gives information about the occurrence of events during particular experiment. For example, when experiment is ‘CA’/ ’DA’/ ’SS’ , the state of pilot may be either C/D/B or A.

Here, when experiment is ‘CA’ most of the times the event is C. That can be understood as when pilot is roughly speaking, then he/she is less focussed on monitoring operation.

ECG, R, EEG, GSR:

In this kaggle problem, three experiments were conducted on pilots namely SS, CA and DA. During the time of these experiments, the heart, brain , skin and respiration activities are recorded using medical instruments like ecg, eeg, gsr and r.

These medical instruments are so sensitive to noise that even the nearest electrical/electronic apparatus can easily affect the signal frequency.

So, it is important to filter out the noise from the signal.

Noise is nothing but an unwanted signal which is dominant at high frequencies.

Where as, the respiration rate in adults is 12–16 breaths per minute which means 12/60 Hz. Similarly the heartbeat is 72 times per minute. i.e 72/60 Hz. So, these signals are low frequency signals.

So, to de-noise these signals, we can use a low pass filter which allows the low frequency signal and blocks the high frequency signal which is noise.

These test result signals are in volts. It is important to know the interpretation of those signals.

ECG:

ECG is a medical instrument which collects the heart signals and which can be analysed to know the state of person.

ecg values are almost in the same range for every event.

The ecg.ecg function from biosppy module helps to obtain the heart rate from the ecg signal.

This gives the heart rate only at few time stamps. So, we use interpolate function to obtain the heart rate at all the time stamps.

Respiration:

Most of the respiration values are also in similar range, so we can’t simply classify the event based on these signals.

Using Biosppy module we can also obtain the respiration rate.

GSR:

GSR is the property of the human body that causes changes in the electrical characteristics of skin and is performed by the nervous system through the subconscious mind.

When there is change/ arousal in the nervous system, the sweat gland activity increases, which in turn increases the skin conductance. As the Skin conductance is not under the conscious control, they can be used in finding the emotional states of humans.

GSP is the potential between the nodes observed when no external current is applied.

These gsr values alone cannot differentiate between the events. But definitely helps in classification along with other features.

EEG:

eeg gives information of brain activity.

The difference between the node voltages called as montages are calculated to get idea about the brain activity in that region between the nodes. These potential differences are added as the new features.

The brain activity can also be obtained by analysing the frequency of eeg signals. This is called firing rate. This firing rate is divided into 5 bands. Delta(<4 Hz), Theta(4–7Hz), Alpha(8–15Hz), Beta(16–31Hz), Gamma(>32hz).

This frequency values of eeg can be obtained using get_power_features from Biosppy module. And these are also added as new features.

Bivariate Analysis:

In Uni variate analysis, seat and crew did not infer much about the event classification. So, in bivariate analysis, we deal with medical signals and time.

It will be easy to understand the data if the signal is plotted separately for each class.

From these plots, we can say when ecg values are constant till half of the time later experiment, then the class is ‘D’.

If the values are gradually decreasing after initial second of experiment, then it is class ‘B’.

The plot is same for class A and C.

Similar analysis can be done for respiration and gsr signals also.

Pilot_ID:

We are not given with pilot id in the data. We are given info about crew and seat. Each crew has 2 pilots and we are having 9 crews.

So, we can easily obtain the pilot id by combining the crew number and seat number. For example, if data point ‘i’ has crew number 3 and seat is 0, then the pilot id is 30.

Feature Importance:

It is important to know the important features in predicting the event, so that we train the model only on important features, which reduces time and space.

Least or zero important features are remove from the data.

Modelling:

Modelling is performed on the data after feature engineering.

The train data is split into train and validation data.

The model is trained on train data and evaluated on validation data.

Machine learning models like logistic regression, decision tree, AdaBoost and also stacking of those models can be used for training. All these models are saved using pickle files so that they can be used on test data later.

Test data is loaded and all feature engineering steps are performed. All new features are obtained on test data too.

Now the trained models are used to predict the class labels of test data and to get log loss score.

So, the Best score is obtained for Random forest classifier.

Future works:

· As the data is imbalanced, Techniques like SMOTE can be used to balance the data.

· R peak and gsr peak features can be obtained.

You can find my code in my GitHub Repository and if you have any suggestions, please contact me via LinkedIn.

References:

https://www.kaggle.com/stuartbman/introduction-to-physiological-data

https://www.appliedaicourse.com/course/11/Applied-Machine-learning-course

https://www.kaggle.com/hanhdao123/reducing-aviation-fatalities

--

--