Stroke Prediction Analysis - Regression, Tree and Boosting Models with Regulation and Resampling

Summary

The primary objective of this analysis is to develop a predictive model for stroke risk using various health and demographic attributes, focusing on maximizing recall to minimize missed stroke cases. The dataset, sourced from Kaggle, contains information on 5,110 patients with attributes such as age, gender, hypertension, heart disease, and smoking status. Data exploration revealed age, hypertension, and heart disease as significant predictors of stroke risk. Various classification models were evaluated, with the AdaBoost model showing the best overall performance in terms of balancing true and false positives. The analysis recommends incorporating additional health-related features and using longitudinal data to improve model accuracy. Future work should also focus on validating the model across diverse populations and regularly updating it with new data to maintain its relevance and accuracy.

1. Main Objective of the Analysis

The primary objective of this analysis is to build a predictive model focused on predicting the likelihood of stroke in patients based on various health and demographic attributes. The objective for finding the best prediction is to maximize accuracy in terms of recall whilst not classifying too many cases wrongly as stroke risk. The true cost of misclassifying a patient at stroke risk as healthy outweighs the cost of misclassifying a healthy patient as at risk for stroke. The analysis aims at providing insights for stroke risk classification to allow:

Early Identification: Helping healthcare providers identify individuals at higher risk of stroke for timely intervention.
Resource Allocation: Assisting in the efficient allocation of medical resources to those most in need.
Preventive Measures: Providing insights for developing targeted preventive measures to reduce the incidence of strokes.

2. Dataset Description

Dataset Overview

The Stroke Prediction dataset is available at Kaggle (URL : https://www.kaggle.com/fedesoriano/stroke-prediction-dataset) and contains data on patients, including their medical and demographic attributes.

Attributes

Attribute	Description
id	Unique identifier for each patient.
gender	Gender of the patient (Male, Female, Other).
age	Age of the patient.
hypertension	Whether the patient has hypertension (0: No, 1: Yes).
heart_disease	Whether the patient has heart disease (0: No, 1: Yes).
ever_married	Marital status of the patient (No, Yes).
work_type	Type of occupation (children, Govt_job, Never_worked, Private, Self-employed).
residence_type	Type of residence (Rural, Urban).
avg_glucose_level	Average glucose level in the blood.
bmi	Body mass index.
smoking_status	Smoking status (formerly smoked, never smoked, smokes, Unknown).
stroke	Target variable indicating whether the patient had a stroke (0: No, 1: Yes).

Number of Instances: 5,110
Number of Features: 11
Target Variable: stroke (0: No Stroke, 1: Stroke)

Analysis Approach

The analysis aims to:

Explore and visualize the data to understand the distribution of attributes and identify any missing or anomalous values.
Engineer features and prepare the data for modeling.
Train multiple classifier models to predict stroke risk and evaluate the performance of the models.
Identify the best-performing model based on support for stroke risk management.
Provide recommendations for next steps and further optimization.

3. Data Exploration and Cleaning

Data Exploration

Besides the id the dataset includes ten features as listed above plus the target variable stroke . There are three numerical features: age, avg_glucose_level and bmi. The remaining seven features are all categorical. Out of the 5,110 observations in the dataset, 4861 were observed with no stroke and 249 patients had a stroke. The dataset could be classified as not balanced, which has to be addressed before model training. The variables of the dataset and the distribution of the target variable are shown below.

	count	unique	top	freq	mean	std	min	25%	50%	75%	max
gender	5110	3	Female	2994	NaN	NaN	NaN	NaN	NaN	NaN	NaN
age	5110.0	NaN	NaN	NaN	43.23	22.61	0.08	25.0	45.0	61.0	82.0
hypertension	5110.0	NaN	NaN	NaN	0.1	0.3	0.0	0.0	0.0	0.0	1.0
heart_disease	5110.0	NaN	NaN	NaN	0.05	0.23	0.0	0.0	0.0	0.0	1.0
ever_married	5110	2	Yes	3353	NaN	NaN	NaN	NaN	NaN	NaN	NaN
work_type	5110	5	Private	2925	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Residence_type	5110	2	Urban	2596	NaN	NaN	NaN	NaN	NaN	NaN	NaN
avg_glucose_level	5110.0	NaN	NaN	NaN	106.15	45.28	55.12	77.24	91.88	114.09	271.74
bmi	4909.0	NaN	NaN	NaN	28.89	7.85	10.3	23.5	28.1	33.1	97.6
smoking_status	5110	4	never smoked	1892	NaN	NaN	NaN	NaN	NaN	NaN	NaN
stroke	5110.0	NaN	NaN	NaN	0.05	0.22	0.0	0.0	0.0	0.0	1.0

Distribution of stroke cases in the dataset

To analyze distribution and correlation of the data we prepared a set of 4 plots for each of the variables despending on the type as follows:

Numerical Variables: The overall distribution of the variable, the distribution of the variable for non stroke observations, the distribution of the variable for stroke observations and the density distribution separated by stroke cases.
Categorical Variables: The overall distribution of the variable, the distribution of the variable for non stroke observations, the distribution of the variable for stroke observations and the distribution of stroke cases within the groups of the categorical variable.

The distribution of the target variable is shown below.

Numerical Variables

The graphs for the three numerical variables are shown below. There isn't a meaningful correlation visible for BMI or the Average Glucose Level. In contrast Age shows a fairly strong dependency with the target variable.

Distribution of numerical variables

Stroke cases plotted versus Age show to be not equally distributed. The reported stroke cases can be more often found with increasing age. Body Mass Index and Average Glucose level do not show an obvious influence on stroke. The observation is confirmed in the pairplot below.

Pairplot of numerical variable

As seen before here isn't a meaningful correlation visible for BMI or the Average Glucose Level, but Age shows a fairly strong left skew against the target variable.

Categorical Variables

The distribution of the seven categorical variables are shown below.

Distribution of categorical variables

If we analyze the group percentages and compare the distributions of the variables for stroke and non stroke cases we can identify Hypertension, Heat Disease and the Married Status as potential influence for stroke cases. We will analyze the correlation further below.

Correlation Analysis

We can pairplot the entire dataset after after coding and adding some noise to categorical variables for better illustration.

Pairplot of all variables with random noise

The graphs below show the result of the calculated correlation matrix for the entire dataset next to the correlation values of the variables in relation to the target variable stroke.

Correlation matrix and correlation coefficient for stroke as linear model

The correlation matrix shows Age, Heart Disease, Average Glucose Level and Hypertension as the main independent variables for stroke risk. Ever married shows a strong correlation with age, possibly indicating that marriage was more common in older generations. As such, Ever Married is likely to be a confounder in the context of stroke risk analysis.

Data Cleaning and Feature Engineering

To prepare the data for the further analysis and the modeling phase we will perform the following steps:

Handling Missing Values: Address missing values as appropriate.
Discretization - Handling of Outliers: Transforming continuos variables into categories for improved performance and explanation.
Encoding Categorical Variables: Convert categorical variables into numerical format using one-hot encoding.
Data Splitting: Split the data into training and testing sets.
Feature Scaling: Scale features to ensure they are on a similar scale.
Addressing unbalanced data: Balances the classes in the training set.

Handling Missing Values

The only variable showing missing values in the dataset id 'BMI'. There are 201 records with missing BMI value, from these 201 a count of 40 are stroke cases (Stroke =1). We will adjust the missing values by calculating the group mean BMI by gender, age and glocuse level.

Discretization - Handling of Outliers

By categorizing continuous variables into discrete groups, we can enhance the interpretability of the data and improve the performance of classification models by capturing non-linear relationships and reducing the influence of outliers.

For age, the dataset is divided into categories from 0-18 years, 18-30 years, 30-40 years, 40-50 years, 50-65 years, 65-75 years and more than 75 years. These categories reflect different life stages, which may correlate differently with stroke risk due to varying health behaviors and biological changes.

For BMI (Body Mass Index), the classification follows standard health guidelines: Underweight (BMI < 18.5), Normal Weight (18.5 ≤ BMI < 24.9), Overweight (25 ≤ BMI < 29.9), and Obese (BMI ≥ 30).

For glucose levels, categories are Normal (glucose < 140 mg/dL), Prediabetes (140 ≤ glucose < 200 mg/dL), and Diabetes (glucose ≥ 200 mg/dL).

The resulting distributions are shown below.

Distribution for grouped variables Age, BMI and Diabetes(Glucose Level)

Encoding categorical variables

To prepare the data for the modeling phase we will apply one hot encoding to all categorical variables. Two categorical variables are already encoded as numbers, these are 'hypertension' and 'heart_disease'. After performing the one hot encoding with omitting the default value (drop_first=true) the dataset is widened to 17 features including the target variable.

	count	mean	std	min	25%	50%	75%	max
hypertension	5110.0	0.097456	0.296607	0.0	0.0	0.0	0.0	1.0
heart_disease	5110.0	0.054012	0.226063	0.0	0.0	0.0	0.0	1.0
stroke	5110.0	0.048728	0.215320	0.0	0.0	0.0	0.0	1.0
glucose_group	5110.0	0.245597	0.595996	0.0	0.0	0.0	0.0	2.0
age_group	5110.0	2.846184	1.915983	0.0	1.0	3.0	4.0	6.0
bmi_group	5110.0	2.028963	0.952761	0.0	1.0	2.0	3.0	3.0
gender_Male	5110.0	0.413894	0.492578	0.0	0.0	0.0	1.0	1.0
gender_Other	5110.0	0.000196	0.013989	0.0	0.0	0.0	0.0	1.0
ever_married_Yes	5110.0	0.656164	0.475034	0.0	0.0	1.0	1.0	1.0
work_type_Never_worked	5110.0	0.004305	0.065480	0.0	0.0	0.0	0.0	1.0
work_type_Private	5110.0	0.572407	0.494778	0.0	0.0	1.0	1.0	1.0
work_type_Self-employed	5110.0	0.160274	0.366896	0.0	0.0	0.0	0.0	1.0
work_type_children	5110.0	0.134442	0.341160	0.0	0.0	0.0	0.0	1.0
Residence_type_Urban	5110.0	0.508023	0.499985	0.0	0.0	1.0	1.0	1.0
smoking_status_formerly smoked	5110.0	0.173190	0.378448	0.0	0.0	0.0	0.0	1.0
smoking_status_never smoked	5110.0	0.370254	0.482920	0.0	0.0	0.0	1.0	1.0
smoking_status_smokes	5110.0	0.154403	0.361370	0.0	0.0	0.0	0.0	1.0

Data splitting

In the next step we split the data in a ratio of 70/30 into training and test sets, whilst maintaining the class distribution with stratify=y.

	Dataset	Shape
0	Training Features	(3577, 16)
1	Test Features	(1533, 16)
2	Training Target	(3577,)
3	Test Target	(1533,)

Feature Scaling

Scaling is crucial for ensuring that algorithms, which are sensitive to the scale of the input data, perform optimally and produce reliable results. We apply a MinMax scaler to the stroke data to normalize the features so that they have a value between 0 and 1.

	count	mean	std	min	25%	50%	75%	max
hypertension	3577.0	0.096170	0.294865	0.0	0.000000	0.000000	0.000000	1.0
heart_disease	3577.0	0.053956	0.225962	0.0	0.000000	0.000000	0.000000	1.0
glucose_group	3577.0	0.120352	0.295112	0.0	0.000000	0.000000	0.000000	1.0
age_group	3577.0	0.474839	0.317024	0.0	0.166667	0.500000	0.666667	1.0
bmi_group	3577.0	0.675240	0.319805	0.0	0.333333	0.666667	1.000000	1.0
gender_Male	3577.0	0.414593	0.492721	0.0	0.000000	0.000000	1.000000	1.0
gender_Other	3577.0	0.000280	0.016720	0.0	0.000000	0.000000	0.000000	1.0
ever_married_Yes	3577.0	0.660050	0.473758	0.0	0.000000	1.000000	1.000000	1.0
work_type_Never_worked	3577.0	0.003355	0.057831	0.0	0.000000	0.000000	0.000000	1.0
work_type_Private	3577.0	0.571429	0.494941	0.0	0.000000	1.000000	1.000000	1.0
work_type_Self-employed	3577.0	0.164663	0.370928	0.0	0.000000	0.000000	0.000000	1.0
work_type_children	3577.0	0.134470	0.341205	0.0	0.000000	0.000000	0.000000	1.0
Residence_type_Urban	3577.0	0.504054	0.500053	0.0	0.000000	1.000000	1.000000	1.0
smoking_status_formerly smoked	3577.0	0.176405	0.381217	0.0	0.000000	0.000000	0.000000	1.0
smoking_status_never smoked	3577.0	0.370981	0.483135	0.0	0.000000	0.000000	1.000000	1.0
smoking_status_smokes	3577.0	0.149567	0.356696	0.0	0.000000	0.000000	0.000000	1.0

Addressing unbalanced dataset

The imbalance of the dataset can lead to biased models that are heavily skewed towards predicting the majority class, thereby compromising the model's ability to correctly identify and predict strokes. We will apply SMOTE (Synthetic Minority Over-sampling Technique) and alternatively Random Undersampling to the stroke dataset to address the significant class imbalance. This should enhance recall results, which is critical for developing predictive models in healthcare.

4. Model Training

During model training we will evaluate six classification models as listed below. To optimize these models, we will tune hyperparameters using GridSearchCV with a 5-fold cross-validation. This approach ensures that our models are robust and generalize well to unseen data. In addition to hyperparameter tuning, we will explore different resampling methods to address class imbalance, specifically using SMOTE (Synthetic Minority Over-sampling Technique) and random undersampling. To comprehensively evaluate model performance, we will vary the scoring metrics used in GridSearchCV, including F1 score, recall, and F-beta scores with beta values of 2 and 4. These varied scoring metrics will help us assess the models' ability to balance precision and recall, particularly emphasizing recall with the higher beta values in the F-beta score. This extensive evaluation process aims to identify the most effective model scoring and resampling strategy for predicting stroke risk.

Classification Models

The following classification models have been selected for the analysis.

Logistic Regression with Regularization: Used as a baseline model for its simplicity and interpretability.
KNN: A non-parametric method used for classification by comparing a test sample to the 'k' nearest neighbors in the feature space.
Decision Tree: A model that uses a tree-like graph of decisions and their possible consequences, known for its simplicity and ability to handle non-linear relationships.
Naive Bayes: A probabilistic classifier based on Bayes' theorem, assuming independence between predictors. It's particularly effective for problems with categorical input data.
AdaBoost: An ensemble method that combines multiple weak classifiers to create a strong classifier. It works by iteratively training classifiers and adjusting their weights to focus on misclassified instances, improving overall model performance.
Random Forest: Another ensemble method that constructs multiple decision trees during training and outputs the mode of the classes (classification) or the mean prediction (regression) of the individual trees. This method controls overfitting by averaging multiple decision trees, each built from a random subset of the training data and features.

Model Evaluation

The models where evaluated using the same training and test splits for all models to ensure fair comparison. The evaluation methods that were used to evaluate the models were:

Performance Indicators

Accuracy
Precision
Recall
F1 score as F1
FBeta score for beta=2 as F2
FBeta score for beta=4 as F4

Confusion Matrix

True positive (1) and False positive (1) counts
True negative (0) and False negative (0) counts

The results of these metrics for a total of 48 different combinations of resampling method, cross validation scoring and model are shown in the tables below sorted by the F4 score. In each combination the best hyperparameters for the model were determined before calculating the performance indicators.

Results - Performance Indicators

	Resampling	Scoring	Model	Precision	Recall	Accuracy	F1	F2	F4
0	SMOTE	f1	Logistic Regression	0.1096	0.8400	0.6582	0.1938	0.3600	0.6034
6	SMOTE	recall	Logistic Regression	0.1096	0.8400	0.6582	0.1938	0.3600	0.6034
12	SMOTE	f2	Logistic Regression	0.1096	0.8400	0.6582	0.1938	0.3600	0.6034
18	SMOTE	f4	Logistic Regression	0.1096	0.8400	0.6582	0.1938	0.3600	0.6034
16	SMOTE	f2	AdaBoost	0.1009	0.8667	0.6158	0.1808	0.3443	0.5992
22	SMOTE	f4	AdaBoost	0.0988	0.8667	0.6067	0.1774	0.3392	0.5947
38	Undersampling	f2	Decision Tree	0.1204	0.7867	0.7084	0.2088	0.3734	0.5935
44	Undersampling	f4	Decision Tree	0.1204	0.7867	0.7084	0.2088	0.3734	0.5935
32	Undersampling	recall	Decision Tree	0.1204	0.7867	0.7084	0.2088	0.3734	0.5935
26	Undersampling	f1	Decision Tree	0.1204	0.7867	0.7084	0.2088	0.3734	0.5935
40	Undersampling	f2	AdaBoost	0.0973	0.8667	0.6001	0.1750	0.3357	0.5915
42	Undersampling	f4	Logistic Regression	0.0973	0.8667	0.6001	0.1750	0.3357	0.5915
34	Undersampling	recall	AdaBoost	0.0973	0.8667	0.6001	0.1750	0.3357	0.5915
30	Undersampling	recall	Logistic Regression	0.0973	0.8667	0.6001	0.1750	0.3357	0.5915
46	Undersampling	f4	AdaBoost	0.0973	0.8667	0.6001	0.1750	0.3357	0.5915
28	Undersampling	f1	AdaBoost	0.0973	0.8667	0.6001	0.1750	0.3357	0.5915
36	Undersampling	f2	Logistic Regression	0.0973	0.8667	0.6001	0.1750	0.3357	0.5915
24	Undersampling	f1	Logistic Regression	0.0973	0.8667	0.6001	0.1750	0.3357	0.5915
10	SMOTE	recall	AdaBoost	0.0973	0.8667	0.6001	0.1750	0.3357	0.5915
3	SMOTE	f1	Naive Bayes	0.0808	0.9467	0.4703	0.1488	0.3011	0.5806
15	SMOTE	f2	Naive Bayes	0.0723	0.9867	0.3803	0.1348	0.2797	0.5659
39	Undersampling	f2	Naive Bayes	0.0711	0.9867	0.3686	0.1326	0.2759	0.5614
27	Undersampling	f1	Naive Bayes	0.0841	0.8667	0.5316	0.1533	0.3029	0.5601
47	Undersampling	f4	Random Forest	0.1071	0.7600	0.6784	0.1878	0.3425	0.5595
41	Undersampling	f2	Random Forest	0.1071	0.7600	0.6784	0.1878	0.3425	0.5595
29	Undersampling	f1	Random Forest	0.1071	0.7600	0.6784	0.1878	0.3425	0.5595
43	Undersampling	f4	K-Nearest Neighbors	0.0988	0.7867	0.6386	0.1756	0.3289	0.5582
25	Undersampling	f1	K-Nearest Neighbors	0.0988	0.7867	0.6386	0.1756	0.3289	0.5582
31	Undersampling	recall	K-Nearest Neighbors	0.0988	0.7867	0.6386	0.1756	0.3289	0.5582
37	Undersampling	f2	K-Nearest Neighbors	0.0988	0.7867	0.6386	0.1756	0.3289	0.5582
35	Undersampling	recall	Random Forest	0.1059	0.7600	0.6745	0.1860	0.3401	0.5575
21	SMOTE	f4	Naive Bayes	0.0672	0.9867	0.3294	0.1259	0.2641	0.5467
45	Undersampling	f4	Naive Bayes	0.0661	0.9867	0.3170	0.1238	0.2606	0.5422
4	SMOTE	f1	AdaBoost	0.0966	0.7600	0.6406	0.1714	0.3202	0.5413
33	Undersampling	recall	Naive Bayes	0.0651	0.9867	0.3059	0.1221	0.2575	0.5383
9	SMOTE	recall	Naive Bayes	0.0562	0.9867	0.1885	0.1063	0.2288	0.4998
19	SMOTE	f4	K-Nearest Neighbors	0.0911	0.4667	0.7462	0.1525	0.2558	0.3756
7	SMOTE	recall	K-Nearest Neighbors	0.0911	0.4667	0.7462	0.1525	0.2558	0.3756
13	SMOTE	f2	K-Nearest Neighbors	0.0911	0.4667	0.7462	0.1525	0.2558	0.3756
1	SMOTE	f1	K-Nearest Neighbors	0.0985	0.4400	0.7756	0.1610	0.2598	0.3655
8	SMOTE	recall	Decision Tree	0.0930	0.2667	0.8369	0.1379	0.1942	0.2403
14	SMOTE	f2	Decision Tree	0.0930	0.2667	0.8369	0.1379	0.1942	0.2403
20	SMOTE	f4	Decision Tree	0.0930	0.2667	0.8369	0.1379	0.1942	0.2403
2	SMOTE	f1	Decision Tree	0.0930	0.2667	0.8369	0.1379	0.1942	0.2403
11	SMOTE	recall	Random Forest	0.0941	0.2533	0.8441	0.1372	0.1892	0.2304
23	SMOTE	f4	Random Forest	0.0941	0.2533	0.8441	0.1372	0.1892	0.2304
17	SMOTE	f2	Random Forest	0.0838	0.2133	0.8474	0.1203	0.1629	0.1955
5	SMOTE	f1	Random Forest	0.0789	0.2000	0.8467	0.1132	0.1531	0.1835

Performance Results for SMOTE resampling and different cross validation scoring

Performance Results for Random Undersampling resampling and different cross validation scoring

The models were optimized with different scoring criteria to focus on recall and a weighted fbeta score in favor of avoiding misclassification of true stroke risk patients. The results in the tables are sorted by the F4 score as fbeta with beta=4. THe results show the best performance for Logistic regression models trained with SMOTE sampled data, followed by two AdaBoost model also trained with SMOTE sampled data before four Decision Tree models trained with Random Undersampled data. Regading the resampling method it can be observed that DecisionTrees, RandomForest and KNN models perform better when trained with Random Undersampled data. In contrast Logistic Regression, Naive Bayes and AdaBoost show better results when they are trained applying Oversampling with SMOTE.

Results - Confusion Matrix

	Resampling	Scoring	Model	F4	True 1	True 0	False 1	False 0
0	SMOTE	f1	Logistic Regression	0.6034	63	946	512	12
6	SMOTE	recall	Logistic Regression	0.6034	63	946	512	12
12	SMOTE	f2	Logistic Regression	0.6034	63	946	512	12
18	SMOTE	f4	Logistic Regression	0.6034	63	946	512	12
16	SMOTE	f2	AdaBoost	0.5992	65	879	579	10
22	SMOTE	f4	AdaBoost	0.5947	65	865	593	10
38	Undersampling	f2	Decision Tree	0.5935	59	1027	431	16
44	Undersampling	f4	Decision Tree	0.5935	59	1027	431	16
32	Undersampling	recall	Decision Tree	0.5935	59	1027	431	16
26	Undersampling	f1	Decision Tree	0.5935	59	1027	431	16
40	Undersampling	f2	AdaBoost	0.5915	65	855	603	10
42	Undersampling	f4	Logistic Regression	0.5915	65	855	603	10
34	Undersampling	recall	AdaBoost	0.5915	65	855	603	10
30	Undersampling	recall	Logistic Regression	0.5915	65	855	603	10
46	Undersampling	f4	AdaBoost	0.5915	65	855	603	10
28	Undersampling	f1	AdaBoost	0.5915	65	855	603	10
36	Undersampling	f2	Logistic Regression	0.5915	65	855	603	10
24	Undersampling	f1	Logistic Regression	0.5915	65	855	603	10
10	SMOTE	recall	AdaBoost	0.5915	65	855	603	10
3	SMOTE	f1	Naive Bayes	0.5806	71	650	808	4
15	SMOTE	f2	Naive Bayes	0.5659	74	509	949	1
39	Undersampling	f2	Naive Bayes	0.5614	74	491	967	1
27	Undersampling	f1	Naive Bayes	0.5601	65	750	708	10
47	Undersampling	f4	Random Forest	0.5595	57	983	475	18
41	Undersampling	f2	Random Forest	0.5595	57	983	475	18
29	Undersampling	f1	Random Forest	0.5595	57	983	475	18
43	Undersampling	f4	K-Nearest Neighbors	0.5582	59	920	538	16
25	Undersampling	f1	K-Nearest Neighbors	0.5582	59	920	538	16
31	Undersampling	recall	K-Nearest Neighbors	0.5582	59	920	538	16
37	Undersampling	f2	K-Nearest Neighbors	0.5582	59	920	538	16
35	Undersampling	recall	Random Forest	0.5575	57	977	481	18
21	SMOTE	f4	Naive Bayes	0.5467	74	431	1027	1
45	Undersampling	f4	Naive Bayes	0.5422	74	412	1046	1
4	SMOTE	f1	AdaBoost	0.5413	57	925	533	18
33	Undersampling	recall	Naive Bayes	0.5383	74	395	1063	1
9	SMOTE	recall	Naive Bayes	0.4998	74	215	1243	1
19	SMOTE	f4	K-Nearest Neighbors	0.3756	35	1109	349	40
7	SMOTE	recall	K-Nearest Neighbors	0.3756	35	1109	349	40
13	SMOTE	f2	K-Nearest Neighbors	0.3756	35	1109	349	40
1	SMOTE	f1	K-Nearest Neighbors	0.3655	33	1156	302	42
8	SMOTE	recall	Decision Tree	0.2403	20	1263	195	55
14	SMOTE	f2	Decision Tree	0.2403	20	1263	195	55
20	SMOTE	f4	Decision Tree	0.2403	20	1263	195	55
2	SMOTE	f1	Decision Tree	0.2403	20	1263	195	55
11	SMOTE	recall	Random Forest	0.2304	19	1275	183	56
23	SMOTE	f4	Random Forest	0.2304	19	1275	183	56
17	SMOTE	f2	Random Forest	0.1955	16	1283	175	59
5	SMOTE	f1	Random Forest	0.1835	15	1283	175	60

Confusion Matrices for SMOTE and f2 cross validation scoring

Confusion Matrices for Random Undersampling and f2 cross validation scoring

If we would aim purely to minimize the Type II error, meaning trying to avoid misclassifying true stroke risk cases, we would find the smallest Type II error counts in the Naive Bayes models. Model 15 as an example misclassifies only one true stroke case of the test dataset falsely as stroke=0. Unfortunately the good performance at finding true stroke cases comes at the cost of big counts of Type I errors, non stroke patients that were identified as stroke risk patients. Naive Bayes model 15 produced 949 Type I errors out of 1458 non stroke cases. The Logistic Regression models deliver a more balanced result. The best out of the 8 Logistic Regression Models, Model 0, produced 12 Type II errors out of 75 stroke cases in the test data and 512 Type I errors out of 1458 non stroke cases. Also AdaBoost and Decision Tree models show good balance in terms of Type II vs Type I error counts. AdaBoost Model 16, produced 10 Type II errors out of 75 stroke cases in the test data and 579 Type I errors out of 1458 non stroke cases. Decision Tree Model 38, produced 16 Type II errors out of 75 stroke cases in the test data but only 431 Type I errors out of 1458 non stroke cases.

Results - Influence of features

To understand how the different algorithms used the values of different features to build a classification model, we select the best performing model for each of the algorithm and extract the available information about the feature influences. The results are shown in the bargraphs above the model specific confusion matrices below.

Feature Influence for Best in Class Models

Confusion Matrix for Best in Class Models

We can see that age is the predominant influencing feature for all models. The Logistic Regression Model 0 uses L1 Regularization and eliminates most of the other features. The Random Forest and Naive Bayes classifier models show a wider use of features for their classification and allow better explanation than just age. From the exploratory analysis we would have summarized that age is most crucial but also BMI, Diabetes condition and heart disease have some influence on the stroke risk distribution. The feature extraction doesn't show otherwise but mainly confirm the influence of age.

5. Recommended Models

After training and evaluating the models, it is hard to identify a single best model. We would vote for Model 16 AdaBoost as best overall model as it has a low Type II error count but also a decent Type I error performance. As benchmark we would define an ensemble of models and predict based on the majority of "votes" of the models. For the ensemble we include the following models for best results in terms of:

Best overall: Model 16 - AdaBoost
Precision: Model 38 - Decision Tree
Type I error: Model 17 - Random Forest
Type II error: Model 15 - Naive Bayes
F4 score: Model 0 - Logistic Regression with L1 Regularization
F1 score: Model 41 - Random Forest

Best Model - Overall

The AdaBoost Model 16 shows a low Type II error count but also a decent Type I error performance, good accuracy and robust performance across different metrics.

[H]

10 1.2

	Model 16 - AdaBoost
Resampling	SMOTE
Scoring	f2
Model	AdaBoost
Precision	0.100932
Recall	0.866667
Accuracy	0.615786
F1	0.180807
F2	0.34428
F4	0.599241
True 1	65
True 0	879
False 1	579
False 0	10

Confusion Matrix and Feature Influence for AdaBoost Model 16

Best Model - Precision

The Decision Tree Model 38 provides the highest Precision score, good accuracy and robust performance across different metrics.

[H]

10 1.2

	Model 38 - Decision Tree
Resampling	Undersampling
Scoring	f2
Model	Decision Tree
Precision	0.120408
Recall	0.786667
Accuracy	0.708415
F1	0.20885
F2	0.373418
F4	0.593491
True 1	59
True 0	1027
False 1	431
False 0	16

Confusion Matrix and Feature Influence for Decision Tree Model 38

Best Model - Type I error

The Random Forest Model 17 shows the lowest Type I error count, good explainability and good accuracy.

[H]

10 1.2

	Model 17 - Random Forest
Resampling	SMOTE
Scoring	f2
Model	Random Forest
Precision	0.08377
Recall	0.213333
Accuracy	0.847358
F1	0.120301
F2	0.162933
F4	0.195543
True 1	16
True 0	1283
False 1	175
False 0	59

Confusion Matrix and Feature Influence for Random Forest Model 17

Best Model - Type II error

We were evaluating Naive Bayes Model 15 classifier achieves the best recall performance which should be helpful in avoiding too many Type II errors.

[H]

10 1.2

	Model 15 - Naive Bayes
Resampling	SMOTE
Scoring	f2
Model	Naive Bayes
Precision	0.072336
Recall	0.986667
Accuracy	0.3803
F1	0.134791
F2	0.279667
F4	0.565902
True 1	74
True 0	509
False 1	949
False 0	1

Confusion Matrix and Feature Influence for Naive Bayes Model 15

Best Model - F4 Score

The Logistic Regression Model 0 with L1 regularization provides the highest F4 score, good accuracy and robust performance across different metrics.

	Model 0 - Logistic Regression
Resampling	SMOTE
Scoring	f1
Model	Logistic Regression
Precision	0.109565
Recall	0.84
Accuracy	0.658187
F1	0.193846
F2	0.36
F4	0.60338
True 1	63
True 0	946
False 1	512
False 0	12

Confusion Matrix and Feature Influence for Logistic Regression Model 0

Best Model - F1 score

The Random Forest Model 41 offers the best F1 score after the Decision Tree and Logistic Regression Models that have already been selected. It shows good interpretability through the feature importance, making it easier to understand the key drivers of stroke risk.

[H]

10 1.2

	Model 41 - Random Forest
Resampling	Undersampling
Scoring	f2
Model	Random Forest
Precision	0.107143
Recall	0.76
Accuracy	0.678408
F1	0.187809
F2	0.342548
F4	0.559469
True 1	57
True 0	983
False 1	475
False 0	18

Confusion Matrix and Feature Influence for Random Forest Model 41

Ensemble Voting

All six models were used for an ensemble voting with the decision was made in favor of stroke risk for equal votes. The results show a very minor improvement compared to the overall best AdaBoost Model 16.

	ID	Model	Precision	Recall	Accuracy	F1	F2	F4
0	Model 16	AdaBoost	0.101	0.867	0.616	0.181	0.344	0.599
1	Model 0	Logistic Regression	0.110	0.840	0.658	0.194	0.360	0.603
2	Model 15	Naive Bayes	0.072	0.987	0.380	0.135	0.280	0.566
3	Model 38	Decision Tree	0.120	0.787	0.708	0.209	0.373	0.593
4	Model 17	Random Forest	0.084	0.213	0.847	0.120	0.163	0.196
5	Model 41	Random Forest	0.107	0.760	0.678	0.188	0.343	0.559
6	Model E	Ensemble	0.101	0.867	0.616	0.181	0.345	0.600

	ID	Model	F4	True 1	True 0	False 1	False 0
0	Model 16	AdaBoost	0.599	65	879	579	10
1	Model 0	Logistic Regression	0.603	63	946	512	12
2	Model 15	Naive Bayes	0.566	74	509	949	1
3	Model 38	Decision Tree	0.593	59	1027	431	16
4	Model 17	Random Forest	0.196	16	1283	175	59
5	Model 41	Random Forest	0.559	57	983	475	18
6	Model E	Ensemble	0.600	65	880	578	10

6. Key Findings and Insights

The analysis of the stroke prediction model has revealed several critical factors that significantly influence the likelihood of stroke in patients. Understanding these drivers allows for better-targeted interventions and more effective prevention strategies. However, to further enhance the accuracy and reliability of the model, additional data and features are essential.

Main Drivers influencing Stroke Risk

Age: Older patients have a higher risk of stroke. This finding underscores the importance of age-related health monitoring and interventions, as the likelihood of experiencing a stroke increases with age, necessitating enhanced medical vigilance for the elderly.
Hypertension: Presence of hypertension increases stroke risk. Hypertension, or high blood pressure, is a well-established risk factor for stroke, emphasizing the need for strict blood pressure control through medication, lifestyle changes, and regular monitoring.
Heart Disease: Patients with heart disease are more likely to experience a stroke. The strong correlation between cardiovascular conditions and stroke highlights the necessity for comprehensive care plans that address both heart disease management and stroke prevention.
Average Glucose Level: Higher average glucose levels are associated with Diabetes and increase stroke risk. Elevated glucose levels indicate poor diabetes control, which can lead to vascular damage and increased stroke risk, highlighting the importance of maintaining optimal glucose levels through diet, exercise, and medication adherence.

Insights

Preventive Measures: Targeted interventions for patients with hypertension and heart disease could reduce stroke incidence. Implementing comprehensive care plans that include lifestyle modifications, medication adherence, and regular health check-ups is crucial for mitigating stroke risk in these high-risk populations.
Public Health Strategies: Programs aimed at managing blood glucose levels and promoting healthy aging could be beneficial. Public health initiatives should focus on widespread screening for diabetes and hypertension, coupled with campaigns that encourage physical activity, healthy eating, and smoking cessation to reduce stroke risk at a population level.
Holistic Health Approach: Adopting a holistic approach that considers the interplay between various risk factors can enhance stroke prevention efforts. By addressing lifestyle factors such as diet, exercise, and stress management, healthcare providers can simultaneously mitigate risks associated with hypertension, heart disease, and diabetes, leading to better overall health outcomes.
Technology and Monitoring: Leveraging technology, such as wearable devices and telemedicine, can aid in the continuous monitoring of at-risk individuals. These technologies provide real-time data on blood pressure, glucose levels, and heart rate, allowing for timely interventions and personalized care plans that can significantly reduce stroke risk.
Education and Awareness: Raising awareness about the risk factors and preventive measures for stroke is crucial. Public health campaigns and educational programs should aim to inform individuals about the importance of regular health screenings, recognizing early symptoms of stroke, and seeking immediate medical attention, empowering them to take proactive steps towards stroke prevention.

Future Directions

To further improve the understanding and prediction of stroke risk, it is imperative to gather more comprehensive data and incorporate additional features into the model. Including a wider range of demographic, genetic, and lifestyle factors can provide a more nuanced view of stroke risk. Additionally, longitudinal data tracking patients over time could offer insights into how risk factors evolve and interact. By expanding the dataset and refining the features used, we can develop more accurate and robust models that enhance our ability to prevent and manage stroke.

7. Suggestions for Next Steps

Feature Enhancement: Incorporate additional health-related features, such as cholesterol levels, physical activity, and diet, to improve the model's predictive performance. Including more comprehensive lifestyle and biometric data can help create a more accurate and holistic risk assessment for stroke.
Longitudinal Data: Utilize longitudinal data to track changes in patient health over time, which can provide deeper insights into the progression and interaction of risk factors. Longitudinal studies allow for the observation of how individual risk profiles evolve, leading to more precise and personalized predictions.
Assessment of Existing Studies and Scores: Evaluate and integrate findings from established studies and scoring systems, such as the Framingham Heart Study and the CHA₂DS₂-VASc score, which are widely used for predicting cardiovascular and stroke risk. Comparing our model's performance with these well-regarded benchmarks can provide validation and highlight areas for improvement. Additionally, exploring datasets from these studies can offer valuable insights and potential features to enhance our model.
Collaborative Research: Engage in collaborative research with other institutions and researchers to leverage a broader range of expertise and datasets. By pooling resources and knowledge, we can develop more robust and generalizable models that are applicable across diverse populations.
Validation Across Diverse Populations: Test and validate the model across different demographic and geographic populations to ensure its applicability and reliability. Understanding how the model performs in various contexts can help identify any biases or limitations, leading to more equitable and effective stroke risk prediction tools.
Model Re-evaluation: Regularly update and re-evaluate the model as new data becomes available to ensure its continued relevance and accuracy. Incorporating the latest research findings and medical advancements will help maintain the model's effectiveness in predicting stroke risk.