Back to Gallery

Stroke Prediction Analysis - Regression, Tree and Boosting Models with Regulation and Resampling

Summary

The primary objective of this analysis is to develop a predictive model for stroke risk using various health and demographic attributes, focusing on maximizing recall to minimize missed stroke cases. The dataset, sourced from Kaggle, contains information on 5,110 patients with attributes such as age, gender, hypertension, heart disease, and smoking status. Data exploration revealed age, hypertension, and heart disease as significant predictors of stroke risk. Various classification models were evaluated, with the AdaBoost model showing the best overall performance in terms of balancing true and false positives. The analysis recommends incorporating additional health-related features and using longitudinal data to improve model accuracy. Future work should also focus on validating the model across diverse populations and regularly updating it with new data to maintain its relevance and accuracy.

1. Main Objective of the Analysis

The primary objective of this analysis is to build a predictive model focused on predicting the likelihood of stroke in patients based on various health and demographic attributes. The objective for finding the best prediction is to maximize accuracy in terms of recall whilst not classifying too many cases wrongly as stroke risk. The true cost of misclassifying a patient at stroke risk as healthy outweighs the cost of misclassifying a healthy patient as at risk for stroke. The analysis aims at providing insights for stroke risk classification to allow:

  • Early Identification: Helping healthcare providers identify individuals at higher risk of stroke for timely intervention.
  • Resource Allocation: Assisting in the efficient allocation of medical resources to those most in need.
  • Preventive Measures: Providing insights for developing targeted preventive measures to reduce the incidence of strokes.

2. Dataset Description

Dataset Overview

The Stroke Prediction dataset is available at Kaggle (URL : https://www.kaggle.com/fedesoriano/stroke-prediction-dataset) and contains data on patients, including their medical and demographic attributes.

Attributes

Attribute Description
id Unique identifier for each patient.
gender Gender of the patient (Male, Female, Other).
age Age of the patient.
hypertension Whether the patient has hypertension (0: No, 1: Yes).
heart_disease Whether the patient has heart disease (0: No, 1: Yes).
ever_married Marital status of the patient (No, Yes).
work_type Type of occupation (children, Govt_job, Never_worked, Private, Self-employed).
residence_type Type of residence (Rural, Urban).
avg_glucose_level Average glucose level in the blood.
bmi Body mass index.
smoking_status Smoking status (formerly smoked, never smoked, smokes, Unknown).
stroke Target variable indicating whether the patient had a stroke (0: No, 1: Yes).
  • Number of Instances: 5,110

  • Number of Features: 11

  • Target Variable: stroke (0: No Stroke, 1: Stroke)

Analysis Approach

The analysis aims to:

  • Explore and visualize the data to understand the distribution of attributes and identify any missing or anomalous values.
  • Engineer features and prepare the data for modeling.
  • Train multiple classifier models to predict stroke risk and evaluate the performance of the models.
  • Identify the best-performing model based on support for stroke risk management.
  • Provide recommendations for next steps and further optimization.

3. Data Exploration and Cleaning

Data Exploration

Besides the id the dataset includes ten features as listed above plus the target variable stroke . There are three numerical features: age, avg_glucose_level and bmi. The remaining seven features are all categorical. Out of the 5,110 observations in the dataset, 4861 were observed with no stroke and 249 patients had a stroke. The dataset could be classified as not balanced, which has to be addressed before model training. The variables of the dataset and the distribution of the target variable are shown below.

count unique top freq mean std min 25% 50% 75% max
gender 5110 3 Female 2994 NaN NaN NaN NaN NaN NaN NaN
age 5110.0 NaN NaN NaN 43.23 22.61 0.08 25.0 45.0 61.0 82.0
hypertension 5110.0 NaN NaN NaN 0.1 0.3 0.0 0.0 0.0 0.0 1.0
heart_disease 5110.0 NaN NaN NaN 0.05 0.23 0.0 0.0 0.0 0.0 1.0
ever_married 5110 2 Yes 3353 NaN NaN NaN NaN NaN NaN NaN
work_type 5110 5 Private 2925 NaN NaN NaN NaN NaN NaN NaN
Residence_type 5110 2 Urban 2596 NaN NaN NaN NaN NaN NaN NaN
avg_glucose_level 5110.0 NaN NaN NaN 106.15 45.28 55.12 77.24 91.88 114.09 271.74
bmi 4909.0 NaN NaN NaN 28.89 7.85 10.3 23.5 28.1 33.1 97.6
smoking_status 5110 4 never smoked 1892 NaN NaN NaN NaN NaN NaN NaN
stroke 5110.0 NaN NaN NaN 0.05 0.22 0.0 0.0 0.0 0.0 1.0

Distribution of stroke cases in the dataset

To analyze distribution and correlation of the data we prepared a set of 4 plots for each of the variables despending on the type as follows:

  • Numerical Variables: The overall distribution of the variable, the distribution of the variable for non stroke observations, the distribution of the variable for stroke observations and the density distribution separated by stroke cases.

  • Categorical Variables: The overall distribution of the variable, the distribution of the variable for non stroke observations, the distribution of the variable for stroke observations and the distribution of stroke cases within the groups of the categorical variable.

The distribution of the target variable is shown below.

Numerical Variables

The graphs for the three numerical variables are shown below. There isn't a meaningful correlation visible for BMI or the Average Glucose Level. In contrast Age shows a fairly strong dependency with the target variable.

Distribution of numerical variables

Stroke cases plotted versus Age show to be not equally distributed. The reported stroke cases can be more often found with increasing age. Body Mass Index and Average Glucose level do not show an obvious influence on stroke. The observation is confirmed in the pairplot below.

Pairplot of numerical variable

As seen before here isn't a meaningful correlation visible for BMI or the Average Glucose Level, but Age shows a fairly strong left skew against the target variable.

Categorical Variables

The distribution of the seven categorical variables are shown below.

Distribution of categorical variables

If we analyze the group percentages and compare the distributions of the variables for stroke and non stroke cases we can identify Hypertension, Heat Disease and the Married Status as potential influence for stroke cases. We will analyze the correlation further below.

Correlation Analysis

We can pairplot the entire dataset after after coding and adding some noise to categorical variables for better illustration.

Pairplot of all variables with random noise

The graphs below show the result of the calculated correlation matrix for the entire dataset next to the correlation values of the variables in relation to the target variable stroke.

Correlation matrix and correlation coefficient for stroke as linear model

The correlation matrix shows Age, Heart Disease, Average Glucose Level and Hypertension as the main independent variables for stroke risk. Ever married shows a strong correlation with age, possibly indicating that marriage was more common in older generations. As such, Ever Married is likely to be a confounder in the context of stroke risk analysis.

Data Cleaning and Feature Engineering

To prepare the data for the further analysis and the modeling phase we will perform the following steps:

  • Handling Missing Values: Address missing values as appropriate.
  • Discretization - Handling of Outliers: Transforming continuos variables into categories for improved performance and explanation.
  • Encoding Categorical Variables: Convert categorical variables into numerical format using one-hot encoding.
  • Data Splitting: Split the data into training and testing sets.
  • Feature Scaling: Scale features to ensure they are on a similar scale.
  • Addressing unbalanced data: Balances the classes in the training set.

Handling Missing Values

The only variable showing missing values in the dataset id 'BMI'. There are 201 records with missing BMI value, from these 201 a count of 40 are stroke cases (Stroke =1). We will adjust the missing values by calculating the group mean BMI by gender, age and glocuse level.

Discretization - Handling of Outliers

By categorizing continuous variables into discrete groups, we can enhance the interpretability of the data and improve the performance of classification models by capturing non-linear relationships and reducing the influence of outliers.

For age, the dataset is divided into categories from 0-18 years, 18-30 years, 30-40 years, 40-50 years, 50-65 years, 65-75 years and more than 75 years. These categories reflect different life stages, which may correlate differently with stroke risk due to varying health behaviors and biological changes.

For BMI (Body Mass Index), the classification follows standard health guidelines: Underweight (BMI < 18.5), Normal Weight (18.5 ≤ BMI < 24.9), Overweight (25 ≤ BMI < 29.9), and Obese (BMI ≥ 30).

For glucose levels, categories are Normal (glucose < 140 mg/dL), Prediabetes (140 ≤ glucose < 200 mg/dL), and Diabetes (glucose ≥ 200 mg/dL).

The resulting distributions are shown below.

Distribution for grouped variables Age, BMI and Diabetes(Glucose Level)

Encoding categorical variables

To prepare the data for the modeling phase we will apply one hot encoding to all categorical variables. Two categorical variables are already encoded as numbers, these are 'hypertension' and 'heart_disease'. After performing the one hot encoding with omitting the default value (drop_first=true) the dataset is widened to 17 features including the target variable.

count mean std min 25% 50% 75% max
hypertension 5110.0 0.097456 0.296607 0.0 0.0 0.0 0.0 1.0
heart_disease 5110.0 0.054012 0.226063 0.0 0.0 0.0 0.0 1.0
stroke 5110.0 0.048728 0.215320 0.0 0.0 0.0 0.0 1.0
glucose_group 5110.0 0.245597 0.595996 0.0 0.0 0.0 0.0 2.0
age_group 5110.0 2.846184 1.915983 0.0 1.0 3.0 4.0 6.0
bmi_group 5110.0 2.028963 0.952761 0.0 1.0 2.0 3.0 3.0
gender_Male 5110.0 0.413894 0.492578 0.0 0.0 0.0 1.0 1.0
gender_Other 5110.0 0.000196 0.013989 0.0 0.0 0.0 0.0 1.0
ever_married_Yes 5110.0 0.656164 0.475034 0.0 0.0 1.0 1.0 1.0
work_type_Never_worked 5110.0 0.004305 0.065480 0.0 0.0 0.0 0.0 1.0
work_type_Private 5110.0 0.572407 0.494778 0.0 0.0 1.0 1.0 1.0
work_type_Self-employed 5110.0 0.160274 0.366896 0.0 0.0 0.0 0.0 1.0
work_type_children 5110.0 0.134442 0.341160 0.0 0.0 0.0 0.0 1.0
Residence_type_Urban 5110.0 0.508023 0.499985 0.0 0.0 1.0 1.0 1.0
smoking_status_formerly smoked 5110.0 0.173190 0.378448 0.0 0.0 0.0 0.0 1.0
smoking_status_never smoked 5110.0 0.370254 0.482920 0.0 0.0 0.0 1.0 1.0
smoking_status_smokes 5110.0 0.154403 0.361370 0.0 0.0 0.0 0.0 1.0

Data splitting

In the next step we split the data in a ratio of 70/30 into training and test sets, whilst maintaining the class distribution with stratify=y.

Dataset Shape
0 Training Features (3577, 16)
1 Test Features (1533, 16)
2 Training Target (3577,)
3 Test Target (1533,)

Feature Scaling

Scaling is crucial for ensuring that algorithms, which are sensitive to the scale of the input data, perform optimally and produce reliable results. We apply a MinMax scaler to the stroke data to normalize the features so that they have a value between 0 and 1.

count mean std min 25% 50% 75% max
hypertension 3577.0 0.096170 0.294865 0.0 0.000000 0.000000 0.000000 1.0
heart_disease 3577.0 0.053956 0.225962 0.0 0.000000 0.000000 0.000000 1.0
glucose_group 3577.0 0.120352 0.295112 0.0 0.000000 0.000000 0.000000 1.0
age_group 3577.0 0.474839 0.317024 0.0 0.166667 0.500000 0.666667 1.0
bmi_group 3577.0 0.675240 0.319805 0.0 0.333333 0.666667 1.000000 1.0
gender_Male 3577.0 0.414593 0.492721 0.0 0.000000 0.000000 1.000000 1.0
gender_Other 3577.0 0.000280 0.016720 0.0 0.000000 0.000000 0.000000 1.0
ever_married_Yes 3577.0 0.660050 0.473758 0.0 0.000000 1.000000 1.000000 1.0
work_type_Never_worked 3577.0 0.003355 0.057831 0.0 0.000000 0.000000 0.000000 1.0
work_type_Private 3577.0 0.571429 0.494941 0.0 0.000000 1.000000 1.000000 1.0
work_type_Self-employed 3577.0 0.164663 0.370928 0.0 0.000000 0.000000 0.000000 1.0
work_type_children 3577.0 0.134470 0.341205 0.0 0.000000 0.000000 0.000000 1.0
Residence_type_Urban 3577.0 0.504054 0.500053 0.0 0.000000 1.000000 1.000000 1.0
smoking_status_formerly smoked 3577.0 0.176405 0.381217 0.0 0.000000 0.000000 0.000000 1.0
smoking_status_never smoked 3577.0 0.370981 0.483135 0.0 0.000000 0.000000 1.000000 1.0
smoking_status_smokes 3577.0 0.149567 0.356696 0.0 0.000000 0.000000 0.000000 1.0

Addressing unbalanced dataset

The imbalance of the dataset can lead to biased models that are heavily skewed towards predicting the majority class, thereby compromising the model's ability to correctly identify and predict strokes. We will apply SMOTE (Synthetic Minority Over-sampling Technique) and alternatively Random Undersampling to the stroke dataset to address the significant class imbalance. This should enhance recall results, which is critical for developing predictive models in healthcare.

4. Model Training

During model training we will evaluate six classification models as listed below. To optimize these models, we will tune hyperparameters using GridSearchCV with a 5-fold cross-validation. This approach ensures that our models are robust and generalize well to unseen data. In addition to hyperparameter tuning, we will explore different resampling methods to address class imbalance, specifically using SMOTE (Synthetic Minority Over-sampling Technique) and random undersampling. To comprehensively evaluate model performance, we will vary the scoring metrics used in GridSearchCV, including F1 score, recall, and F-beta scores with beta values of 2 and 4. These varied scoring metrics will help us assess the models' ability to balance precision and recall, particularly emphasizing recall with the higher beta values in the F-beta score. This extensive evaluation process aims to identify the most effective model scoring and resampling strategy for predicting stroke risk.

Classification Models

The following classification models have been selected for the analysis.

  1. Logistic Regression with Regularization: Used as a baseline model for its simplicity and interpretability.
  2. KNN: A non-parametric method used for classification by comparing a test sample to the 'k' nearest neighbors in the feature space.
  3. Decision Tree: A model that uses a tree-like graph of decisions and their possible consequences, known for its simplicity and ability to handle non-linear relationships.
  4. Naive Bayes: A probabilistic classifier based on Bayes' theorem, assuming independence between predictors. It's particularly effective for problems with categorical input data.
  5. AdaBoost: An ensemble method that combines multiple weak classifiers to create a strong classifier. It works by iteratively training classifiers and adjusting their weights to focus on misclassified instances, improving overall model performance.
  6. Random Forest: Another ensemble method that constructs multiple decision trees during training and outputs the mode of the classes (classification) or the mean prediction (regression) of the individual trees. This method controls overfitting by averaging multiple decision trees, each built from a random subset of the training data and features.

Model Evaluation

The models where evaluated using the same training and test splits for all models to ensure fair comparison. The evaluation methods that were used to evaluate the models were:

Performance Indicators

  • Accuracy
  • Precision
  • Recall
  • F1 score as F1
  • FBeta score for beta=2 as F2
  • FBeta score for beta=4 as F4

Confusion Matrix

  • True positive (1) and False positive (1) counts
  • True negative (0) and False negative (0) counts

The results of these metrics for a total of 48 different combinations of resampling method, cross validation scoring and model are shown in the tables below sorted by the F4 score. In each combination the best hyperparameters for the model were determined before calculating the performance indicators.

Results - Performance Indicators

Resampling Scoring Model Precision Recall Accuracy F1 F2 F4
0 SMOTE f1 Logistic Regression 0.1096 0.8400 0.6582 0.1938 0.3600 0.6034
6 SMOTE recall Logistic Regression 0.1096 0.8400 0.6582 0.1938 0.3600 0.6034
12 SMOTE f2 Logistic Regression 0.1096 0.8400 0.6582 0.1938 0.3600 0.6034
18 SMOTE f4 Logistic Regression 0.1096 0.8400 0.6582 0.1938 0.3600 0.6034
16 SMOTE f2 AdaBoost 0.1009 0.8667 0.6158 0.1808 0.3443 0.5992
22 SMOTE f4 AdaBoost 0.0988 0.8667 0.6067 0.1774 0.3392 0.5947
38 Undersampling f2 Decision Tree 0.1204 0.7867 0.7084 0.2088 0.3734 0.5935
44 Undersampling f4 Decision Tree 0.1204 0.7867 0.7084 0.2088 0.3734 0.5935
32 Undersampling recall Decision Tree 0.1204 0.7867 0.7084 0.2088 0.3734 0.5935
26 Undersampling f1 Decision Tree 0.1204 0.7867 0.7084 0.2088 0.3734 0.5935
40 Undersampling f2 AdaBoost 0.0973 0.8667 0.6001 0.1750 0.3357 0.5915
42 Undersampling f4 Logistic Regression 0.0973 0.8667 0.6001 0.1750 0.3357 0.5915
34 Undersampling recall AdaBoost 0.0973 0.8667 0.6001 0.1750 0.3357 0.5915
30 Undersampling recall Logistic Regression 0.0973 0.8667 0.6001 0.1750 0.3357 0.5915
46 Undersampling f4 AdaBoost 0.0973 0.8667 0.6001 0.1750 0.3357 0.5915
28 Undersampling f1 AdaBoost 0.0973 0.8667 0.6001 0.1750 0.3357 0.5915
36 Undersampling f2 Logistic Regression 0.0973 0.8667 0.6001 0.1750 0.3357 0.5915
24 Undersampling f1 Logistic Regression 0.0973 0.8667 0.6001 0.1750 0.3357 0.5915
10 SMOTE recall AdaBoost 0.0973 0.8667 0.6001 0.1750 0.3357 0.5915
3 SMOTE f1 Naive Bayes 0.0808 0.9467 0.4703 0.1488 0.3011 0.5806
15 SMOTE f2 Naive Bayes 0.0723 0.9867 0.3803 0.1348 0.2797 0.5659
39 Undersampling f2 Naive Bayes 0.0711 0.9867 0.3686 0.1326 0.2759 0.5614
27 Undersampling f1 Naive Bayes 0.0841 0.8667 0.5316 0.1533 0.3029 0.5601
47 Undersampling f4 Random Forest 0.1071 0.7600 0.6784 0.1878 0.3425 0.5595
41 Undersampling f2 Random Forest 0.1071 0.7600 0.6784 0.1878 0.3425 0.5595
29 Undersampling f1 Random Forest 0.1071 0.7600 0.6784 0.1878 0.3425 0.5595
43 Undersampling f4 K-Nearest Neighbors 0.0988 0.7867 0.6386 0.1756 0.3289 0.5582
25 Undersampling f1 K-Nearest Neighbors 0.0988 0.7867 0.6386 0.1756 0.3289 0.5582
31 Undersampling recall K-Nearest Neighbors 0.0988 0.7867 0.6386 0.1756 0.3289 0.5582
37 Undersampling f2 K-Nearest Neighbors 0.0988 0.7867 0.6386 0.1756 0.3289 0.5582
35 Undersampling recall Random Forest 0.1059 0.7600 0.6745 0.1860 0.3401 0.5575
21 SMOTE f4 Naive Bayes 0.0672 0.9867 0.3294 0.1259 0.2641 0.5467
45 Undersampling f4 Naive Bayes 0.0661 0.9867 0.3170 0.1238 0.2606 0.5422
4 SMOTE f1 AdaBoost 0.0966 0.7600 0.6406 0.1714 0.3202 0.5413
33 Undersampling recall Naive Bayes 0.0651 0.9867 0.3059 0.1221 0.2575 0.5383
9 SMOTE recall Naive Bayes 0.0562 0.9867 0.1885 0.1063 0.2288 0.4998
19 SMOTE f4 K-Nearest Neighbors 0.0911 0.4667 0.7462 0.1525 0.2558 0.3756
7 SMOTE recall K-Nearest Neighbors 0.0911 0.4667 0.7462 0.1525 0.2558 0.3756
13 SMOTE f2 K-Nearest Neighbors 0.0911 0.4667 0.7462 0.1525 0.2558 0.3756
1 SMOTE f1 K-Nearest Neighbors 0.0985 0.4400 0.7756 0.1610 0.2598 0.3655
8 SMOTE recall Decision Tree 0.0930 0.2667 0.8369 0.1379 0.1942 0.2403
14 SMOTE f2 Decision Tree 0.0930 0.2667 0.8369 0.1379 0.1942 0.2403
20 SMOTE f4 Decision Tree 0.0930 0.2667 0.8369 0.1379 0.1942 0.2403
2 SMOTE f1 Decision Tree 0.0930 0.2667 0.8369 0.1379 0.1942 0.2403
11 SMOTE recall Random Forest 0.0941 0.2533 0.8441 0.1372 0.1892 0.2304
23 SMOTE f4 Random Forest 0.0941 0.2533 0.8441 0.1372 0.1892 0.2304
17 SMOTE f2 Random Forest 0.0838 0.2133 0.8474 0.1203 0.1629 0.1955
5 SMOTE f1 Random Forest 0.0789 0.2000 0.8467 0.1132 0.1531 0.1835

Performance Results for SMOTE resampling and different cross validation scoring

Performance Results for Random Undersampling resampling and different cross validation scoring

The models were optimized with different scoring criteria to focus on recall and a weighted fbeta score in favor of avoiding misclassification of true stroke risk patients. The results in the tables are sorted by the F4 score as fbeta with beta=4. THe results show the best performance for Logistic regression models trained with SMOTE sampled data, followed by two AdaBoost model also trained with SMOTE sampled data before four Decision Tree models trained with Random Undersampled data. Regading the resampling method it can be observed that DecisionTrees, RandomForest and KNN models perform better when trained with Random Undersampled data. In contrast Logistic Regression, Naive Bayes and AdaBoost show better results when they are trained applying Oversampling with SMOTE.

Results - Confusion Matrix

Resampling Scoring Model F4 True 1 True 0 False 1 False 0
0 SMOTE f1 Logistic Regression 0.6034 63 946 512 12
6 SMOTE recall Logistic Regression 0.6034 63 946 512 12
12 SMOTE f2 Logistic Regression 0.6034 63 946 512 12
18 SMOTE f4 Logistic Regression 0.6034 63 946 512 12
16 SMOTE f2 AdaBoost 0.5992 65 879 579 10
22 SMOTE f4 AdaBoost 0.5947 65 865 593 10
38 Undersampling f2 Decision Tree 0.5935 59 1027 431 16
44 Undersampling f4 Decision Tree 0.5935 59 1027 431 16
32 Undersampling recall Decision Tree 0.5935 59 1027 431 16
26 Undersampling f1 Decision Tree 0.5935 59 1027 431 16
40 Undersampling f2 AdaBoost 0.5915 65 855 603 10
42 Undersampling f4 Logistic Regression 0.5915 65 855 603 10
34 Undersampling recall AdaBoost 0.5915 65 855 603 10
30 Undersampling recall Logistic Regression 0.5915 65 855 603 10
46 Undersampling f4 AdaBoost 0.5915 65 855 603 10
28 Undersampling f1 AdaBoost 0.5915 65 855 603 10
36 Undersampling f2 Logistic Regression 0.5915 65 855 603 10
24 Undersampling f1 Logistic Regression 0.5915 65 855 603 10
10 SMOTE recall AdaBoost 0.5915 65 855 603 10
3 SMOTE f1 Naive Bayes 0.5806 71 650 808 4
15 SMOTE f2 Naive Bayes 0.5659 74 509 949 1
39 Undersampling f2 Naive Bayes 0.5614 74 491 967 1
27 Undersampling f1 Naive Bayes 0.5601 65 750 708 10
47 Undersampling f4 Random Forest 0.5595 57 983 475 18
41 Undersampling f2 Random Forest 0.5595 57 983 475 18
29 Undersampling f1 Random Forest 0.5595 57 983 475 18
43 Undersampling f4 K-Nearest Neighbors 0.5582 59 920 538 16
25 Undersampling f1 K-Nearest Neighbors 0.5582 59 920 538 16
31 Undersampling recall K-Nearest Neighbors 0.5582 59 920 538 16
37 Undersampling f2 K-Nearest Neighbors 0.5582 59 920 538 16
35 Undersampling recall Random Forest 0.5575 57 977 481 18
21 SMOTE f4 Naive Bayes 0.5467 74 431 1027 1
45 Undersampling f4 Naive Bayes 0.5422 74 412 1046 1
4 SMOTE f1 AdaBoost 0.5413 57 925 533 18
33 Undersampling recall Naive Bayes 0.5383 74 395 1063 1
9 SMOTE recall Naive Bayes 0.4998 74 215 1243 1
19 SMOTE f4 K-Nearest Neighbors 0.3756 35 1109 349 40
7 SMOTE recall K-Nearest Neighbors 0.3756 35 1109 349 40
13 SMOTE f2 K-Nearest Neighbors 0.3756 35 1109 349 40
1 SMOTE f1 K-Nearest Neighbors 0.3655 33 1156 302 42
8 SMOTE recall Decision Tree 0.2403 20 1263 195 55
14 SMOTE f2 Decision Tree 0.2403 20 1263 195 55
20 SMOTE f4 Decision Tree 0.2403 20 1263 195 55
2 SMOTE f1 Decision Tree 0.2403 20 1263 195 55
11 SMOTE recall Random Forest 0.2304 19 1275 183 56
23 SMOTE f4 Random Forest 0.2304 19 1275 183 56
17 SMOTE f2 Random Forest 0.1955 16 1283 175 59
5 SMOTE f1 Random Forest 0.1835 15 1283 175 60

Confusion Matrices for SMOTE and f2 cross validation scoring

Confusion Matrices for Random Undersampling and f2 cross validation scoring

If we would aim purely to minimize the Type II error, meaning trying to avoid misclassifying true stroke risk cases, we would find the smallest Type II error counts in the Naive Bayes models. Model 15 as an example misclassifies only one true stroke case of the test dataset falsely as stroke=0. Unfortunately the good performance at finding true stroke cases comes at the cost of big counts of Type I errors, non stroke patients that were identified as stroke risk patients. Naive Bayes model 15 produced 949 Type I errors out of 1458 non stroke cases. The Logistic Regression models deliver a more balanced result. The best out of the 8 Logistic Regression Models, Model 0, produced 12 Type II errors out of 75 stroke cases in the test data and 512 Type I errors out of 1458 non stroke cases. Also AdaBoost and Decision Tree models show good balance in terms of Type II vs Type I error counts. AdaBoost Model 16, produced 10 Type II errors out of 75 stroke cases in the test data and 579 Type I errors out of 1458 non stroke cases. Decision Tree Model 38, produced 16 Type II errors out of 75 stroke cases in the test data but only 431 Type I errors out of 1458 non stroke cases.

Results - Influence of features

To understand how the different algorithms used the values of different features to build a classification model, we select the best performing model for each of the algorithm and extract the available information about the feature influences. The results are shown in the bargraphs above the model specific confusion matrices below.

Feature Influence for Best in Class Models

Confusion Matrix for Best in Class Models

We can see that age is the predominant influencing feature for all models. The Logistic Regression Model 0 uses L1 Regularization and eliminates most of the other features. The Random Forest and Naive Bayes classifier models show a wider use of features for their classification and allow better explanation than just age. From the exploratory analysis we would have summarized that age is most crucial but also BMI, Diabetes condition and heart disease have some influence on the stroke risk distribution. The feature extraction doesn't show otherwise but mainly confirm the influence of age.

5. Recommended Models

After training and evaluating the models, it is hard to identify a single best model. We would vote for Model 16 AdaBoost as best overall model as it has a low Type II error count but also a decent Type I error performance. As benchmark we would define an ensemble of models and predict based on the majority of "votes" of the models. For the ensemble we include the following models for best results in terms of:

  • Best overall: Model 16 - AdaBoost

  • Precision: Model 38 - Decision Tree

  • Type I error: Model 17 - Random Forest

  • Type II error: Model 15 - Naive Bayes

  • F4 score: Model 0 - Logistic Regression with L1 Regularization

  • F1 score: Model 41 - Random Forest

Best Model - Overall

The AdaBoost Model 16 shows a low Type II error count but also a decent Type I error performance, good accuracy and robust performance across different metrics.

[H]

10 1.2
Model 16 - AdaBoost
Resampling SMOTE
Scoring f2
Model AdaBoost
Precision 0.100932
Recall 0.866667
Accuracy 0.615786
F1 0.180807
F2 0.34428
F4 0.599241
True 1 65
True 0 879
False 1 579
False 0 10

Confusion Matrix and Feature Influence for AdaBoost Model 16

Best Model - Precision

The Decision Tree Model 38 provides the highest Precision score, good accuracy and robust performance across different metrics.

[H]

10 1.2
Model 38 - Decision Tree
Resampling Undersampling
Scoring f2
Model Decision Tree
Precision 0.120408
Recall 0.786667
Accuracy 0.708415
F1 0.20885
F2 0.373418
F4 0.593491
True 1 59
True 0 1027
False 1 431
False 0 16

Confusion Matrix and Feature Influence for Decision Tree Model 38

Best Model - Type I error

The Random Forest Model 17 shows the lowest Type I error count, good explainability and good accuracy.

[H]

10 1.2
Model 17 - Random Forest
Resampling SMOTE
Scoring f2
Model Random Forest
Precision 0.08377
Recall 0.213333
Accuracy 0.847358
F1 0.120301
F2 0.162933
F4 0.195543
True 1 16
True 0 1283
False 1 175
False 0 59

Confusion Matrix and Feature Influence for Random Forest Model 17

Best Model - Type II error

We were evaluating Naive Bayes Model 15 classifier achieves the best recall performance which should be helpful in avoiding too many Type II errors.

[H]

10 1.2
Model 15 - Naive Bayes
Resampling SMOTE
Scoring f2
Model Naive Bayes
Precision 0.072336
Recall 0.986667
Accuracy 0.3803
F1 0.134791
F2 0.279667
F4 0.565902
True 1 74
True 0 509
False 1 949
False 0 1

Confusion Matrix and Feature Influence for Naive Bayes Model 15

Best Model - F4 Score

The Logistic Regression Model 0 with L1 regularization provides the highest F4 score, good accuracy and robust performance across different metrics.

Model 0 - Logistic Regression
Resampling SMOTE
Scoring f1
Model Logistic Regression
Precision 0.109565
Recall 0.84
Accuracy 0.658187
F1 0.193846
F2 0.36
F4 0.60338
True 1 63
True 0 946
False 1 512
False 0 12

Confusion Matrix and Feature Influence for Logistic Regression Model 0

Best Model - F1 score

The Random Forest Model 41 offers the best F1 score after the Decision Tree and Logistic Regression Models that have already been selected. It shows good interpretability through the feature importance, making it easier to understand the key drivers of stroke risk.

[H]

10 1.2
Model 41 - Random Forest
Resampling Undersampling
Scoring f2
Model Random Forest
Precision 0.107143
Recall 0.76
Accuracy 0.678408
F1 0.187809
F2 0.342548
F4 0.559469
True 1 57
True 0 983
False 1 475
False 0 18

Confusion Matrix and Feature Influence for Random Forest Model 41

Ensemble Voting

All six models were used for an ensemble voting with the decision was made in favor of stroke risk for equal votes. The results show a very minor improvement compared to the overall best AdaBoost Model 16.

ID Model Precision Recall Accuracy F1 F2 F4
0 Model 16 AdaBoost 0.101 0.867 0.616 0.181 0.344 0.599
1 Model 0 Logistic Regression 0.110 0.840 0.658 0.194 0.360 0.603
2 Model 15 Naive Bayes 0.072 0.987 0.380 0.135 0.280 0.566
3 Model 38 Decision Tree 0.120 0.787 0.708 0.209 0.373 0.593
4 Model 17 Random Forest 0.084 0.213 0.847 0.120 0.163 0.196
5 Model 41 Random Forest 0.107 0.760 0.678 0.188 0.343 0.559
6 Model E Ensemble 0.101 0.867 0.616 0.181 0.345 0.600
ID Model F4 True 1 True 0 False 1 False 0
0 Model 16 AdaBoost 0.599 65 879 579 10
1 Model 0 Logistic Regression 0.603 63 946 512 12
2 Model 15 Naive Bayes 0.566 74 509 949 1
3 Model 38 Decision Tree 0.593 59 1027 431 16
4 Model 17 Random Forest 0.196 16 1283 175 59
5 Model 41 Random Forest 0.559 57 983 475 18
6 Model E Ensemble 0.600 65 880 578 10

6. Key Findings and Insights

The analysis of the stroke prediction model has revealed several critical factors that significantly influence the likelihood of stroke in patients. Understanding these drivers allows for better-targeted interventions and more effective prevention strategies. However, to further enhance the accuracy and reliability of the model, additional data and features are essential.

Main Drivers influencing Stroke Risk

  • Age: Older patients have a higher risk of stroke. This finding underscores the importance of age-related health monitoring and interventions, as the likelihood of experiencing a stroke increases with age, necessitating enhanced medical vigilance for the elderly.

  • Hypertension: Presence of hypertension increases stroke risk. Hypertension, or high blood pressure, is a well-established risk factor for stroke, emphasizing the need for strict blood pressure control through medication, lifestyle changes, and regular monitoring.

  • Heart Disease: Patients with heart disease are more likely to experience a stroke. The strong correlation between cardiovascular conditions and stroke highlights the necessity for comprehensive care plans that address both heart disease management and stroke prevention.

  • Average Glucose Level: Higher average glucose levels are associated with Diabetes and increase stroke risk. Elevated glucose levels indicate poor diabetes control, which can lead to vascular damage and increased stroke risk, highlighting the importance of maintaining optimal glucose levels through diet, exercise, and medication adherence.

Insights

  • Preventive Measures: Targeted interventions for patients with hypertension and heart disease could reduce stroke incidence. Implementing comprehensive care plans that include lifestyle modifications, medication adherence, and regular health check-ups is crucial for mitigating stroke risk in these high-risk populations.

  • Public Health Strategies: Programs aimed at managing blood glucose levels and promoting healthy aging could be beneficial. Public health initiatives should focus on widespread screening for diabetes and hypertension, coupled with campaigns that encourage physical activity, healthy eating, and smoking cessation to reduce stroke risk at a population level.

  • Holistic Health Approach: Adopting a holistic approach that considers the interplay between various risk factors can enhance stroke prevention efforts. By addressing lifestyle factors such as diet, exercise, and stress management, healthcare providers can simultaneously mitigate risks associated with hypertension, heart disease, and diabetes, leading to better overall health outcomes.

  • Technology and Monitoring: Leveraging technology, such as wearable devices and telemedicine, can aid in the continuous monitoring of at-risk individuals. These technologies provide real-time data on blood pressure, glucose levels, and heart rate, allowing for timely interventions and personalized care plans that can significantly reduce stroke risk.

  • Education and Awareness: Raising awareness about the risk factors and preventive measures for stroke is crucial. Public health campaigns and educational programs should aim to inform individuals about the importance of regular health screenings, recognizing early symptoms of stroke, and seeking immediate medical attention, empowering them to take proactive steps towards stroke prevention.

Future Directions

To further improve the understanding and prediction of stroke risk, it is imperative to gather more comprehensive data and incorporate additional features into the model. Including a wider range of demographic, genetic, and lifestyle factors can provide a more nuanced view of stroke risk. Additionally, longitudinal data tracking patients over time could offer insights into how risk factors evolve and interact. By expanding the dataset and refining the features used, we can develop more accurate and robust models that enhance our ability to prevent and manage stroke.

7. Suggestions for Next Steps

  • Feature Enhancement: Incorporate additional health-related features, such as cholesterol levels, physical activity, and diet, to improve the model's predictive performance. Including more comprehensive lifestyle and biometric data can help create a more accurate and holistic risk assessment for stroke.

  • Longitudinal Data: Utilize longitudinal data to track changes in patient health over time, which can provide deeper insights into the progression and interaction of risk factors. Longitudinal studies allow for the observation of how individual risk profiles evolve, leading to more precise and personalized predictions.

  • Assessment of Existing Studies and Scores: Evaluate and integrate findings from established studies and scoring systems, such as the Framingham Heart Study and the CHA₂DS₂-VASc score, which are widely used for predicting cardiovascular and stroke risk. Comparing our model's performance with these well-regarded benchmarks can provide validation and highlight areas for improvement. Additionally, exploring datasets from these studies can offer valuable insights and potential features to enhance our model.

  • Collaborative Research: Engage in collaborative research with other institutions and researchers to leverage a broader range of expertise and datasets. By pooling resources and knowledge, we can develop more robust and generalizable models that are applicable across diverse populations.

  • Validation Across Diverse Populations: Test and validate the model across different demographic and geographic populations to ensure its applicability and reliability. Understanding how the model performs in various contexts can help identify any biases or limitations, leading to more equitable and effective stroke risk prediction tools.

  • Model Re-evaluation: Regularly update and re-evaluate the model as new data becomes available to ensure its continued relevance and accuracy. Incorporating the latest research findings and medical advancements will help maintain the model's effectiveness in predicting stroke risk.