Stroke Prediction - Cluster Analysis and Dimensionality Reduction
This analysis aimed to enhance stroke data classification models by integrating unsupervised learning techniques like clustering and dimensionality reduction. By employing clustering, we segmented patients into groups with similar features to improve stroke risk classification, and applied dimensionality reduction to explore patterns in the data. Despite the dataset's imbalance, we observed that cluster-based classification could slightly improve results. Specifically, clustering approaches like DBSCAN Clustering, combined with classification algorithms, showed potential in better identifying stroke risk while balancing accuracy and recall. The ultimate goal is to provide early identification of high-risk individuals, optimize resource allocation, and support targeted preventive measures.


1. Main Objective
The main objective of this analysis is to improve derived base classification models of the Stroke Data Analysis by applying unsupervised learning methods like clustering and dimensionality reduction. With clustering we will analyze the data and aim to segment the population into groups of patients with similar features to support classification of stroke risk. Dimensionality reduction aims at transforming the observations in a different space to see whether we can identify some structure or patterns. The metrics for finding the best prediction is again to maximize accuracy in terms of recall whilst not classifying too many cases wrongly as stroke risk. The true cost of misclassifying a patient at stroke risk as healthy outweighs the cost of misclassifying a healthy patient as at risk for stroke. The analysis aims at improving basic classification models and providing further insights for stroke risk to allow:
- Early Identification: Helping healthcare providers identify individuals at higher risk of stroke for timely intervention.
- Resource Allocation: Assisting in the efficient allocation of medical resources to those most in need.
- Preventive Measures: Providing insights for developing targeted preventive measures to reduce the incidence of strokes.
2. Dataset Description
The Stroke Prediction dataset is available at Kaggle (URL : https://www.kaggle.com/fedesoriano/stroke-prediction-dataset) and contains data on patients, including their medical and demographic attributes.
Attributes
Attribute | Description |
---|---|
id | Unique identifier for each patient. |
gender | Gender of the patient (Male, Female, Other). |
age | Age of the patient. |
hypertension | Whether the patient has hypertension (0: No, 1: Yes). |
heart_disease | Whether the patient has heart disease (0: No, 1: Yes). |
ever_married | Marital status of the patient (No, Yes). |
work_type | Type of occupation (children, Govt_job, Never_worked, Private, Self-employed). |
residence_type | Type of residence (Rural, Urban). |
avg_glucose_level | Average glucose level in the blood. |
bmi | Body mass index. |
smoking_status | Smoking status (formerly smoked, never smoked, smokes, Unknown). |
stroke | Target variable indicating whether the patient had a stroke (0: No, 1: Yes). |
- Number of Instances: 5,110
- Number of Features: 11
- Target Variable:
stroke
(0: No Stroke, 1: Stroke)
Analysis Objectives
The analysis aims to:
-
Explore and visualize the data to understand the distribution of attributes and identify any missing or anomalous values.
-
Engineer features and prepare data.
-
Train multiple clustering models on the new engineered dataset and evaluate the performance of classification based on the clusters.
-
Train multiple dimensionality reduction models and evaluate the performance of classification based on the transformed data.
-
Identify the best-performing model and Feature engineering approach.
-
Provide recommendations for next steps and further optimization.
3. Data Exploration and Cleaning
Data Exploration
Besides the id the dataset includes ten features as listed above plus the target variable stroke
. There are three numerical features: age
, avg_glucose_level
and bmi
. The remaining seven features are all categorical.
Out of the 5,110 observations in the dataset, 4861 were observed with no stroke and 249 patients had a stroke. The dataset could be classified as not balanced, which has to be addressed before model training.
The variables of the dataset are shown below.
count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|
gender | 5110 | 3 | Female | 2994 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
age | 5110.0 | NaN | NaN | NaN | 43.23 | 22.61 | 0.08 | 25.0 | 45.0 | 61.0 | 82.0 |
hypertension | 5110.0 | NaN | NaN | NaN | 0.1 | 0.3 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
heart_disease | 5110.0 | NaN | NaN | NaN | 0.05 | 0.23 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
ever_married | 5110 | 2 | Yes | 3353 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
work_type | 5110 | 5 | Private | 2925 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Residence_type | 5110 | 2 | Urban | 2596 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
avg_glucose_level | 5110.0 | NaN | NaN | NaN | 106.15 | 45.28 | 55.12 | 77.24 | 91.88 | 114.09 | 271.74 |
bmi | 4909.0 | NaN | NaN | NaN | 28.89 | 7.85 | 10.3 | 23.5 | 28.1 | 33.1 | 97.6 |
smoking_status | 5110 | 4 | never smoked | 1892 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
stroke | 5110.0 | NaN | NaN | NaN | 0.05 | 0.22 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
To analyze distribution and correlation of the data we prepared a set of 4 plots for each of the variables despending on the type as follows:
-
Numerical Variables: The overall distribution of the variable, the distribution of the variable for non stroke observations, the distribution of the variable for stroke observations and the density distribution separated by stroke cases.
-
Categorical Variables: The overall distribution of the variable, the distribution of the variable for non stroke observations, the distribution of the variable for stroke observations and the distribution of stroke cases within the groups of the categorical variable.
Numerical Variables
The graphs for the three numerical variables are shown below. There isn't a meaningful correlation visible for BMI or the Average Glucose Level. In contrast Age shows a fairly strong dependency with the target variable.



Distribution of numerical variables
Stroke cases plotted versus Age show to be not equally distributed. The reported stroke cases can be more often found with increasing age. Body Mass Index and Average Glucose level do not show an obvious influence on stroke. The observation is confirmed in the pairplot below.

Correlation of numerical variables
Categorical Variables
The graphs for the seven categorical variables are shown below. There isn't a meaningful correlation visible for BMI or the Average Glucose Level. In contrast Age shows a fairly strong dependency with the target variable.







Distribution of categorical variables
If we analyze the group percentages and compare the distributions of the variables for stroke and non stroke cases we can identify Hypertension, Heat Disease and the Married Status as potential influence for stroke cases. We refer to further correlation analysis in Part 1.
Data Cleaning and Feature Engineering
To prepare the data for the further analysis and the modeling phase we will perform the following steps:
- Handling Missing Values and Outliers: Address missing values as appropriate.
- Encoding Categorical Variables: Convert categorical variables into numerical format using one-hot encoding.
- Data Splitting: Split the data into training and testing sets.
- Feature Scaling: Scale features to ensure they are on a similar scale.
- Addressing unbalanced data: Balances the classes in the training set.
Handling Missing Values and outliers
There are 201 records with missing BMI value, from these 201 a count of 40 are stroke cases (Stroke =1). We will adjust the missing values by calculating the group mean BMI by gender, age and glucose level. Further we have to address two outlier stroke cases as well as some more unrealistic BMI outliers.
There are two stroke cases in very early years which can be regarded as outlier. These cases are rather caused by very rare circumstances and should not be part of a systemic data analysis. We drop the records with id 69768 and 49669.
162 | 245 | |
---|---|---|
id | 69768 | 49669 |
gender | Female | Female |
age | 1.32 | 14.0 |
hypertension | 0 | 0 |
heart_disease | 0 | 0 |
ever_married | No | No |
work_type | children | children |
Residence_type | Urban | Rural |
avg_glucose_level | 70.37 | 57.93 |
bmi | NaN | 30.9 |
smoking_status | Unknown | Unknown |
stroke | 1 | 1 |
We also will drop some unrealistic BMI (Body Mass Index) records. A BMI over 70 is extremely high and generally not realistic for most individuals. BMI is calculated as weight in kg divided by height in meter squared For context, a BMI between 18.5 and 24.9 is considered normal weight, a BMI between 25 and 29.9 is considered overweight and a BMI of 30 or above is considered obese. A BMI over 70 would indicate severe obesity, which is very rare. For example, an individual with a BMI of 70 who is 1.75 meters tall (about 5 feet 9 inches) would weigh approximately 473 lbs. Accordingly we drop the records with ids 545,41097, 56420 and 51856.
544 | 928 | 2128 | 4209 | |
---|---|---|---|---|
id | 545 | 41097 | 56420 | 51856 |
gender | Male | Female | Male | Male |
age | 42.0 | 23.0 | 17.0 | 38.0 |
hypertension | 0 | 1 | 1 | 1 |
heart_disease | 0 | 0 | 0 | 0 |
ever_married | Yes | No | No | Yes |
work_type | Private | Private | Private | Private |
Residence_type | Rural | Urban | Rural | Rural |
avg_glucose_level | 210.48 | 70.03 | 61.67 | 56.9 |
bmi | 71.9 | 78.0 | 97.6 | 92.0 |
smoking_status | never smoked | smokes | Unknown | never smoked |
stroke | 0 | 0 | 0 | 0 |
4. Base model
In a prior analysis we evaluated different classification models with various resampling and cross validation approaches. During model training we evaluated four classification models: Logistic Regression, Decision Tree, Naive Bayes and AdaBoost. To optimize these models, we tuned hyperparameters using GridSearchCV with a 5-fold cross-validation. This approach ensures that our models are robust and generalize well to unseen data. In addition to hyperparameter tuning, we explored different resampling methods to address class imbalance, specifically using AdaSyn oversampling and TomekLinks undersampling. To comprehensively evaluate model performance, we varied the scoring metrics used in GridSearchCV, including F1 score, recall, and F-beta scores with beta values of 2 and 4. These varied scoring metrics will help us assess the models' ability to balance precision and recall, particularly emphasizing recall with the higher beta values in the F-beta score. This extensive evaluation process aims to identify the most effective model scoring and resampling strategy for predicting stroke risk.
To prepare the data for the modeling we will apply one hot encoding to all categorical variables. Two categorical variables are already encoded as numbers, these are 'hypertension' and 'heart_disease' . After performing the one hot encoding with omitting the default value (drop_first=true) the dataset is widened to 17 features including the target variable. In the next step we split the data in a ratio of 70/30 into training and test sets, whilst maintaining the class distribution with stratify=y. Scaling is crucial for ensuring that algorithms, which are sensitive to the scale of the input data, perform optimally and produce reliable results. We apply a MinMax scaler to the stroke data to normalize the features so that they have a value between 0 and 1.
Model Evaluation
The models where evaluated using the same training and test splits for all models to ensure fair comparison. The evaluation methods that were used to evaluate the models were:
Performance Indicators
-
Accuracy
-
Precision
-
Recall
-
F1 score as F1
-
FBeta score for beta=2 as F2
-
FBeta score for beta=4 as F4
Confusion Matrix
- True positive (1) and False positive (1) counts
- True negative (0) and False negative (0) counts
Below are the results for the best classification models found with a recall score larger than 0.7.
Resampling | Scoring | Model | Precision | Recall | Accuracy | F1 | F2 | F4 | |
---|---|---|---|---|---|---|---|---|---|
19 | TomekLinks | f1 | AdaBoost | 0.167 | 0.716 | 0.814 | 0.271 | 0.432 | 0.600 |
28 | TomekLinks | f4 | Logistic Regression | 0.140 | 0.784 | 0.757 | 0.237 | 0.408 | 0.617 |
20 | TomekLinks | recall | Logistic Regression | 0.140 | 0.784 | 0.757 | 0.237 | 0.408 | 0.617 |
24 | TomekLinks | f2 | Logistic Regression | 0.140 | 0.784 | 0.757 | 0.237 | 0.408 | 0.617 |
44 | None | f4 | Logistic Regression | 0.137 | 0.757 | 0.758 | 0.232 | 0.398 | 0.598 |
36 | None | recall | Logistic Regression | 0.137 | 0.757 | 0.758 | 0.232 | 0.398 | 0.598 |
45 | None | f4 | Decision Tree | 0.136 | 0.757 | 0.756 | 0.230 | 0.395 | 0.596 |
41 | None | f2 | Decision Tree | 0.136 | 0.757 | 0.756 | 0.230 | 0.395 | 0.596 |
37 | None | recall | Decision Tree | 0.136 | 0.757 | 0.756 | 0.230 | 0.395 | 0.596 |
33 | None | f1 | Decision Tree | 0.136 | 0.757 | 0.756 | 0.230 | 0.395 | 0.596 |
27 | TomekLinks | f2 | AdaBoost | 0.135 | 0.757 | 0.755 | 0.230 | 0.394 | 0.596 |
43 | None | f2 | AdaBoost | 0.134 | 0.743 | 0.757 | 0.228 | 0.390 | 0.587 |
21 | TomekLinks | recall | Decision Tree | 0.124 | 0.784 | 0.723 | 0.215 | 0.381 | 0.598 |
29 | TomekLinks | f4 | Decision Tree | 0.124 | 0.784 | 0.723 | 0.215 | 0.381 | 0.598 |
25 | TomekLinks | f2 | Decision Tree | 0.124 | 0.784 | 0.723 | 0.215 | 0.381 | 0.598 |
47 | None | f4 | AdaBoost | 0.104 | 0.905 | 0.618 | 0.186 | 0.356 | 0.623 |
31 | TomekLinks | f4 | AdaBoost | 0.104 | 0.905 | 0.617 | 0.186 | 0.355 | 0.622 |
16 | TomekLinks | f1 | Logistic Regression | 0.101 | 0.824 | 0.636 | 0.179 | 0.338 | 0.579 |
30 | TomekLinks | f4 | Naive Bayes | 0.086 | 0.959 | 0.505 | 0.158 | 0.316 | 0.600 |
46 | None | f4 | Naive Bayes | 0.086 | 0.959 | 0.503 | 0.157 | 0.316 | 0.600 |
11 | Adasyn | f2 | AdaBoost | 0.089 | 0.838 | 0.578 | 0.161 | 0.312 | 0.561 |
15 | Adasyn | f4 | AdaBoost | 0.072 | 1.000 | 0.381 | 0.135 | 0.281 | 0.570 |
2 | Adasyn | f1 | Naive Bayes | 0.079 | 0.770 | 0.554 | 0.143 | 0.279 | 0.508 |
10 | Adasyn | f2 | Naive Bayes | 0.070 | 0.959 | 0.379 | 0.130 | 0.270 | 0.548 |
14 | Adasyn | f4 | Naive Bayes | 0.065 | 1.000 | 0.309 | 0.123 | 0.259 | 0.543 |
32 | None | f1 | Logistic Regression | 0.064 | 1.000 | 0.290 | 0.120 | 0.254 | 0.536 |
40 | None | f2 | Logistic Regression | 0.064 | 1.000 | 0.290 | 0.120 | 0.254 | 0.536 |
38 | None | recall | Naive Bayes | 0.058 | 1.000 | 0.213 | 0.109 | 0.235 | 0.511 |
39 | None | recall | AdaBoost | 0.058 | 1.000 | 0.213 | 0.109 | 0.235 | 0.511 |
23 | TomekLinks | recall | AdaBoost | 0.058 | 1.000 | 0.213 | 0.109 | 0.235 | 0.511 |
22 | TomekLinks | recall | Naive Bayes | 0.058 | 1.000 | 0.213 | 0.109 | 0.235 | 0.511 |
7 | Adasyn | recall | AdaBoost | 0.057 | 1.000 | 0.204 | 0.108 | 0.233 | 0.508 |
6 | Adasyn | recall | Naive Bayes | 0.057 | 1.000 | 0.204 | 0.108 | 0.233 | 0.508 |
Resampling | Scoring | Model | F2 | True 1 | True 0 | False 1 | False 0 | |
---|---|---|---|---|---|---|---|---|
19 | TomekLinks | f1 | AdaBoost | 0.432 | 53 | 1194 | 264 | 21 |
28 | TomekLinks | f4 | Logistic Regression | 0.408 | 58 | 1101 | 357 | 16 |
20 | TomekLinks | recall | Logistic Regression | 0.408 | 58 | 1101 | 357 | 16 |
24 | TomekLinks | f2 | Logistic Regression | 0.408 | 58 | 1101 | 357 | 16 |
44 | None | f4 | Logistic Regression | 0.398 | 56 | 1106 | 352 | 18 |
36 | None | recall | Logistic Regression | 0.398 | 56 | 1106 | 352 | 18 |
45 | None | f4 | Decision Tree | 0.395 | 56 | 1102 | 356 | 18 |
41 | None | f2 | Decision Tree | 0.395 | 56 | 1102 | 356 | 18 |
37 | None | recall | Decision Tree | 0.395 | 56 | 1102 | 356 | 18 |
33 | None | f1 | Decision Tree | 0.395 | 56 | 1102 | 356 | 18 |
27 | TomekLinks | f2 | AdaBoost | 0.394 | 56 | 1100 | 358 | 18 |
43 | None | f2 | AdaBoost | 0.390 | 55 | 1104 | 354 | 19 |
21 | TomekLinks | recall | Decision Tree | 0.381 | 58 | 1050 | 408 | 16 |
29 | TomekLinks | f4 | Decision Tree | 0.381 | 58 | 1050 | 408 | 16 |
25 | TomekLinks | f2 | Decision Tree | 0.381 | 58 | 1050 | 408 | 16 |
47 | None | f4 | AdaBoost | 0.356 | 67 | 880 | 578 | 7 |
31 | TomekLinks | f4 | AdaBoost | 0.355 | 67 | 878 | 580 | 7 |
16 | TomekLinks | f1 | Logistic Regression | 0.338 | 61 | 913 | 545 | 13 |
30 | TomekLinks | f4 | Naive Bayes | 0.316 | 71 | 702 | 756 | 3 |
46 | None | f4 | Naive Bayes | 0.316 | 71 | 700 | 758 | 3 |
11 | Adasyn | f2 | AdaBoost | 0.312 | 62 | 824 | 634 | 12 |
15 | Adasyn | f4 | AdaBoost | 0.281 | 74 | 510 | 948 | 0 |
2 | Adasyn | f1 | Naive Bayes | 0.279 | 57 | 791 | 667 | 17 |
10 | Adasyn | f2 | Naive Bayes | 0.270 | 71 | 510 | 948 | 3 |
14 | Adasyn | f4 | Naive Bayes | 0.259 | 74 | 399 | 1059 | 0 |
32 | None | f1 | Logistic Regression | 0.254 | 74 | 370 | 1088 | 0 |
40 | None | f2 | Logistic Regression | 0.254 | 74 | 370 | 1088 | 0 |
38 | None | recall | Naive Bayes | 0.235 | 74 | 253 | 1205 | 0 |
39 | None | recall | AdaBoost | 0.235 | 74 | 253 | 1205 | 0 |
23 | TomekLinks | recall | AdaBoost | 0.235 | 74 | 253 | 1205 | 0 |
22 | TomekLinks | recall | Naive Bayes | 0.235 | 74 | 253 | 1205 | 0 |
7 | Adasyn | recall | AdaBoost | 0.233 | 74 | 238 | 1220 | 0 |
6 | Adasyn | recall | Naive Bayes | 0.233 | 74 | 238 | 1220 | 0 |
The best models depending on the scoring metric are shown below.
Best Accuracy | Best Precision | Best Recall | Best F1 | Best F2 | Best F4 | |
---|---|---|---|---|---|---|
Model ID | 1 | 17 | 15 | 19 | 19 | 47 |
Resampling | Adasyn | TomekLinks | Adasyn | TomekLinks | TomekLinks | None |
Scoring | f1 | f1 | f4 | f1 | f1 | f4 |
Model | Decision Tree | Decision Tree | AdaBoost | AdaBoost | AdaBoost | AdaBoost |
Precision | 0.1298 | 0.1673 | 0.0724 | 0.1672 | 0.1672 | 0.1039 |
Recall | 0.2297 | 0.5676 | 1.0 | 0.7162 | 0.7162 | 0.9054 |
Accuracy | 0.8884 | 0.8427 | 0.3812 | 0.814 | 0.814 | 0.6181 |
F1 | 0.1659 | 0.2585 | 0.135 | 0.2711 | 0.2711 | 0.1864 |
F2 | 0.1991 | 0.3839 | 0.2807 | 0.4323 | 0.4323 | 0.356 |
F4 | 0.2198 | 0.4976 | 0.5703 | 0.6003 | 0.6003 | 0.6227 |
True 1 | 17 | 42 | 74 | 53 | 53 | 67 |
True 0 | 1344 | 1249 | 510 | 1194 | 1194 | 880 |
False 1 | 114 | 209 | 948 | 264 | 264 | 578 |
False 0 | 57 | 32 | 0 | 21 | 21 | 7 |

Best base classification models Confusion matrix
5. K Means, DBSCAN and Agglomerative Clustering
Having understood the complexity of the data we will try to segment the data into clusters, which we plan to use in our improvement approaches for the classification models. We will apply three different clustering algorithms K Means, DBSCAN (Density-Based Spatial Clustering of Applications with Noise) and Agglomerative Clustering. For all methods we will include All Features but not the target variable Stroke. As a result we will enrich our dataset with one column per approach that defines the Cluster of the record. In case of DBSCAN we can also record a negative one as marker for noise not belonging to any cluster. The three clustering approaches are KMeans, DBSCAN and Agglomerative Clustering.
KMeans
We will perform KMeans clustering over all features of the dataset and determine the best K as number of clusters. First we will apply a StandardScaler and recode all categorical features to numeric format.
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
gender | 5104.0 | 0.0 | 1.0 | -0.840 | -0.840 | -0.840 | 1.188 | 3.217 |
age | 5104.0 | -0.0 | 1.0 | -1.910 | -0.807 | 0.077 | 0.785 | 1.714 |
hypertension | 5104.0 | -0.0 | 1.0 | -0.328 | -0.328 | -0.328 | -0.328 | 3.051 |
heart_disease | 5104.0 | 0.0 | 1.0 | -0.239 | -0.239 | -0.239 | -0.239 | 4.182 |
ever_married | 5104.0 | -0.0 | 1.0 | -1.383 | -1.383 | 0.723 | 0.723 | 0.723 |
work_type | 5104.0 | -0.0 | 1.0 | -1.988 | -0.153 | -0.153 | 0.764 | 1.681 |
Residence_type | 5104.0 | 0.0 | 1.0 | -1.017 | -1.017 | 0.984 | 0.984 | 0.984 |
smoking_status | 5104.0 | -0.0 | 1.0 | -1.286 | -1.286 | 0.581 | 0.581 | 1.515 |
glucose_group | 5104.0 | 0.0 | 1.0 | -0.412 | -0.412 | -0.412 | -0.412 | 2.945 |
bmi_group | 5104.0 | -0.0 | 1.0 | -2.129 | -1.079 | -0.030 | 1.020 | 1.020 |
Below you can see the resulting graph showing inertia values over number of clusters K.

Elbow curve - Inertia over number of clusters
From the ellbow curve we identify K=10 as significant candidate for the ellbow point and will rerun KMean for K=9 and include it in our data set. Below you can see the resulting distributions for the continuos variable Age as Violinplot differentiated by cluster, as well as the Distribution for the categorical variables for Heart Disease, Hypertension.
KMeanAF | Total | Stroke | Not_Stroke | |
---|---|---|---|---|
0 | 0 | 761 | 32 | 729 |
1 | 1 | 850 | 3 | 847 |
2 | 2 | 463 | 24 | 439 |
3 | 3 | 274 | 24 | 250 |
4 | 4 | 470 | 23 | 447 |
5 | 5 | 717 | 0 | 717 |
6 | 6 | 276 | 47 | 229 |
7 | 7 | 498 | 24 | 474 |
8 | 8 | 364 | 17 | 347 |
9 | 9 | 431 | 53 | 378 |

Cluster distribution
We can see that cluster label 5 has patients segmented without stroke. Cluster label 6 shows the highest percentage of stroke cases per segment. Below you can see the resulting distributions for the continuos variables Age, Avg Glucose Level and BMI as Violinplot differentiated by cluster, as well as the Distribution for the categorical variables for Heart Disease, Hypertension.



KMean cluster - Distribution of Age, Glucose and BMI Groups



KMean cluster - Distribution of Hypertension, Heart disease and Stroke
DBSCAN
Next we will perform DBSCAN clustering over all features of the dataset and determine. The number of clusters are optimized by DBSCAN itself. As before, we will apply a StandardScaler and recode all categorical features to numeric format.
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
gender | 5104.0 | 0.0 | 1.0 | -0.840 | -0.840 | -0.840 | 1.188 | 3.217 |
age | 5104.0 | -0.0 | 1.0 | -1.910 | -0.807 | 0.077 | 0.785 | 1.714 |
hypertension | 5104.0 | -0.0 | 1.0 | -0.328 | -0.328 | -0.328 | -0.328 | 3.051 |
heart_disease | 5104.0 | 0.0 | 1.0 | -0.239 | -0.239 | -0.239 | -0.239 | 4.182 |
ever_married | 5104.0 | -0.0 | 1.0 | -1.383 | -1.383 | 0.723 | 0.723 | 0.723 |
work_type | 5104.0 | -0.0 | 1.0 | -1.988 | -0.153 | -0.153 | 0.764 | 1.681 |
Residence_type | 5104.0 | 0.0 | 1.0 | -1.017 | -1.017 | 0.984 | 0.984 | 0.984 |
smoking_status | 5104.0 | -0.0 | 1.0 | -1.286 | -1.286 | 0.581 | 0.581 | 1.515 |
glucose_group | 5104.0 | 0.0 | 1.0 | -0.412 | -0.412 | -0.412 | -0.412 | 2.945 |
bmi_group | 5104.0 | -0.0 | 1.0 | -2.129 | -1.079 | -0.030 | 1.020 | 1.020 |
Below you can see the result. DBSCAN calculated 4 clusters and identified 2 records as noise.
DBSCANAF | Total | Stroke | Not_Stroke | |
---|---|---|---|---|
0 | -1 | 3 | 0 | 3 |
1 | 0 | 211 | 34 | 177 |
2 | 1 | 4397 | 147 | 4250 |
3 | 2 | 431 | 53 | 378 |
4 | 3 | 62 | 13 | 49 |

Cluster distribution
We can see that cluster label 1 has the majority of all observations and also the smallest stroke percentage.There were three observations classified as noise, all non stroke patients. Below you can see the resulting distributions for the continuos variables Age, Avg Glucose Level and BMI as Violinplot differentiated by cluster, as well as the Distribution for the categorical variables for Heart Disease, Hypertension.



DBSCAN cluster-distribution for Age, glucose level and bmi



DBSCAN cluster-distribution for Hypertension, Heart disease and stroke
Agglomerative Clustering SF
Next we will perform Agglomerative Clustering over all features of the dataset and determine the best n as number of clusters. As before, we will apply a StandardScaler and recode all categorical features to numeric format.
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
gender | 5104.0 | 0.0 | 1.0 | -0.840 | -0.840 | -0.840 | 1.188 | 3.217 |
age | 5104.0 | -0.0 | 1.0 | -1.910 | -0.807 | 0.077 | 0.785 | 1.714 |
hypertension | 5104.0 | -0.0 | 1.0 | -0.328 | -0.328 | -0.328 | -0.328 | 3.051 |
heart_disease | 5104.0 | 0.0 | 1.0 | -0.239 | -0.239 | -0.239 | -0.239 | 4.182 |
ever_married | 5104.0 | -0.0 | 1.0 | -1.383 | -1.383 | 0.723 | 0.723 | 0.723 |
work_type | 5104.0 | -0.0 | 1.0 | -1.988 | -0.153 | -0.153 | 0.764 | 1.681 |
Residence_type | 5104.0 | 0.0 | 1.0 | -1.017 | -1.017 | 0.984 | 0.984 | 0.984 |
smoking_status | 5104.0 | -0.0 | 1.0 | -1.286 | -1.286 | 0.581 | 0.581 | 1.515 |
glucose_group | 5104.0 | 0.0 | 1.0 | -0.412 | -0.412 | -0.412 | -0.412 | 2.945 |
bmi_group | 5104.0 | -0.0 | 1.0 | -2.129 | -1.079 | -0.030 | 1.020 | 1.020 |
Below you can see the resulting graph showing silhouettes scores over number of clusters n for the selected features.

Silhouettes Scores for different linkage parameters
For the selected features we select n=9 as best candidate and will rerun the Agglomerative Clustering for n=9 and ward linkage and include it in our data set. Below you can see the resulting distribution and cluster counts.
AggloAF | Total | Stroke | Not_Stroke | |
---|---|---|---|---|
0 | 0 | 966 | 13 | 953 |
1 | 1 | 645 | 0 | 645 |
2 | 2 | 696 | 35 | 661 |
3 | 3 | 431 | 53 | 378 |
4 | 4 | 587 | 21 | 566 |
5 | 5 | 505 | 42 | 463 |
6 | 6 | 276 | 47 | 229 |
7 | 7 | 377 | 11 | 366 |
8 | 8 | 621 | 25 | 596 |

Cluster distribution
We can see that cluster label 1 has patients segmented without stroke. Cluster label 6 shows the highest percentage of stroke cases per segment. Below you can see the resulting distributions for the continuos variables Age, Avg Glucose Level and BMI as Violinplot differentiated by cluster, as well as the Distribution for the categorical variables for Heart Disease, Hypertension.



Agglomerative Clustering cluster-distribution for Age, glucose level and bmi



Agglomerative Clustering cluster-distribution for Hypertension, Heart disease and stroke
Cluster based classification
We can now analyze the quality of the clusters and use the results to create additional features and use these clusters as part of the feature set for classification. We will apply the same classification algorithms as for the base models and will apply resampling methods as before. The table below shows the results for models with a recall of greater than 0.7.
Clustering | Resampling | Scoring | Model | Precision | Recall | Accuracy | F1 | F2 | F4 | |
---|---|---|---|---|---|---|---|---|---|---|
83 | AggloAF | TomekLinks | f1 | AdaBoost | 0.167 | 0.703 | 0.816 | 0.269 | 0.428 | 0.591 |
51 | DBSCANAF | TomekLinks | f1 | AdaBoost | 0.154 | 0.703 | 0.799 | 0.253 | 0.411 | 0.581 |
59 | DBSCANAF | TomekLinks | f2 | AdaBoost | 0.133 | 0.743 | 0.752 | 0.225 | 0.387 | 0.585 |
60 | DBSCANAF | TomekLinks | f4 | Logistic Regression | 0.130 | 0.716 | 0.754 | 0.220 | 0.376 | 0.566 |
56 | DBSCANAF | TomekLinks | f2 | Logistic Regression | 0.130 | 0.716 | 0.754 | 0.220 | 0.376 | 0.566 |
95 | AggloAF | TomekLinks | f4 | AdaBoost | 0.118 | 0.824 | 0.694 | 0.206 | 0.375 | 0.610 |
91 | AggloAF | TomekLinks | f2 | AdaBoost | 0.118 | 0.824 | 0.694 | 0.206 | 0.375 | 0.610 |
27 | KMeanAF | TomekLinks | f2 | AdaBoost | 0.116 | 0.824 | 0.687 | 0.203 | 0.371 | 0.606 |
19 | KMeanAF | TomekLinks | f1 | AdaBoost | 0.116 | 0.824 | 0.687 | 0.203 | 0.371 | 0.606 |
31 | KMeanAF | TomekLinks | f4 | AdaBoost | 0.116 | 0.824 | 0.687 | 0.203 | 0.371 | 0.606 |
48 | DBSCANAF | TomekLinks | f1 | Logistic Regression | 0.121 | 0.757 | 0.722 | 0.208 | 0.368 | 0.578 |
20 | KMeanAF | TomekLinks | recall | Logistic Regression | 0.109 | 0.892 | 0.644 | 0.195 | 0.367 | 0.628 |
28 | KMeanAF | TomekLinks | f4 | Logistic Regression | 0.109 | 0.892 | 0.644 | 0.195 | 0.367 | 0.628 |
24 | KMeanAF | TomekLinks | f2 | Logistic Regression | 0.109 | 0.892 | 0.644 | 0.195 | 0.367 | 0.628 |
16 | KMeanAF | TomekLinks | f1 | Logistic Regression | 0.109 | 0.892 | 0.644 | 0.195 | 0.367 | 0.628 |
63 | DBSCANAF | TomekLinks | f4 | AdaBoost | 0.106 | 0.919 | 0.620 | 0.189 | 0.362 | 0.632 |
21 | KMeanAF | TomekLinks | recall | Decision Tree | 0.113 | 0.784 | 0.692 | 0.197 | 0.358 | 0.581 |
66 | AggloAF | Adasyn | f1 | Naive Bayes | 0.112 | 0.784 | 0.689 | 0.196 | 0.356 | 0.579 |
89 | AggloAF | TomekLinks | f2 | Decision Tree | 0.117 | 0.716 | 0.726 | 0.202 | 0.354 | 0.551 |
75 | AggloAF | Adasyn | f2 | AdaBoost | 0.102 | 0.919 | 0.605 | 0.184 | 0.353 | 0.625 |
26 | KMeanAF | TomekLinks | f2 | Naive Bayes | 0.106 | 0.770 | 0.676 | 0.187 | 0.342 | 0.563 |
18 | KMeanAF | TomekLinks | f1 | Naive Bayes | 0.106 | 0.770 | 0.676 | 0.187 | 0.342 | 0.563 |
15 | KMeanAF | Adasyn | f4 | AdaBoost | 0.093 | 0.932 | 0.560 | 0.170 | 0.334 | 0.610 |
11 | KMeanAF | Adasyn | f2 | AdaBoost | 0.093 | 0.932 | 0.560 | 0.170 | 0.334 | 0.610 |
43 | DBSCANAF | Adasyn | f2 | AdaBoost | 0.095 | 0.892 | 0.584 | 0.172 | 0.333 | 0.597 |
94 | AggloAF | TomekLinks | f4 | Naive Bayes | 0.094 | 0.905 | 0.573 | 0.170 | 0.332 | 0.600 |
80 | AggloAF | TomekLinks | f1 | Logistic Regression | 0.091 | 0.932 | 0.547 | 0.166 | 0.327 | 0.604 |
62 | DBSCANAF | TomekLinks | f4 | Naive Bayes | 0.093 | 0.838 | 0.596 | 0.167 | 0.322 | 0.569 |
30 | KMeanAF | TomekLinks | f4 | Naive Bayes | 0.084 | 0.946 | 0.496 | 0.154 | 0.309 | 0.589 |
79 | AggloAF | Adasyn | f4 | AdaBoost | 0.079 | 0.986 | 0.445 | 0.146 | 0.299 | 0.589 |
34 | DBSCANAF | Adasyn | f1 | Naive Bayes | 0.085 | 0.784 | 0.580 | 0.153 | 0.296 | 0.528 |
47 | DBSCANAF | Adasyn | f4 | AdaBoost | 0.074 | 1.000 | 0.398 | 0.138 | 0.287 | 0.577 |
14 | KMeanAF | Adasyn | f4 | Naive Bayes | 0.075 | 0.959 | 0.428 | 0.139 | 0.286 | 0.567 |
6 | KMeanAF | Adasyn | recall | Naive Bayes | 0.075 | 0.959 | 0.428 | 0.139 | 0.286 | 0.567 |
42 | DBSCANAF | Adasyn | f2 | Naive Bayes | 0.076 | 0.865 | 0.483 | 0.139 | 0.280 | 0.536 |
10 | KMeanAF | Adasyn | f2 | Naive Bayes | 0.078 | 0.811 | 0.525 | 0.142 | 0.280 | 0.521 |
22 | KMeanAF | TomekLinks | recall | Naive Bayes | 0.070 | 1.000 | 0.362 | 0.131 | 0.274 | 0.563 |
46 | DBSCANAF | Adasyn | f4 | Naive Bayes | 0.070 | 0.932 | 0.401 | 0.131 | 0.270 | 0.542 |
7 | KMeanAF | Adasyn | recall | AdaBoost | 0.068 | 1.000 | 0.336 | 0.127 | 0.267 | 0.553 |
78 | AggloAF | Adasyn | f4 | Naive Bayes | 0.064 | 0.973 | 0.315 | 0.121 | 0.254 | 0.531 |
74 | AggloAF | Adasyn | f2 | Naive Bayes | 0.065 | 0.905 | 0.366 | 0.121 | 0.252 | 0.514 |
70 | AggloAF | Adasyn | recall | Naive Bayes | 0.061 | 0.973 | 0.279 | 0.115 | 0.245 | 0.519 |
55 | DBSCANAF | TomekLinks | recall | AdaBoost | 0.059 | 1.000 | 0.233 | 0.112 | 0.239 | 0.517 |
Clustering | Resampling | Scoring | Model | F4 | True 1 | True 0 | False 1 | False 0 | |
---|---|---|---|---|---|---|---|---|---|
83 | AggloAF | TomekLinks | f1 | AdaBoost | 0.591 | 52 | 1198 | 260 | 22 |
51 | DBSCANAF | TomekLinks | f1 | AdaBoost | 0.581 | 52 | 1172 | 285 | 22 |
59 | DBSCANAF | TomekLinks | f2 | AdaBoost | 0.585 | 55 | 1097 | 360 | 19 |
60 | DBSCANAF | TomekLinks | f4 | Logistic Regression | 0.566 | 53 | 1102 | 355 | 21 |
56 | DBSCANAF | TomekLinks | f2 | Logistic Regression | 0.566 | 53 | 1102 | 355 | 21 |
95 | AggloAF | TomekLinks | f4 | AdaBoost | 0.610 | 61 | 1002 | 456 | 13 |
91 | AggloAF | TomekLinks | f2 | AdaBoost | 0.610 | 61 | 1002 | 456 | 13 |
27 | KMeanAF | TomekLinks | f2 | AdaBoost | 0.606 | 61 | 992 | 466 | 13 |
19 | KMeanAF | TomekLinks | f1 | AdaBoost | 0.606 | 61 | 992 | 466 | 13 |
31 | KMeanAF | TomekLinks | f4 | AdaBoost | 0.606 | 61 | 992 | 466 | 13 |
48 | DBSCANAF | TomekLinks | f1 | Logistic Regression | 0.578 | 56 | 1049 | 408 | 18 |
20 | KMeanAF | TomekLinks | recall | Logistic Regression | 0.628 | 66 | 921 | 537 | 8 |
28 | KMeanAF | TomekLinks | f4 | Logistic Regression | 0.628 | 66 | 921 | 537 | 8 |
24 | KMeanAF | TomekLinks | f2 | Logistic Regression | 0.628 | 66 | 921 | 537 | 8 |
16 | KMeanAF | TomekLinks | f1 | Logistic Regression | 0.628 | 66 | 921 | 537 | 8 |
63 | DBSCANAF | TomekLinks | f4 | AdaBoost | 0.632 | 68 | 881 | 576 | 6 |
21 | KMeanAF | TomekLinks | recall | Decision Tree | 0.581 | 58 | 1002 | 456 | 16 |
66 | AggloAF | Adasyn | f1 | Naive Bayes | 0.579 | 58 | 997 | 461 | 16 |
89 | AggloAF | TomekLinks | f2 | Decision Tree | 0.551 | 53 | 1059 | 399 | 21 |
75 | AggloAF | Adasyn | f2 | AdaBoost | 0.625 | 68 | 859 | 599 | 6 |
26 | KMeanAF | TomekLinks | f2 | Naive Bayes | 0.563 | 57 | 978 | 480 | 17 |
18 | KMeanAF | TomekLinks | f1 | Naive Bayes | 0.563 | 57 | 978 | 480 | 17 |
15 | KMeanAF | Adasyn | f4 | AdaBoost | 0.610 | 69 | 789 | 669 | 5 |
11 | KMeanAF | Adasyn | f2 | AdaBoost | 0.610 | 69 | 789 | 669 | 5 |
43 | DBSCANAF | Adasyn | f2 | AdaBoost | 0.597 | 66 | 828 | 629 | 8 |
94 | AggloAF | TomekLinks | f4 | Naive Bayes | 0.600 | 67 | 811 | 647 | 7 |
80 | AggloAF | TomekLinks | f1 | Logistic Regression | 0.604 | 69 | 769 | 689 | 5 |
62 | DBSCANAF | TomekLinks | f4 | Naive Bayes | 0.569 | 62 | 851 | 606 | 12 |
30 | KMeanAF | TomekLinks | f4 | Naive Bayes | 0.589 | 70 | 690 | 768 | 4 |
79 | AggloAF | Adasyn | f4 | AdaBoost | 0.589 | 73 | 608 | 850 | 1 |
34 | DBSCANAF | Adasyn | f1 | Naive Bayes | 0.528 | 58 | 830 | 627 | 16 |
47 | DBSCANAF | Adasyn | f4 | AdaBoost | 0.577 | 74 | 536 | 921 | 0 |
14 | KMeanAF | Adasyn | f4 | Naive Bayes | 0.567 | 71 | 585 | 873 | 3 |
6 | KMeanAF | Adasyn | recall | Naive Bayes | 0.567 | 71 | 585 | 873 | 3 |
42 | DBSCANAF | Adasyn | f2 | Naive Bayes | 0.536 | 64 | 676 | 781 | 10 |
10 | KMeanAF | Adasyn | f2 | Naive Bayes | 0.521 | 60 | 744 | 714 | 14 |
22 | KMeanAF | TomekLinks | recall | Naive Bayes | 0.563 | 74 | 480 | 978 | 0 |
46 | DBSCANAF | Adasyn | f4 | Naive Bayes | 0.542 | 69 | 545 | 912 | 5 |
7 | KMeanAF | Adasyn | recall | AdaBoost | 0.553 | 74 | 440 | 1018 | 0 |
78 | AggloAF | Adasyn | f4 | Naive Bayes | 0.531 | 72 | 410 | 1048 | 2 |
74 | AggloAF | Adasyn | f2 | Naive Bayes | 0.514 | 67 | 493 | 965 | 7 |
70 | AggloAF | Adasyn | recall | Naive Bayes | 0.519 | 72 | 355 | 1103 | 2 |
55 | DBSCANAF | TomekLinks | recall | AdaBoost | 0.517 | 74 | 282 | 1175 | 0 |
The best models depending on the scoring metric are shown below.
Best Accuracy | Best Precision | Best Recall | Best F1 | Best F2 | Best F4 | |
---|---|---|---|---|---|---|
Model ID | 84 | 64 | 47 | 83 | 83 | 63 |
Clustering | AggloAF | AggloAF | DBSCANAF | AggloAF | AggloAF | DBSCANAF |
Resampling | TomekLinks | Adasyn | Adasyn | TomekLinks | TomekLinks | TomekLinks |
Scoring | recall | f1 | f4 | f1 | f1 | f4 |
Model | Logistic Reg | Logistic Reg | AdaBoost | AdaBoost | AdaBoost | AdaBoost |
Precision | 0.0 | 0.1901 | 0.0744 | 0.1667 | 0.1667 | 0.1056 |
Recall | 0.0 | 0.3108 | 1.0 | 0.7027 | 0.7027 | 0.9189 |
Accuracy | 0.9517 | 0.9027 | 0.3984 | 0.8159 | 0.8159 | 0.6199 |
F1 | 0.0 | 0.2359 | 0.1384 | 0.2694 | 0.2694 | 0.1894 |
F2 | 0.0 | 0.2758 | 0.2866 | 0.4276 | 0.4276 | 0.3617 |
F4 | 0.0 | 0.2996 | 0.5773 | 0.5909 | 0.5909 | 0.6324 |
True 1 | 0 | 23 | 74 | 52 | 52 | 68 |
True 0 | 1458 | 1360 | 536 | 1198 | 1198 | 881 |
False 1 | 0 | 98 | 921 | 260 | 260 | 576 |
False 0 | 74 | 51 | 0 | 22 | 22 | 6 |

Confusion matrix - Best models for Cluster based Classification
Cluster only classification
Next we will evaluate the clustering results using only the cluster labels as feature and the stroke variable as target.
Clustering | Resampling | Scoring | Model | Precision | Recall | Accuracy | F1 | F2 | F4 | |
---|---|---|---|---|---|---|---|---|---|---|
1 | KMeanAF | Adasyn | f1 | Decision Tree | 0.082 | 0.716 | 0.597 | 0.146 | 0.280 | 0.491 |
9 | KMeanAF | Adasyn | f2 | Decision Tree | 0.082 | 0.716 | 0.597 | 0.146 | 0.280 | 0.491 |
13 | KMeanAF | Adasyn | f4 | Decision Tree | 0.082 | 0.716 | 0.597 | 0.146 | 0.280 | 0.491 |
95 | AggloAF | TomekLinks | f4 | AdaBoost | 0.074 | 0.919 | 0.441 | 0.137 | 0.280 | 0.550 |
5 | KMeanAF | Adasyn | recall | Decision Tree | 0.082 | 0.716 | 0.597 | 0.146 | 0.280 | 0.491 |
11 | KMeanAF | Adasyn | f2 | AdaBoost | 0.070 | 1.000 | 0.361 | 0.131 | 0.274 | 0.562 |
8 | KMeanAF | Adasyn | f2 | Logistic Regression | 0.070 | 1.000 | 0.361 | 0.131 | 0.274 | 0.562 |
6 | KMeanAF | Adasyn | recall | Naive Bayes | 0.070 | 1.000 | 0.361 | 0.131 | 0.274 | 0.562 |
3 | KMeanAF | Adasyn | f1 | AdaBoost | 0.070 | 1.000 | 0.361 | 0.131 | 0.274 | 0.562 |
2 | KMeanAF | Adasyn | f1 | Naive Bayes | 0.070 | 1.000 | 0.361 | 0.131 | 0.274 | 0.562 |
10 | KMeanAF | Adasyn | f2 | Naive Bayes | 0.070 | 1.000 | 0.361 | 0.131 | 0.274 | 0.562 |
4 | KMeanAF | Adasyn | recall | Logistic Regression | 0.070 | 1.000 | 0.361 | 0.131 | 0.274 | 0.562 |
23 | KMeanAF | TomekLinks | recall | AdaBoost | 0.070 | 1.000 | 0.361 | 0.131 | 0.274 | 0.562 |
0 | KMeanAF | Adasyn | f1 | Logistic Regression | 0.070 | 1.000 | 0.361 | 0.131 | 0.274 | 0.562 |
15 | KMeanAF | Adasyn | f4 | AdaBoost | 0.070 | 1.000 | 0.361 | 0.131 | 0.274 | 0.562 |
22 | KMeanAF | TomekLinks | recall | Naive Bayes | 0.070 | 1.000 | 0.361 | 0.131 | 0.274 | 0.562 |
30 | KMeanAF | TomekLinks | f4 | Naive Bayes | 0.070 | 1.000 | 0.361 | 0.131 | 0.274 | 0.562 |
31 | KMeanAF | TomekLinks | f4 | AdaBoost | 0.070 | 1.000 | 0.361 | 0.131 | 0.274 | 0.562 |
12 | KMeanAF | Adasyn | f4 | Logistic Regression | 0.070 | 1.000 | 0.361 | 0.131 | 0.274 | 0.562 |
14 | KMeanAF | Adasyn | f4 | Naive Bayes | 0.070 | 1.000 | 0.361 | 0.131 | 0.274 | 0.562 |
67 | AggloAF | Adasyn | f1 | AdaBoost | 0.067 | 0.946 | 0.364 | 0.126 | 0.262 | 0.535 |
7 | KMeanAF | Adasyn | recall | AdaBoost | 0.057 | 1.000 | 0.198 | 0.108 | 0.232 | 0.506 |
84 | AggloAF | TomekLinks | recall | Logistic Regression | 0.056 | 1.000 | 0.187 | 0.106 | 0.229 | 0.502 |
94 | AggloAF | TomekLinks | f4 | Naive Bayes | 0.056 | 1.000 | 0.187 | 0.106 | 0.229 | 0.502 |
92 | AggloAF | TomekLinks | f4 | Logistic Regression | 0.056 | 1.000 | 0.187 | 0.106 | 0.229 | 0.502 |
87 | AggloAF | TomekLinks | recall | AdaBoost | 0.056 | 1.000 | 0.187 | 0.106 | 0.229 | 0.502 |
86 | AggloAF | TomekLinks | recall | Naive Bayes | 0.056 | 1.000 | 0.187 | 0.106 | 0.229 | 0.502 |
70 | AggloAF | Adasyn | recall | Naive Bayes | 0.056 | 1.000 | 0.187 | 0.106 | 0.229 | 0.502 |
79 | AggloAF | Adasyn | f4 | AdaBoost | 0.056 | 1.000 | 0.187 | 0.106 | 0.229 | 0.502 |
78 | AggloAF | Adasyn | f4 | Naive Bayes | 0.056 | 1.000 | 0.187 | 0.106 | 0.229 | 0.502 |
75 | AggloAF | Adasyn | f2 | AdaBoost | 0.056 | 1.000 | 0.187 | 0.106 | 0.229 | 0.502 |
74 | AggloAF | Adasyn | f2 | Naive Bayes | 0.056 | 1.000 | 0.187 | 0.106 | 0.229 | 0.502 |
66 | AggloAF | Adasyn | f1 | Naive Bayes | 0.056 | 1.000 | 0.187 | 0.106 | 0.229 | 0.502 |
71 | AggloAF | Adasyn | recall | AdaBoost | 0.056 | 1.000 | 0.187 | 0.106 | 0.229 | 0.502 |
Clustering | Resampling | Scoring | Model | F4 | True 1 | True 0 | False 1 | False 0 | |
---|---|---|---|---|---|---|---|---|---|
1 | KMeanAF | Adasyn | f1 | Decision Tree | 0.491 | 53 | 861 | 597 | 21 |
9 | KMeanAF | Adasyn | f2 | Decision Tree | 0.491 | 53 | 861 | 597 | 21 |
13 | KMeanAF | Adasyn | f4 | Decision Tree | 0.491 | 53 | 861 | 597 | 21 |
95 | AggloAF | TomekLinks | f4 | AdaBoost | 0.550 | 68 | 608 | 850 | 6 |
5 | KMeanAF | Adasyn | recall | Decision Tree | 0.491 | 53 | 861 | 597 | 21 |
11 | KMeanAF | Adasyn | f2 | AdaBoost | 0.562 | 74 | 479 | 979 | 0 |
8 | KMeanAF | Adasyn | f2 | Logistic Regression | 0.562 | 74 | 479 | 979 | 0 |
6 | KMeanAF | Adasyn | recall | Naive Bayes | 0.562 | 74 | 479 | 979 | 0 |
3 | KMeanAF | Adasyn | f1 | AdaBoost | 0.562 | 74 | 479 | 979 | 0 |
2 | KMeanAF | Adasyn | f1 | Naive Bayes | 0.562 | 74 | 479 | 979 | 0 |
10 | KMeanAF | Adasyn | f2 | Naive Bayes | 0.562 | 74 | 479 | 979 | 0 |
4 | KMeanAF | Adasyn | recall | Logistic Regression | 0.562 | 74 | 479 | 979 | 0 |
23 | KMeanAF | TomekLinks | recall | AdaBoost | 0.562 | 74 | 479 | 979 | 0 |
0 | KMeanAF | Adasyn | f1 | Logistic Regression | 0.562 | 74 | 479 | 979 | 0 |
15 | KMeanAF | Adasyn | f4 | AdaBoost | 0.562 | 74 | 479 | 979 | 0 |
22 | KMeanAF | TomekLinks | recall | Naive Bayes | 0.562 | 74 | 479 | 979 | 0 |
30 | KMeanAF | TomekLinks | f4 | Naive Bayes | 0.562 | 74 | 479 | 979 | 0 |
31 | KMeanAF | TomekLinks | f4 | AdaBoost | 0.562 | 74 | 479 | 979 | 0 |
12 | KMeanAF | Adasyn | f4 | Logistic Regression | 0.562 | 74 | 479 | 979 | 0 |
14 | KMeanAF | Adasyn | f4 | Naive Bayes | 0.562 | 74 | 479 | 979 | 0 |
67 | AggloAF | Adasyn | f1 | AdaBoost | 0.535 | 70 | 488 | 970 | 4 |
7 | KMeanAF | Adasyn | recall | AdaBoost | 0.506 | 74 | 230 | 1228 | 0 |
84 | AggloAF | TomekLinks | recall | Logistic Regression | 0.502 | 74 | 212 | 1246 | 0 |
94 | AggloAF | TomekLinks | f4 | Naive Bayes | 0.502 | 74 | 212 | 1246 | 0 |
92 | AggloAF | TomekLinks | f4 | Logistic Regression | 0.502 | 74 | 212 | 1246 | 0 |
87 | AggloAF | TomekLinks | recall | AdaBoost | 0.502 | 74 | 212 | 1246 | 0 |
86 | AggloAF | TomekLinks | recall | Naive Bayes | 0.502 | 74 | 212 | 1246 | 0 |
70 | AggloAF | Adasyn | recall | Naive Bayes | 0.502 | 74 | 212 | 1246 | 0 |
79 | AggloAF | Adasyn | f4 | AdaBoost | 0.502 | 74 | 212 | 1246 | 0 |
78 | AggloAF | Adasyn | f4 | Naive Bayes | 0.502 | 74 | 212 | 1246 | 0 |
75 | AggloAF | Adasyn | f2 | AdaBoost | 0.502 | 74 | 212 | 1246 | 0 |
74 | AggloAF | Adasyn | f2 | Naive Bayes | 0.502 | 74 | 212 | 1246 | 0 |
66 | AggloAF | Adasyn | f1 | Naive Bayes | 0.502 | 74 | 212 | 1246 | 0 |
71 | AggloAF | Adasyn | recall | AdaBoost | 0.502 | 74 | 212 | 1246 | 0 |
The best models depending on the scoring metric are shown below.
Best Accuracy | Best Precision | Best Recall | Best F1 | Best F2 | Best F4 | |
---|---|---|---|---|---|---|
Model ID | 48 | 50 | 0 | 32 | 32 | 0 |
Clustering | DBSCANAF | DBSCANAF | KMeanAF | DBSCANAF | DBSCANAF | KMeanAF |
Resampling | TomekLinks | TomekLinks | Adasyn | Adasyn | Adasyn | Adasyn |
Scoring | f1 | f1 | f1 | f1 | f1 | f1 |
Model | Logistic Reg | Naive Bayes | Logistic Reg | Logistic Reg | Logistic Reg | Logistic Reg |
Precision | 0.1636 | 0.188 | 0.0703 | 0.1809 | 0.1809 | 0.0703 |
Recall | 0.1216 | 0.3378 | 1.0 | 0.4595 | 0.4595 | 1.0 |
Accuracy | 0.9275 | 0.8975 | 0.361 | 0.8733 | 0.8733 | 0.361 |
F1 | 0.1395 | 0.2415 | 0.1313 | 0.2595 | 0.2595 | 0.1313 |
F2 | 0.1282 | 0.2914 | 0.2743 | 0.3512 | 0.3512 | 0.2743 |
F4 | 0.1235 | 0.3227 | 0.5624 | 0.4213 | 0.4213 | 0.5624 |
True 1 | 9 | 25 | 74 | 34 | 34 | 74 |
True 0 | 1411 | 1349 | 479 | 1303 | 1303 | 479 |
False 1 | 46 | 108 | 979 | 154 | 154 | 979 |
False 0 | 65 | 49 | 0 | 40 | 40 | 0 |

Confusion Matrix - Best models for Cluster only Classification
Classification within the clusters
As suggested we will adjust our approach now and fit classification models for each cluster individually. We evaluated with either over-sampling or no resampling as the size of the clusters would result in small counts of observations with under-sampling. We evaluate the results in two variations:
-
Same model for all clusters: We apply the model sae model to all clusters and calculate the performance based on the aggregated confusion matrix.
-
Best model per cluster: We select the best performing model per cluster label and calculate the performance based on the aggregated confusion matrix.
Results for Same Model for all Cluster
Clustering | Model | Resampling | Scoring | Precision | Recall | Accuracy | F1 | F2 | F4 | |
---|---|---|---|---|---|---|---|---|---|---|
22 | AggloAF | Logistic Regression | None | f4 | 0.1163 | 0.6667 | 0.7741 | 0.1980 | 0.3425 | 0.5215 |
84 | KMeanAF | Logistic Regression | None | f1 | 0.1201 | 0.5811 | 0.8100 | 0.1991 | 0.3287 | 0.4741 |
85 | KMeanAF | Logistic Regression | None | f2 | 0.1201 | 0.5811 | 0.8100 | 0.1991 | 0.3287 | 0.4741 |
45 | DBSCANAF | Decision Tree | None | f2 | 0.1250 | 0.5541 | 0.7913 | 0.2040 | 0.3285 | 0.4610 |
77 | KMeanAF | Decision Tree | None | f2 | 0.1333 | 0.5135 | 0.8446 | 0.2117 | 0.3270 | 0.4398 |
44 | DBSCANAF | Decision Tree | None | f1 | 0.1329 | 0.5135 | 0.8147 | 0.2111 | 0.3265 | 0.4395 |
86 | KMeanAF | Logistic Regression | None | f4 | 0.1103 | 0.6216 | 0.7809 | 0.1874 | 0.3226 | 0.4884 |
29 | AggloAF | Naive Bayes | None | f2 | 0.1242 | 0.5200 | 0.8265 | 0.2005 | 0.3176 | 0.4379 |
47 | DBSCANAF | Decision Tree | None | recall | 0.1114 | 0.5541 | 0.7652 | 0.1855 | 0.3087 | 0.4491 |
46 | DBSCANAF | Decision Tree | None | f4 | 0.1114 | 0.5541 | 0.7652 | 0.1855 | 0.3087 | 0.4491 |
3 | AggloAF | AdaBoost | Adasyn | recall | 0.0996 | 0.6000 | 0.7563 | 0.1708 | 0.2992 | 0.4631 |
93 | KMeanAF | Naive Bayes | None | f2 | 0.1176 | 0.4865 | 0.8309 | 0.1895 | 0.2990 | 0.4107 |
2 | AggloAF | AdaBoost | Adasyn | f4 | 0.1153 | 0.4933 | 0.8204 | 0.1869 | 0.2979 | 0.4135 |
21 | AggloAF | Logistic Regression | None | f2 | 0.0928 | 0.6400 | 0.7234 | 0.1622 | 0.2938 | 0.4752 |
23 | AggloAF | Logistic Regression | None | recall | 0.1020 | 0.5467 | 0.7797 | 0.1719 | 0.2920 | 0.4351 |
78 | KMeanAF | Decision Tree | None | f4 | 0.1095 | 0.5000 | 0.8144 | 0.1796 | 0.2918 | 0.4133 |
61 | DBSCANAF | Naive Bayes | None | f2 | 0.1023 | 0.5405 | 0.7489 | 0.1720 | 0.2911 | 0.4317 |
20 | AggloAF | Logistic Regression | None | f1 | 0.0935 | 0.6133 | 0.7351 | 0.1623 | 0.2904 | 0.4622 |
94 | KMeanAF | Naive Bayes | None | f4 | 0.1066 | 0.5000 | 0.8094 | 0.1758 | 0.2877 | 0.4108 |
33 | DBSCANAF | AdaBoost | Adasyn | f2 | 0.0854 | 0.7027 | 0.6223 | 0.1523 | 0.2873 | 0.4930 |
87 | KMeanAF | Logistic Regression | None | recall | 0.1134 | 0.4459 | 0.8358 | 0.1808 | 0.2811 | 0.3803 |
79 | KMeanAF | Decision Tree | None | recall | 0.1016 | 0.5000 | 0.8001 | 0.1689 | 0.2803 | 0.4063 |
62 | DBSCANAF | Naive Bayes | None | f4 | 0.0817 | 0.6622 | 0.6243 | 0.1454 | 0.2734 | 0.4669 |
30 | AggloAF | Naive Bayes | None | f4 | 0.0870 | 0.5600 | 0.7356 | 0.1505 | 0.2682 | 0.4242 |
67 | KMeanAF | AdaBoost | Adasyn | recall | 0.1006 | 0.4595 | 0.8111 | 0.1650 | 0.2681 | 0.3798 |
95 | KMeanAF | Naive Bayes | None | recall | 0.0823 | 0.6081 | 0.7084 | 0.1449 | 0.2669 | 0.4419 |
91 | KMeanAF | Naive Bayes | Adasyn | recall | 0.0751 | 0.5676 | 0.6985 | 0.1327 | 0.2456 | 0.4096 |
27 | AggloAF | Naive Bayes | Adasyn | recall | 0.0743 | 0.5733 | 0.6832 | 0.1315 | 0.2446 | 0.4109 |
89 | KMeanAF | Naive Bayes | Adasyn | f2 | 0.0752 | 0.5541 | 0.7051 | 0.1325 | 0.2438 | 0.4031 |
90 | KMeanAF | Naive Bayes | Adasyn | f4 | 0.0750 | 0.5541 | 0.7040 | 0.1320 | 0.2432 | 0.4027 |
38 | DBSCANAF | AdaBoost | None | f4 | 0.0673 | 0.6622 | 0.5408 | 0.1222 | 0.2393 | 0.4357 |
39 | DBSCANAF | AdaBoost | None | recall | 0.0673 | 0.6622 | 0.5408 | 0.1222 | 0.2393 | 0.4357 |
56 | DBSCANAF | Naive Bayes | Adasyn | f1 | 0.0643 | 0.6216 | 0.5453 | 0.1166 | 0.2275 | 0.4118 |
70 | KMeanAF | AdaBoost | None | f4 | 0.0662 | 0.5270 | 0.6787 | 0.1176 | 0.2203 | 0.3739 |
Clustering | Model | Resampling | Scoring | F2 | True 1 | True 0 | False 1 | False 0 | |
---|---|---|---|---|---|---|---|---|---|
22 | AggloAF | Logistic Regression | None | f4 | 0.3425 | 50 | 1338 | 380 | 25 |
84 | KMeanAF | Logistic Regression | None | f1 | 0.3287 | 43 | 1432 | 315 | 31 |
85 | KMeanAF | Logistic Regression | None | f2 | 0.3287 | 43 | 1432 | 315 | 31 |
45 | DBSCANAF | Decision Tree | None | f2 | 0.3285 | 41 | 1172 | 287 | 33 |
77 | KMeanAF | Decision Tree | None | f2 | 0.3270 | 38 | 1500 | 247 | 36 |
44 | DBSCANAF | Decision Tree | None | f1 | 0.3265 | 38 | 1211 | 248 | 36 |
86 | KMeanAF | Logistic Regression | None | f4 | 0.3226 | 46 | 1376 | 371 | 28 |
29 | AggloAF | Naive Bayes | None | f2 | 0.3176 | 39 | 1443 | 275 | 36 |
47 | DBSCANAF | Decision Tree | None | recall | 0.3087 | 41 | 1132 | 327 | 33 |
46 | DBSCANAF | Decision Tree | None | f4 | 0.3087 | 41 | 1132 | 327 | 33 |
3 | AggloAF | AdaBoost | Adasyn | recall | 0.2992 | 45 | 1311 | 407 | 30 |
93 | KMeanAF | Naive Bayes | None | f2 | 0.2990 | 36 | 1477 | 270 | 38 |
2 | AggloAF | AdaBoost | Adasyn | f4 | 0.2979 | 37 | 1434 | 284 | 38 |
21 | AggloAF | Logistic Regression | None | f2 | 0.2938 | 48 | 1249 | 469 | 27 |
23 | AggloAF | Logistic Regression | None | recall | 0.2920 | 41 | 1357 | 361 | 34 |
78 | KMeanAF | Decision Tree | None | f4 | 0.2918 | 37 | 1446 | 301 | 37 |
61 | DBSCANAF | Naive Bayes | None | f2 | 0.2911 | 40 | 1108 | 351 | 34 |
20 | AggloAF | Logistic Regression | None | f1 | 0.2904 | 46 | 1272 | 446 | 29 |
94 | KMeanAF | Naive Bayes | None | f4 | 0.2877 | 37 | 1437 | 310 | 37 |
33 | DBSCANAF | AdaBoost | Adasyn | f2 | 0.2873 | 52 | 902 | 557 | 22 |
87 | KMeanAF | Logistic Regression | None | recall | 0.2811 | 33 | 1489 | 258 | 41 |
79 | KMeanAF | Decision Tree | None | recall | 0.2803 | 37 | 1420 | 327 | 37 |
62 | DBSCANAF | Naive Bayes | None | f4 | 0.2734 | 49 | 908 | 551 | 25 |
30 | AggloAF | Naive Bayes | None | f4 | 0.2682 | 42 | 1277 | 441 | 33 |
67 | KMeanAF | AdaBoost | Adasyn | recall | 0.2681 | 34 | 1443 | 304 | 40 |
95 | KMeanAF | Naive Bayes | None | recall | 0.2669 | 45 | 1245 | 502 | 29 |
91 | KMeanAF | Naive Bayes | Adasyn | recall | 0.2456 | 42 | 1230 | 517 | 32 |
27 | AggloAF | Naive Bayes | Adasyn | recall | 0.2446 | 43 | 1182 | 536 | 32 |
89 | KMeanAF | Naive Bayes | Adasyn | f2 | 0.2438 | 41 | 1243 | 504 | 33 |
90 | KMeanAF | Naive Bayes | Adasyn | f4 | 0.2432 | 41 | 1241 | 506 | 33 |
38 | DBSCANAF | AdaBoost | None | f4 | 0.2393 | 49 | 780 | 679 | 25 |
39 | DBSCANAF | AdaBoost | None | recall | 0.2393 | 49 | 780 | 679 | 25 |
56 | DBSCANAF | Naive Bayes | Adasyn | f1 | 0.2275 | 46 | 790 | 669 | 28 |
70 | KMeanAF | AdaBoost | None | f4 | 0.2203 | 39 | 1197 | 550 | 35 |
The best models depending on the scoring metric are shown below.
Best Accuracy | Best Precision | Best Recall | Best F1 | Best F2 | Best F4 | |
---|---|---|---|---|---|---|
Model ID | 77 | 77 | 34 | 77 | 22 | 22 |
Clustering | KMeanAF | KMeanAF | DBSCANAF | KMeanAF | AggloAF | AggloAF |
Model | Decision Tree | Decision Tree | AdaBoost | Decision Tree | Logistic Reg | Logistic Reg |
Resampling | None | None | Adasyn | None | None | None |
Scoring | f2 | f2 | f4 | f2 | f4 | f4 |
Precision | 0.1333 | 0.1333 | 0.0509 | 0.1333 | 0.1163 | 0.1163 |
Recall | 0.5135 | 0.5135 | 0.8108 | 0.5135 | 0.6667 | 0.6667 |
Accuracy | 0.8446 | 0.8446 | 0.2616 | 0.8446 | 0.7741 | 0.7741 |
F1 | 0.2117 | 0.2117 | 0.0958 | 0.2117 | 0.198 | 0.198 |
F2 | 0.327 | 0.327 | 0.2035 | 0.327 | 0.3425 | 0.3425 |
F4 | 0.4398 | 0.4398 | 0.4318 | 0.4398 | 0.5215 | 0.5215 |
True 1 | 38 | 38 | 60 | 38 | 50 | 50 |
True 0 | 1500 | 1500 | 341 | 1500 | 1338 | 1338 |
False 1 | 247 | 247 | 1118 | 247 | 380 | 380 |
False 0 | 36 | 36 | 14 | 36 | 25 | 25 |

Confusion matrix - Best models for Cluster individual classification
Results for Best Model per Cluster
Clustering | CID | Resampling | Scoring | Model | Precision | Recall | Accuracy | F1 | F2 | F4 | |
---|---|---|---|---|---|---|---|---|---|---|---|
448 | AggloAF | 0 | Adasyn | f1 | Log Reg | 0.3333 | 0.7500 | 0.9759 | 0.4615 | 0.6000 | 0.6986 |
480 | AggloAF | 1 | Adasyn | f1 | Log Reg | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
519 | AggloAF | 2 | Adasyn | recall | AdaBoost | 0.0900 | 0.8182 | 0.5550 | 0.1622 | 0.3125 | 0.5543 |
560 | AggloAF | 3 | None | f1 | Log Reg | 0.1887 | 0.6250 | 0.6231 | 0.2899 | 0.4274 | 0.5502 |
594 | AggloAF | 4 | None | f1 | Naive Bayes | 0.1176 | 0.3333 | 0.8927 | 0.1739 | 0.2439 | 0.3009 |
628 | AggloAF | 5 | None | recall | Log Reg | 0.1719 | 0.8462 | 0.6382 | 0.2857 | 0.4741 | 0.6875 |
666 | AggloAF | 6 | None | f2 | Naive Bayes | 0.1750 | 1.0000 | 0.2048 | 0.2979 | 0.5147 | 0.7829 |
687 | AggloAF | 7 | Adasyn | f4 | AdaBoost | 0.0741 | 0.6667 | 0.7719 | 0.1333 | 0.2564 | 0.4533 |
721 | AggloAF | 8 | None | f1 | Dec. Tree | 0.1667 | 0.5000 | 0.8717 | 0.2500 | 0.3571 | 0.4474 |
343 | DBSCANAF | 0 | None | recall | AdaBoost | 0.2174 | 1.0000 | 0.4375 | 0.3571 | 0.5814 | 0.8252 |
377 | DBSCANAF | 1 | None | f2 | Dec. Tree | 0.1042 | 0.5682 | 0.8227 | 0.1761 | 0.3005 | 0.4502 |
400 | DBSCANAF | 2 | None | f1 | Log Reg | 0.1887 | 0.6250 | 0.6231 | 0.2899 | 0.4274 | 0.5502 |
438 | DBSCANAF | 3 | None | recall | Naive Bayes | 0.2500 | 1.0000 | 0.3684 | 0.4000 | 0.6250 | 0.8500 |
21 | KMeanAF | 0 | None | recall | Dec. Tree | 0.0794 | 0.5000 | 0.7249 | 0.1370 | 0.2427 | 0.3812 |
32 | KMeanAF | 1 | Adasyn | f1 | Log Reg | 0.5000 | 1.0000 | 0.9961 | 0.6667 | 0.8333 | 0.9444 |
64 | KMeanAF | 2 | Adasyn | f1 | Log Reg | 0.2500 | 0.4286 | 0.9065 | 0.3158 | 0.3750 | 0.4113 |
113 | KMeanAF | 3 | None | f1 | Dec. Tree | 0.1875 | 0.8571 | 0.6747 | 0.3077 | 0.5000 | 0.7083 |
144 | KMeanAF | 4 | None | f1 | Log Reg | 0.0870 | 0.8571 | 0.5461 | 0.1579 | 0.3093 | 0.5635 |
160 | KMeanAF | 5 | Adasyn | f1 | Log Reg | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
218 | KMeanAF | 6 | None | f2 | Naive Bayes | 0.1750 | 1.0000 | 0.2048 | 0.2979 | 0.5147 | 0.7829 |
243 | KMeanAF | 7 | None | f1 | AdaBoost | 0.1852 | 0.7143 | 0.8400 | 0.2941 | 0.4545 | 0.6115 |
274 | KMeanAF | 8 | None | f1 | Naive Bayes | 0.1818 | 0.4000 | 0.8909 | 0.2500 | 0.3226 | 0.3736 |
304 | KMeanAF | 9 | None | f1 | Log Reg | 0.1887 | 0.6250 | 0.6231 | 0.2899 | 0.4274 | 0.5502 |
Clustering | CID | Resampling | Scoring | Model | F2 | True 1 | True 0 | False 1 | False 0 | |
---|---|---|---|---|---|---|---|---|---|---|
448 | AggloAF | 0 | Adasyn | f1 | Log Reg | 0.6000 | 3 | 280 | 6 | 1 |
480 | AggloAF | 1 | Adasyn | f1 | Log Reg | 1.0000 | 0 | 451 | 0 | 0 |
519 | AggloAF | 2 | Adasyn | recall | AdaBoost | 0.3125 | 9 | 107 | 91 | 2 |
560 | AggloAF | 3 | None | f1 | Log Reg | 0.4274 | 10 | 71 | 43 | 6 |
594 | AggloAF | 4 | None | f1 | Naive Bayes | 0.2439 | 2 | 156 | 15 | 4 |
628 | AggloAF | 5 | None | recall | Log Reg | 0.4741 | 11 | 86 | 53 | 2 |
666 | AggloAF | 6 | None | f2 | Naive Bayes | 0.5147 | 14 | 3 | 66 | 0 |
687 | AggloAF | 7 | Adasyn | f4 | AdaBoost | 0.2564 | 2 | 86 | 25 | 1 |
721 | AggloAF | 8 | None | f1 | Dec. Tree | 0.3571 | 4 | 159 | 20 | 4 |
343 | DBSCANAF | 0 | None | recall | AdaBoost | 0.5814 | 10 | 18 | 36 | 0 |
377 | DBSCANAF | 1 | None | f2 | Dec. Tree | 0.3005 | 25 | 1061 | 215 | 19 |
400 | DBSCANAF | 2 | None | f1 | Log Reg | 0.4274 | 10 | 71 | 43 | 6 |
438 | DBSCANAF | 3 | None | recall | Naive Bayes | 0.6250 | 4 | 3 | 12 | 0 |
21 | KMeanAF | 0 | None | recall | Dec. Tree | 0.2427 | 5 | 161 | 58 | 5 |
32 | KMeanAF | 1 | Adasyn | f1 | Log Reg | 0.8333 | 1 | 253 | 1 | 0 |
64 | KMeanAF | 2 | Adasyn | f1 | Log Reg | 0.3750 | 3 | 123 | 9 | 4 |
113 | KMeanAF | 3 | None | f1 | Dec. Tree | 0.5000 | 6 | 50 | 26 | 1 |
144 | KMeanAF | 4 | None | f1 | Log Reg | 0.3093 | 6 | 71 | 63 | 1 |
160 | KMeanAF | 5 | Adasyn | f1 | Log Reg | 1.0000 | 0 | 501 | 0 | 0 |
218 | KMeanAF | 6 | None | f2 | Naive Bayes | 0.5147 | 14 | 3 | 66 | 0 |
243 | KMeanAF | 7 | None | f1 | AdaBoost | 0.4545 | 5 | 121 | 22 | 2 |
274 | KMeanAF | 8 | None | f1 | Naive Bayes | 0.3226 | 2 | 96 | 9 | 3 |
304 | KMeanAF | 9 | None | f1 | Log Reg | 0.4274 | 10 | 71 | 43 | 6 |
Clustering | Precision | Recall | Accuracy | F1 | F2 | F4 | |
---|---|---|---|---|---|---|---|
0 | AggloAF | 0.1471 | 0.7333 | 0.8109 | 0.2450 | 0.4080 | 0.5940 |
2 | KMeanAF | 0.1490 | 0.7027 | 0.8248 | 0.2459 | 0.4031 | 0.5766 |
1 | DBSCANAF | 0.1380 | 0.6622 | 0.7841 | 0.2284 | 0.3763 | 0.5413 |
Clustering | F2 | True 1 | True 0 | False 1 | False 0 | |
---|---|---|---|---|---|---|
0 | AggloAF | 0.4080 | 55 | 1399 | 319 | 20 |
2 | KMeanAF | 0.4031 | 52 | 1450 | 297 | 22 |
1 | DBSCANAF | 0.3763 | 49 | 1153 | 306 | 25 |
The best models depending on the scoring metric are shown below.
Best Accuracy | Best Precision | Best Recall | Best F1 | Best F2 | Best F4 | |
---|---|---|---|---|---|---|
Model ID | 2 | 2 | 0 | 2 | 0 | 0 |
Clustering | KMeanAF | KMeanAF | AggloAF | KMeanAF | AggloAF | AggloAF |
Precision | 0.149 | 0.149 | 0.1471 | 0.149 | 0.1471 | 0.1471 |
Recall | 0.7027 | 0.7027 | 0.7333 | 0.7027 | 0.7333 | 0.7333 |
Accuracy | 0.8248 | 0.8248 | 0.8109 | 0.8248 | 0.8109 | 0.8109 |
F1 | 0.2459 | 0.2459 | 0.245 | 0.2459 | 0.245 | 0.245 |
F2 | 0.4031 | 0.4031 | 0.408 | 0.4031 | 0.408 | 0.408 |
F4 | 0.5766 | 0.5766 | 0.594 | 0.5766 | 0.594 | 0.594 |
True 1 | 52 | 52 | 55 | 52 | 55 | 55 |
True 0 | 1450 | 1450 | 1399 | 1450 | 1399 | 1399 |
False 1 | 297 | 297 | 319 | 297 | 319 | 319 |
False 0 | 22 | 22 | 20 | 22 | 20 | 20 |

Confusion matrix - Best Models for for Best Model per Cluster
6. Dimensionality reduction - PCA and KPCA
Dimensionality reduction is crucial when working with high-dimensional datasets like the stroke dataset because it helps simplify the data, making it easier to visualize and analyze. High-dimensional data can be complex and computationally expensive to process, and it often contains redundant or irrelevant features that can obscure meaningful patterns. By reducing the number of dimensions, we can highlight the most important features that capture the majority of the data's variance, thus facilitating more efficient and insightful analysis. Reducing dimensions aids in visualizing the data, which is essential for understanding the underlying structure and relationships within the dataset. Visualizing the data in two dimensions can help identify clusters of patients with similar characteristics or risk profiles. This can provide valuable insights for developing targeted prevention and treatment strategies, ultimately improving patient outcomes. We will apply two widely used dimensionality reduction methods:
-
PCA : Principal Component Analysis and
-
KPCA : Kernel Function Principal Component Analysis.
PCA
Principal Component Analysis (PCA) is a widely used technique for dimensionality reduction that transforms the original high-dimensional data into a new set of uncorrelated variables called principal components. These principal components are linear combinations of the original features, ordered by the amount of variance they capture in the data. The first principal component captures the most variance, followed by the second, and so on. By selecting a subset of these components, we can reduce the dimensionality of the dataset while retaining most of its variability. PCA is particularly useful for identifying patterns and trends in the data, as well as for noise reduction and feature extraction.
In a first attempt to support vizualisation we will first scale the data and then run PCA to reduce to 2 Dimensions.
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
gender | 5104.0 | 0.0 | 1.0 | -0.840 | -0.840 | -0.840 | 1.188 | 3.217 |
age | 5104.0 | -0.0 | 1.0 | -1.910 | -0.807 | 0.077 | 0.785 | 1.714 |
hypertension | 5104.0 | -0.0 | 1.0 | -0.328 | -0.328 | -0.328 | -0.328 | 3.051 |
heart_disease | 5104.0 | 0.0 | 1.0 | -0.239 | -0.239 | -0.239 | -0.239 | 4.182 |
ever_married | 5104.0 | -0.0 | 1.0 | -1.383 | -1.383 | 0.723 | 0.723 | 0.723 |
work_type | 5104.0 | -0.0 | 1.0 | -1.988 | -0.153 | -0.153 | 0.764 | 1.681 |
Residence_type | 5104.0 | 0.0 | 1.0 | -1.017 | -1.017 | 0.984 | 0.984 | 0.984 |
smoking_status | 5104.0 | -0.0 | 1.0 | -1.286 | -1.286 | 0.581 | 0.581 | 1.515 |
glucose_group | 5104.0 | 0.0 | 1.0 | -0.412 | -0.412 | -0.412 | -0.412 | 2.945 |
bmi_group | 5104.0 | -0.0 | 1.0 | -2.129 | -1.079 | -0.030 | 1.020 | 1.020 |

Scatterplot of 2D PCA principal components and target variable stroke as hue
The plot doesn't show good separation regarding the target variable. Also the cummulative explained variance ratio is not very good at 0.399.
Principal Component | Explained Variance Ratio | Cumulative EVR | Eigenvalues | |
---|---|---|---|---|
0 | PC1 | 0.274 | 0.274 | 2.738 |
1 | PC2 | 0.125 | 0.399 | 1.252 |
So we rerun PCA with the aim to explain 95% of the variance and plot the cummulative Variance.

PCA Cummulative explained variance
We would require 9 components to explain 97% of the 10 features in our dataset.
[H]
10 1.2Principal Component | Explained Variance Ratio | Cumulative EVR | Eigenvalues | |
---|---|---|---|---|
0 | PC1 | 0.274 | 0.274 | 2.738 |
1 | PC2 | 0.125 | 0.399 | 1.252 |
2 | PC3 | 0.100 | 0.499 | 1.001 |
3 | PC4 | 0.096 | 0.595 | 0.962 |
4 | PC5 | 0.088 | 0.683 | 0.882 |
5 | PC6 | 0.082 | 0.766 | 0.825 |
6 | PC7 | 0.080 | 0.846 | 0.799 |
7 | PC8 | 0.065 | 0.911 | 0.648 |
8 | PC9 | 0.060 | 0.971 | 0.603 |
Now we try to fit classification models for the principal components as features. The results are listed below.
Resampling | Scoring | Model | Precision | Recall | Accuracy | F1 | F2 | F4 | |
---|---|---|---|---|---|---|---|---|---|
3 | Adasyn | f1 | AdaBoost | 0.114 | 0.784 | 0.696 | 0.199 | 0.361 | 0.583 |
0 | Adasyn | f1 | Logistic Regression | 0.115 | 0.743 | 0.711 | 0.199 | 0.355 | 0.562 |
8 | Adasyn | f2 | Logistic Regression | 0.115 | 0.743 | 0.711 | 0.199 | 0.355 | 0.562 |
4 | Adasyn | recall | Logistic Regression | 0.111 | 0.730 | 0.704 | 0.193 | 0.345 | 0.549 |
12 | Adasyn | f4 | Logistic Regression | 0.111 | 0.730 | 0.704 | 0.193 | 0.345 | 0.549 |
7 | Adasyn | recall | AdaBoost | 0.075 | 0.946 | 0.435 | 0.139 | 0.285 | 0.563 |
11 | Adasyn | f2 | AdaBoost | 0.075 | 0.946 | 0.435 | 0.139 | 0.285 | 0.563 |
15 | Adasyn | f4 | AdaBoost | 0.075 | 0.946 | 0.435 | 0.139 | 0.285 | 0.563 |
Resampling | Scoring | Model | F2 | True 1 | True 0 | False 1 | False 0 | |
---|---|---|---|---|---|---|---|---|
3 | Adasyn | f1 | AdaBoost | 0.361 | 58 | 1008 | 450 | 16 |
0 | Adasyn | f1 | Logistic Regression | 0.355 | 55 | 1034 | 424 | 19 |
8 | Adasyn | f2 | Logistic Regression | 0.355 | 55 | 1034 | 424 | 19 |
4 | Adasyn | recall | Logistic Regression | 0.345 | 54 | 1025 | 433 | 20 |
12 | Adasyn | f4 | Logistic Regression | 0.345 | 54 | 1025 | 433 | 20 |
7 | Adasyn | recall | AdaBoost | 0.285 | 70 | 597 | 861 | 4 |
11 | Adasyn | f2 | AdaBoost | 0.285 | 70 | 597 | 861 | 4 |
15 | Adasyn | f4 | AdaBoost | 0.285 | 70 | 597 | 861 | 4 |
The best models depending on the scoring metric are shown below.
Best Accuracy | Best Precision | Best Recall | Best F1 | Best F2 | Best F4 | |
---|---|---|---|---|---|---|
Model ID | 0 | 0 | 7 | 3 | 3 | 3 |
Resampling | Adasyn | Adasyn | Adasyn | Adasyn | Adasyn | Adasyn |
Scoring | f1 | f1 | recall | f1 | f1 | f1 |
Model | Logistic Reg | Logistic Reg | AdaBoost | AdaBoost | AdaBoost | AdaBoost |
Precision | 0.115 | 0.115 | 0.075 | 0.114 | 0.114 | 0.114 |
Recall | 0.743 | 0.743 | 0.946 | 0.784 | 0.784 | 0.784 |
Accuracy | 0.711 | 0.711 | 0.435 | 0.696 | 0.696 | 0.696 |
F1 | 0.199 | 0.199 | 0.139 | 0.199 | 0.199 | 0.199 |
F2 | 0.355 | 0.355 | 0.285 | 0.361 | 0.361 | 0.361 |
F4 | 0.562 | 0.562 | 0.563 | 0.583 | 0.583 | 0.583 |
True 1 | 55 | 55 | 70 | 58 | 58 | 58 |
True 0 | 1034 | 1034 | 597 | 1008 | 1008 | 1008 |
False 1 | 424 | 424 | 861 | 450 | 450 | 450 |
False 0 | 19 | 19 | 4 | 16 | 16 | 16 |

Confusion matrix Best models for PCA based classification
KPCA
Kernel PCA (KPCA) extends the traditional PCA method by using kernel functions to handle non-linear relationships in the data. While PCA is limited to linear transformations, KPCA projects the data into a higher-dimensional space using a kernel function, allowing it to capture complex, non-linear structures. The kernel function implicitly computes the principal components in this higher-dimensional space without the need to explicitly perform the transformation, making it computationally efficient. KPCA is especially useful when the data has intricate non-linear patterns that cannot be captured by linear methods like PCA. It provides a more flexible approach to dimensionality reduction, enabling better feature extraction and improved performance of machine learning algorithms on complex datasets.
As with PCA we will first scale the data and then run PCA to reduce to 2 Dimensions.
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
gender | 5104.0 | 0.0 | 1.0 | -0.840 | -0.840 | -0.840 | 1.188 | 3.217 |
age | 5104.0 | -0.0 | 1.0 | -1.910 | -0.807 | 0.077 | 0.785 | 1.714 |
hypertension | 5104.0 | -0.0 | 1.0 | -0.328 | -0.328 | -0.328 | -0.328 | 3.051 |
heart_disease | 5104.0 | 0.0 | 1.0 | -0.239 | -0.239 | -0.239 | -0.239 | 4.182 |
ever_married | 5104.0 | -0.0 | 1.0 | -1.383 | -1.383 | 0.723 | 0.723 | 0.723 |
work_type | 5104.0 | -0.0 | 1.0 | -1.988 | -0.153 | -0.153 | 0.764 | 1.681 |
Residence_type | 5104.0 | 0.0 | 1.0 | -1.017 | -1.017 | 0.984 | 0.984 | 0.984 |
smoking_status | 5104.0 | -0.0 | 1.0 | -1.286 | -1.286 | 0.581 | 0.581 | 1.515 |
glucose_group | 5104.0 | 0.0 | 1.0 | -0.412 | -0.412 | -0.412 | -0.412 | 2.945 |
bmi_group | 5104.0 | -0.0 | 1.0 | -2.129 | -1.079 | -0.030 | 1.020 | 1.020 |

Scatterplot of 2D KPCA principal components and target variable stroke as hue
The plot appears to show good separation regarding the target variable. When we zoom into the range of the target variable and have a closer look, we can see that the classes are hardly separated.


Zoomed in Scatterplots of 2D KPCA principal components and target variable stroke as hue
When we rerun KPCA without specifying the number of components, we can then see how much variance is explained by each principal component, if we take a look at the magnitude of their respective eigenvalues.

Eigenvalues of KPCA Principal Components
We can try to evaluate the Kernel PCA results with applying our different classifcation models and their performance using the principal components as features and the stroke variable as target. We will evaluate for 20,40 and 60 principal components.
KPC | Resampling | Scoring | Model | Precision | Recall | Accuracy | F1 | F2 | F4 | |
---|---|---|---|---|---|---|---|---|---|---|
3 | 20 | Adasyn | f1 | AdaBoost | 0.1231 | 0.7703 | 0.7239 | 0.2123 | 0.3755 | 0.5883 |
35 | 40 | Adasyn | f1 | AdaBoost | 0.1223 | 0.7568 | 0.7258 | 0.2105 | 0.3714 | 0.5798 |
32 | 40 | Adasyn | f1 | Logistic Regression | 0.1225 | 0.7432 | 0.7304 | 0.2103 | 0.3691 | 0.5726 |
36 | 40 | Adasyn | recall | Logistic Regression | 0.1225 | 0.7432 | 0.7304 | 0.2103 | 0.3691 | 0.5726 |
40 | 40 | Adasyn | f2 | Logistic Regression | 0.1225 | 0.7432 | 0.7304 | 0.2103 | 0.3691 | 0.5726 |
44 | 40 | Adasyn | f4 | Logistic Regression | 0.1225 | 0.7432 | 0.7304 | 0.2103 | 0.3691 | 0.5726 |
0 | 20 | Adasyn | f1 | Logistic Regression | 0.1204 | 0.7432 | 0.7252 | 0.2072 | 0.3652 | 0.5698 |
4 | 20 | Adasyn | recall | Logistic Regression | 0.1198 | 0.7432 | 0.7239 | 0.2064 | 0.3642 | 0.5691 |
8 | 20 | Adasyn | f2 | Logistic Regression | 0.1198 | 0.7432 | 0.7239 | 0.2064 | 0.3642 | 0.5691 |
12 | 20 | Adasyn | f4 | Logistic Regression | 0.1198 | 0.7432 | 0.7239 | 0.2064 | 0.3642 | 0.5691 |
68 | 60 | Adasyn | recall | Logistic Regression | 0.1189 | 0.7297 | 0.7258 | 0.2045 | 0.3600 | 0.5604 |
64 | 60 | Adasyn | f1 | Logistic Regression | 0.1189 | 0.7297 | 0.7258 | 0.2045 | 0.3600 | 0.5604 |
72 | 60 | Adasyn | f2 | Logistic Regression | 0.1189 | 0.7297 | 0.7258 | 0.2045 | 0.3600 | 0.5604 |
76 | 60 | Adasyn | f4 | Logistic Regression | 0.1189 | 0.7297 | 0.7258 | 0.2045 | 0.3600 | 0.5604 |
60 | 40 | TomekLinks | f4 | Logistic Regression | 0.1182 | 0.7297 | 0.7239 | 0.2034 | 0.3586 | 0.5594 |
52 | 40 | TomekLinks | recall | Logistic Regression | 0.1182 | 0.7297 | 0.7239 | 0.2034 | 0.3586 | 0.5594 |
56 | 40 | TomekLinks | f2 | Logistic Regression | 0.1182 | 0.7297 | 0.7239 | 0.2034 | 0.3586 | 0.5594 |
48 | 40 | TomekLinks | f1 | Logistic Regression | 0.1182 | 0.7297 | 0.7239 | 0.2034 | 0.3586 | 0.5594 |
16 | 20 | TomekLinks | f1 | Logistic Regression | 0.1048 | 0.8243 | 0.6514 | 0.1860 | 0.3474 | 0.5872 |
43 | 40 | Adasyn | f2 | AdaBoost | 0.0870 | 0.8514 | 0.5614 | 0.1579 | 0.3088 | 0.5613 |
181 | 60 | TomekLinks | recall | Decision Tree | 0.0832 | 0.7432 | 0.5920 | 0.1497 | 0.2874 | 0.5068 |
131 | 40 | Adasyn | f1 | AdaBoost | 0.0804 | 0.7703 | 0.5633 | 0.1456 | 0.2836 | 0.5119 |
103 | 20 | Adasyn | recall | AdaBoost | 0.0791 | 0.7703 | 0.5555 | 0.1434 | 0.2802 | 0.5087 |
111 | 20 | Adasyn | f4 | AdaBoost | 0.0791 | 0.7703 | 0.5555 | 0.1434 | 0.2802 | 0.5087 |
107 | 20 | Adasyn | f2 | AdaBoost | 0.0782 | 0.7703 | 0.5503 | 0.1420 | 0.2780 | 0.5065 |
163 | 60 | Adasyn | f1 | AdaBoost | 0.0794 | 0.7297 | 0.5783 | 0.1432 | 0.2766 | 0.4925 |
99 | 20 | Adasyn | f1 | AdaBoost | 0.0779 | 0.7568 | 0.5555 | 0.1412 | 0.2759 | 0.5003 |
KPC | Resampling | Scoring | Model | F2 | True 1 | True 0 | False 1 | False 0 | |
---|---|---|---|---|---|---|---|---|---|
3 | 20 | Adasyn | f1 | AdaBoost | 0.3755 | 57 | 1052 | 406 | 17 |
35 | 40 | Adasyn | f1 | AdaBoost | 0.3714 | 56 | 1056 | 402 | 18 |
32 | 40 | Adasyn | f1 | Logistic Regression | 0.3691 | 55 | 1064 | 394 | 19 |
36 | 40 | Adasyn | recall | Logistic Regression | 0.3691 | 55 | 1064 | 394 | 19 |
40 | 40 | Adasyn | f2 | Logistic Regression | 0.3691 | 55 | 1064 | 394 | 19 |
44 | 40 | Adasyn | f4 | Logistic Regression | 0.3691 | 55 | 1064 | 394 | 19 |
0 | 20 | Adasyn | f1 | Logistic Regression | 0.3652 | 55 | 1056 | 402 | 19 |
4 | 20 | Adasyn | recall | Logistic Regression | 0.3642 | 55 | 1054 | 404 | 19 |
8 | 20 | Adasyn | f2 | Logistic Regression | 0.3642 | 55 | 1054 | 404 | 19 |
12 | 20 | Adasyn | f4 | Logistic Regression | 0.3642 | 55 | 1054 | 404 | 19 |
68 | 60 | Adasyn | recall | Logistic Regression | 0.3600 | 54 | 1058 | 400 | 20 |
64 | 60 | Adasyn | f1 | Logistic Regression | 0.3600 | 54 | 1058 | 400 | 20 |
72 | 60 | Adasyn | f2 | Logistic Regression | 0.3600 | 54 | 1058 | 400 | 20 |
76 | 60 | Adasyn | f4 | Logistic Regression | 0.3600 | 54 | 1058 | 400 | 20 |
60 | 40 | TomekLinks | f4 | Logistic Regression | 0.3586 | 54 | 1055 | 403 | 20 |
52 | 40 | TomekLinks | recall | Logistic Regression | 0.3586 | 54 | 1055 | 403 | 20 |
56 | 40 | TomekLinks | f2 | Logistic Regression | 0.3586 | 54 | 1055 | 403 | 20 |
48 | 40 | TomekLinks | f1 | Logistic Regression | 0.3586 | 54 | 1055 | 403 | 20 |
16 | 20 | TomekLinks | f1 | Logistic Regression | 0.3474 | 61 | 937 | 521 | 13 |
43 | 40 | Adasyn | f2 | AdaBoost | 0.3088 | 63 | 797 | 661 | 11 |
181 | 60 | TomekLinks | recall | Decision Tree | 0.2874 | 55 | 852 | 606 | 19 |
131 | 40 | Adasyn | f1 | AdaBoost | 0.2836 | 57 | 806 | 652 | 17 |
103 | 20 | Adasyn | recall | AdaBoost | 0.2802 | 57 | 794 | 664 | 17 |
111 | 20 | Adasyn | f4 | AdaBoost | 0.2802 | 57 | 794 | 664 | 17 |
107 | 20 | Adasyn | f2 | AdaBoost | 0.2780 | 57 | 786 | 672 | 17 |
163 | 60 | Adasyn | f1 | AdaBoost | 0.2766 | 54 | 832 | 626 | 20 |
99 | 20 | Adasyn | f1 | AdaBoost | 0.2759 | 56 | 795 | 663 | 18 |
The best models depending on the scoring metric are shown below.
Best Accuracy | Best Precision | Best Recall | Best F1 | Best F2 | Best F4 | |
---|---|---|---|---|---|---|
Model ID | 32 | 3 | 43 | 3 | 3 | 3 |
KPC | 40 | 20 | 40 | 20 | 20 | 20 |
Resampling | Adasyn | Adasyn | Adasyn | Adasyn | Adasyn | Adasyn |
Scoring | f1 | f1 | f2 | f1 | f1 | f1 |
Model | Logistic Reg | AdaBoost | AdaBoost | AdaBoost | AdaBoost | AdaBoost |
Precision | 0.1225 | 0.1231 | 0.087 | 0.1231 | 0.1231 | 0.1231 |
Recall | 0.7432 | 0.7703 | 0.8514 | 0.7703 | 0.7703 | 0.7703 |
Accuracy | 0.7304 | 0.7239 | 0.5614 | 0.7239 | 0.7239 | 0.7239 |
F1 | 0.2103 | 0.2123 | 0.1579 | 0.2123 | 0.2123 | 0.2123 |
F2 | 0.3691 | 0.3755 | 0.3088 | 0.3755 | 0.3755 | 0.3755 |
F4 | 0.5726 | 0.5883 | 0.5613 | 0.5883 | 0.5883 | 0.5883 |
True 1 | 55 | 57 | 63 | 57 | 57 | 57 |
True 0 | 1064 | 1052 | 797 | 1052 | 1052 | 1052 |
False 1 | 394 | 406 | 661 | 406 | 406 | 406 |
False 0 | 19 | 17 | 11 | 17 | 17 | 17 |

Confusion matrix - Best model for KPCA based classification
7. Recommended Model
After training and evaluating the models, it is hard to identify a single best model as it is highly depending on the preference on how you value the misclassification of a stroke patient. We did not gain huge improvements with the unsupervised leaning approaches but could identify slight improvement in the F4 score for Cluster based classification. Dimensionality Reduction did not result in improved results compared to the base models. Based on our performance metric we can propose two models as promising. The best F2 value was achieved with the base classification only AdaBoost model number 19 with under-sampling. The best F4 value was realized by applying AdaBoost model number 63 with under-sampling and enriching the classification with the resulting cluster labels from the DBSCAN clustering. Both models could be a good starting point for further evaluation. Below you can find the Confusion Matrix for both models.
Best F2 Score
Model 19 - AdaBoost | |
---|---|
Resampling | TomekLinks |
Scoring | f1 |
Model | AdaBoost |
Precision | 0.167192 |
Recall | 0.716216 |
Accuracy | 0.813969 |
F1 | 0.2711 |
F2 | 0.4323 |
F4 | 0.600266 |
True 1 | 53 |
True 0 | 1194 |
False 1 | 264 |
False 0 | 21 |

Confusion matrix Best F2 Score - AdaBoost Base model 19
Best F4 Score
Model 63 - AdaBoost DBSCAN | |
---|---|
Clustering | DBSCANAF |
Resampling | TomekLinks |
Scoring | f4 |
Model | AdaBoost |
Precision | 0.10559 |
Recall | 0.918919 |
Accuracy | 0.619856 |
F1 | 0.189415 |
F2 | 0.361702 |
F4 | 0.632385 |
True 1 | 68 |
True 0 | 881 |
False 1 | 576 |
False 0 | 6 |

Confusion matrix Best F4 Score - AdaBoost DBSCAN Cluster model 63
8. Key Findings and Insights
The analysis of the stroke prediction model has revealed several critical factors that significantly influence the likelihood of stroke in patients. Understanding these drivers allows for better-targeted interventions and more effective prevention strategies. However, to further enhance the accuracy and reliability of the model, additional data and features are essential.
Main Drivers influencing Stroke Risk
-
Age: Older patients have a higher risk of stroke. This finding underscores the importance of age-related health monitoring and interventions, as the likelihood of experiencing a stroke increases with age, necessitating enhanced medical vigilance for the elderly.
-
Hypertension: Presence of hypertension increases stroke risk. Hypertension, or high blood pressure, is a well-established risk factor for stroke, emphasizing the need for strict blood pressure control through medication, lifestyle changes, and regular monitoring.
-
Heart Disease: Patients with heart disease are more likely to experience a stroke. The strong correlation between cardiovascular conditions and stroke highlights the necessity for comprehensive care plans that address both heart disease management and stroke prevention.
-
Average Glucose Level: Higher average glucose levels are associated with Diabetes and increase stroke risk. Elevated glucose levels indicate poor diabetes control, which can lead to vascular damage and increased stroke risk, highlighting the importance of maintaining optimal glucose levels through diet, exercise, and medication adherence.
Insights
-
Preventive Measures: Targeted interventions for patients with hypertension and heart disease could reduce stroke incidence. Implementing comprehensive care plans that include lifestyle modifications, medication adherence, and regular health check-ups is crucial for mitigating stroke risk in these high-risk populations.
-
Public Health Strategies: Programs aimed at managing blood glucose levels and promoting healthy aging could be beneficial. Public health initiatives should focus on widespread screening for diabetes and hypertension, coupled with campaigns that encourage physical activity, healthy eating, and smoking cessation to reduce stroke risk at a population level.
-
Holistic Health Approach: Adopting a holistic approach that considers the interplay between various risk factors can enhance stroke prevention efforts. By addressing lifestyle factors such as diet, exercise, and stress management, healthcare providers can simultaneously mitigate risks associated with hypertension, heart disease, and diabetes, leading to better overall health outcomes.
-
Technology and Monitoring: Leveraging technology, such as wearable devices and telemedicine, can aid in the continuous monitoring of at-risk individuals. These technologies provide real-time data on blood pressure, glucose levels, and heart rate, allowing for timely interventions and personalized care plans that can significantly reduce stroke risk.
-
Education and Awareness: Raising awareness about the risk factors and preventive measures for stroke is crucial. Public health campaigns and educational programs should aim to inform individuals about the importance of regular health screenings, recognizing early symptoms of stroke, and seeking immediate medical attention, empowering them to take proactive steps towards stroke prevention.
Future Directions
To further improve the understanding and prediction of stroke risk, it is imperative to gather more comprehensive data and incorporate additional features into the model. Including a wider range of demographic, genetic, and lifestyle factors can provide a more nuanced view of stroke risk. Additionally, longitudinal data tracking patients over time could offer insights into how risk factors evolve and interact. By expanding the dataset and refining the features used, we can develop more accurate and robust models that enhance our ability to prevent and manage stroke.
9. Suggestions for Next Steps
-
Feature Enhancement: Incorporate additional health-related features, such as cholesterol levels, physical activity, and diet, to improve the model's predictive performance. Including more comprehensive lifestyle and biometric data can help create a more accurate and holistic risk assessment for stroke.
-
Longitudinal Data: Utilize longitudinal data to track changes in patient health over time, which can provide deeper insights into the progression and interaction of risk factors. Longitudinal studies allow for the observation of how individual risk profiles evolve, leading to more precise and personalized predictions.
-
Assessment of Existing Studies and Scores: Evaluate and integrate findings from established studies and scoring systems, such as the Framingham Heart Study and the CHA₂DS₂-VASc score, which are widely used for predicting cardiovascular and stroke risk. Comparing our model's performance with these well-regarded benchmarks can provide validation and highlight areas for improvement. Additionally, exploring datasets from these studies can offer valuable insights and potential features to enhance our model.
-
Collaborative Research: Engage in collaborative research with other institutions and researchers to leverage a broader range of expertise and datasets. By pooling resources and knowledge, we can develop more robust and generalizable models that are applicable across diverse populations.
-
Validation Across Diverse Populations: Test and validate the model across different demographic and geographic populations to ensure its applicability and reliability. Understanding how the model performs in various contexts can help identify any biases or limitations, leading to more equitable and effective stroke risk prediction tools.
-
Model Re-evaluation: Regularly update and re-evaluate the model as new data becomes available to ensure its continued relevance and accuracy. Incorporating the latest research findings and medical advancements will help maintain the model's effectiveness in predicting stroke risk.