Stroke Prediction - Cluster Analysis and Dimensionality Reduction

This analysis aimed to enhance stroke data classification models by integrating unsupervised learning techniques like clustering and dimensionality reduction. By employing clustering, we segmented patients into groups with similar features to improve stroke risk classification, and applied dimensionality reduction to explore patterns in the data. Despite the dataset's imbalance, we observed that cluster-based classification could slightly improve results. Specifically, clustering approaches like DBSCAN Clustering, combined with classification algorithms, showed potential in better identifying stroke risk while balancing accuracy and recall. The ultimate goal is to provide early identification of high-risk individuals, optimize resource allocation, and support targeted preventive measures.

1. Main Objective

The main objective of this analysis is to improve derived base classification models of the Stroke Data Analysis by applying unsupervised learning methods like clustering and dimensionality reduction. With clustering we will analyze the data and aim to segment the population into groups of patients with similar features to support classification of stroke risk. Dimensionality reduction aims at transforming the observations in a different space to see whether we can identify some structure or patterns. The metrics for finding the best prediction is again to maximize accuracy in terms of recall whilst not classifying too many cases wrongly as stroke risk. The true cost of misclassifying a patient at stroke risk as healthy outweighs the cost of misclassifying a healthy patient as at risk for stroke. The analysis aims at improving basic classification models and providing further insights for stroke risk to allow:

Early Identification: Helping healthcare providers identify individuals at higher risk of stroke for timely intervention.
Resource Allocation: Assisting in the efficient allocation of medical resources to those most in need.
Preventive Measures: Providing insights for developing targeted preventive measures to reduce the incidence of strokes.

2. Dataset Description

The Stroke Prediction dataset is available at Kaggle (URL : https://www.kaggle.com/fedesoriano/stroke-prediction-dataset) and contains data on patients, including their medical and demographic attributes.

Attributes

Attribute	Description
id	Unique identifier for each patient.
gender	Gender of the patient (Male, Female, Other).
age	Age of the patient.
hypertension	Whether the patient has hypertension (0: No, 1: Yes).
heart_disease	Whether the patient has heart disease (0: No, 1: Yes).
ever_married	Marital status of the patient (No, Yes).
work_type	Type of occupation (children, Govt_job, Never_worked, Private, Self-employed).
residence_type	Type of residence (Rural, Urban).
avg_glucose_level	Average glucose level in the blood.
bmi	Body mass index.
smoking_status	Smoking status (formerly smoked, never smoked, smokes, Unknown).
stroke	Target variable indicating whether the patient had a stroke (0: No, 1: Yes).

Number of Instances: 5,110
Number of Features: 11
Target Variable: stroke (0: No Stroke, 1: Stroke)

Analysis Objectives

The analysis aims to:

Explore and visualize the data to understand the distribution of attributes and identify any missing or anomalous values.
Engineer features and prepare data.
Train multiple clustering models on the new engineered dataset and evaluate the performance of classification based on the clusters.
Train multiple dimensionality reduction models and evaluate the performance of classification based on the transformed data.
Identify the best-performing model and Feature engineering approach.
Provide recommendations for next steps and further optimization.

3. Data Exploration and Cleaning

Data Exploration

Besides the id the dataset includes ten features as listed above plus the target variable stroke . There are three numerical features: age, avg_glucose_level and bmi. The remaining seven features are all categorical. Out of the 5,110 observations in the dataset, 4861 were observed with no stroke and 249 patients had a stroke. The dataset could be classified as not balanced, which has to be addressed before model training. The variables of the dataset are shown below.

	count	unique	top	freq	mean	std	min	25%	50%	75%	max
gender	5110	3	Female	2994	NaN	NaN	NaN	NaN	NaN	NaN	NaN
age	5110.0	NaN	NaN	NaN	43.23	22.61	0.08	25.0	45.0	61.0	82.0
hypertension	5110.0	NaN	NaN	NaN	0.1	0.3	0.0	0.0	0.0	0.0	1.0
heart_disease	5110.0	NaN	NaN	NaN	0.05	0.23	0.0	0.0	0.0	0.0	1.0
ever_married	5110	2	Yes	3353	NaN	NaN	NaN	NaN	NaN	NaN	NaN
work_type	5110	5	Private	2925	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Residence_type	5110	2	Urban	2596	NaN	NaN	NaN	NaN	NaN	NaN	NaN
avg_glucose_level	5110.0	NaN	NaN	NaN	106.15	45.28	55.12	77.24	91.88	114.09	271.74
bmi	4909.0	NaN	NaN	NaN	28.89	7.85	10.3	23.5	28.1	33.1	97.6
smoking_status	5110	4	never smoked	1892	NaN	NaN	NaN	NaN	NaN	NaN	NaN
stroke	5110.0	NaN	NaN	NaN	0.05	0.22	0.0	0.0	0.0	0.0	1.0

To analyze distribution and correlation of the data we prepared a set of 4 plots for each of the variables despending on the type as follows:

Numerical Variables: The overall distribution of the variable, the distribution of the variable for non stroke observations, the distribution of the variable for stroke observations and the density distribution separated by stroke cases.
Categorical Variables: The overall distribution of the variable, the distribution of the variable for non stroke observations, the distribution of the variable for stroke observations and the distribution of stroke cases within the groups of the categorical variable.

Numerical Variables

The graphs for the three numerical variables are shown below. There isn't a meaningful correlation visible for BMI or the Average Glucose Level. In contrast Age shows a fairly strong dependency with the target variable.

Distribution of numerical variables

Stroke cases plotted versus Age show to be not equally distributed. The reported stroke cases can be more often found with increasing age. Body Mass Index and Average Glucose level do not show an obvious influence on stroke. The observation is confirmed in the pairplot below.

Correlation of numerical variables

Categorical Variables

The graphs for the seven categorical variables are shown below. There isn't a meaningful correlation visible for BMI or the Average Glucose Level. In contrast Age shows a fairly strong dependency with the target variable.

Distribution of categorical variables

If we analyze the group percentages and compare the distributions of the variables for stroke and non stroke cases we can identify Hypertension, Heat Disease and the Married Status as potential influence for stroke cases. We refer to further correlation analysis in Part 1.

Data Cleaning and Feature Engineering

To prepare the data for the further analysis and the modeling phase we will perform the following steps:

Handling Missing Values and Outliers: Address missing values as appropriate.
Encoding Categorical Variables: Convert categorical variables into numerical format using one-hot encoding.
Data Splitting: Split the data into training and testing sets.
Feature Scaling: Scale features to ensure they are on a similar scale.
Addressing unbalanced data: Balances the classes in the training set.

Handling Missing Values and outliers

There are 201 records with missing BMI value, from these 201 a count of 40 are stroke cases (Stroke =1). We will adjust the missing values by calculating the group mean BMI by gender, age and glucose level. Further we have to address two outlier stroke cases as well as some more unrealistic BMI outliers.

There are two stroke cases in very early years which can be regarded as outlier. These cases are rather caused by very rare circumstances and should not be part of a systemic data analysis. We drop the records with id 69768 and 49669.

	162	245
id	69768	49669
gender	Female	Female
age	1.32	14.0
hypertension	0	0
heart_disease	0	0
ever_married	No	No
work_type	children	children
Residence_type	Urban	Rural
avg_glucose_level	70.37	57.93
bmi	NaN	30.9
smoking_status	Unknown	Unknown
stroke	1	1

We also will drop some unrealistic BMI (Body Mass Index) records. A BMI over 70 is extremely high and generally not realistic for most individuals. BMI is calculated as weight in kg divided by height in meter squared For context, a BMI between 18.5 and 24.9 is considered normal weight, a BMI between 25 and 29.9 is considered overweight and a BMI of 30 or above is considered obese. A BMI over 70 would indicate severe obesity, which is very rare. For example, an individual with a BMI of 70 who is 1.75 meters tall (about 5 feet 9 inches) would weigh approximately 473 lbs. Accordingly we drop the records with ids 545,41097, 56420 and 51856.

	544	928	2128	4209
id	545	41097	56420	51856
gender	Male	Female	Male	Male
age	42.0	23.0	17.0	38.0
hypertension	0	1	1	1
heart_disease	0	0	0	0
ever_married	Yes	No	No	Yes
work_type	Private	Private	Private	Private
Residence_type	Rural	Urban	Rural	Rural
avg_glucose_level	210.48	70.03	61.67	56.9
bmi	71.9	78.0	97.6	92.0
smoking_status	never smoked	smokes	Unknown	never smoked
stroke	0	0	0	0

4. Base model

In a prior analysis we evaluated different classification models with various resampling and cross validation approaches. During model training we evaluated four classification models: Logistic Regression, Decision Tree, Naive Bayes and AdaBoost. To optimize these models, we tuned hyperparameters using GridSearchCV with a 5-fold cross-validation. This approach ensures that our models are robust and generalize well to unseen data. In addition to hyperparameter tuning, we explored different resampling methods to address class imbalance, specifically using AdaSyn oversampling and TomekLinks undersampling. To comprehensively evaluate model performance, we varied the scoring metrics used in GridSearchCV, including F1 score, recall, and F-beta scores with beta values of 2 and 4. These varied scoring metrics will help us assess the models' ability to balance precision and recall, particularly emphasizing recall with the higher beta values in the F-beta score. This extensive evaluation process aims to identify the most effective model scoring and resampling strategy for predicting stroke risk.

To prepare the data for the modeling we will apply one hot encoding to all categorical variables. Two categorical variables are already encoded as numbers, these are 'hypertension' and 'heart_disease' . After performing the one hot encoding with omitting the default value (drop_first=true) the dataset is widened to 17 features including the target variable. In the next step we split the data in a ratio of 70/30 into training and test sets, whilst maintaining the class distribution with stratify=y. Scaling is crucial for ensuring that algorithms, which are sensitive to the scale of the input data, perform optimally and produce reliable results. We apply a MinMax scaler to the stroke data to normalize the features so that they have a value between 0 and 1.

Model Evaluation

The models where evaluated using the same training and test splits for all models to ensure fair comparison. The evaluation methods that were used to evaluate the models were:

Performance Indicators

Accuracy
Precision
Recall
F1 score as F1
FBeta score for beta=2 as F2
FBeta score for beta=4 as F4

Confusion Matrix

True positive (1) and False positive (1) counts
True negative (0) and False negative (0) counts

Below are the results for the best classification models found with a recall score larger than 0.7.

	Resampling	Scoring	Model	Precision	Recall	Accuracy	F1	F2	F4
19	TomekLinks	f1	AdaBoost	0.167	0.716	0.814	0.271	0.432	0.600
28	TomekLinks	f4	Logistic Regression	0.140	0.784	0.757	0.237	0.408	0.617
20	TomekLinks	recall	Logistic Regression	0.140	0.784	0.757	0.237	0.408	0.617
24	TomekLinks	f2	Logistic Regression	0.140	0.784	0.757	0.237	0.408	0.617
44	None	f4	Logistic Regression	0.137	0.757	0.758	0.232	0.398	0.598
36	None	recall	Logistic Regression	0.137	0.757	0.758	0.232	0.398	0.598
45	None	f4	Decision Tree	0.136	0.757	0.756	0.230	0.395	0.596
41	None	f2	Decision Tree	0.136	0.757	0.756	0.230	0.395	0.596
37	None	recall	Decision Tree	0.136	0.757	0.756	0.230	0.395	0.596
33	None	f1	Decision Tree	0.136	0.757	0.756	0.230	0.395	0.596
27	TomekLinks	f2	AdaBoost	0.135	0.757	0.755	0.230	0.394	0.596
43	None	f2	AdaBoost	0.134	0.743	0.757	0.228	0.390	0.587
21	TomekLinks	recall	Decision Tree	0.124	0.784	0.723	0.215	0.381	0.598
29	TomekLinks	f4	Decision Tree	0.124	0.784	0.723	0.215	0.381	0.598
25	TomekLinks	f2	Decision Tree	0.124	0.784	0.723	0.215	0.381	0.598
47	None	f4	AdaBoost	0.104	0.905	0.618	0.186	0.356	0.623
31	TomekLinks	f4	AdaBoost	0.104	0.905	0.617	0.186	0.355	0.622
16	TomekLinks	f1	Logistic Regression	0.101	0.824	0.636	0.179	0.338	0.579
30	TomekLinks	f4	Naive Bayes	0.086	0.959	0.505	0.158	0.316	0.600
46	None	f4	Naive Bayes	0.086	0.959	0.503	0.157	0.316	0.600
11	Adasyn	f2	AdaBoost	0.089	0.838	0.578	0.161	0.312	0.561
15	Adasyn	f4	AdaBoost	0.072	1.000	0.381	0.135	0.281	0.570
2	Adasyn	f1	Naive Bayes	0.079	0.770	0.554	0.143	0.279	0.508
10	Adasyn	f2	Naive Bayes	0.070	0.959	0.379	0.130	0.270	0.548
14	Adasyn	f4	Naive Bayes	0.065	1.000	0.309	0.123	0.259	0.543
32	None	f1	Logistic Regression	0.064	1.000	0.290	0.120	0.254	0.536
40	None	f2	Logistic Regression	0.064	1.000	0.290	0.120	0.254	0.536
38	None	recall	Naive Bayes	0.058	1.000	0.213	0.109	0.235	0.511
39	None	recall	AdaBoost	0.058	1.000	0.213	0.109	0.235	0.511
23	TomekLinks	recall	AdaBoost	0.058	1.000	0.213	0.109	0.235	0.511
22	TomekLinks	recall	Naive Bayes	0.058	1.000	0.213	0.109	0.235	0.511
7	Adasyn	recall	AdaBoost	0.057	1.000	0.204	0.108	0.233	0.508
6	Adasyn	recall	Naive Bayes	0.057	1.000	0.204	0.108	0.233	0.508

	Resampling	Scoring	Model	F2	True 1	True 0	False 1	False 0
19	TomekLinks	f1	AdaBoost	0.432	53	1194	264	21
28	TomekLinks	f4	Logistic Regression	0.408	58	1101	357	16
20	TomekLinks	recall	Logistic Regression	0.408	58	1101	357	16
24	TomekLinks	f2	Logistic Regression	0.408	58	1101	357	16
44	None	f4	Logistic Regression	0.398	56	1106	352	18
36	None	recall	Logistic Regression	0.398	56	1106	352	18
45	None	f4	Decision Tree	0.395	56	1102	356	18
41	None	f2	Decision Tree	0.395	56	1102	356	18
37	None	recall	Decision Tree	0.395	56	1102	356	18
33	None	f1	Decision Tree	0.395	56	1102	356	18
27	TomekLinks	f2	AdaBoost	0.394	56	1100	358	18
43	None	f2	AdaBoost	0.390	55	1104	354	19
21	TomekLinks	recall	Decision Tree	0.381	58	1050	408	16
29	TomekLinks	f4	Decision Tree	0.381	58	1050	408	16
25	TomekLinks	f2	Decision Tree	0.381	58	1050	408	16
47	None	f4	AdaBoost	0.356	67	880	578	7
31	TomekLinks	f4	AdaBoost	0.355	67	878	580	7
16	TomekLinks	f1	Logistic Regression	0.338	61	913	545	13
30	TomekLinks	f4	Naive Bayes	0.316	71	702	756	3
46	None	f4	Naive Bayes	0.316	71	700	758	3
11	Adasyn	f2	AdaBoost	0.312	62	824	634	12
15	Adasyn	f4	AdaBoost	0.281	74	510	948	0
2	Adasyn	f1	Naive Bayes	0.279	57	791	667	17
10	Adasyn	f2	Naive Bayes	0.270	71	510	948	3
14	Adasyn	f4	Naive Bayes	0.259	74	399	1059	0
32	None	f1	Logistic Regression	0.254	74	370	1088	0
40	None	f2	Logistic Regression	0.254	74	370	1088	0
38	None	recall	Naive Bayes	0.235	74	253	1205	0
39	None	recall	AdaBoost	0.235	74	253	1205	0
23	TomekLinks	recall	AdaBoost	0.235	74	253	1205	0
22	TomekLinks	recall	Naive Bayes	0.235	74	253	1205	0
7	Adasyn	recall	AdaBoost	0.233	74	238	1220	0
6	Adasyn	recall	Naive Bayes	0.233	74	238	1220	0

The best models depending on the scoring metric are shown below.

	Best Accuracy	Best Precision	Best Recall	Best F1	Best F2	Best F4
Model ID	1	17	15	19	19	47
Resampling	Adasyn	TomekLinks	Adasyn	TomekLinks	TomekLinks	None
Scoring	f1	f1	f4	f1	f1	f4
Model	Decision Tree	Decision Tree	AdaBoost	AdaBoost	AdaBoost	AdaBoost
Precision	0.1298	0.1673	0.0724	0.1672	0.1672	0.1039
Recall	0.2297	0.5676	1.0	0.7162	0.7162	0.9054
Accuracy	0.8884	0.8427	0.3812	0.814	0.814	0.6181
F1	0.1659	0.2585	0.135	0.2711	0.2711	0.1864
F2	0.1991	0.3839	0.2807	0.4323	0.4323	0.356
F4	0.2198	0.4976	0.5703	0.6003	0.6003	0.6227
True 1	17	42	74	53	53	67
True 0	1344	1249	510	1194	1194	880
False 1	114	209	948	264	264	578
False 0	57	32	0	21	21	7

Best base classification models Confusion matrix

5. K Means, DBSCAN and Agglomerative Clustering

Having understood the complexity of the data we will try to segment the data into clusters, which we plan to use in our improvement approaches for the classification models. We will apply three different clustering algorithms K Means, DBSCAN (Density-Based Spatial Clustering of Applications with Noise) and Agglomerative Clustering. For all methods we will include All Features but not the target variable Stroke. As a result we will enrich our dataset with one column per approach that defines the Cluster of the record. In case of DBSCAN we can also record a negative one as marker for noise not belonging to any cluster. The three clustering approaches are KMeans, DBSCAN and Agglomerative Clustering.

KMeans

We will perform KMeans clustering over all features of the dataset and determine the best K as number of clusters. First we will apply a StandardScaler and recode all categorical features to numeric format.

	count	mean	std	min	25%	50%	75%	max
gender	5104.0	0.0	1.0	-0.840	-0.840	-0.840	1.188	3.217
age	5104.0	-0.0	1.0	-1.910	-0.807	0.077	0.785	1.714
hypertension	5104.0	-0.0	1.0	-0.328	-0.328	-0.328	-0.328	3.051
heart_disease	5104.0	0.0	1.0	-0.239	-0.239	-0.239	-0.239	4.182
ever_married	5104.0	-0.0	1.0	-1.383	-1.383	0.723	0.723	0.723
work_type	5104.0	-0.0	1.0	-1.988	-0.153	-0.153	0.764	1.681
Residence_type	5104.0	0.0	1.0	-1.017	-1.017	0.984	0.984	0.984
smoking_status	5104.0	-0.0	1.0	-1.286	-1.286	0.581	0.581	1.515
glucose_group	5104.0	0.0	1.0	-0.412	-0.412	-0.412	-0.412	2.945
bmi_group	5104.0	-0.0	1.0	-2.129	-1.079	-0.030	1.020	1.020

Below you can see the resulting graph showing inertia values over number of clusters K.

Elbow curve - Inertia over number of clusters

From the ellbow curve we identify K=10 as significant candidate for the ellbow point and will rerun KMean for K=9 and include it in our data set. Below you can see the resulting distributions for the continuos variable Age as Violinplot differentiated by cluster, as well as the Distribution for the categorical variables for Heart Disease, Hypertension.

	KMeanAF	Total	Stroke	Not_Stroke
0	0	761	32	729
1	1	850	3	847
2	2	463	24	439
3	3	274	24	250
4	4	470	23	447
5	5	717	0	717
6	6	276	47	229
7	7	498	24	474
8	8	364	17	347
9	9	431	53	378

Cluster distribution

We can see that cluster label 5 has patients segmented without stroke. Cluster label 6 shows the highest percentage of stroke cases per segment. Below you can see the resulting distributions for the continuos variables Age, Avg Glucose Level and BMI as Violinplot differentiated by cluster, as well as the Distribution for the categorical variables for Heart Disease, Hypertension.

KMean cluster - Distribution of Age, Glucose and BMI Groups

KMean cluster - Distribution of Hypertension, Heart disease and Stroke

DBSCAN

Next we will perform DBSCAN clustering over all features of the dataset and determine. The number of clusters are optimized by DBSCAN itself. As before, we will apply a StandardScaler and recode all categorical features to numeric format.

	count	mean	std	min	25%	50%	75%	max
gender	5104.0	0.0	1.0	-0.840	-0.840	-0.840	1.188	3.217
age	5104.0	-0.0	1.0	-1.910	-0.807	0.077	0.785	1.714
hypertension	5104.0	-0.0	1.0	-0.328	-0.328	-0.328	-0.328	3.051
heart_disease	5104.0	0.0	1.0	-0.239	-0.239	-0.239	-0.239	4.182
ever_married	5104.0	-0.0	1.0	-1.383	-1.383	0.723	0.723	0.723
work_type	5104.0	-0.0	1.0	-1.988	-0.153	-0.153	0.764	1.681
Residence_type	5104.0	0.0	1.0	-1.017	-1.017	0.984	0.984	0.984
smoking_status	5104.0	-0.0	1.0	-1.286	-1.286	0.581	0.581	1.515
glucose_group	5104.0	0.0	1.0	-0.412	-0.412	-0.412	-0.412	2.945
bmi_group	5104.0	-0.0	1.0	-2.129	-1.079	-0.030	1.020	1.020

Below you can see the result. DBSCAN calculated 4 clusters and identified 2 records as noise.

	DBSCANAF	Total	Stroke	Not_Stroke
0	-1	3	0	3
1	0	211	34	177
2	1	4397	147	4250
3	2	431	53	378
4	3	62	13	49

Cluster distribution

We can see that cluster label 1 has the majority of all observations and also the smallest stroke percentage.There were three observations classified as noise, all non stroke patients. Below you can see the resulting distributions for the continuos variables Age, Avg Glucose Level and BMI as Violinplot differentiated by cluster, as well as the Distribution for the categorical variables for Heart Disease, Hypertension.

DBSCAN cluster-distribution for Age, glucose level and bmi

DBSCAN cluster-distribution for Hypertension, Heart disease and stroke

Agglomerative Clustering SF

Next we will perform Agglomerative Clustering over all features of the dataset and determine the best n as number of clusters. As before, we will apply a StandardScaler and recode all categorical features to numeric format.

	count	mean	std	min	25%	50%	75%	max
gender	5104.0	0.0	1.0	-0.840	-0.840	-0.840	1.188	3.217
age	5104.0	-0.0	1.0	-1.910	-0.807	0.077	0.785	1.714
hypertension	5104.0	-0.0	1.0	-0.328	-0.328	-0.328	-0.328	3.051
heart_disease	5104.0	0.0	1.0	-0.239	-0.239	-0.239	-0.239	4.182
ever_married	5104.0	-0.0	1.0	-1.383	-1.383	0.723	0.723	0.723
work_type	5104.0	-0.0	1.0	-1.988	-0.153	-0.153	0.764	1.681
Residence_type	5104.0	0.0	1.0	-1.017	-1.017	0.984	0.984	0.984
smoking_status	5104.0	-0.0	1.0	-1.286	-1.286	0.581	0.581	1.515
glucose_group	5104.0	0.0	1.0	-0.412	-0.412	-0.412	-0.412	2.945
bmi_group	5104.0	-0.0	1.0	-2.129	-1.079	-0.030	1.020	1.020

Below you can see the resulting graph showing silhouettes scores over number of clusters n for the selected features.

Silhouettes Scores for different linkage parameters

For the selected features we select n=9 as best candidate and will rerun the Agglomerative Clustering for n=9 and ward linkage and include it in our data set. Below you can see the resulting distribution and cluster counts.

	AggloAF	Total	Stroke	Not_Stroke
0	0	966	13	953
1	1	645	0	645
2	2	696	35	661
3	3	431	53	378
4	4	587	21	566
5	5	505	42	463
6	6	276	47	229
7	7	377	11	366
8	8	621	25	596

Cluster distribution

We can see that cluster label 1 has patients segmented without stroke. Cluster label 6 shows the highest percentage of stroke cases per segment. Below you can see the resulting distributions for the continuos variables Age, Avg Glucose Level and BMI as Violinplot differentiated by cluster, as well as the Distribution for the categorical variables for Heart Disease, Hypertension.

Agglomerative Clustering cluster-distribution for Age, glucose level and bmi

Agglomerative Clustering cluster-distribution for Hypertension, Heart disease and stroke

Cluster based classification

We can now analyze the quality of the clusters and use the results to create additional features and use these clusters as part of the feature set for classification. We will apply the same classification algorithms as for the base models and will apply resampling methods as before. The table below shows the results for models with a recall of greater than 0.7.

	Clustering	Resampling	Scoring	Model	Precision	Recall	Accuracy	F1	F2	F4
83	AggloAF	TomekLinks	f1	AdaBoost	0.167	0.703	0.816	0.269	0.428	0.591
51	DBSCANAF	TomekLinks	f1	AdaBoost	0.154	0.703	0.799	0.253	0.411	0.581
59	DBSCANAF	TomekLinks	f2	AdaBoost	0.133	0.743	0.752	0.225	0.387	0.585
60	DBSCANAF	TomekLinks	f4	Logistic Regression	0.130	0.716	0.754	0.220	0.376	0.566
56	DBSCANAF	TomekLinks	f2	Logistic Regression	0.130	0.716	0.754	0.220	0.376	0.566
95	AggloAF	TomekLinks	f4	AdaBoost	0.118	0.824	0.694	0.206	0.375	0.610
91	AggloAF	TomekLinks	f2	AdaBoost	0.118	0.824	0.694	0.206	0.375	0.610
27	KMeanAF	TomekLinks	f2	AdaBoost	0.116	0.824	0.687	0.203	0.371	0.606
19	KMeanAF	TomekLinks	f1	AdaBoost	0.116	0.824	0.687	0.203	0.371	0.606
31	KMeanAF	TomekLinks	f4	AdaBoost	0.116	0.824	0.687	0.203	0.371	0.606
48	DBSCANAF	TomekLinks	f1	Logistic Regression	0.121	0.757	0.722	0.208	0.368	0.578
20	KMeanAF	TomekLinks	recall	Logistic Regression	0.109	0.892	0.644	0.195	0.367	0.628
28	KMeanAF	TomekLinks	f4	Logistic Regression	0.109	0.892	0.644	0.195	0.367	0.628
24	KMeanAF	TomekLinks	f2	Logistic Regression	0.109	0.892	0.644	0.195	0.367	0.628
16	KMeanAF	TomekLinks	f1	Logistic Regression	0.109	0.892	0.644	0.195	0.367	0.628
63	DBSCANAF	TomekLinks	f4	AdaBoost	0.106	0.919	0.620	0.189	0.362	0.632
21	KMeanAF	TomekLinks	recall	Decision Tree	0.113	0.784	0.692	0.197	0.358	0.581
66	AggloAF	Adasyn	f1	Naive Bayes	0.112	0.784	0.689	0.196	0.356	0.579
89	AggloAF	TomekLinks	f2	Decision Tree	0.117	0.716	0.726	0.202	0.354	0.551
75	AggloAF	Adasyn	f2	AdaBoost	0.102	0.919	0.605	0.184	0.353	0.625
26	KMeanAF	TomekLinks	f2	Naive Bayes	0.106	0.770	0.676	0.187	0.342	0.563
18	KMeanAF	TomekLinks	f1	Naive Bayes	0.106	0.770	0.676	0.187	0.342	0.563
15	KMeanAF	Adasyn	f4	AdaBoost	0.093	0.932	0.560	0.170	0.334	0.610
11	KMeanAF	Adasyn	f2	AdaBoost	0.093	0.932	0.560	0.170	0.334	0.610
43	DBSCANAF	Adasyn	f2	AdaBoost	0.095	0.892	0.584	0.172	0.333	0.597
94	AggloAF	TomekLinks	f4	Naive Bayes	0.094	0.905	0.573	0.170	0.332	0.600
80	AggloAF	TomekLinks	f1	Logistic Regression	0.091	0.932	0.547	0.166	0.327	0.604
62	DBSCANAF	TomekLinks	f4	Naive Bayes	0.093	0.838	0.596	0.167	0.322	0.569
30	KMeanAF	TomekLinks	f4	Naive Bayes	0.084	0.946	0.496	0.154	0.309	0.589
79	AggloAF	Adasyn	f4	AdaBoost	0.079	0.986	0.445	0.146	0.299	0.589
34	DBSCANAF	Adasyn	f1	Naive Bayes	0.085	0.784	0.580	0.153	0.296	0.528
47	DBSCANAF	Adasyn	f4	AdaBoost	0.074	1.000	0.398	0.138	0.287	0.577
14	KMeanAF	Adasyn	f4	Naive Bayes	0.075	0.959	0.428	0.139	0.286	0.567
6	KMeanAF	Adasyn	recall	Naive Bayes	0.075	0.959	0.428	0.139	0.286	0.567
42	DBSCANAF	Adasyn	f2	Naive Bayes	0.076	0.865	0.483	0.139	0.280	0.536
10	KMeanAF	Adasyn	f2	Naive Bayes	0.078	0.811	0.525	0.142	0.280	0.521
22	KMeanAF	TomekLinks	recall	Naive Bayes	0.070	1.000	0.362	0.131	0.274	0.563
46	DBSCANAF	Adasyn	f4	Naive Bayes	0.070	0.932	0.401	0.131	0.270	0.542
7	KMeanAF	Adasyn	recall	AdaBoost	0.068	1.000	0.336	0.127	0.267	0.553
78	AggloAF	Adasyn	f4	Naive Bayes	0.064	0.973	0.315	0.121	0.254	0.531
74	AggloAF	Adasyn	f2	Naive Bayes	0.065	0.905	0.366	0.121	0.252	0.514
70	AggloAF	Adasyn	recall	Naive Bayes	0.061	0.973	0.279	0.115	0.245	0.519
55	DBSCANAF	TomekLinks	recall	AdaBoost	0.059	1.000	0.233	0.112	0.239	0.517

	Clustering	Resampling	Scoring	Model	F4	True 1	True 0	False 1	False 0
83	AggloAF	TomekLinks	f1	AdaBoost	0.591	52	1198	260	22
51	DBSCANAF	TomekLinks	f1	AdaBoost	0.581	52	1172	285	22
59	DBSCANAF	TomekLinks	f2	AdaBoost	0.585	55	1097	360	19
60	DBSCANAF	TomekLinks	f4	Logistic Regression	0.566	53	1102	355	21
56	DBSCANAF	TomekLinks	f2	Logistic Regression	0.566	53	1102	355	21
95	AggloAF	TomekLinks	f4	AdaBoost	0.610	61	1002	456	13
91	AggloAF	TomekLinks	f2	AdaBoost	0.610	61	1002	456	13
27	KMeanAF	TomekLinks	f2	AdaBoost	0.606	61	992	466	13
19	KMeanAF	TomekLinks	f1	AdaBoost	0.606	61	992	466	13
31	KMeanAF	TomekLinks	f4	AdaBoost	0.606	61	992	466	13
48	DBSCANAF	TomekLinks	f1	Logistic Regression	0.578	56	1049	408	18
20	KMeanAF	TomekLinks	recall	Logistic Regression	0.628	66	921	537	8
28	KMeanAF	TomekLinks	f4	Logistic Regression	0.628	66	921	537	8
24	KMeanAF	TomekLinks	f2	Logistic Regression	0.628	66	921	537	8
16	KMeanAF	TomekLinks	f1	Logistic Regression	0.628	66	921	537	8
63	DBSCANAF	TomekLinks	f4	AdaBoost	0.632	68	881	576	6
21	KMeanAF	TomekLinks	recall	Decision Tree	0.581	58	1002	456	16
66	AggloAF	Adasyn	f1	Naive Bayes	0.579	58	997	461	16
89	AggloAF	TomekLinks	f2	Decision Tree	0.551	53	1059	399	21
75	AggloAF	Adasyn	f2	AdaBoost	0.625	68	859	599	6
26	KMeanAF	TomekLinks	f2	Naive Bayes	0.563	57	978	480	17
18	KMeanAF	TomekLinks	f1	Naive Bayes	0.563	57	978	480	17
15	KMeanAF	Adasyn	f4	AdaBoost	0.610	69	789	669	5
11	KMeanAF	Adasyn	f2	AdaBoost	0.610	69	789	669	5
43	DBSCANAF	Adasyn	f2	AdaBoost	0.597	66	828	629	8
94	AggloAF	TomekLinks	f4	Naive Bayes	0.600	67	811	647	7
80	AggloAF	TomekLinks	f1	Logistic Regression	0.604	69	769	689	5
62	DBSCANAF	TomekLinks	f4	Naive Bayes	0.569	62	851	606	12
30	KMeanAF	TomekLinks	f4	Naive Bayes	0.589	70	690	768	4
79	AggloAF	Adasyn	f4	AdaBoost	0.589	73	608	850	1
34	DBSCANAF	Adasyn	f1	Naive Bayes	0.528	58	830	627	16
47	DBSCANAF	Adasyn	f4	AdaBoost	0.577	74	536	921	0
14	KMeanAF	Adasyn	f4	Naive Bayes	0.567	71	585	873	3
6	KMeanAF	Adasyn	recall	Naive Bayes	0.567	71	585	873	3
42	DBSCANAF	Adasyn	f2	Naive Bayes	0.536	64	676	781	10
10	KMeanAF	Adasyn	f2	Naive Bayes	0.521	60	744	714	14
22	KMeanAF	TomekLinks	recall	Naive Bayes	0.563	74	480	978	0
46	DBSCANAF	Adasyn	f4	Naive Bayes	0.542	69	545	912	5
7	KMeanAF	Adasyn	recall	AdaBoost	0.553	74	440	1018	0
78	AggloAF	Adasyn	f4	Naive Bayes	0.531	72	410	1048	2
74	AggloAF	Adasyn	f2	Naive Bayes	0.514	67	493	965	7
70	AggloAF	Adasyn	recall	Naive Bayes	0.519	72	355	1103	2
55	DBSCANAF	TomekLinks	recall	AdaBoost	0.517	74	282	1175	0

The best models depending on the scoring metric are shown below.

	Best Accuracy	Best Precision	Best Recall	Best F1	Best F2	Best F4
Model ID	84	64	47	83	83	63
Clustering	AggloAF	AggloAF	DBSCANAF	AggloAF	AggloAF	DBSCANAF
Resampling	TomekLinks	Adasyn	Adasyn	TomekLinks	TomekLinks	TomekLinks
Scoring	recall	f1	f4	f1	f1	f4
Model	Logistic Reg	Logistic Reg	AdaBoost	AdaBoost	AdaBoost	AdaBoost
Precision	0.0	0.1901	0.0744	0.1667	0.1667	0.1056
Recall	0.0	0.3108	1.0	0.7027	0.7027	0.9189
Accuracy	0.9517	0.9027	0.3984	0.8159	0.8159	0.6199
F1	0.0	0.2359	0.1384	0.2694	0.2694	0.1894
F2	0.0	0.2758	0.2866	0.4276	0.4276	0.3617
F4	0.0	0.2996	0.5773	0.5909	0.5909	0.6324
True 1	0	23	74	52	52	68
True 0	1458	1360	536	1198	1198	881
False 1	0	98	921	260	260	576
False 0	74	51	0	22	22	6

Confusion matrix - Best models for Cluster based Classification

Cluster only classification

Next we will evaluate the clustering results using only the cluster labels as feature and the stroke variable as target.

	Clustering	Resampling	Scoring	Model	Precision	Recall	Accuracy	F1	F2	F4
1	KMeanAF	Adasyn	f1	Decision Tree	0.082	0.716	0.597	0.146	0.280	0.491
9	KMeanAF	Adasyn	f2	Decision Tree	0.082	0.716	0.597	0.146	0.280	0.491
13	KMeanAF	Adasyn	f4	Decision Tree	0.082	0.716	0.597	0.146	0.280	0.491
95	AggloAF	TomekLinks	f4	AdaBoost	0.074	0.919	0.441	0.137	0.280	0.550
5	KMeanAF	Adasyn	recall	Decision Tree	0.082	0.716	0.597	0.146	0.280	0.491
11	KMeanAF	Adasyn	f2	AdaBoost	0.070	1.000	0.361	0.131	0.274	0.562
8	KMeanAF	Adasyn	f2	Logistic Regression	0.070	1.000	0.361	0.131	0.274	0.562
6	KMeanAF	Adasyn	recall	Naive Bayes	0.070	1.000	0.361	0.131	0.274	0.562
3	KMeanAF	Adasyn	f1	AdaBoost	0.070	1.000	0.361	0.131	0.274	0.562
2	KMeanAF	Adasyn	f1	Naive Bayes	0.070	1.000	0.361	0.131	0.274	0.562
10	KMeanAF	Adasyn	f2	Naive Bayes	0.070	1.000	0.361	0.131	0.274	0.562
4	KMeanAF	Adasyn	recall	Logistic Regression	0.070	1.000	0.361	0.131	0.274	0.562
23	KMeanAF	TomekLinks	recall	AdaBoost	0.070	1.000	0.361	0.131	0.274	0.562
0	KMeanAF	Adasyn	f1	Logistic Regression	0.070	1.000	0.361	0.131	0.274	0.562
15	KMeanAF	Adasyn	f4	AdaBoost	0.070	1.000	0.361	0.131	0.274	0.562
22	KMeanAF	TomekLinks	recall	Naive Bayes	0.070	1.000	0.361	0.131	0.274	0.562
30	KMeanAF	TomekLinks	f4	Naive Bayes	0.070	1.000	0.361	0.131	0.274	0.562
31	KMeanAF	TomekLinks	f4	AdaBoost	0.070	1.000	0.361	0.131	0.274	0.562
12	KMeanAF	Adasyn	f4	Logistic Regression	0.070	1.000	0.361	0.131	0.274	0.562
14	KMeanAF	Adasyn	f4	Naive Bayes	0.070	1.000	0.361	0.131	0.274	0.562
67	AggloAF	Adasyn	f1	AdaBoost	0.067	0.946	0.364	0.126	0.262	0.535
7	KMeanAF	Adasyn	recall	AdaBoost	0.057	1.000	0.198	0.108	0.232	0.506
84	AggloAF	TomekLinks	recall	Logistic Regression	0.056	1.000	0.187	0.106	0.229	0.502
94	AggloAF	TomekLinks	f4	Naive Bayes	0.056	1.000	0.187	0.106	0.229	0.502
92	AggloAF	TomekLinks	f4	Logistic Regression	0.056	1.000	0.187	0.106	0.229	0.502
87	AggloAF	TomekLinks	recall	AdaBoost	0.056	1.000	0.187	0.106	0.229	0.502
86	AggloAF	TomekLinks	recall	Naive Bayes	0.056	1.000	0.187	0.106	0.229	0.502
70	AggloAF	Adasyn	recall	Naive Bayes	0.056	1.000	0.187	0.106	0.229	0.502
79	AggloAF	Adasyn	f4	AdaBoost	0.056	1.000	0.187	0.106	0.229	0.502
78	AggloAF	Adasyn	f4	Naive Bayes	0.056	1.000	0.187	0.106	0.229	0.502
75	AggloAF	Adasyn	f2	AdaBoost	0.056	1.000	0.187	0.106	0.229	0.502
74	AggloAF	Adasyn	f2	Naive Bayes	0.056	1.000	0.187	0.106	0.229	0.502
66	AggloAF	Adasyn	f1	Naive Bayes	0.056	1.000	0.187	0.106	0.229	0.502
71	AggloAF	Adasyn	recall	AdaBoost	0.056	1.000	0.187	0.106	0.229	0.502

	Clustering	Resampling	Scoring	Model	F4	True 1	True 0	False 1	False 0
1	KMeanAF	Adasyn	f1	Decision Tree	0.491	53	861	597	21
9	KMeanAF	Adasyn	f2	Decision Tree	0.491	53	861	597	21
13	KMeanAF	Adasyn	f4	Decision Tree	0.491	53	861	597	21
95	AggloAF	TomekLinks	f4	AdaBoost	0.550	68	608	850	6
5	KMeanAF	Adasyn	recall	Decision Tree	0.491	53	861	597	21
11	KMeanAF	Adasyn	f2	AdaBoost	0.562	74	479	979	0
8	KMeanAF	Adasyn	f2	Logistic Regression	0.562	74	479	979	0
6	KMeanAF	Adasyn	recall	Naive Bayes	0.562	74	479	979	0
3	KMeanAF	Adasyn	f1	AdaBoost	0.562	74	479	979	0
2	KMeanAF	Adasyn	f1	Naive Bayes	0.562	74	479	979	0
10	KMeanAF	Adasyn	f2	Naive Bayes	0.562	74	479	979	0
4	KMeanAF	Adasyn	recall	Logistic Regression	0.562	74	479	979	0
23	KMeanAF	TomekLinks	recall	AdaBoost	0.562	74	479	979	0
0	KMeanAF	Adasyn	f1	Logistic Regression	0.562	74	479	979	0
15	KMeanAF	Adasyn	f4	AdaBoost	0.562	74	479	979	0
22	KMeanAF	TomekLinks	recall	Naive Bayes	0.562	74	479	979	0
30	KMeanAF	TomekLinks	f4	Naive Bayes	0.562	74	479	979	0
31	KMeanAF	TomekLinks	f4	AdaBoost	0.562	74	479	979	0
12	KMeanAF	Adasyn	f4	Logistic Regression	0.562	74	479	979	0
14	KMeanAF	Adasyn	f4	Naive Bayes	0.562	74	479	979	0
67	AggloAF	Adasyn	f1	AdaBoost	0.535	70	488	970	4
7	KMeanAF	Adasyn	recall	AdaBoost	0.506	74	230	1228	0
84	AggloAF	TomekLinks	recall	Logistic Regression	0.502	74	212	1246	0
94	AggloAF	TomekLinks	f4	Naive Bayes	0.502	74	212	1246	0
92	AggloAF	TomekLinks	f4	Logistic Regression	0.502	74	212	1246	0
87	AggloAF	TomekLinks	recall	AdaBoost	0.502	74	212	1246	0
86	AggloAF	TomekLinks	recall	Naive Bayes	0.502	74	212	1246	0
70	AggloAF	Adasyn	recall	Naive Bayes	0.502	74	212	1246	0
79	AggloAF	Adasyn	f4	AdaBoost	0.502	74	212	1246	0
78	AggloAF	Adasyn	f4	Naive Bayes	0.502	74	212	1246	0
75	AggloAF	Adasyn	f2	AdaBoost	0.502	74	212	1246	0
74	AggloAF	Adasyn	f2	Naive Bayes	0.502	74	212	1246	0
66	AggloAF	Adasyn	f1	Naive Bayes	0.502	74	212	1246	0
71	AggloAF	Adasyn	recall	AdaBoost	0.502	74	212	1246	0

The best models depending on the scoring metric are shown below.

	Best Accuracy	Best Precision	Best Recall	Best F1	Best F2	Best F4
Model ID	48	50	0	32	32	0
Clustering	DBSCANAF	DBSCANAF	KMeanAF	DBSCANAF	DBSCANAF	KMeanAF
Resampling	TomekLinks	TomekLinks	Adasyn	Adasyn	Adasyn	Adasyn
Scoring	f1	f1	f1	f1	f1	f1
Model	Logistic Reg	Naive Bayes	Logistic Reg	Logistic Reg	Logistic Reg	Logistic Reg
Precision	0.1636	0.188	0.0703	0.1809	0.1809	0.0703
Recall	0.1216	0.3378	1.0	0.4595	0.4595	1.0
Accuracy	0.9275	0.8975	0.361	0.8733	0.8733	0.361
F1	0.1395	0.2415	0.1313	0.2595	0.2595	0.1313
F2	0.1282	0.2914	0.2743	0.3512	0.3512	0.2743
F4	0.1235	0.3227	0.5624	0.4213	0.4213	0.5624
True 1	9	25	74	34	34	74
True 0	1411	1349	479	1303	1303	479
False 1	46	108	979	154	154	979
False 0	65	49	0	40	40	0

Confusion Matrix - Best models for Cluster only Classification

Classification within the clusters

As suggested we will adjust our approach now and fit classification models for each cluster individually. We evaluated with either over-sampling or no resampling as the size of the clusters would result in small counts of observations with under-sampling. We evaluate the results in two variations:

Same model for all clusters: We apply the model sae model to all clusters and calculate the performance based on the aggregated confusion matrix.
Best model per cluster: We select the best performing model per cluster label and calculate the performance based on the aggregated confusion matrix.

Results for Same Model for all Cluster

	Clustering	Model	Resampling	Scoring	Precision	Recall	Accuracy	F1	F2	F4
22	AggloAF	Logistic Regression	None	f4	0.1163	0.6667	0.7741	0.1980	0.3425	0.5215
84	KMeanAF	Logistic Regression	None	f1	0.1201	0.5811	0.8100	0.1991	0.3287	0.4741
85	KMeanAF	Logistic Regression	None	f2	0.1201	0.5811	0.8100	0.1991	0.3287	0.4741
45	DBSCANAF	Decision Tree	None	f2	0.1250	0.5541	0.7913	0.2040	0.3285	0.4610
77	KMeanAF	Decision Tree	None	f2	0.1333	0.5135	0.8446	0.2117	0.3270	0.4398
44	DBSCANAF	Decision Tree	None	f1	0.1329	0.5135	0.8147	0.2111	0.3265	0.4395
86	KMeanAF	Logistic Regression	None	f4	0.1103	0.6216	0.7809	0.1874	0.3226	0.4884
29	AggloAF	Naive Bayes	None	f2	0.1242	0.5200	0.8265	0.2005	0.3176	0.4379
47	DBSCANAF	Decision Tree	None	recall	0.1114	0.5541	0.7652	0.1855	0.3087	0.4491
46	DBSCANAF	Decision Tree	None	f4	0.1114	0.5541	0.7652	0.1855	0.3087	0.4491
3	AggloAF	AdaBoost	Adasyn	recall	0.0996	0.6000	0.7563	0.1708	0.2992	0.4631
93	KMeanAF	Naive Bayes	None	f2	0.1176	0.4865	0.8309	0.1895	0.2990	0.4107
2	AggloAF	AdaBoost	Adasyn	f4	0.1153	0.4933	0.8204	0.1869	0.2979	0.4135
21	AggloAF	Logistic Regression	None	f2	0.0928	0.6400	0.7234	0.1622	0.2938	0.4752
23	AggloAF	Logistic Regression	None	recall	0.1020	0.5467	0.7797	0.1719	0.2920	0.4351
78	KMeanAF	Decision Tree	None	f4	0.1095	0.5000	0.8144	0.1796	0.2918	0.4133
61	DBSCANAF	Naive Bayes	None	f2	0.1023	0.5405	0.7489	0.1720	0.2911	0.4317
20	AggloAF	Logistic Regression	None	f1	0.0935	0.6133	0.7351	0.1623	0.2904	0.4622
94	KMeanAF	Naive Bayes	None	f4	0.1066	0.5000	0.8094	0.1758	0.2877	0.4108
33	DBSCANAF	AdaBoost	Adasyn	f2	0.0854	0.7027	0.6223	0.1523	0.2873	0.4930
87	KMeanAF	Logistic Regression	None	recall	0.1134	0.4459	0.8358	0.1808	0.2811	0.3803
79	KMeanAF	Decision Tree	None	recall	0.1016	0.5000	0.8001	0.1689	0.2803	0.4063
62	DBSCANAF	Naive Bayes	None	f4	0.0817	0.6622	0.6243	0.1454	0.2734	0.4669
30	AggloAF	Naive Bayes	None	f4	0.0870	0.5600	0.7356	0.1505	0.2682	0.4242
67	KMeanAF	AdaBoost	Adasyn	recall	0.1006	0.4595	0.8111	0.1650	0.2681	0.3798
95	KMeanAF	Naive Bayes	None	recall	0.0823	0.6081	0.7084	0.1449	0.2669	0.4419
91	KMeanAF	Naive Bayes	Adasyn	recall	0.0751	0.5676	0.6985	0.1327	0.2456	0.4096
27	AggloAF	Naive Bayes	Adasyn	recall	0.0743	0.5733	0.6832	0.1315	0.2446	0.4109
89	KMeanAF	Naive Bayes	Adasyn	f2	0.0752	0.5541	0.7051	0.1325	0.2438	0.4031
90	KMeanAF	Naive Bayes	Adasyn	f4	0.0750	0.5541	0.7040	0.1320	0.2432	0.4027
38	DBSCANAF	AdaBoost	None	f4	0.0673	0.6622	0.5408	0.1222	0.2393	0.4357
39	DBSCANAF	AdaBoost	None	recall	0.0673	0.6622	0.5408	0.1222	0.2393	0.4357
56	DBSCANAF	Naive Bayes	Adasyn	f1	0.0643	0.6216	0.5453	0.1166	0.2275	0.4118
70	KMeanAF	AdaBoost	None	f4	0.0662	0.5270	0.6787	0.1176	0.2203	0.3739

	Clustering	Model	Resampling	Scoring	F2	True 1	True 0	False 1	False 0
22	AggloAF	Logistic Regression	None	f4	0.3425	50	1338	380	25
84	KMeanAF	Logistic Regression	None	f1	0.3287	43	1432	315	31
85	KMeanAF	Logistic Regression	None	f2	0.3287	43	1432	315	31
45	DBSCANAF	Decision Tree	None	f2	0.3285	41	1172	287	33
77	KMeanAF	Decision Tree	None	f2	0.3270	38	1500	247	36
44	DBSCANAF	Decision Tree	None	f1	0.3265	38	1211	248	36
86	KMeanAF	Logistic Regression	None	f4	0.3226	46	1376	371	28
29	AggloAF	Naive Bayes	None	f2	0.3176	39	1443	275	36
47	DBSCANAF	Decision Tree	None	recall	0.3087	41	1132	327	33
46	DBSCANAF	Decision Tree	None	f4	0.3087	41	1132	327	33
3	AggloAF	AdaBoost	Adasyn	recall	0.2992	45	1311	407	30
93	KMeanAF	Naive Bayes	None	f2	0.2990	36	1477	270	38
2	AggloAF	AdaBoost	Adasyn	f4	0.2979	37	1434	284	38
21	AggloAF	Logistic Regression	None	f2	0.2938	48	1249	469	27
23	AggloAF	Logistic Regression	None	recall	0.2920	41	1357	361	34
78	KMeanAF	Decision Tree	None	f4	0.2918	37	1446	301	37
61	DBSCANAF	Naive Bayes	None	f2	0.2911	40	1108	351	34
20	AggloAF	Logistic Regression	None	f1	0.2904	46	1272	446	29
94	KMeanAF	Naive Bayes	None	f4	0.2877	37	1437	310	37
33	DBSCANAF	AdaBoost	Adasyn	f2	0.2873	52	902	557	22
87	KMeanAF	Logistic Regression	None	recall	0.2811	33	1489	258	41
79	KMeanAF	Decision Tree	None	recall	0.2803	37	1420	327	37
62	DBSCANAF	Naive Bayes	None	f4	0.2734	49	908	551	25
30	AggloAF	Naive Bayes	None	f4	0.2682	42	1277	441	33
67	KMeanAF	AdaBoost	Adasyn	recall	0.2681	34	1443	304	40
95	KMeanAF	Naive Bayes	None	recall	0.2669	45	1245	502	29
91	KMeanAF	Naive Bayes	Adasyn	recall	0.2456	42	1230	517	32
27	AggloAF	Naive Bayes	Adasyn	recall	0.2446	43	1182	536	32
89	KMeanAF	Naive Bayes	Adasyn	f2	0.2438	41	1243	504	33
90	KMeanAF	Naive Bayes	Adasyn	f4	0.2432	41	1241	506	33
38	DBSCANAF	AdaBoost	None	f4	0.2393	49	780	679	25
39	DBSCANAF	AdaBoost	None	recall	0.2393	49	780	679	25
56	DBSCANAF	Naive Bayes	Adasyn	f1	0.2275	46	790	669	28
70	KMeanAF	AdaBoost	None	f4	0.2203	39	1197	550	35

The best models depending on the scoring metric are shown below.

	Best Accuracy	Best Precision	Best Recall	Best F1	Best F2	Best F4
Model ID	77	77	34	77	22	22
Clustering	KMeanAF	KMeanAF	DBSCANAF	KMeanAF	AggloAF	AggloAF
Model	Decision Tree	Decision Tree	AdaBoost	Decision Tree	Logistic Reg	Logistic Reg
Resampling	None	None	Adasyn	None	None	None
Scoring	f2	f2	f4	f2	f4	f4
Precision	0.1333	0.1333	0.0509	0.1333	0.1163	0.1163
Recall	0.5135	0.5135	0.8108	0.5135	0.6667	0.6667
Accuracy	0.8446	0.8446	0.2616	0.8446	0.7741	0.7741
F1	0.2117	0.2117	0.0958	0.2117	0.198	0.198
F2	0.327	0.327	0.2035	0.327	0.3425	0.3425
F4	0.4398	0.4398	0.4318	0.4398	0.5215	0.5215
True 1	38	38	60	38	50	50
True 0	1500	1500	341	1500	1338	1338
False 1	247	247	1118	247	380	380
False 0	36	36	14	36	25	25

Confusion matrix - Best models for Cluster individual classification

Results for Best Model per Cluster

	Clustering	CID	Resampling	Scoring	Model	Precision	Recall	Accuracy	F1	F2	F4
448	AggloAF	0	Adasyn	f1	Log Reg	0.3333	0.7500	0.9759	0.4615	0.6000	0.6986
480	AggloAF	1	Adasyn	f1	Log Reg	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000
519	AggloAF	2	Adasyn	recall	AdaBoost	0.0900	0.8182	0.5550	0.1622	0.3125	0.5543
560	AggloAF	3	None	f1	Log Reg	0.1887	0.6250	0.6231	0.2899	0.4274	0.5502
594	AggloAF	4	None	f1	Naive Bayes	0.1176	0.3333	0.8927	0.1739	0.2439	0.3009
628	AggloAF	5	None	recall	Log Reg	0.1719	0.8462	0.6382	0.2857	0.4741	0.6875
666	AggloAF	6	None	f2	Naive Bayes	0.1750	1.0000	0.2048	0.2979	0.5147	0.7829
687	AggloAF	7	Adasyn	f4	AdaBoost	0.0741	0.6667	0.7719	0.1333	0.2564	0.4533
721	AggloAF	8	None	f1	Dec. Tree	0.1667	0.5000	0.8717	0.2500	0.3571	0.4474
343	DBSCANAF	0	None	recall	AdaBoost	0.2174	1.0000	0.4375	0.3571	0.5814	0.8252
377	DBSCANAF	1	None	f2	Dec. Tree	0.1042	0.5682	0.8227	0.1761	0.3005	0.4502
400	DBSCANAF	2	None	f1	Log Reg	0.1887	0.6250	0.6231	0.2899	0.4274	0.5502
438	DBSCANAF	3	None	recall	Naive Bayes	0.2500	1.0000	0.3684	0.4000	0.6250	0.8500
21	KMeanAF	0	None	recall	Dec. Tree	0.0794	0.5000	0.7249	0.1370	0.2427	0.3812
32	KMeanAF	1	Adasyn	f1	Log Reg	0.5000	1.0000	0.9961	0.6667	0.8333	0.9444
64	KMeanAF	2	Adasyn	f1	Log Reg	0.2500	0.4286	0.9065	0.3158	0.3750	0.4113
113	KMeanAF	3	None	f1	Dec. Tree	0.1875	0.8571	0.6747	0.3077	0.5000	0.7083
144	KMeanAF	4	None	f1	Log Reg	0.0870	0.8571	0.5461	0.1579	0.3093	0.5635
160	KMeanAF	5	Adasyn	f1	Log Reg	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000
218	KMeanAF	6	None	f2	Naive Bayes	0.1750	1.0000	0.2048	0.2979	0.5147	0.7829
243	KMeanAF	7	None	f1	AdaBoost	0.1852	0.7143	0.8400	0.2941	0.4545	0.6115
274	KMeanAF	8	None	f1	Naive Bayes	0.1818	0.4000	0.8909	0.2500	0.3226	0.3736
304	KMeanAF	9	None	f1	Log Reg	0.1887	0.6250	0.6231	0.2899	0.4274	0.5502

	Clustering	CID	Resampling	Scoring	Model	F2	True 1	True 0	False 1	False 0
448	AggloAF	0	Adasyn	f1	Log Reg	0.6000	3	280	6	1
480	AggloAF	1	Adasyn	f1	Log Reg	1.0000	0	451	0	0
519	AggloAF	2	Adasyn	recall	AdaBoost	0.3125	9	107	91	2
560	AggloAF	3	None	f1	Log Reg	0.4274	10	71	43	6
594	AggloAF	4	None	f1	Naive Bayes	0.2439	2	156	15	4
628	AggloAF	5	None	recall	Log Reg	0.4741	11	86	53	2
666	AggloAF	6	None	f2	Naive Bayes	0.5147	14	3	66	0
687	AggloAF	7	Adasyn	f4	AdaBoost	0.2564	2	86	25	1
721	AggloAF	8	None	f1	Dec. Tree	0.3571	4	159	20	4
343	DBSCANAF	0	None	recall	AdaBoost	0.5814	10	18	36	0
377	DBSCANAF	1	None	f2	Dec. Tree	0.3005	25	1061	215	19
400	DBSCANAF	2	None	f1	Log Reg	0.4274	10	71	43	6
438	DBSCANAF	3	None	recall	Naive Bayes	0.6250	4	3	12	0
21	KMeanAF	0	None	recall	Dec. Tree	0.2427	5	161	58	5
32	KMeanAF	1	Adasyn	f1	Log Reg	0.8333	1	253	1	0
64	KMeanAF	2	Adasyn	f1	Log Reg	0.3750	3	123	9	4
113	KMeanAF	3	None	f1	Dec. Tree	0.5000	6	50	26	1
144	KMeanAF	4	None	f1	Log Reg	0.3093	6	71	63	1
160	KMeanAF	5	Adasyn	f1	Log Reg	1.0000	0	501	0	0
218	KMeanAF	6	None	f2	Naive Bayes	0.5147	14	3	66	0
243	KMeanAF	7	None	f1	AdaBoost	0.4545	5	121	22	2
274	KMeanAF	8	None	f1	Naive Bayes	0.3226	2	96	9	3
304	KMeanAF	9	None	f1	Log Reg	0.4274	10	71	43	6

	Clustering	Precision	Recall	Accuracy	F1	F2	F4
0	AggloAF	0.1471	0.7333	0.8109	0.2450	0.4080	0.5940
2	KMeanAF	0.1490	0.7027	0.8248	0.2459	0.4031	0.5766
1	DBSCANAF	0.1380	0.6622	0.7841	0.2284	0.3763	0.5413

	Clustering	F2	True 1	True 0	False 1	False 0
0	AggloAF	0.4080	55	1399	319	20
2	KMeanAF	0.4031	52	1450	297	22
1	DBSCANAF	0.3763	49	1153	306	25

The best models depending on the scoring metric are shown below.

	Best Accuracy	Best Precision	Best Recall	Best F1	Best F2	Best F4
Model ID	2	2	0	2	0	0
Clustering	KMeanAF	KMeanAF	AggloAF	KMeanAF	AggloAF	AggloAF
Precision	0.149	0.149	0.1471	0.149	0.1471	0.1471
Recall	0.7027	0.7027	0.7333	0.7027	0.7333	0.7333
Accuracy	0.8248	0.8248	0.8109	0.8248	0.8109	0.8109
F1	0.2459	0.2459	0.245	0.2459	0.245	0.245
F2	0.4031	0.4031	0.408	0.4031	0.408	0.408
F4	0.5766	0.5766	0.594	0.5766	0.594	0.594
True 1	52	52	55	52	55	55
True 0	1450	1450	1399	1450	1399	1399
False 1	297	297	319	297	319	319
False 0	22	22	20	22	20	20

Confusion matrix - Best Models for for Best Model per Cluster

6. Dimensionality reduction - PCA and KPCA

Dimensionality reduction is crucial when working with high-dimensional datasets like the stroke dataset because it helps simplify the data, making it easier to visualize and analyze. High-dimensional data can be complex and computationally expensive to process, and it often contains redundant or irrelevant features that can obscure meaningful patterns. By reducing the number of dimensions, we can highlight the most important features that capture the majority of the data's variance, thus facilitating more efficient and insightful analysis. Reducing dimensions aids in visualizing the data, which is essential for understanding the underlying structure and relationships within the dataset. Visualizing the data in two dimensions can help identify clusters of patients with similar characteristics or risk profiles. This can provide valuable insights for developing targeted prevention and treatment strategies, ultimately improving patient outcomes. We will apply two widely used dimensionality reduction methods:

PCA : Principal Component Analysis and
KPCA : Kernel Function Principal Component Analysis.

PCA

Principal Component Analysis (PCA) is a widely used technique for dimensionality reduction that transforms the original high-dimensional data into a new set of uncorrelated variables called principal components. These principal components are linear combinations of the original features, ordered by the amount of variance they capture in the data. The first principal component captures the most variance, followed by the second, and so on. By selecting a subset of these components, we can reduce the dimensionality of the dataset while retaining most of its variability. PCA is particularly useful for identifying patterns and trends in the data, as well as for noise reduction and feature extraction.

In a first attempt to support vizualisation we will first scale the data and then run PCA to reduce to 2 Dimensions.

	count	mean	std	min	25%	50%	75%	max
gender	5104.0	0.0	1.0	-0.840	-0.840	-0.840	1.188	3.217
age	5104.0	-0.0	1.0	-1.910	-0.807	0.077	0.785	1.714
hypertension	5104.0	-0.0	1.0	-0.328	-0.328	-0.328	-0.328	3.051
heart_disease	5104.0	0.0	1.0	-0.239	-0.239	-0.239	-0.239	4.182
ever_married	5104.0	-0.0	1.0	-1.383	-1.383	0.723	0.723	0.723
work_type	5104.0	-0.0	1.0	-1.988	-0.153	-0.153	0.764	1.681
Residence_type	5104.0	0.0	1.0	-1.017	-1.017	0.984	0.984	0.984
smoking_status	5104.0	-0.0	1.0	-1.286	-1.286	0.581	0.581	1.515
glucose_group	5104.0	0.0	1.0	-0.412	-0.412	-0.412	-0.412	2.945
bmi_group	5104.0	-0.0	1.0	-2.129	-1.079	-0.030	1.020	1.020

Scatterplot of 2D PCA principal components and target variable stroke as hue

The plot doesn't show good separation regarding the target variable. Also the cummulative explained variance ratio is not very good at 0.399.

	Principal Component	Explained Variance Ratio	Cumulative EVR	Eigenvalues
0	PC1	0.274	0.274	2.738
1	PC2	0.125	0.399	1.252

So we rerun PCA with the aim to explain 95% of the variance and plot the cummulative Variance.

PCA Cummulative explained variance

We would require 9 components to explain 97% of the 10 features in our dataset.

[H]

10 1.2

	Principal Component	Explained Variance Ratio	Cumulative EVR	Eigenvalues
0	PC1	0.274	0.274	2.738
1	PC2	0.125	0.399	1.252
2	PC3	0.100	0.499	1.001
3	PC4	0.096	0.595	0.962
4	PC5	0.088	0.683	0.882
5	PC6	0.082	0.766	0.825
6	PC7	0.080	0.846	0.799
7	PC8	0.065	0.911	0.648
8	PC9	0.060	0.971	0.603

Now we try to fit classification models for the principal components as features. The results are listed below.

	Resampling	Scoring	Model	Precision	Recall	Accuracy	F1	F2	F4
3	Adasyn	f1	AdaBoost	0.114	0.784	0.696	0.199	0.361	0.583
0	Adasyn	f1	Logistic Regression	0.115	0.743	0.711	0.199	0.355	0.562
8	Adasyn	f2	Logistic Regression	0.115	0.743	0.711	0.199	0.355	0.562
4	Adasyn	recall	Logistic Regression	0.111	0.730	0.704	0.193	0.345	0.549
12	Adasyn	f4	Logistic Regression	0.111	0.730	0.704	0.193	0.345	0.549
7	Adasyn	recall	AdaBoost	0.075	0.946	0.435	0.139	0.285	0.563
11	Adasyn	f2	AdaBoost	0.075	0.946	0.435	0.139	0.285	0.563
15	Adasyn	f4	AdaBoost	0.075	0.946	0.435	0.139	0.285	0.563

	Resampling	Scoring	Model	F2	True 1	True 0	False 1	False 0
3	Adasyn	f1	AdaBoost	0.361	58	1008	450	16
0	Adasyn	f1	Logistic Regression	0.355	55	1034	424	19
8	Adasyn	f2	Logistic Regression	0.355	55	1034	424	19
4	Adasyn	recall	Logistic Regression	0.345	54	1025	433	20
12	Adasyn	f4	Logistic Regression	0.345	54	1025	433	20
7	Adasyn	recall	AdaBoost	0.285	70	597	861	4
11	Adasyn	f2	AdaBoost	0.285	70	597	861	4
15	Adasyn	f4	AdaBoost	0.285	70	597	861	4

The best models depending on the scoring metric are shown below.

	Best Accuracy	Best Precision	Best Recall	Best F1	Best F2	Best F4
Model ID	0	0	7	3	3	3
Resampling	Adasyn	Adasyn	Adasyn	Adasyn	Adasyn	Adasyn
Scoring	f1	f1	recall	f1	f1	f1
Model	Logistic Reg	Logistic Reg	AdaBoost	AdaBoost	AdaBoost	AdaBoost
Precision	0.115	0.115	0.075	0.114	0.114	0.114
Recall	0.743	0.743	0.946	0.784	0.784	0.784
Accuracy	0.711	0.711	0.435	0.696	0.696	0.696
F1	0.199	0.199	0.139	0.199	0.199	0.199
F2	0.355	0.355	0.285	0.361	0.361	0.361
F4	0.562	0.562	0.563	0.583	0.583	0.583
True 1	55	55	70	58	58	58
True 0	1034	1034	597	1008	1008	1008
False 1	424	424	861	450	450	450
False 0	19	19	4	16	16	16

Confusion matrix Best models for PCA based classification

KPCA

Kernel PCA (KPCA) extends the traditional PCA method by using kernel functions to handle non-linear relationships in the data. While PCA is limited to linear transformations, KPCA projects the data into a higher-dimensional space using a kernel function, allowing it to capture complex, non-linear structures. The kernel function implicitly computes the principal components in this higher-dimensional space without the need to explicitly perform the transformation, making it computationally efficient. KPCA is especially useful when the data has intricate non-linear patterns that cannot be captured by linear methods like PCA. It provides a more flexible approach to dimensionality reduction, enabling better feature extraction and improved performance of machine learning algorithms on complex datasets.

As with PCA we will first scale the data and then run PCA to reduce to 2 Dimensions.

	count	mean	std	min	25%	50%	75%	max
gender	5104.0	0.0	1.0	-0.840	-0.840	-0.840	1.188	3.217
age	5104.0	-0.0	1.0	-1.910	-0.807	0.077	0.785	1.714
hypertension	5104.0	-0.0	1.0	-0.328	-0.328	-0.328	-0.328	3.051
heart_disease	5104.0	0.0	1.0	-0.239	-0.239	-0.239	-0.239	4.182
ever_married	5104.0	-0.0	1.0	-1.383	-1.383	0.723	0.723	0.723
work_type	5104.0	-0.0	1.0	-1.988	-0.153	-0.153	0.764	1.681
Residence_type	5104.0	0.0	1.0	-1.017	-1.017	0.984	0.984	0.984
smoking_status	5104.0	-0.0	1.0	-1.286	-1.286	0.581	0.581	1.515
glucose_group	5104.0	0.0	1.0	-0.412	-0.412	-0.412	-0.412	2.945
bmi_group	5104.0	-0.0	1.0	-2.129	-1.079	-0.030	1.020	1.020

Scatterplot of 2D KPCA principal components and target variable stroke as hue

The plot appears to show good separation regarding the target variable. When we zoom into the range of the target variable and have a closer look, we can see that the classes are hardly separated.

Zoomed in Scatterplots of 2D KPCA principal components and target variable stroke as hue

When we rerun KPCA without specifying the number of components, we can then see how much variance is explained by each principal component, if we take a look at the magnitude of their respective eigenvalues.

Eigenvalues of KPCA Principal Components

We can try to evaluate the Kernel PCA results with applying our different classifcation models and their performance using the principal components as features and the stroke variable as target. We will evaluate for 20,40 and 60 principal components.

	KPC	Resampling	Scoring	Model	Precision	Recall	Accuracy	F1	F2	F4
3	20	Adasyn	f1	AdaBoost	0.1231	0.7703	0.7239	0.2123	0.3755	0.5883
35	40	Adasyn	f1	AdaBoost	0.1223	0.7568	0.7258	0.2105	0.3714	0.5798
32	40	Adasyn	f1	Logistic Regression	0.1225	0.7432	0.7304	0.2103	0.3691	0.5726
36	40	Adasyn	recall	Logistic Regression	0.1225	0.7432	0.7304	0.2103	0.3691	0.5726
40	40	Adasyn	f2	Logistic Regression	0.1225	0.7432	0.7304	0.2103	0.3691	0.5726
44	40	Adasyn	f4	Logistic Regression	0.1225	0.7432	0.7304	0.2103	0.3691	0.5726
0	20	Adasyn	f1	Logistic Regression	0.1204	0.7432	0.7252	0.2072	0.3652	0.5698
4	20	Adasyn	recall	Logistic Regression	0.1198	0.7432	0.7239	0.2064	0.3642	0.5691
8	20	Adasyn	f2	Logistic Regression	0.1198	0.7432	0.7239	0.2064	0.3642	0.5691
12	20	Adasyn	f4	Logistic Regression	0.1198	0.7432	0.7239	0.2064	0.3642	0.5691
68	60	Adasyn	recall	Logistic Regression	0.1189	0.7297	0.7258	0.2045	0.3600	0.5604
64	60	Adasyn	f1	Logistic Regression	0.1189	0.7297	0.7258	0.2045	0.3600	0.5604
72	60	Adasyn	f2	Logistic Regression	0.1189	0.7297	0.7258	0.2045	0.3600	0.5604
76	60	Adasyn	f4	Logistic Regression	0.1189	0.7297	0.7258	0.2045	0.3600	0.5604
60	40	TomekLinks	f4	Logistic Regression	0.1182	0.7297	0.7239	0.2034	0.3586	0.5594
52	40	TomekLinks	recall	Logistic Regression	0.1182	0.7297	0.7239	0.2034	0.3586	0.5594
56	40	TomekLinks	f2	Logistic Regression	0.1182	0.7297	0.7239	0.2034	0.3586	0.5594
48	40	TomekLinks	f1	Logistic Regression	0.1182	0.7297	0.7239	0.2034	0.3586	0.5594
16	20	TomekLinks	f1	Logistic Regression	0.1048	0.8243	0.6514	0.1860	0.3474	0.5872
43	40	Adasyn	f2	AdaBoost	0.0870	0.8514	0.5614	0.1579	0.3088	0.5613
181	60	TomekLinks	recall	Decision Tree	0.0832	0.7432	0.5920	0.1497	0.2874	0.5068
131	40	Adasyn	f1	AdaBoost	0.0804	0.7703	0.5633	0.1456	0.2836	0.5119
103	20	Adasyn	recall	AdaBoost	0.0791	0.7703	0.5555	0.1434	0.2802	0.5087
111	20	Adasyn	f4	AdaBoost	0.0791	0.7703	0.5555	0.1434	0.2802	0.5087
107	20	Adasyn	f2	AdaBoost	0.0782	0.7703	0.5503	0.1420	0.2780	0.5065
163	60	Adasyn	f1	AdaBoost	0.0794	0.7297	0.5783	0.1432	0.2766	0.4925
99	20	Adasyn	f1	AdaBoost	0.0779	0.7568	0.5555	0.1412	0.2759	0.5003

	KPC	Resampling	Scoring	Model	F2	True 1	True 0	False 1	False 0
3	20	Adasyn	f1	AdaBoost	0.3755	57	1052	406	17
35	40	Adasyn	f1	AdaBoost	0.3714	56	1056	402	18
32	40	Adasyn	f1	Logistic Regression	0.3691	55	1064	394	19
36	40	Adasyn	recall	Logistic Regression	0.3691	55	1064	394	19
40	40	Adasyn	f2	Logistic Regression	0.3691	55	1064	394	19
44	40	Adasyn	f4	Logistic Regression	0.3691	55	1064	394	19
0	20	Adasyn	f1	Logistic Regression	0.3652	55	1056	402	19
4	20	Adasyn	recall	Logistic Regression	0.3642	55	1054	404	19
8	20	Adasyn	f2	Logistic Regression	0.3642	55	1054	404	19
12	20	Adasyn	f4	Logistic Regression	0.3642	55	1054	404	19
68	60	Adasyn	recall	Logistic Regression	0.3600	54	1058	400	20
64	60	Adasyn	f1	Logistic Regression	0.3600	54	1058	400	20
72	60	Adasyn	f2	Logistic Regression	0.3600	54	1058	400	20
76	60	Adasyn	f4	Logistic Regression	0.3600	54	1058	400	20
60	40	TomekLinks	f4	Logistic Regression	0.3586	54	1055	403	20
52	40	TomekLinks	recall	Logistic Regression	0.3586	54	1055	403	20
56	40	TomekLinks	f2	Logistic Regression	0.3586	54	1055	403	20
48	40	TomekLinks	f1	Logistic Regression	0.3586	54	1055	403	20
16	20	TomekLinks	f1	Logistic Regression	0.3474	61	937	521	13
43	40	Adasyn	f2	AdaBoost	0.3088	63	797	661	11
181	60	TomekLinks	recall	Decision Tree	0.2874	55	852	606	19
131	40	Adasyn	f1	AdaBoost	0.2836	57	806	652	17
103	20	Adasyn	recall	AdaBoost	0.2802	57	794	664	17
111	20	Adasyn	f4	AdaBoost	0.2802	57	794	664	17
107	20	Adasyn	f2	AdaBoost	0.2780	57	786	672	17
163	60	Adasyn	f1	AdaBoost	0.2766	54	832	626	20
99	20	Adasyn	f1	AdaBoost	0.2759	56	795	663	18

The best models depending on the scoring metric are shown below.

	Best Accuracy	Best Precision	Best Recall	Best F1	Best F2	Best F4
Model ID	32	3	43	3	3	3
KPC	40	20	40	20	20	20
Resampling	Adasyn	Adasyn	Adasyn	Adasyn	Adasyn	Adasyn
Scoring	f1	f1	f2	f1	f1	f1
Model	Logistic Reg	AdaBoost	AdaBoost	AdaBoost	AdaBoost	AdaBoost
Precision	0.1225	0.1231	0.087	0.1231	0.1231	0.1231
Recall	0.7432	0.7703	0.8514	0.7703	0.7703	0.7703
Accuracy	0.7304	0.7239	0.5614	0.7239	0.7239	0.7239
F1	0.2103	0.2123	0.1579	0.2123	0.2123	0.2123
F2	0.3691	0.3755	0.3088	0.3755	0.3755	0.3755
F4	0.5726	0.5883	0.5613	0.5883	0.5883	0.5883
True 1	55	57	63	57	57	57
True 0	1064	1052	797	1052	1052	1052
False 1	394	406	661	406	406	406
False 0	19	17	11	17	17	17

Confusion matrix - Best model for KPCA based classification

7. Recommended Model

After training and evaluating the models, it is hard to identify a single best model as it is highly depending on the preference on how you value the misclassification of a stroke patient. We did not gain huge improvements with the unsupervised leaning approaches but could identify slight improvement in the F4 score for Cluster based classification. Dimensionality Reduction did not result in improved results compared to the base models. Based on our performance metric we can propose two models as promising. The best F2 value was achieved with the base classification only AdaBoost model number 19 with under-sampling. The best F4 value was realized by applying AdaBoost model number 63 with under-sampling and enriching the classification with the resulting cluster labels from the DBSCAN clustering. Both models could be a good starting point for further evaluation. Below you can find the Confusion Matrix for both models.

Best F2 Score

	Model 19 - AdaBoost
Resampling	TomekLinks
Scoring	f1
Model	AdaBoost
Precision	0.167192
Recall	0.716216
Accuracy	0.813969
F1	0.2711
F2	0.4323
F4	0.600266
True 1	53
True 0	1194
False 1	264
False 0	21

Confusion matrix Best F2 Score - AdaBoost Base model 19

Best F4 Score

	Model 63 - AdaBoost DBSCAN
Clustering	DBSCANAF
Resampling	TomekLinks
Scoring	f4
Model	AdaBoost
Precision	0.10559
Recall	0.918919
Accuracy	0.619856
F1	0.189415
F2	0.361702
F4	0.632385
True 1	68
True 0	881
False 1	576
False 0	6

Confusion matrix Best F4 Score - AdaBoost DBSCAN Cluster model 63

8. Key Findings and Insights

The analysis of the stroke prediction model has revealed several critical factors that significantly influence the likelihood of stroke in patients. Understanding these drivers allows for better-targeted interventions and more effective prevention strategies. However, to further enhance the accuracy and reliability of the model, additional data and features are essential.

Main Drivers influencing Stroke Risk

Age: Older patients have a higher risk of stroke. This finding underscores the importance of age-related health monitoring and interventions, as the likelihood of experiencing a stroke increases with age, necessitating enhanced medical vigilance for the elderly.
Hypertension: Presence of hypertension increases stroke risk. Hypertension, or high blood pressure, is a well-established risk factor for stroke, emphasizing the need for strict blood pressure control through medication, lifestyle changes, and regular monitoring.
Heart Disease: Patients with heart disease are more likely to experience a stroke. The strong correlation between cardiovascular conditions and stroke highlights the necessity for comprehensive care plans that address both heart disease management and stroke prevention.
Average Glucose Level: Higher average glucose levels are associated with Diabetes and increase stroke risk. Elevated glucose levels indicate poor diabetes control, which can lead to vascular damage and increased stroke risk, highlighting the importance of maintaining optimal glucose levels through diet, exercise, and medication adherence.

Insights

Preventive Measures: Targeted interventions for patients with hypertension and heart disease could reduce stroke incidence. Implementing comprehensive care plans that include lifestyle modifications, medication adherence, and regular health check-ups is crucial for mitigating stroke risk in these high-risk populations.
Public Health Strategies: Programs aimed at managing blood glucose levels and promoting healthy aging could be beneficial. Public health initiatives should focus on widespread screening for diabetes and hypertension, coupled with campaigns that encourage physical activity, healthy eating, and smoking cessation to reduce stroke risk at a population level.
Holistic Health Approach: Adopting a holistic approach that considers the interplay between various risk factors can enhance stroke prevention efforts. By addressing lifestyle factors such as diet, exercise, and stress management, healthcare providers can simultaneously mitigate risks associated with hypertension, heart disease, and diabetes, leading to better overall health outcomes.
Technology and Monitoring: Leveraging technology, such as wearable devices and telemedicine, can aid in the continuous monitoring of at-risk individuals. These technologies provide real-time data on blood pressure, glucose levels, and heart rate, allowing for timely interventions and personalized care plans that can significantly reduce stroke risk.
Education and Awareness: Raising awareness about the risk factors and preventive measures for stroke is crucial. Public health campaigns and educational programs should aim to inform individuals about the importance of regular health screenings, recognizing early symptoms of stroke, and seeking immediate medical attention, empowering them to take proactive steps towards stroke prevention.

Future Directions

To further improve the understanding and prediction of stroke risk, it is imperative to gather more comprehensive data and incorporate additional features into the model. Including a wider range of demographic, genetic, and lifestyle factors can provide a more nuanced view of stroke risk. Additionally, longitudinal data tracking patients over time could offer insights into how risk factors evolve and interact. By expanding the dataset and refining the features used, we can develop more accurate and robust models that enhance our ability to prevent and manage stroke.

9. Suggestions for Next Steps

Feature Enhancement: Incorporate additional health-related features, such as cholesterol levels, physical activity, and diet, to improve the model's predictive performance. Including more comprehensive lifestyle and biometric data can help create a more accurate and holistic risk assessment for stroke.
Longitudinal Data: Utilize longitudinal data to track changes in patient health over time, which can provide deeper insights into the progression and interaction of risk factors. Longitudinal studies allow for the observation of how individual risk profiles evolve, leading to more precise and personalized predictions.
Assessment of Existing Studies and Scores: Evaluate and integrate findings from established studies and scoring systems, such as the Framingham Heart Study and the CHA₂DS₂-VASc score, which are widely used for predicting cardiovascular and stroke risk. Comparing our model's performance with these well-regarded benchmarks can provide validation and highlight areas for improvement. Additionally, exploring datasets from these studies can offer valuable insights and potential features to enhance our model.
Collaborative Research: Engage in collaborative research with other institutions and researchers to leverage a broader range of expertise and datasets. By pooling resources and knowledge, we can develop more robust and generalizable models that are applicable across diverse populations.
Validation Across Diverse Populations: Test and validate the model across different demographic and geographic populations to ensure its applicability and reliability. Understanding how the model performs in various contexts can help identify any biases or limitations, leading to more equitable and effective stroke risk prediction tools.
Model Re-evaluation: Regularly update and re-evaluate the model as new data becomes available to ensure its continued relevance and accuracy. Incorporating the latest research findings and medical advancements will help maintain the model's effectiveness in predicting stroke risk.

	KMeanAF	Total	Stroke	Not_Stroke
0	0	761	32	729
1	1	850	3	847
2	2	463	24	439
3	3	274	24	250
4	4	470	23	447
5	5	717	0	717
6	6	276	47	229
7	7	498	24	474
8	8	364	17	347
9	9	431	53	378

	AggloAF	Total	Stroke	Not_Stroke
0	0	966	13	953
1	1	645	0	645
2	2	696	35	661
3	3	431	53	378
4	4	587	21	566
5	5	505	42	463
6	6	276	47	229
7	7	377	11	366
8	8	621	25	596

	KMeanAF	Total	Stroke	Not_Stroke
0	0	761	32	729
1	1	850	3	847
2	2	463	24	439
3	3	274	24	250
4	4	470	23	447
5	5	717	0	717
6	6	276	47	229
7	7	498	24	474
8	8	364	17	347
9	9	431	53	378

	AggloAF	Total	Stroke	Not_Stroke
0	0	966	13	953
1	1	645	0	645
2	2	696	35	661
3	3	431	53	378
4	4	587	21	566
5	5	505	42	463
6	6	276	47	229
7	7	377	11	366
8	8	621	25	596

	KMeanAF	Total	Stroke	Not_Stroke
0	0	761	32	729
1	1	850	3	847
2	2	463	24	439
3	3	274	24	250
4	4	470	23	447
5	5	717	0	717
6	6	276	47	229
7	7	498	24	474
8	8	364	17	347
9	9	431	53	378

	AggloAF	Total	Stroke	Not_Stroke
0	0	966	13	953
1	1	645	0	645
2	2	696	35	661
3	3	431	53	378
4	4	587	21	566
5	5	505	42	463
6	6	276	47	229
7	7	377	11	366
8	8	621	25	596