Exploring heart disease prediction is a challenging but rewarding task, particularly because data quality and preparation play an essential role in building effective models. In this post, I’ll be using the Heart Disease dataset from UCI data repository (2021), a comprehensive collection containing various clinical attributes. Let’s dive into a structured approach to examining and preparing this dataset for analysis.
Initial Data Exploration
Before we start our analysis, let’s have a preliminary understanding of the data.
The Variables
# Display the variables
print(data.columns)
## Index(['id', 'age', 'sex', 'dataset', 'cp', 'trestbps', 'chol', 'fbs',
## 'restecg', 'thalch', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num'],
## dtype='object')
The columns provide clinical features essential for heart disease analysis. Here’s a quick overview:
- id (Unique id for each patient)
- age (Age of the patient in years)
- origin (place of study)
- sex (Male/Female)
- cp chest pain type (Values: typical angina, atypical angina, non-anginal, asymptomatic)
- trestbps resting blood pressure (resting blood pressure (in mm Hg on admission to the hospital))
- chol (serum cholesterol in mg/dl)
- fbs (if fasting blood sugar > 120 mg/dl)
- restecg (resting electrocardiographic results) - Values: normal, stt abnormality, lv hypertrophy
- thalach maximum heart rate achieved
- exang exercise-induced angina (True/ False)
- oldpeak ST depression induced by exercise relative to rest
- slope the slope of the peak exercise ST segment
- ca number of major vessels (0-3) colored by fluoroscopy
- thal - Values are normal, fixed defect, reversible defect
- num the predicted attribute
Sample Data
Here’s a glimpse of the data itself to better understand its structure:
# Display the first few rows of the dataset
print(data.head())
## id age sex dataset ... slope ca thal num
## 0 1 63 Male Cleveland ... downsloping 0.0 fixed defect 0
## 1 2 67 Male Cleveland ... flat 3.0 normal 2
## 2 3 67 Male Cleveland ... flat 2.0 reversable defect 1
## 3 4 37 Male Cleveland ... downsloping 0.0 normal 0
## 4 5 41 Female Cleveland ... upsloping 0.0 normal 0
##
## [5 rows x 16 columns]
Obtaining more information
# Describing the data frame
summary = data.describe()
# Extract specific statistics
age_mean = round(summary.loc['mean', 'age'],2)
age_min = summary.loc['min', 'age']
age_max = summary.loc['max', 'age']
trestbps_mean = round(summary.loc['mean', 'trestbps'],2)
trestbps_min = summary.loc['min', 'trestbps']
ca_missing = data['ca'].isnull().sum()
This summary highlights several observations:
-
Age: Average age is around 53.51 years, ranging from
r py$age_minto 77. -
trestbps (Resting Blood Pressure): Mean of 132.13 mm Hg, but values as low as 0 suggest outliers or incorrect entries, since a Blood Pressure of zero is not possible
-
ca (Number of Major Vessels): This feature has a substantial number of missing values.
These insights underscore the need to address data quality issues, including missing values and outliers.
Finding out more information about the variables
# Information about the data set
print(data.info())
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 920 entries, 0 to 919
## Data columns (total 16 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 id 920 non-null int64
## 1 age 920 non-null int64
## 2 sex 920 non-null object
## 3 dataset 920 non-null object
## 4 cp 920 non-null object
## 5 trestbps 861 non-null float64
## 6 chol 890 non-null float64
## 7 fbs 830 non-null object
## 8 restecg 918 non-null object
## 9 thalch 865 non-null float64
## 10 exang 865 non-null object
## 11 oldpeak 858 non-null float64
## 12 slope 611 non-null object
## 13 ca 309 non-null float64
## 14 thal 434 non-null object
## 15 num 920 non-null int64
## dtypes: float64(5), int64(3), object(8)
## memory usage: 115.1+ KB
## None
Data Quality Assessment
-
Missing Values
Several features, such as ca, slope, and thal, have missing values that may impact the analysis. Missing values in critical variables like these require careful handling.
-
Outliers
Some variables, like trestbps and oldpeak, may contain outliers that could affect statistical analysis and modeling performance.
-
Categorical Variables
Categorical variables (sex, cp, etc.) may need to be converted into a numerical format (e.g., one-hot encoding) for machine learning algorithms.
-
Target Variable
The target variable (num, which indicates the presence or absence of heart disease) is fully populated, which is essential for training a predictive model.
Explorative Data Analysis
Summary Statistics
- Count:
The count for some columns is less than the total number of entries (920).
For example, trestbps has 861 non-null entries, indicating that there
are missing values in this column.
The columns ca, chol, fbs, restecg, thalch, exang,
oldpeak, and slope also show missing values, which may require
handling during data preprocessing.
- Mean, Standard Deviation, Minimum, and Maximum:
-
Age: The average age of participants is approximately 53.5 years, with a minimum of 28 and a maximum of 77, suggesting a diverse age range.
-
Trestbps (Resting Blood Pressure): The mean is around 132 mm Hg, with values ranging from 0 to 200, indicating some potential outliers, especially since a resting blood pressure of 0 is unusual.
-
Oldpeak: The average
oldpeak(depression induced by exercise relative to rest) is about 0.88, with a minimum of -2.6 (which may also indicate an outlier) and a maximum of 6.2. -
Number of Major Vessels (
ca): This column has values ranging from 0 to 3, with a mean of approximately 0.6 (~1.0), suggesting that a significant number of patients have at least one major vessels affected.
- Percentiles:
The 25th, 50th (median), and 75th percentiles provide insight into the distribution of continuous variables. For example, the median age is 54 years, while 75% of the participants are aged 60 or younger.
Data Types and Non-Null Counts
- Data Types:
The dataset consists of 16 columns, with a mix of int64, float64, and object types. Continuous variables (like age, trestbps, chol, etc.) are of type float64 or int64, while categorical variables (like sex, cp, etc.) are of type object.
- Non-Null Counts:
The non-null counts indicate that not all variables are fully populated. For instance:
The slope column has only 611 non-null entries, which means more than 30% of the data is missing in this feature. Similarly, ca and thal also have significant missing values. These missing values need to be addressed, either through imputation, removal, or other strategies, depending on the specific context of your analysis.
Data Preprocessing
The next steps will data pre-processing steps such as handling missing values, exploring relationships between features and the target variable, and feature selection or extraction to improve model performance.
Data Cleaning and Imputation
It is clear from the observations above that we will have to impute the data for multiple variables. Let’s begin the exercise of imputing missing values:
- Categorical Variables
let impute restecg, a categorical variable with values like (normal, st-t abnormality, lv hypertrophy), simpler imputation methods that respect its categorical nature is appropriate.
# Impute 'restecg' with its mode
mode_value = data['restecg'].mode()[0]
data_imputed = data.assign(restecg = data['restecg'].fillna(mode_value))
data = data_imputed
- Continuous Variables
For trestbps, chol, and thalch which is a measure of the resting blood pressure, cholesterol, max heart rate achieved, we can use Median Imputation.
from sklearn.impute import SimpleImputer
cols = ['trestbps', 'chol', 'thalch']
imputer = SimpleImputer(strategy = 'median')
data[cols] = imputer.fit_transform(data[cols])
- Complex Features
In the case of ca(number of major vessels involved), we cannot use a simple mean/median imputation, especially when a significant number of values are missing.
From subject matter experts, we know that ca has a correlation with age, chol or Cholesterol levels, trestbps and the resting blood pressure, talch. So we use an advanced imputation method called KNN imputation. The idea for this imputation is, if a data point is close together for all these variables, then the value of the ca will also be close.
from sklearn.impute import KNNImputer
# The features that has effects ca in the real world
features = ['age', 'chol', 'trestbps', 'thalch', 'ca']
knn_imputer = KNNImputer(n_neighbors = 5)
data_for_impute = data[features]
# making a copy so that the original dataset is not changed
imputed_data = knn_imputer.fit_transform(data_for_impute)
# We are doing this to ensure that the columns between the
# imputed_df and the data df is in alignment
imputed_df = pd.DataFrame(imputed_data, columns = features)
imputed_df = imputed_df.reset_index(drop = True)
data = data.reset_index(drop = True)
# updating on 'ca' with the imputed data
data['ca'] = imputed_df['ca']
data[['ca']].head()
## ca
## 0 0.0
## 1 3.0
## 2 2.0
## 3 0.0
## 4 0.0
Next, for the slope
given that the values of slope are “flat”, “downsloping”, and “upsloping”,
they can be treated as ordinal categorical variables, since there is a
clear order or ranking between these categories. We will need more sophisticated
imputation due to high missing rate and correlation with other variables.
- Order of the Categories:
- “downsloping” (worse condition)
- “flat” (neutral condition)
- “upsloping” (better condition)
- Encoding
Before applying machine learning algorithms, you’ll need to encode slope numerically while keeping its order intact. Using Ordinal Encoding.
- “downsloping” is 0
- “flat” is 1
- “upsloping” is 2
from sklearn.preprocessing import OrdinalEncoder
categories = [['downsloping', 'flat', 'upsloping']]
# because the OrdinalEncoder cannot handle NaNs
ordinal_encoder = OrdinalEncoder(categories=categories,
handle_unknown = 'use_encoded_value',
unknown_value = -1)
data['slope_encoded'] = ordinal_encoder.fit_transform(data[['slope']])
# Changing back the -1 encoded into np.nan
data['slope_encoded'] = data['slope_encoded'].replace(-1.0, np.nan)
print(data[['slope', 'slope_encoded']])
## slope slope_encoded
## 0 downsloping 0.0
## 1 flat 1.0
## 2 flat 1.0
## 3 downsloping 0.0
## 4 upsloping 2.0
## .. ... ...
## 915 NaN NaN
## 916 NaN NaN
## 917 NaN NaN
## 918 NaN NaN
## 919 NaN NaN
##
## [920 rows x 2 columns]
We have encoded the NaNs into unknown and encoded it as -1. After encoding, we will replace slope with it’s encoded values for machine learning compatibility.
We will include the variables of age, trestbps (resting blood pressure),
chol (cholesterol), thalch (maximum heart rate achived) since these are
most likely to influence the value of slope. Also, MICE imputer cannot
handle non numeric types.
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imputation_columns = ['age',
'trestbps', 'chol',
'thalch', 'slope_encoded']
mice_imputer = IterativeImputer(max_iter = 50, random_state = 0)
imputed_data = mice_imputer.fit_transform(data[imputation_columns])
imputed_df = pd.DataFrame(imputed_data, columns=imputation_columns)
data['slope_encoded'] = round(imputed_df['slope_encoded'])
data['slope_encoded'] = data['slope_encoded'].astype("float64")
data['slope'] = data['slope_encoded']
data = data.drop(columns=['slope_encoded'])
data.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 920 entries, 0 to 919
## Data columns (total 16 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 id 920 non-null int64
## 1 age 920 non-null int64
## 2 sex 920 non-null object
## 3 dataset 920 non-null object
## 4 cp 920 non-null object
## 5 trestbps 920 non-null float64
## 6 chol 920 non-null float64
## 7 fbs 830 non-null object
## 8 restecg 920 non-null object
## 9 thalch 920 non-null float64
## 10 exang 865 non-null object
## 11 oldpeak 858 non-null float64
## 12 slope 920 non-null float64
## 13 ca 920 non-null float64
## 14 thal 434 non-null object
## 15 num 920 non-null int64
## dtypes: float64(6), int64(3), object(7)
## memory usage: 115.1+ KB
Next, we will impute the oldpeak (ST depression induced by exercise relative to rest), exchang (exercise inducted angina) and restecg (resting ecg). The thalach (maximum heart rate achieved) and slope are likely to have a strong relationship with oldpeak, as they directly relate to heart response during exercise. Similarly, both variables are relevant for assessing the heart’s performance under exercise stress, which can influence the likelihood of angina represented by exang.
Iputing Missing Variables
We observe that the fasting blood sugar fbs (if fasting blood sugar > 120 mg/dl) has significant number of rows which need to be imputed.
High fasting blood sugar levels can indicate poor metabolic control and is associated with diabetes, which is a significant risk factor for cardiovascular diseases. Studies have shown that elevated fasting blood sugar levels correlate with increased risk of cardiovascular events.
The inclusion of this variable can improve the predictive power of models aimed at assessing heart disease risk.
from sklearn.impute import KNNImputer
import pandas as pd
knn_imputer = KNNImputer(n_neighbors = 5)
variables = ['age', 'trestbps', 'chol', 'fbs', 'ca']
imputed_data = knn_imputer.fit_transform(data[variables])
imputed_df = pd.DataFrame(imputed_data, columns = variables)
imputed_df['fbs'] = imputed_df['fbs'].round().astype(int)
data['fbs'] = imputed_df['fbs']
data['fbs'] = data['fbs'].astype("int64")
Next, with almost 50% of the data missing the variable thal poses a challenge. Is not a good candidate for imputation due to high imputation uncertainty. Further, we may also introduce bias if we try to impute.
Predictive imputation methods like KNN or classification models depend on relationships with other features to infer missing values. Variables with high missingness, there may be an insufficient basis to make reliable predictions for the missing values if its relationship with other features is weak.
We will therfore drop this variable from our analysis.
# Drop the 'thal' column from the dataset
data = data.drop(columns=['thal'])
Advanced Imputation Techniques
Next, we will use thalch and slope for imputing oldpeak and exang. They both assess the heart’s performance under stress (exercise), potentially influencing angina likelihood.
def get_dtype(col):
return "category" if len(col.unique()) < 20 and col.dtype == "object" else col.dtype
def convert_dtype(col):
return col.astype(get_dtype(col))
data = data.apply(convert_dtype)
# Using miceforest which uses random forests to model complex relationships
import miceforest as mf
# Define the variable schema to prioritize 'thalach' and 'slope'
# for imputing 'oldpeak' and 'exang'
variable_schema = {
'oldpeak': ['thalch', 'slope', 'age', 'trestbps', 'chol'],
'exang': ['thalch', 'slope', 'age', 'trestbps', 'chol'],
}
# initializing the MICE kernel
kernel = mf.ImputationKernel(
data,
random_state=1991, # used as convention, can be replaced with any int.
num_datasets=4,
variable_schema = variable_schema,
)
kernel.mice(iterations=10)
imputed_data = kernel.complete_data(dataset=0)
imputed_data.head()
## id age sex dataset cp ... exang oldpeak slope ca num
## 0 1 63 Male Cleveland typical angina ... False 2.3 0.0 0.0 0
## 1 2 67 Male Cleveland asymptomatic ... True 1.5 1.0 3.0 2
## 2 3 67 Male Cleveland asymptomatic ... True 2.6 1.0 2.0 1
## 3 4 37 Male Cleveland non-anginal ... False 3.5 0.0 0.0 0
## 4 5 41 Female Cleveland atypical angina ... False 1.4 2.0 0.0 0
##
## [5 rows x 15 columns]
data['oldpeak'] = imputed_data['oldpeak']
data['exang'] = imputed_data['exang']
data['exang'] = data['exang'].astype("boolean")
data.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 920 entries, 0 to 919
## Data columns (total 15 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 id 920 non-null int64
## 1 age 920 non-null int64
## 2 sex 920 non-null category
## 3 dataset 920 non-null category
## 4 cp 920 non-null category
## 5 trestbps 920 non-null float64
## 6 chol 920 non-null float64
## 7 fbs 920 non-null int64
## 8 restecg 920 non-null category
## 9 thalch 920 non-null float64
## 10 exang 920 non-null boolean
## 11 oldpeak 920 non-null float64
## 12 slope 920 non-null float64
## 13 ca 920 non-null float64
## 14 num 920 non-null int64
## dtypes: boolean(1), category(4), float64(6), int64(4)
## memory usage: 78.0 KB
Summary
Data preparation reveals essential insights and patterns that allow us to make informed choices in imputation and handling missing values. Here’s a summary of what we accomplished so far:
-
Addressed missing values using appropriate imputation techniques for categorical and continuous variables.
-
Handled outliers in critical clinical features.
-
Prepared the dataset for the modeling phase by encoding categorical values.
In the next phase, we’ll focus on feature engineering, including transforming categorical variables, creating new features, and scaling continuous variables to build a model that effectively predicts heart disease. Stay tuned as we transform this data into valuable insights for heart disease prediction!
References
UCI data repository, Heart Disease Data Set from. 2021. “UCI Heart Disease Data.” Kaggle. https://www.kaggle.com/datasets/redwankarimsony/heart-disease-data.