13 min read

Heart Disease Dataset: An Exercise in Data Preparation

Exploring heart disease prediction is a challenging but rewarding task, particularly because data quality and preparation play an essential role in building effective models. In this post, I’ll be using the Heart Disease dataset from UCI data repository (2021), a comprehensive collection containing various clinical attributes. Let’s dive into a structured approach to examining and preparing this dataset for analysis.

Initial Data Exploration

Before we start our analysis, let’s have a preliminary understanding of the data.

The Variables

# Display the variables
print(data.columns)
## Index(['id', 'age', 'sex', 'dataset', 'cp', 'trestbps', 'chol', 'fbs',
##        'restecg', 'thalch', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num'],
##       dtype='object')

The columns provide clinical features essential for heart disease analysis. Here’s a quick overview:

  1. id (Unique id for each patient)
  2. age (Age of the patient in years)
  3. origin (place of study)
  4. sex (Male/Female)
  5. cp chest pain type (Values: typical angina, atypical angina, non-anginal, asymptomatic)
  6. trestbps resting blood pressure (resting blood pressure (in mm Hg on admission to the hospital))
  7. chol (serum cholesterol in mg/dl)
  8. fbs (if fasting blood sugar > 120 mg/dl)
  9. restecg (resting electrocardiographic results) - Values: normal, stt abnormality, lv hypertrophy
  10. thalach maximum heart rate achieved
  11. exang exercise-induced angina (True/ False)
  12. oldpeak ST depression induced by exercise relative to rest
  13. slope the slope of the peak exercise ST segment
  14. ca number of major vessels (0-3) colored by fluoroscopy
  15. thal - Values are normal, fixed defect, reversible defect
  16. num the predicted attribute

Sample Data

Here’s a glimpse of the data itself to better understand its structure:

# Display the first few rows of the dataset
print(data.head())
##    id  age     sex    dataset  ...        slope   ca               thal num
## 0   1   63    Male  Cleveland  ...  downsloping  0.0       fixed defect   0
## 1   2   67    Male  Cleveland  ...         flat  3.0             normal   2
## 2   3   67    Male  Cleveland  ...         flat  2.0  reversable defect   1
## 3   4   37    Male  Cleveland  ...  downsloping  0.0             normal   0
## 4   5   41  Female  Cleveland  ...    upsloping  0.0             normal   0
## 
## [5 rows x 16 columns]

Obtaining more information

# Describing the data frame
summary = data.describe()

# Extract specific statistics
age_mean = round(summary.loc['mean', 'age'],2)
age_min = summary.loc['min', 'age']
age_max = summary.loc['max', 'age']
trestbps_mean = round(summary.loc['mean', 'trestbps'],2)
trestbps_min = summary.loc['min', 'trestbps']
ca_missing = data['ca'].isnull().sum()

This summary highlights several observations:

  • Age: Average age is around 53.51 years, ranging from r py$age_min to 77.

  • trestbps (Resting Blood Pressure): Mean of 132.13 mm Hg, but values as low as 0 suggest outliers or incorrect entries, since a Blood Pressure of zero is not possible

  • ca (Number of Major Vessels): This feature has a substantial number of missing values.

These insights underscore the need to address data quality issues, including missing values and outliers.

Finding out more information about the variables

# Information about the data set

print(data.info())
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 920 entries, 0 to 919
## Data columns (total 16 columns):
##  #   Column    Non-Null Count  Dtype  
## ---  ------    --------------  -----  
##  0   id        920 non-null    int64  
##  1   age       920 non-null    int64  
##  2   sex       920 non-null    object 
##  3   dataset   920 non-null    object 
##  4   cp        920 non-null    object 
##  5   trestbps  861 non-null    float64
##  6   chol      890 non-null    float64
##  7   fbs       830 non-null    object 
##  8   restecg   918 non-null    object 
##  9   thalch    865 non-null    float64
##  10  exang     865 non-null    object 
##  11  oldpeak   858 non-null    float64
##  12  slope     611 non-null    object 
##  13  ca        309 non-null    float64
##  14  thal      434 non-null    object 
##  15  num       920 non-null    int64  
## dtypes: float64(5), int64(3), object(8)
## memory usage: 115.1+ KB
## None

Data Quality Assessment

  • Missing Values

    Several features, such as ca, slope, and thal, have missing values that may impact the analysis. Missing values in critical variables like these require careful handling.

  • Outliers

    Some variables, like trestbps and oldpeak, may contain outliers that could affect statistical analysis and modeling performance.

  • Categorical Variables

    Categorical variables (sex, cp, etc.) may need to be converted into a numerical format (e.g., one-hot encoding) for machine learning algorithms.

  • Target Variable

    The target variable (num, which indicates the presence or absence of heart disease) is fully populated, which is essential for training a predictive model.

Explorative Data Analysis

Summary Statistics

  1. Count:

The count for some columns is less than the total number of entries (920). For example, trestbps has 861 non-null entries, indicating that there are missing values in this column.

The columns ca, chol, fbs, restecg, thalch, exang, oldpeak, and slope also show missing values, which may require handling during data preprocessing.

  1. Mean, Standard Deviation, Minimum, and Maximum:
  • Age: The average age of participants is approximately 53.5 years, with a minimum of 28 and a maximum of 77, suggesting a diverse age range.

  • Trestbps (Resting Blood Pressure): The mean is around 132 mm Hg, with values ranging from 0 to 200, indicating some potential outliers, especially since a resting blood pressure of 0 is unusual.

  • Oldpeak: The average oldpeak (depression induced by exercise relative to rest) is about 0.88, with a minimum of -2.6 (which may also indicate an outlier) and a maximum of 6.2.

  • Number of Major Vessels (ca): This column has values ranging from 0 to 3, with a mean of approximately 0.6 (~1.0), suggesting that a significant number of patients have at least one major vessels affected.

  1. Percentiles:

The 25th, 50th (median), and 75th percentiles provide insight into the distribution of continuous variables. For example, the median age is 54 years, while 75% of the participants are aged 60 or younger.

Data Types and Non-Null Counts

  1. Data Types:

The dataset consists of 16 columns, with a mix of int64, float64, and object types. Continuous variables (like age, trestbps, chol, etc.) are of type float64 or int64, while categorical variables (like sex, cp, etc.) are of type object.

  1. Non-Null Counts:

The non-null counts indicate that not all variables are fully populated. For instance:

The slope column has only 611 non-null entries, which means more than 30% of the data is missing in this feature. Similarly, ca and thal also have significant missing values. These missing values need to be addressed, either through imputation, removal, or other strategies, depending on the specific context of your analysis.

Data Preprocessing

The next steps will data pre-processing steps such as handling missing values, exploring relationships between features and the target variable, and feature selection or extraction to improve model performance.

Data Cleaning and Imputation

It is clear from the observations above that we will have to impute the data for multiple variables. Let’s begin the exercise of imputing missing values:

  1. Categorical Variables

let impute restecg, a categorical variable with values like (normal, st-t abnormality, lv hypertrophy), simpler imputation methods that respect its categorical nature is appropriate.


# Impute 'restecg' with its mode
mode_value = data['restecg'].mode()[0]
data_imputed = data.assign(restecg = data['restecg'].fillna(mode_value))
data = data_imputed
  1. Continuous Variables

For trestbps, chol, and thalch which is a measure of the resting blood pressure, cholesterol, max heart rate achieved, we can use Median Imputation.

from sklearn.impute import SimpleImputer

cols = ['trestbps', 'chol', 'thalch']
imputer = SimpleImputer(strategy = 'median')
data[cols] = imputer.fit_transform(data[cols])
  1. Complex Features

In the case of ca(number of major vessels involved), we cannot use a simple mean/median imputation, especially when a significant number of values are missing.

From subject matter experts, we know that ca has a correlation with age, chol or Cholesterol levels, trestbps and the resting blood pressure, talch. So we use an advanced imputation method called KNN imputation. The idea for this imputation is, if a data point is close together for all these variables, then the value of the ca will also be close.

from sklearn.impute import KNNImputer

# The features that has effects ca in the real world
features = ['age', 'chol', 'trestbps', 'thalch', 'ca']

knn_imputer = KNNImputer(n_neighbors = 5)
data_for_impute = data[features]

# making a copy so that the original dataset is not changed
imputed_data = knn_imputer.fit_transform(data_for_impute)

# We are doing this to ensure that the columns between the 
# imputed_df and the data df is in alignment
imputed_df = pd.DataFrame(imputed_data, columns = features)
imputed_df = imputed_df.reset_index(drop = True)
data = data.reset_index(drop = True)

# updating on 'ca' with the imputed data
data['ca'] = imputed_df['ca']

data[['ca']].head()
##     ca
## 0  0.0
## 1  3.0
## 2  2.0
## 3  0.0
## 4  0.0

Next, for the slope given that the values of slope are “flat”, “downsloping”, and “upsloping”, they can be treated as ordinal categorical variables, since there is a clear order or ranking between these categories. We will need more sophisticated imputation due to high missing rate and correlation with other variables.

  1. Order of the Categories:
  • “downsloping” (worse condition)
  • “flat” (neutral condition)
  • “upsloping” (better condition)
  1. Encoding

Before applying machine learning algorithms, you’ll need to encode slope numerically while keeping its order intact. Using Ordinal Encoding.

  • “downsloping” is 0
  • “flat” is 1
  • “upsloping” is 2

from sklearn.preprocessing import OrdinalEncoder


categories = [['downsloping', 'flat', 'upsloping']]


# because the OrdinalEncoder cannot handle NaNs

ordinal_encoder = OrdinalEncoder(categories=categories, 
                      handle_unknown = 'use_encoded_value', 
                      unknown_value = -1)
data['slope_encoded'] = ordinal_encoder.fit_transform(data[['slope']])

# Changing back the -1 encoded into np.nan
data['slope_encoded'] = data['slope_encoded'].replace(-1.0, np.nan)
print(data[['slope', 'slope_encoded']])
##            slope  slope_encoded
## 0    downsloping            0.0
## 1           flat            1.0
## 2           flat            1.0
## 3    downsloping            0.0
## 4      upsloping            2.0
## ..           ...            ...
## 915          NaN            NaN
## 916          NaN            NaN
## 917          NaN            NaN
## 918          NaN            NaN
## 919          NaN            NaN
## 
## [920 rows x 2 columns]

We have encoded the NaNs into unknown and encoded it as -1. After encoding, we will replace slope with it’s encoded values for machine learning compatibility.

We will include the variables of age, trestbps (resting blood pressure), chol (cholesterol), thalch (maximum heart rate achived) since these are most likely to influence the value of slope. Also, MICE imputer cannot handle non numeric types.


from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imputation_columns = ['age', 
                       'trestbps', 'chol', 
                      'thalch', 'slope_encoded']
mice_imputer = IterativeImputer(max_iter = 50, random_state = 0)

imputed_data = mice_imputer.fit_transform(data[imputation_columns])
imputed_df = pd.DataFrame(imputed_data, columns=imputation_columns)
data['slope_encoded'] = round(imputed_df['slope_encoded'])
data['slope_encoded'] = data['slope_encoded'].astype("float64")
data['slope'] = data['slope_encoded']
data = data.drop(columns=['slope_encoded'])

data.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 920 entries, 0 to 919
## Data columns (total 16 columns):
##  #   Column    Non-Null Count  Dtype  
## ---  ------    --------------  -----  
##  0   id        920 non-null    int64  
##  1   age       920 non-null    int64  
##  2   sex       920 non-null    object 
##  3   dataset   920 non-null    object 
##  4   cp        920 non-null    object 
##  5   trestbps  920 non-null    float64
##  6   chol      920 non-null    float64
##  7   fbs       830 non-null    object 
##  8   restecg   920 non-null    object 
##  9   thalch    920 non-null    float64
##  10  exang     865 non-null    object 
##  11  oldpeak   858 non-null    float64
##  12  slope     920 non-null    float64
##  13  ca        920 non-null    float64
##  14  thal      434 non-null    object 
##  15  num       920 non-null    int64  
## dtypes: float64(6), int64(3), object(7)
## memory usage: 115.1+ KB

Next, we will impute the oldpeak (ST depression induced by exercise relative to rest), exchang (exercise inducted angina) and restecg (resting ecg). The thalach (maximum heart rate achieved) and slope are likely to have a strong relationship with oldpeak, as they directly relate to heart response during exercise. Similarly, both variables are relevant for assessing the heart’s performance under exercise stress, which can influence the likelihood of angina represented by exang.

Iputing Missing Variables

We observe that the fasting blood sugar fbs (if fasting blood sugar > 120 mg/dl) has significant number of rows which need to be imputed.

High fasting blood sugar levels can indicate poor metabolic control and is associated with diabetes, which is a significant risk factor for cardiovascular diseases. Studies have shown that elevated fasting blood sugar levels correlate with increased risk of cardiovascular events.

The inclusion of this variable can improve the predictive power of models aimed at assessing heart disease risk.


from sklearn.impute import KNNImputer
import pandas as pd

knn_imputer = KNNImputer(n_neighbors = 5)
variables = ['age', 'trestbps', 'chol', 'fbs', 'ca'] 
imputed_data = knn_imputer.fit_transform(data[variables])
imputed_df = pd.DataFrame(imputed_data, columns = variables)
imputed_df['fbs'] = imputed_df['fbs'].round().astype(int)
data['fbs'] = imputed_df['fbs']
data['fbs'] = data['fbs'].astype("int64")

Next, with almost 50% of the data missing the variable thal poses a challenge. Is not a good candidate for imputation due to high imputation uncertainty. Further, we may also introduce bias if we try to impute.

Predictive imputation methods like KNN or classification models depend on relationships with other features to infer missing values. Variables with high missingness, there may be an insufficient basis to make reliable predictions for the missing values if its relationship with other features is weak.

We will therfore drop this variable from our analysis.


# Drop the 'thal' column from the dataset
data = data.drop(columns=['thal'])

Advanced Imputation Techniques

Next, we will use thalch and slope for imputing oldpeak and exang. They both assess the heart’s performance under stress (exercise), potentially influencing angina likelihood.

def get_dtype(col):
  return "category" if len(col.unique()) < 20 and col.dtype == "object" else col.dtype

def convert_dtype(col):
  return col.astype(get_dtype(col))

data = data.apply(convert_dtype)

# Using miceforest which uses random forests to model complex relationships

import miceforest as mf

# Define the variable schema to prioritize 'thalach' and 'slope'
# for imputing 'oldpeak' and 'exang'
variable_schema = {
    'oldpeak': ['thalch', 'slope', 'age', 'trestbps', 'chol'],
    'exang': ['thalch', 'slope', 'age', 'trestbps', 'chol'],
}

# initializing the MICE kernel
kernel = mf.ImputationKernel(
  data,
  random_state=1991, # used as convention, can be replaced with any int.
  num_datasets=4,
  variable_schema = variable_schema,
)

kernel.mice(iterations=10)
imputed_data = kernel.complete_data(dataset=0)
imputed_data.head()
##    id  age     sex    dataset               cp  ...  exang  oldpeak  slope   ca  num
## 0   1   63    Male  Cleveland   typical angina  ...  False      2.3    0.0  0.0    0
## 1   2   67    Male  Cleveland     asymptomatic  ...   True      1.5    1.0  3.0    2
## 2   3   67    Male  Cleveland     asymptomatic  ...   True      2.6    1.0  2.0    1
## 3   4   37    Male  Cleveland      non-anginal  ...  False      3.5    0.0  0.0    0
## 4   5   41  Female  Cleveland  atypical angina  ...  False      1.4    2.0  0.0    0
## 
## [5 rows x 15 columns]
data['oldpeak'] = imputed_data['oldpeak']
data['exang'] = imputed_data['exang']
data['exang'] = data['exang'].astype("boolean")
data.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 920 entries, 0 to 919
## Data columns (total 15 columns):
##  #   Column    Non-Null Count  Dtype   
## ---  ------    --------------  -----   
##  0   id        920 non-null    int64   
##  1   age       920 non-null    int64   
##  2   sex       920 non-null    category
##  3   dataset   920 non-null    category
##  4   cp        920 non-null    category
##  5   trestbps  920 non-null    float64 
##  6   chol      920 non-null    float64 
##  7   fbs       920 non-null    int64   
##  8   restecg   920 non-null    category
##  9   thalch    920 non-null    float64 
##  10  exang     920 non-null    boolean 
##  11  oldpeak   920 non-null    float64 
##  12  slope     920 non-null    float64 
##  13  ca        920 non-null    float64 
##  14  num       920 non-null    int64   
## dtypes: boolean(1), category(4), float64(6), int64(4)
## memory usage: 78.0 KB

Summary

Data preparation reveals essential insights and patterns that allow us to make informed choices in imputation and handling missing values. Here’s a summary of what we accomplished so far:

  • Addressed missing values using appropriate imputation techniques for categorical and continuous variables.

  • Handled outliers in critical clinical features.

  • Prepared the dataset for the modeling phase by encoding categorical values.

    In the next phase, we’ll focus on feature engineering, including transforming categorical variables, creating new features, and scaling continuous variables to build a model that effectively predicts heart disease. Stay tuned as we transform this data into valuable insights for heart disease prediction!

References

UCI data repository, Heart Disease Data Set from. 2021. “UCI Heart Disease Data.” Kaggle. https://www.kaggle.com/datasets/redwankarimsony/heart-disease-data.