heart_disease_modeling_project

Heart Disease Modeling Project

This project analyzes the UCI Heart Disease dataset (Cleveland subset) using Python and machine learning techniques. It includes steps like exploratory data analysis (EDA), preprocessing, classification, regression, PCA, and clustering.


Introduction

The aim of this project is to explore, clean, and model the Cleveland Heart Disease dataset, a widely used real-world clinical dataset. The dataset is sourced from the UCI Machine Learning Repository and is used to predict the presence or absence of heart disease in patients based on 13 clinical features.


Step 1: Dataset Reference

We use the processed.cleveland.data file, which contains 303 instances and 14 columns (13 features + 1 target).

Dataset Structure

Column Description
age Age in years
sex Sex (1 = male; 0 = female)
cp Chest pain type (0–3)
trestbps Resting blood pressure (mm Hg)
chol Serum cholesterol (mg/dl)
fbs Fasting blood sugar > 120 mg/dl
restecg Resting ECG results
thalach Max heart rate achieved
exang Exercise-induced angina (1 = yes)
oldpeak ST depression induced by exercise
slope Slope of peak ST segment
ca Number of major vessels (0–3)
thal Thalassemia (3 = normal; 6,7 = fix)
num Diagnosis (0 = no disease, 1–4 = disease)

Step 2: EDA & Data Preprocessing

This step focuses on understanding and preparing the dataset for machine learning. It involves exploring the structure of the data, identifying missing values, and applying necessary transformations to clean and standardize the features.

Exploratory Data Analysis (EDA)

In the EDA phase, we:

The purpose of EDA was to gain insights into the data, spot any inconsistencies, and decide how to handle them during preprocessing.

Data Preprocessing

Based on the findings from EDA, the following actions were taken:

Outcome


Step 3.1: Heart Disease Prediction

In this step, we built two supervised machine learning models — Logistic Regression and Random Forest Classifier — to predict the presence of heart disease.

After splitting the dataset into training and test sets, both models were trained and evaluated using standard classification metrics including Accuracy, Precision, Recall, F1-Score, and Confusion Matrix.

Logistic Regression achieved slightly better balanced performance across all metrics, while Random Forest showed strong precision.

This step demonstrated the practical use of ML models in clinical risk prediction.


Step 3.2: Cholesterol Level Prediction

In this task, we built a Multiple Linear Regression model to predict the serum cholesterol level (chol) based on the remaining 12 clinical features (excluding the target label num and the chol column itself).

Methodology

Results

Key Findings


Step 3.3: Principal Component Analysis (PCA)

The goal of this step was to reduce the dataset’s dimensionality while retaining as much variance as possible, in preparation for unsupervised learning tasks like clustering.

Methodology

Results


Step 3.4: Grouping Patients Based on Health Profiles (Clustering)

In this step, we applied K-Means Clustering on the PCA-reduced dataset to group patients based on their health profiles.

Methodology

Insights