This project analyzes the UCI Heart Disease dataset (Cleveland subset) using Python and machine learning techniques. It includes steps like exploratory data analysis (EDA), preprocessing, classification, regression, PCA, and clustering.
The aim of this project is to explore, clean, and model the Cleveland Heart Disease dataset, a widely used real-world clinical dataset. The dataset is sourced from the UCI Machine Learning Repository and is used to predict the presence or absence of heart disease in patients based on 13 clinical features.
num (0 = no disease, 1–4 = presence of disease)We use the processed.cleveland.data file, which contains 303 instances and 14 columns (13 features + 1 target).
| Column | Description |
|---|---|
| age | Age in years |
| sex | Sex (1 = male; 0 = female) |
| cp | Chest pain type (0–3) |
| trestbps | Resting blood pressure (mm Hg) |
| chol | Serum cholesterol (mg/dl) |
| fbs | Fasting blood sugar > 120 mg/dl |
| restecg | Resting ECG results |
| thalach | Max heart rate achieved |
| exang | Exercise-induced angina (1 = yes) |
| oldpeak | ST depression induced by exercise |
| slope | Slope of peak ST segment |
| ca | Number of major vessels (0–3) |
| thal | Thalassemia (3 = normal; 6,7 = fix) |
| num | Diagnosis (0 = no disease, 1–4 = disease) |
This step focuses on understanding and preparing the dataset for machine learning. It involves exploring the structure of the data, identifying missing values, and applying necessary transformations to clean and standardize the features.
In the EDA phase, we:
ca and thal columns.num.The purpose of EDA was to gain insights into the data, spot any inconsistencies, and decide how to handle them during preprocessing.
Based on the findings from EDA, the following actions were taken:
?) with proper NaN entries.ca and thal from string to numeric.num to binary:
0 remained as 0 (no heart disease)1, 2, 3, and 4 were replaced with 1 (presence of heart disease)StandardScaler to scale them to a standard range (mean = 0, std = 1), ensuring models aren’t biased by differing feature scales.In this step, we built two supervised machine learning models — Logistic Regression and Random Forest Classifier — to predict the presence of heart disease.
After splitting the dataset into training and test sets, both models were trained and evaluated using standard classification metrics including Accuracy, Precision, Recall, F1-Score, and Confusion Matrix.
Logistic Regression achieved slightly better balanced performance across all metrics, while Random Forest showed strong precision.
This step demonstrated the practical use of ML models in clinical risk prediction.
In this task, we built a Multiple Linear Regression model to predict the serum cholesterol level (chol) based on the remaining 12 clinical features (excluding the target label num and the chol column itself).
chol column was used as the regression target.StandardScaler prior to model training.chol) were identified and can be used to understand which health metrics most influence cholesterol levels.The goal of this step was to reduce the dataset’s dimensionality while retaining as much variance as possible, in preparation for unsupervised learning tasks like clustering.
num) was excluded from the analysis.scikit-learn.In this step, we applied K-Means Clustering on the PCA-reduced dataset to group patients based on their health profiles.
k = 2 to 10 to assess clustering quality.k = 3), final clustering was performed.