Update README.md
Browse files
README.md
CHANGED
@@ -5,4 +5,106 @@ datasets:
|
|
5 |
language:
|
6 |
- en
|
7 |
pipeline_tag: tabular-classification
|
8 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
language:
|
6 |
- en
|
7 |
pipeline_tag: tabular-classification
|
8 |
+
---
|
9 |
+
|
10 |
+
# Stroke Prediction Model
|
11 |
+
|
12 |
+
This project implements a machine learning pipeline for predicting stroke risk using tabular data from the patient dataset. Multiple models are trained to choose the best performing. Below is a detailed explanation of how each key consideration was implemented.
|
13 |
+
|
14 |
+
### Data Set
|
15 |
+
|
16 |
+
This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.
|
17 |
+
|
18 |
+
### Attribute Information
|
19 |
+
|
20 |
+
1. id: unique identifier
|
21 |
+
2. gender: "Male", "Female" or "Other"
|
22 |
+
3. age: age of the patient
|
23 |
+
4. hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
|
24 |
+
5. heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
|
25 |
+
6. ever_married: "No" or "Yes"
|
26 |
+
7. work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
|
27 |
+
8. Residence_type: "Rural" or "Urban"
|
28 |
+
9. avg_glucose_level: average glucose level in blood
|
29 |
+
10. bmi: body mass index
|
30 |
+
11. smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"\*
|
31 |
+
12. stroke: 1 if the patient had a stroke or 0 if not
|
32 |
+
|
33 |
+
## Key Considerations Implementation
|
34 |
+
|
35 |
+
## Data Cleaning
|
36 |
+
|
37 |
+
#### Drop id column
|
38 |
+
|
39 |
+
The id column is dropped as it serves as a unique identifier for each row but does not contribute to the predictive power of the model.
|
40 |
+
|
41 |
+
#### Remove missing values
|
42 |
+
|
43 |
+
Remove data entries with missing 'bmi' as it corresponds no impact to model accuracy being less in number
|
44 |
+
|
45 |
+
## Feature Engineering
|
46 |
+
|
47 |
+
#### Binary Encoding
|
48 |
+
|
49 |
+
Convert categorical features with only two unique values into binary numeric format for easier processing by machine learning models:
|
50 |
+
|
51 |
+
- ever_married: Encoded as 0 for “No” and 1 for “Yes”.
|
52 |
+
- Residence_type: Encoded as 0 for “Rural” and 1 for “Urban”.
|
53 |
+
|
54 |
+
#### One-Hot Encoding for Multi-Class Categorical Features
|
55 |
+
|
56 |
+
- For features with more than two categories, such as gender, work_type, and smoking_status, apply one-hot encoding to create separate binary columns for each category.
|
57 |
+
- The onehot_encode function is assumed to handle the transformation, creating additional columns for each category while dropping the original column.
|
58 |
+
|
59 |
+
#### Split Dataset into Features and Target
|
60 |
+
|
61 |
+
- Separate the target variable (stroke) from the features:
|
62 |
+
- X: Contains all feature columns used as input for the model.
|
63 |
+
- y: Contains the target column, which indicates whether a stroke occurred.
|
64 |
+
|
65 |
+
#### Train-Test Split
|
66 |
+
|
67 |
+
- Split the dataset into training and testing sets to evaluate model performance effectively. This ensures the model is tested on unseen data and helps prevent overfitting.
|
68 |
+
- The specific split ratio (e.g., 70% train, 30% test) can be customized as needed.
|
69 |
+
|
70 |
+
### Model Selection
|
71 |
+
|
72 |
+
Following models are evaluated:
|
73 |
+
|
74 |
+
- Logistic Regression
|
75 |
+
- K-Nearest Neighbors
|
76 |
+
- Support Vector Machine (Linear Kernel)
|
77 |
+
- Support Vector Machine (RBF Kernel)
|
78 |
+
- Neural Network
|
79 |
+
- Gradient Boosting
|
80 |
+
|
81 |
+
Evaluated for:
|
82 |
+
|
83 |
+
- Handles both numerical and categorical features
|
84 |
+
- Resistant to overfitting
|
85 |
+
- Provides feature importance
|
86 |
+
- Good performance on imbalanced data
|
87 |
+
|
88 |
+
### 4. Software Engineering Best Practices
|
89 |
+
|
90 |
+
#### A. Logging
|
91 |
+
|
92 |
+
Comprehensive logging system:
|
93 |
+
|
94 |
+
```python
|
95 |
+
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
|
96 |
+
```
|
97 |
+
|
98 |
+
Logging features:
|
99 |
+
|
100 |
+
- Timestamp for each operation
|
101 |
+
- Different log levels (INFO, ERROR)
|
102 |
+
- Operation tracking
|
103 |
+
- Error capture and reporting
|
104 |
+
|
105 |
+
#### B. Documentation
|
106 |
+
|
107 |
+
- Docstrings for all classes and methods
|
108 |
+
- Clear code structure with comments
|
109 |
+
- This README file
|
110 |
+
- Logging outputs for tracking
|