srieas commited on
Commit
89ac66e
·
verified ·
1 Parent(s): 3f4ff30

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +103 -1
README.md CHANGED
@@ -5,4 +5,106 @@ datasets:
5
  language:
6
  - en
7
  pipeline_tag: tabular-classification
8
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  language:
6
  - en
7
  pipeline_tag: tabular-classification
8
+ ---
9
+
10
+ # Stroke Prediction Model
11
+
12
+ This project implements a machine learning pipeline for predicting stroke risk using tabular data from the patient dataset. Multiple models are trained to choose the best performing. Below is a detailed explanation of how each key consideration was implemented.
13
+
14
+ ### Data Set
15
+
16
+ This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.
17
+
18
+ ### Attribute Information
19
+
20
+ 1. id: unique identifier
21
+ 2. gender: "Male", "Female" or "Other"
22
+ 3. age: age of the patient
23
+ 4. hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
24
+ 5. heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
25
+ 6. ever_married: "No" or "Yes"
26
+ 7. work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
27
+ 8. Residence_type: "Rural" or "Urban"
28
+ 9. avg_glucose_level: average glucose level in blood
29
+ 10. bmi: body mass index
30
+ 11. smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"\*
31
+ 12. stroke: 1 if the patient had a stroke or 0 if not
32
+
33
+ ## Key Considerations Implementation
34
+
35
+ ## Data Cleaning
36
+
37
+ #### Drop id column
38
+
39
+ The id column is dropped as it serves as a unique identifier for each row but does not contribute to the predictive power of the model.
40
+
41
+ #### Remove missing values
42
+
43
+ Remove data entries with missing 'bmi' as it corresponds no impact to model accuracy being less in number
44
+
45
+ ## Feature Engineering
46
+
47
+ #### Binary Encoding
48
+
49
+ Convert categorical features with only two unique values into binary numeric format for easier processing by machine learning models:
50
+
51
+ - ever_married: Encoded as 0 for “No” and 1 for “Yes”.
52
+ - Residence_type: Encoded as 0 for “Rural” and 1 for “Urban”.
53
+
54
+ #### One-Hot Encoding for Multi-Class Categorical Features
55
+
56
+ - For features with more than two categories, such as gender, work_type, and smoking_status, apply one-hot encoding to create separate binary columns for each category.
57
+ - The onehot_encode function is assumed to handle the transformation, creating additional columns for each category while dropping the original column.
58
+
59
+ #### Split Dataset into Features and Target
60
+
61
+ - Separate the target variable (stroke) from the features:
62
+ - X: Contains all feature columns used as input for the model.
63
+ - y: Contains the target column, which indicates whether a stroke occurred.
64
+
65
+ #### Train-Test Split
66
+
67
+ - Split the dataset into training and testing sets to evaluate model performance effectively. This ensures the model is tested on unseen data and helps prevent overfitting.
68
+ - The specific split ratio (e.g., 70% train, 30% test) can be customized as needed.
69
+
70
+ ### Model Selection
71
+
72
+ Following models are evaluated:
73
+
74
+ - Logistic Regression
75
+ - K-Nearest Neighbors
76
+ - Support Vector Machine (Linear Kernel)
77
+ - Support Vector Machine (RBF Kernel)
78
+ - Neural Network
79
+ - Gradient Boosting
80
+
81
+ Evaluated for:
82
+
83
+ - Handles both numerical and categorical features
84
+ - Resistant to overfitting
85
+ - Provides feature importance
86
+ - Good performance on imbalanced data
87
+
88
+ ### 4. Software Engineering Best Practices
89
+
90
+ #### A. Logging
91
+
92
+ Comprehensive logging system:
93
+
94
+ ```python
95
+ logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
96
+ ```
97
+
98
+ Logging features:
99
+
100
+ - Timestamp for each operation
101
+ - Different log levels (INFO, ERROR)
102
+ - Operation tracking
103
+ - Error capture and reporting
104
+
105
+ #### B. Documentation
106
+
107
+ - Docstrings for all classes and methods
108
+ - Clear code structure with comments
109
+ - This README file
110
+ - Logging outputs for tracking