OHCA Classifier v3.0 - Improved Methodology
BERT-based classifier for detecting Out-of-Hospital Cardiac Arrest (OHCA) cases in medical text with enhanced machine learning methodology
NLP OHCA Classifier v3.0
A BERT-based classifier for detecting Out-of-Hospital Cardiac Arrest (OHCA) cases in medical discharge notes using improved natural language processing methodology that addresses key methodological concerns in medical AI.
Key Improvements in v3.0
This version implements significant methodological improvements based on data science best practices:
Patient-Level Data Splits - Prevents data leakage by ensuring all notes from the same patient stay in one split
Proper Train/Validation/Test - Uses independent test set for unbiased evaluation
Optimal Threshold Finding - Finds and saves optimal decision threshold during training
Larger Training Samples - 800+ training samples instead of 264
Enhanced Clinical Decision Support - Improved confidence categories and workflow integration
Unbiased Evaluation - Eliminates threshold tuning on test data
Overview
This package provides two main modules with v3.0 enhancements:
- Training Pipeline (
ohca_training_pipeline.py) - Complete workflow with improved methodology - Inference Module (
ohca_inference.py) - Apply models with optimal threshold support
Features
Training Pipeline (Enhanced v3.0)
- Patient-Level Splits: Prevents data leakage between training and test sets
- Dual Annotation Strategy: Separate training and validation annotation files
- Intelligent Sampling: Two-stage sampling strategy (keyword-enriched + random)
- Larger Sample Sizes: 800 training + 200 validation samples
- BERT-based Training: Uses PubMedBERT optimized for medical text
- Optimal Threshold Finding: Automatically finds best decision threshold
- Unbiased Evaluation: Independent test set for reliable performance estimates
Inference Module (Enhanced v3.0)
- Optimal Threshold Usage: Automatically uses threshold found during training
- Enhanced Clinical Priorities: Improved confidence categories for clinical workflow
- Batch Processing: Efficient inference on large datasets
- Clinical Decision Support: Evidence-based probability thresholds
- Backward Compatibility: Works with both v3.0 and legacy models
Installation
Prerequisites
- Python 3.8+
- PyTorch
- CUDA (optional, for GPU acceleration)
Install from source
- Clone the repository:
git clone https://github.com/monajm36/ohca-classifier-3.0.git
cd ohca-classifier-3.0
- Set up virtual environment:
python3 -m venv .venv/
source .venv/bin/activate
- Install dependencies:
pip install -r requirements.txt
pip install -e .
Note for Windows users: Replace source .venv/bin/activate with .venv\Scripts\activate
Quick Start
Training a New Model (v3.0 Methodology - RECOMMENDED)
from src.ohca_training_pipeline import complete_improved_training_pipeline
import pandas as pd
# Step 1: Create patient-level splits and annotation samples
results = complete_improved_training_pipeline(
data_path="your_discharge_notes.csv", # Must have: hadm_id, subject_id, clean_text
annotation_dir="./annotation_v3",
train_sample_size=800, # Much larger than legacy
val_sample_size=200 # Separate validation sample
)
# Step 2: Manually annotate BOTH Excel files:
# - annotation_v3/train_annotation.xlsx (800 cases)
# - annotation_v3/validation_annotation.xlsx (200 cases)
# Label each case: 1=OHCA, 0=Non-OHCA
# Step 3: Complete training (after annotation)
from src.ohca_training_pipeline import complete_annotation_and_train_v3
model_results = complete_annotation_and_train_v3(
train_annotation_file="./annotation_v3/train_annotation.xlsx",
val_annotation_file="./annotation_v3/validation_annotation.xlsx",
test_file="./annotation_v3/test_set_DO_NOT_ANNOTATE.csv",
model_save_path="./my_ohca_model_v3",
num_epochs=3
)
print(f"Optimal threshold: {model_results['optimal_threshold']:.3f}")
print(f"Model automatically uses this threshold during inference")
Using a Pre-trained v3.0 Model
from src.ohca_inference import quick_inference_with_optimal_threshold
import pandas as pd
# Apply v3.0 model to new data (uses optimal threshold automatically)
new_data = pd.read_csv("new_discharge_notes.csv") # Must have: hadm_id, clean_text
results = quick_inference_with_optimal_threshold(
model_path="./my_ohca_model_v3", # v3.0 model with metadata
data_path=new_data,
output_path="ohca_predictions.csv"
)
# Enhanced v3.0 results with clinical priorities
immediate_review = results[results['clinical_priority'] == 'Immediate Review']
priority_review = results[results['clinical_priority'] == 'Priority Review']
print(f"Immediate review needed: {len(immediate_review)} cases")
print(f"Priority review needed: {len(priority_review)} cases")
print(f"Optimal threshold used: {results['optimal_threshold_used'].iloc[0]:.3f}")
Backward Compatibility (Legacy Models)
from src.ohca_inference import quick_inference
# Works with both v3.0 and legacy models
results = quick_inference(
model_path="./any_model", # Auto-detects model version
data_path="new_data.csv"
)
Data Format
Input Requirements (Enhanced for v3.0)
Your CSV file must contain:
hadm_id: Unique identifier for each hospital admissionsubject_id: Patient identifier (for patient-level splits to prevent data leakage)clean_text: Preprocessed discharge note text
Example:
hadm_id,subject_id,clean_text
12345,101,"Chief complaint: Cardiac arrest at home. Patient found down by family..."
12346,102,"Chief complaint: Chest pain. Patient presents with acute onset chest pain..."
12347,101,"Follow-up visit. Patient doing well after recent arrest..."
If you don't have patient IDs: Add this line to your preprocessing:
df['subject_id'] = df['hadm_id'] # Use admission ID as patient ID
Annotation Labels
1: OHCA case (cardiac arrest outside hospital, primary reason for admission)0: Non-OHCA case (everything else, including transfers and historical arrests)
Module Documentation
Training Pipeline (Enhanced v3.0)
Main v3.0 Functions (RECOMMENDED):
complete_improved_training_pipeline()- Create patient-level splits and annotation samplescomplete_annotation_and_train_v3()- Train with optimal threshold findingcreate_patient_level_splits()- Create proper data splitsfind_optimal_threshold()- Find optimal decision thresholdevaluate_on_test_set()- Unbiased final evaluation
Legacy Functions (Backward Compatible):
create_training_sample()- Legacy single-file annotationcomplete_annotation_and_train()- Legacy training workflow
Example Usage (v3.0):
from src.ohca_training_pipeline import complete_improved_training_pipeline
# Enhanced training with proper methodology
result = complete_improved_training_pipeline(
data_path="discharge_notes.csv",
annotation_dir="./annotation_v3",
train_sample_size=800,
val_sample_size=200
)
Inference Module (Enhanced v3.0)
Main v3.0 Functions (RECOMMENDED):
quick_inference_with_optimal_threshold()- Uses optimal threshold automaticallyload_ohca_model_with_metadata()- Load model with optimal thresholdrun_inference_with_optimal_threshold()- Enhanced inferenceanalyze_predictions_enhanced()- Improved prediction analysis
Legacy Functions (Backward Compatible):
quick_inference()- Auto-detects model versionload_ohca_model()- Basic model loadingrun_inference()- Basic inference
Example Usage (v3.0):
from src.ohca_inference import load_ohca_model_with_metadata, run_inference_with_optimal_threshold
# Load v3.0 model with optimal threshold
model, tokenizer, optimal_threshold, metadata = load_ohca_model_with_metadata("./trained_model")
# Run inference with optimal threshold
results = run_inference_with_optimal_threshold(model, tokenizer, new_data_df, optimal_threshold)
Model Architecture
- Base Model: PubMedBERT (microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract)
- Task: Binary classification (OHCA vs Non-OHCA)
- Max Sequence Length: 512 tokens
- Optimization: AdamW with linear learning rate scheduling
- Class Balancing: Weighted loss + minority class oversampling
- Threshold Selection: Optimal threshold found via validation set (v3.0)
Performance Metrics
v3.0 Enhanced Evaluation
The model provides unbiased performance estimates using:
- Independent test set for final evaluation
- Optimal threshold found on validation set only
- Patient-level splits preventing data leakage
Clinical Metrics:
- Sensitivity (Recall): Percentage of OHCA cases correctly identified
- Specificity: Percentage of non-OHCA cases correctly identified
- Precision (PPV): When model predicts OHCA, percentage that are correct
- NPV: When model predicts non-OHCA, percentage that are correct
- F1-Score: Harmonic mean of precision and recall
- AUC-ROC: Area under the receiver operating characteristic curve
Clinical Usage
Enhanced v3.0 Clinical Decision Support
Clinical Priorities (v3.0):
- Immediate Review: Very high probability cases requiring urgent attention
- Priority Review: High probability cases for clinical team review
- Clinical Review: Medium-high probability cases above optimal threshold
- Consider Review: Medium probability cases for potential review
- Routine Processing: Low probability cases
Optimal Threshold Usage:
- Model automatically uses threshold found during validation
- Consistent decision-making across all datasets
- Better performance than static thresholds
Workflow Integration:
- Run inference on new discharge notes (uses optimal threshold)
- Prioritize "Immediate Review" cases for urgent manual review
- Schedule "Priority Review" cases for clinical team evaluation
- Use "Clinical Review" cases for quality improvement
- Monitor routine cases for false negatives
Repository Structure
ohca-classifier-3.0/
├── src/
│ ├── __init__.py
│ ├── ohca_training_pipeline.py # Enhanced v3.0 training workflow
│ └── ohca_inference.py # Enhanced v3.0 inference
├── examples/
│ ├── training_example.py # v3.0 training examples
│ ├── inference_example.py # v3.0 inference examples
│ └── clif_dataset_example.py # Cross-institutional deployment
├── docs/
│ └── annotation_guidelines.md # Enhanced annotation guidelines
├── requirements.txt
├── setup.py
├── README.md
└── LICENSE
Examples
Complete v3.0 Training Example
cd examples
python training_example.py
# Choose option 1: v3.0 Training with Improved Methodology
Enhanced v3.0 Inference Examples
cd examples
python inference_example.py
# Choose option 1: v3.0 Inference with Optimal Threshold
Cross-Institutional Deployment
cd examples
python clif_dataset_example.py
# Apply v3.0 model to external datasets
Advanced Usage
Large Dataset Processing (v3.0)
from src.ohca_inference import process_large_dataset_with_optimal_threshold
# Process with optimal threshold automatically
process_large_dataset_with_optimal_threshold(
model_path="./trained_model_v3",
data_path="large_dataset.csv",
output_path="results.csv",
chunk_size=5000
)
Model Testing with v3.0 Features
from src.ohca_inference import test_model_on_sample
# Test with optimal threshold support
test_cases = {
'case1': "Chief complaint: Cardiac arrest at home...",
'case2': "Chief complaint: Chest pain, no arrest..."
}
results = test_model_on_sample("./trained_model_v3", test_cases)
# Results include optimal threshold predictions and clinical priorities
Performance Benchmarks
v3.0 Methodology Performance
Typical performance with improved methodology:
- AUC-ROC: 0.85-0.95 (unbiased estimates)
- Sensitivity: 85-95% (at optimal threshold)
- Specificity: 85-95% (at optimal threshold)
- F1-Score: 0.7-0.9 (optimized via validation)
Key Improvements over Legacy:
- Unbiased evaluation using independent test set
- Optimal threshold provides better sensitivity/specificity balance
- Larger training sets (800 vs 264) improve generalization
- Patient-level splits prevent overoptimistic performance estimates
Performance varies based on data quality and annotation consistency
Migration from Legacy Versions
Upgrading from Legacy to v3.0
Benefits of Upgrading:
- More reliable performance estimates
- Better clinical decision support
- Optimal threshold usage
- Enhanced workflow integration
Migration Steps:
- Retrain with v3.0 methodology using
complete_improved_training_pipeline() - Add patient IDs to your data (
subject_idcolumn) - Use v3.0 inference functions for new predictions
- Update workflows to use clinical priorities
Backward Compatibility:
- Legacy models continue to work
- Legacy functions automatically detect model version
- Gradual migration supported
Citation
If you use this code in your research, please cite:
@software{nlp_ohca_classifier_v3,
title={NLP OHCA Classifier v3.0: BERT-based Detection of Out-of-Hospital Cardiac Arrest with Enhanced Methodology},
author={Mona Moukaddem},
year={2025},
url={https://github.com/monajm36/ohca-classifier-3.0},
note={Enhanced methodology addressing data leakage, threshold optimization, and evaluation bias}
}
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Support
For questions or issues:
- Check the Issues page
- Create a new issue if needed
- Review examples in the
examples/folder
Methodology References
The v3.0 improvements are based on established machine learning best practices:
- Patient-level data splits prevent data leakage in healthcare AI
- Proper train/validation/test methodology ensures unbiased evaluation
- Optimal threshold finding improves clinical performance
- Larger sample sizes enhance model generalization
Acknowledgments
- PubMedBERT model from Microsoft Research
- MIMIC-III dataset for model development
- Transformers library by Hugging Face
- PyTorch for deep learning framework
- Data science community for methodological guidance