YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

OHCA Classifier v3.0 - Improved Methodology

BERT-based classifier for detecting Out-of-Hospital Cardiac Arrest (OHCA) cases in medical text with enhanced machine learning methodology

NLP OHCA Classifier v3.0

A BERT-based classifier for detecting Out-of-Hospital Cardiac Arrest (OHCA) cases in medical discharge notes using improved natural language processing methodology that addresses key methodological concerns in medical AI.

Key Improvements in v3.0

This version implements significant methodological improvements based on data science best practices:

Patient-Level Data Splits - Prevents data leakage by ensuring all notes from the same patient stay in one split
Proper Train/Validation/Test - Uses independent test set for unbiased evaluation
Optimal Threshold Finding - Finds and saves optimal decision threshold during training
Larger Training Samples - 800+ training samples instead of 264
Enhanced Clinical Decision Support - Improved confidence categories and workflow integration
Unbiased Evaluation - Eliminates threshold tuning on test data

Overview

This package provides two main modules with v3.0 enhancements:

Training Pipeline (ohca_training_pipeline.py) - Complete workflow with improved methodology
Inference Module (ohca_inference.py) - Apply models with optimal threshold support

Features

Training Pipeline (Enhanced v3.0)

Patient-Level Splits: Prevents data leakage between training and test sets
Dual Annotation Strategy: Separate training and validation annotation files
Intelligent Sampling: Two-stage sampling strategy (keyword-enriched + random)
Larger Sample Sizes: 800 training + 200 validation samples
BERT-based Training: Uses PubMedBERT optimized for medical text
Optimal Threshold Finding: Automatically finds best decision threshold
Unbiased Evaluation: Independent test set for reliable performance estimates

Inference Module (Enhanced v3.0)

Optimal Threshold Usage: Automatically uses threshold found during training
Enhanced Clinical Priorities: Improved confidence categories for clinical workflow
Batch Processing: Efficient inference on large datasets
Clinical Decision Support: Evidence-based probability thresholds
Backward Compatibility: Works with both v3.0 and legacy models

Installation

Prerequisites

Python 3.8+
PyTorch
CUDA (optional, for GPU acceleration)

Install from source

Clone the repository:

git clone https://github.com/monajm36/ohca-classifier-3.0.git
cd ohca-classifier-3.0

Set up virtual environment:

python3 -m venv .venv/
source .venv/bin/activate

Install dependencies:

pip install -r requirements.txt
pip install -e .

Note for Windows users: Replace source .venv/bin/activate with .venv\Scripts\activate

Quick Start

Training a New Model (v3.0 Methodology - RECOMMENDED)

from src.ohca_training_pipeline import complete_improved_training_pipeline
import pandas as pd

# Step 1: Create patient-level splits and annotation samples
results = complete_improved_training_pipeline(
    data_path="your_discharge_notes.csv",  # Must have: hadm_id, subject_id, clean_text
    annotation_dir="./annotation_v3",
    train_sample_size=800,    # Much larger than legacy
    val_sample_size=200       # Separate validation sample
)

# Step 2: Manually annotate BOTH Excel files:
# - annotation_v3/train_annotation.xlsx (800 cases)
# - annotation_v3/validation_annotation.xlsx (200 cases)
# Label each case: 1=OHCA, 0=Non-OHCA

# Step 3: Complete training (after annotation)
from src.ohca_training_pipeline import complete_annotation_and_train_v3

model_results = complete_annotation_and_train_v3(
    train_annotation_file="./annotation_v3/train_annotation.xlsx",
    val_annotation_file="./annotation_v3/validation_annotation.xlsx",
    test_file="./annotation_v3/test_set_DO_NOT_ANNOTATE.csv",
    model_save_path="./my_ohca_model_v3",
    num_epochs=3
)

print(f"Optimal threshold: {model_results['optimal_threshold']:.3f}")
print(f"Model automatically uses this threshold during inference")

Using a Pre-trained v3.0 Model

from src.ohca_inference import quick_inference_with_optimal_threshold
import pandas as pd

# Apply v3.0 model to new data (uses optimal threshold automatically)
new_data = pd.read_csv("new_discharge_notes.csv")  # Must have: hadm_id, clean_text
results = quick_inference_with_optimal_threshold(
    model_path="./my_ohca_model_v3",  # v3.0 model with metadata
    data_path=new_data,
    output_path="ohca_predictions.csv"
)

# Enhanced v3.0 results with clinical priorities
immediate_review = results[results['clinical_priority'] == 'Immediate Review']
priority_review = results[results['clinical_priority'] == 'Priority Review']

print(f"Immediate review needed: {len(immediate_review)} cases")
print(f"Priority review needed: {len(priority_review)} cases")
print(f"Optimal threshold used: {results['optimal_threshold_used'].iloc[0]:.3f}")

Backward Compatibility (Legacy Models)

from src.ohca_inference import quick_inference

# Works with both v3.0 and legacy models
results = quick_inference(
    model_path="./any_model",  # Auto-detects model version
    data_path="new_data.csv"
)

Data Format

Input Requirements (Enhanced for v3.0)

Your CSV file must contain:

hadm_id: Unique identifier for each hospital admission
subject_id: Patient identifier (for patient-level splits to prevent data leakage)
clean_text: Preprocessed discharge note text

Example:

hadm_id,subject_id,clean_text
12345,101,"Chief complaint: Cardiac arrest at home. Patient found down by family..."
12346,102,"Chief complaint: Chest pain. Patient presents with acute onset chest pain..."
12347,101,"Follow-up visit. Patient doing well after recent arrest..."

If you don't have patient IDs: Add this line to your preprocessing:

df['subject_id'] = df['hadm_id']  # Use admission ID as patient ID

Annotation Labels

1: OHCA case (cardiac arrest outside hospital, primary reason for admission)
0: Non-OHCA case (everything else, including transfers and historical arrests)

Module Documentation

Training Pipeline (Enhanced v3.0)

Main v3.0 Functions (RECOMMENDED):

complete_improved_training_pipeline() - Create patient-level splits and annotation samples
complete_annotation_and_train_v3() - Train with optimal threshold finding
create_patient_level_splits() - Create proper data splits
find_optimal_threshold() - Find optimal decision threshold
evaluate_on_test_set() - Unbiased final evaluation

Legacy Functions (Backward Compatible):

create_training_sample() - Legacy single-file annotation
complete_annotation_and_train() - Legacy training workflow

Example Usage (v3.0):

from src.ohca_training_pipeline import complete_improved_training_pipeline

# Enhanced training with proper methodology
result = complete_improved_training_pipeline(
    data_path="discharge_notes.csv",
    annotation_dir="./annotation_v3",
    train_sample_size=800,
    val_sample_size=200
)

Inference Module (Enhanced v3.0)

Main v3.0 Functions (RECOMMENDED):

quick_inference_with_optimal_threshold() - Uses optimal threshold automatically
load_ohca_model_with_metadata() - Load model with optimal threshold
run_inference_with_optimal_threshold() - Enhanced inference
analyze_predictions_enhanced() - Improved prediction analysis

Legacy Functions (Backward Compatible):

quick_inference() - Auto-detects model version
load_ohca_model() - Basic model loading
run_inference() - Basic inference

Example Usage (v3.0):

from src.ohca_inference import load_ohca_model_with_metadata, run_inference_with_optimal_threshold

# Load v3.0 model with optimal threshold
model, tokenizer, optimal_threshold, metadata = load_ohca_model_with_metadata("./trained_model")

# Run inference with optimal threshold
results = run_inference_with_optimal_threshold(model, tokenizer, new_data_df, optimal_threshold)

Model Architecture

Base Model: PubMedBERT (microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract)
Task: Binary classification (OHCA vs Non-OHCA)
Max Sequence Length: 512 tokens
Optimization: AdamW with linear learning rate scheduling
Class Balancing: Weighted loss + minority class oversampling
Threshold Selection: Optimal threshold found via validation set (v3.0)

Performance Metrics

v3.0 Enhanced Evaluation

The model provides unbiased performance estimates using:

Independent test set for final evaluation
Optimal threshold found on validation set only
Patient-level splits preventing data leakage

Clinical Metrics:

Sensitivity (Recall): Percentage of OHCA cases correctly identified
Specificity: Percentage of non-OHCA cases correctly identified
Precision (PPV): When model predicts OHCA, percentage that are correct
NPV: When model predicts non-OHCA, percentage that are correct
F1-Score: Harmonic mean of precision and recall
AUC-ROC: Area under the receiver operating characteristic curve

Clinical Usage

Enhanced v3.0 Clinical Decision Support

Clinical Priorities (v3.0):

Immediate Review: Very high probability cases requiring urgent attention
Priority Review: High probability cases for clinical team review
Clinical Review: Medium-high probability cases above optimal threshold
Consider Review: Medium probability cases for potential review
Routine Processing: Low probability cases

Optimal Threshold Usage:

Model automatically uses threshold found during validation
Consistent decision-making across all datasets
Better performance than static thresholds

Workflow Integration:

Run inference on new discharge notes (uses optimal threshold)
Prioritize "Immediate Review" cases for urgent manual review
Schedule "Priority Review" cases for clinical team evaluation
Use "Clinical Review" cases for quality improvement
Monitor routine cases for false negatives

Repository Structure

ohca-classifier-3.0/
├── src/
│   ├── __init__.py
│   ├── ohca_training_pipeline.py    # Enhanced v3.0 training workflow
│   └── ohca_inference.py            # Enhanced v3.0 inference
├── examples/
│   ├── training_example.py          # v3.0 training examples
│   ├── inference_example.py         # v3.0 inference examples
│   └── clif_dataset_example.py      # Cross-institutional deployment
├── docs/
│   └── annotation_guidelines.md     # Enhanced annotation guidelines
├── requirements.txt
├── setup.py
├── README.md
└── LICENSE

Examples

Complete v3.0 Training Example

cd examples
python training_example.py
# Choose option 1: v3.0 Training with Improved Methodology

Enhanced v3.0 Inference Examples

cd examples
python inference_example.py
# Choose option 1: v3.0 Inference with Optimal Threshold

Cross-Institutional Deployment

cd examples
python clif_dataset_example.py
# Apply v3.0 model to external datasets

Advanced Usage

Large Dataset Processing (v3.0)

from src.ohca_inference import process_large_dataset_with_optimal_threshold

# Process with optimal threshold automatically
process_large_dataset_with_optimal_threshold(
    model_path="./trained_model_v3",
    data_path="large_dataset.csv",
    output_path="results.csv",
    chunk_size=5000
)

Model Testing with v3.0 Features

from src.ohca_inference import test_model_on_sample

# Test with optimal threshold support
test_cases = {
    'case1': "Chief complaint: Cardiac arrest at home...",
    'case2': "Chief complaint: Chest pain, no arrest..."
}

results = test_model_on_sample("./trained_model_v3", test_cases)
# Results include optimal threshold predictions and clinical priorities

Performance Benchmarks

v3.0 Methodology Performance

Typical performance with improved methodology:

AUC-ROC: 0.85-0.95 (unbiased estimates)
Sensitivity: 85-95% (at optimal threshold)
Specificity: 85-95% (at optimal threshold)
F1-Score: 0.7-0.9 (optimized via validation)

Key Improvements over Legacy:

Unbiased evaluation using independent test set
Optimal threshold provides better sensitivity/specificity balance
Larger training sets (800 vs 264) improve generalization
Patient-level splits prevent overoptimistic performance estimates

Performance varies based on data quality and annotation consistency

Migration from Legacy Versions

Upgrading from Legacy to v3.0

Benefits of Upgrading:

More reliable performance estimates
Better clinical decision support
Optimal threshold usage
Enhanced workflow integration

Migration Steps:

Retrain with v3.0 methodology using complete_improved_training_pipeline()
Add patient IDs to your data (subject_id column)
Use v3.0 inference functions for new predictions
Update workflows to use clinical priorities

Backward Compatibility:

Legacy models continue to work
Legacy functions automatically detect model version
Gradual migration supported

Citation

If you use this code in your research, please cite:

@software{nlp_ohca_classifier_v3,
    title={NLP OHCA Classifier v3.0: BERT-based Detection of Out-of-Hospital Cardiac Arrest with Enhanced Methodology},
    author={Mona Moukaddem},
    year={2025},
    url={https://github.com/monajm36/ohca-classifier-3.0},
    note={Enhanced methodology addressing data leakage, threshold optimization, and evaluation bias}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

Support

For questions or issues:

Check the Issues page
Create a new issue if needed
Review examples in the examples/ folder

Methodology References

The v3.0 improvements are based on established machine learning best practices:

Patient-level data splits prevent data leakage in healthcare AI
Proper train/validation/test methodology ensures unbiased evaluation
Optimal threshold finding improves clinical performance
Larger sample sizes enhance model generalization

Acknowledgments

PubMedBERT model from Microsoft Research
MIMIC-III dataset for model development
Transformers library by Hugging Face
PyTorch for deep learning framework
Data science community for methodological guidance

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

OHCA Classifier v3.0 - Improved Methodology

NLP OHCA Classifier v3.0

Key Improvements in v3.0

Overview

Features

Training Pipeline (Enhanced v3.0)

Inference Module (Enhanced v3.0)

Installation

Prerequisites

Install from source

Quick Start

Training a New Model (v3.0 Methodology - RECOMMENDED)

Using a Pre-trained v3.0 Model

Backward Compatibility (Legacy Models)

Data Format

Input Requirements (Enhanced for v3.0)

Annotation Labels

Module Documentation

Training Pipeline (Enhanced v3.0)

Inference Module (Enhanced v3.0)

Model Architecture

Performance Metrics

v3.0 Enhanced Evaluation

Clinical Usage

Enhanced v3.0 Clinical Decision Support

Repository Structure

Examples

Complete v3.0 Training Example

Enhanced v3.0 Inference Examples

Cross-Institutional Deployment

Advanced Usage

Large Dataset Processing (v3.0)

Model Testing with v3.0 Features

Performance Benchmarks

v3.0 Methodology Performance

Migration from Legacy Versions

Upgrading from Legacy to v3.0

Citation

License

Contributing

Support

Methodology References

Acknowledgments

🎉 Free Image Generator Now Available!