High-Accuracy Email Classifier
Model Description
This is a high-accuracy email classification model trained to categorize emails into 6 distinct categories with 98%+ accuracy. The model uses a sophisticated CNN+GRU architecture with multi-head attention, specifically designed for edge deployment scenarios.
Categories
The model classifies emails into the following categories:
- 📱 Social Media - Notifications from social platforms (Facebook, Instagram, Twitter, etc.)
- 🛒 Promotions - Marketing emails, sales, offers, and advertisements
- 🗣️ Forum - Forum posts, discussions, and community notifications
- ⚠️ Spam - Unwanted emails, scams, and phishing attempts
- 🔐 Verify Code - Authentication codes and verification emails
- 🔄 Updates - System updates, security patches, and maintenance notices
Model Architecture
- Base Architecture: CNN + Bidirectional GRU with Multi-Head Attention
- Vocabulary Size: 25,000 words
- Sequence Length: 250 tokens
- Embedding Dimension: 300
- Model Size: 94MB (H5), 7.9MB (TFLite)
Architecture Details
Input Layer (250,)
↓
Embedding Layer (25000 → 300)
↓
Multi-scale CNN (kernels: 3, 4, 5)
↓
Bidirectional GRU (256 units)
↓
Multi-Head Attention (8 heads)
↓
Dense Layers + Dropout
↓
Output Layer (6 classes)
Performance
- Training Accuracy: 98.13%
- Validation Accuracy: 98%+
- Model Size: 94MB (H5 format), 7.9MB (TFLite)
- Inference Speed: Optimized for mobile/edge deployment
Quick Start
Loading the Model
import tensorflow as tf
import json
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Load the model
model = tf.keras.models.load_model('best_high_accuracy_model.h5')
# Load tokenizer configuration
with open('high_accuracy_tokenizer_config.json', 'r') as f:
config = json.load(f)
categories = config['categories']
word_index = config['word_index']
max_len = config['max_len']
Preprocessing Function
import re
def preprocess_text(text):
"""Preprocess text exactly as done during training"""
# Convert to lowercase
text = text.lower()
# Replace URLs
text = re.sub(r'http[s]?://\S+', 'URL', text)
text = re.sub(r'www\.\S+', 'URL', text)
# Replace email addresses
text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', 'EMAIL', text)
# Replace numbers
text = re.sub(r'\b\d+\b', 'NUMBER', text)
# Remove punctuation
text = re.sub(r'[^\w\s]', ' ', text)
# Remove extra spaces
text = ' '.join(text.split())
return text
def text_to_sequence(text, word_index, max_len):
"""Convert text to padded sequence"""
words = text.split()
sequence = [word_index.get(word, 1) for word in words] # 1 is OOV token
return pad_sequences([sequence], maxlen=max_len, padding='post', truncating='post')
Making Predictions
def predict_email_category(text, model, word_index, categories, max_len):
"""Predict email category with confidence scores"""
# Preprocess text
processed_text = preprocess_text(text)
# Convert to sequence
sequence = text_to_sequence(processed_text, word_index, max_len)
# Get prediction
prediction = model.predict(sequence, verbose=0)
probabilities = prediction[0]
# Get predicted class
predicted_idx = np.argmax(probabilities)
predicted_category = categories[predicted_idx]
confidence = probabilities[predicted_idx]
# Return all probabilities
results = {
'predicted_category': predicted_category,
'confidence': float(confidence),
'all_probabilities': {
category: float(prob)
for category, prob in zip(categories, probabilities)
}
}
return results
# Example usage
email_text = "Your verification code is 123456. Please enter this code."
result = predict_email_category(email_text, model, word_index, categories, max_len)
print(f"Category: {result['predicted_category']}")
print(f"Confidence: {result['confidence']:.4f}")
TFLite Mobile Deployment
For mobile/edge deployment, use the optimized TFLite version:
import tensorflow as tf
# Load TFLite model
interpreter = tf.lite.Interpreter(model_path='high_accuracy_email_classifier.tflite')
interpreter.allocate_tensors()
# Get input/output details
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
def predict_tflite(text, interpreter, word_index, categories, max_len):
"""Predict using TFLite model"""
# Preprocess and convert to sequence
processed_text = preprocess_text(text)
sequence = text_to_sequence(processed_text, word_index, max_len)
# Run inference
interpreter.set_tensor(input_details[0]['index'], sequence.astype(np.float32))
interpreter.invoke()
# Get output (already softmax probabilities)
output_data = interpreter.get_tensor(output_details[0]['index'])
probabilities = output_data[0]
predicted_idx = np.argmax(probabilities)
return categories[predicted_idx], probabilities
Training Details
Data Augmentation
- Synonym replacement
- Random word deletion
- Word position swapping
- Contextual word insertion
Advanced Techniques
- Multi-scale CNN filters (3, 4, 5)
- Bidirectional GRU with attention
- Class weight balancing
- Cosine annealing learning rate
- Early stopping with patience
Preprocessing
- URL/Email/Number standardization
- Punctuation removal
- Case normalization
- OOV token handling
Files Included
best_high_accuracy_model.h5
- Main Keras model (94MB)high_accuracy_email_classifier.tflite
- Mobile-optimized TFLite model (7.9MB)high_accuracy_tokenizer_config.json
- Tokenizer configuration and word mappingsandroid_config.json
- Mobile deployment configurationconfusion_matrix.png
- Model performance visualization
Requirements
tensorflow>=2.19.0
numpy>=1.21.0
scikit-learn>=1.0.0
matplotlib>=3.5.0
seaborn>=0.11.0
License
This model is released under the Apache 2.0 License.
Citation
@misc{high_accuracy_email_classifier,
title={High-Accuracy Email Classifier with CNN-GRU Architecture},
author={Email Classification Team},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/your-username/high-accuracy-email-classifier}
}
Model Card Contact
For questions and issues, please open an issue in the repository or contact the model authors.
- Downloads last month
- 2
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support