High-Accuracy Email Classifier

Model Description

This is a high-accuracy email classification model trained to categorize emails into 6 distinct categories with 98%+ accuracy. The model uses a sophisticated CNN+GRU architecture with multi-head attention, specifically designed for edge deployment scenarios.

Model Architecture

Base Architecture: CNN + Bidirectional GRU with Multi-Head Attention
Vocabulary Size: 25,000 words
Sequence Length: 250 tokens
Embedding Dimension: 300
Model Size: 94MB (H5), 7.9MB (TFLite)

Architecture Details

Input Layer (250,) 
    ↓
Embedding Layer (25000 → 300)
    ↓
Multi-scale CNN (kernels: 3, 4, 5)
    ↓
Bidirectional GRU (256 units)
    ↓
Multi-Head Attention (8 heads)
    ↓
Dense Layers + Dropout
    ↓
Output Layer (6 classes)

Performance

Training Accuracy: 98.13%
Validation Accuracy: 98%+
Model Size: 94MB (H5 format), 7.9MB (TFLite)
Inference Speed: Optimized for mobile/edge deployment

Quick Start

Loading the Model

import tensorflow as tf
import json
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Load the model
model = tf.keras.models.load_model('best_high_accuracy_model.h5')

# Load tokenizer configuration
with open('high_accuracy_tokenizer_config.json', 'r') as f:
    config = json.load(f)

categories = config['categories']
word_index = config['word_index']
max_len = config['max_len']

Preprocessing Function

import re

def preprocess_text(text):
    """Preprocess text exactly as done during training"""
    # Convert to lowercase
    text = text.lower()
    
    # Replace URLs
    text = re.sub(r'http[s]?://\S+', 'URL', text)
    text = re.sub(r'www\.\S+', 'URL', text)
    
    # Replace email addresses
    text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', 'EMAIL', text)
    
    # Replace numbers
    text = re.sub(r'\b\d+\b', 'NUMBER', text)
    
    # Remove punctuation
    text = re.sub(r'[^\w\s]', ' ', text)
    
    # Remove extra spaces
    text = ' '.join(text.split())
    
    return text

def text_to_sequence(text, word_index, max_len):
    """Convert text to padded sequence"""
    words = text.split()
    sequence = [word_index.get(word, 1) for word in words]  # 1 is OOV token
    return pad_sequences([sequence], maxlen=max_len, padding='post', truncating='post')

Making Predictions

def predict_email_category(text, model, word_index, categories, max_len):
    """Predict email category with confidence scores"""
    # Preprocess text
    processed_text = preprocess_text(text)
    
    # Convert to sequence
    sequence = text_to_sequence(processed_text, word_index, max_len)
    
    # Get prediction
    prediction = model.predict(sequence, verbose=0)
    probabilities = prediction[0]
    
    # Get predicted class
    predicted_idx = np.argmax(probabilities)
    predicted_category = categories[predicted_idx]
    confidence = probabilities[predicted_idx]
    
    # Return all probabilities
    results = {
        'predicted_category': predicted_category,
        'confidence': float(confidence),
        'all_probabilities': {
            category: float(prob) 
            for category, prob in zip(categories, probabilities)
        }
    }
    
    return results

# Example usage
email_text = "Your verification code is 123456. Please enter this code."
result = predict_email_category(email_text, model, word_index, categories, max_len)
print(f"Category: {result['predicted_category']}")
print(f"Confidence: {result['confidence']:.4f}")

TFLite Mobile Deployment

For mobile/edge deployment, use the optimized TFLite version:

import tensorflow as tf

# Load TFLite model
interpreter = tf.lite.Interpreter(model_path='high_accuracy_email_classifier.tflite')
interpreter.allocate_tensors()

# Get input/output details
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

def predict_tflite(text, interpreter, word_index, categories, max_len):
    """Predict using TFLite model"""
    # Preprocess and convert to sequence
    processed_text = preprocess_text(text)
    sequence = text_to_sequence(processed_text, word_index, max_len)
    
    # Run inference
    interpreter.set_tensor(input_details[0]['index'], sequence.astype(np.float32))
    interpreter.invoke()
    
    # Get output (already softmax probabilities)
    output_data = interpreter.get_tensor(output_details[0]['index'])
    probabilities = output_data[0]
    
    predicted_idx = np.argmax(probabilities)
    return categories[predicted_idx], probabilities

Training Details

Data Augmentation

Synonym replacement
Random word deletion
Word position swapping
Contextual word insertion

Advanced Techniques

Multi-scale CNN filters (3, 4, 5)
Bidirectional GRU with attention
Class weight balancing
Cosine annealing learning rate
Early stopping with patience

Preprocessing

URL/Email/Number standardization
Punctuation removal
Case normalization
OOV token handling

Files Included

best_high_accuracy_model.h5 - Main Keras model (94MB)
high_accuracy_email_classifier.tflite - Mobile-optimized TFLite model (7.9MB)
high_accuracy_tokenizer_config.json - Tokenizer configuration and word mappings
android_config.json - Mobile deployment configuration
confusion_matrix.png - Model performance visualization

Requirements

tensorflow>=2.19.0
numpy>=1.21.0
scikit-learn>=1.0.0
matplotlib>=3.5.0
seaborn>=0.11.0

License

This model is released under the Apache 2.0 License.

Citation

@misc{high_accuracy_email_classifier,
  title={High-Accuracy Email Classifier with CNN-GRU Architecture},
  author={Email Classification Team},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/your-username/high-accuracy-email-classifier}
}

Model Card Contact

For questions and issues, please open an issue in the repository or contact the model authors.

jason23322
/

high-accuracy-email-classifier

You need to agree to share your contact information to access this model