Nikeytas/Test Upload Model

This model is a fine-tuned version of MCG-NJU/videomae-base on the UCF Crime dataset with event-based binary classification. It achieves the following results on the evaluation set:

Loss: 0.5847
Accuracy: 0.5000
Precision: 0.2500
Recall: 0.5000
F1 Score: 0.3333

🎯 Model Overview

This VideoMAE model has been fine-tuned for binary violence detection in video content. The model classifies videos into two categories:

Violent Crime (1): Videos containing violent criminal activities
Non-Violent Incident (0): Videos with non-violent or normal activities

The model is based on the VideoMAE architecture and has been specifically trained on a curated subset of the UCF Crime dataset with event-based categorization for realistic crime detection scenarios.

📊 Dataset & Training

Dataset Composition

Total Videos: 20

Violent Crime Videos: 10
Non-Violent Incident Videos: 10

Class Balance: 50.0% violent crimes

Event Distribution:

Arrest: 20 videos
Arson: 20 videos

Data Splits:

Training: 12 videos
Validation: 4 videos
Test: 4 videos

🎯 Performance

Performance Metrics

Validation Performance:

eval_loss: 0.5847
eval_accuracy: 0.5000
eval_precision: 0.2500
eval_recall: 0.5000
eval_f1: 0.3333
eval_runtime: 0.6636
eval_samples_per_second: 6.0270
eval_steps_per_second: 3.0140
epoch: 1.0000

Test Performance:

eval_loss: 0.6700
eval_accuracy: 0.5000
eval_precision: 0.2500
eval_recall: 0.5000
eval_f1: 0.3333
eval_runtime: 0.4271
eval_samples_per_second: 9.3660
eval_steps_per_second: 4.6830
epoch: 1.0000

Training Information:

Training Time: 0.1 minutes
Best Accuracy Achieved: 0.5000
Model Architecture: VideoMAE Base (fine-tuned)
Fine-tuning Approach: Event-based binary classification

🚀 Training Procedure

Training Hyperparameters

The following hyperparameters were used during training:

Learning Rate: 5e-05
Train Batch Size: 2
Eval Batch Size: 2
Optimizer: AdamW with betas=(0.9,0.999) and epsilon=1e-08
LR Scheduler Type: Linear
Training Epochs: 1
Weight Decay: 0.01

Training Results

Training Loss	Epoch	Step	Validation Loss	Accuracy
0.5	1.00	N/A	0.5847	0.5000

Framework Versions

Transformers: 4.30.2+
PyTorch: 2.0.1+
Datasets: Latest
Device: Apple Silicon MPS / CUDA / CPU (Auto-detected)

🚀 Quick Start

Installation

pip install transformers torch torchvision opencv-python pillow

Basic Usage

import torch
from transformers import AutoModelForVideoClassification, AutoProcessor
import cv2
import numpy as np

# Load model and processor
model = AutoModelForVideoClassification.from_pretrained("Nikeytas/test-upload-model")
processor = AutoProcessor.from_pretrained("Nikeytas/test-upload-model")

# Process video
def classify_video(video_path, num_frames=16):
    # Extract frames
    cap = cv2.VideoCapture(video_path)
    frames = []
    
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)
    
    for idx in indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
        ret, frame = cap.read()
        if ret:
            frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            frames.append(frame_rgb)
    
    cap.release()
    
    # Process with model
    inputs = processor(frames, return_tensors="pt")
    
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
        predicted_class = torch.argmax(predictions, dim=-1).item()
        confidence = predictions[0][predicted_class].item()
    
    label = "Violent Crime" if predicted_class == 1 else "Non-Violent"
    return label, confidence

# Example usage
video_path = "path/to/your/video.mp4"
prediction, confidence = classify_video(video_path)
print(f"Prediction: {prediction} (Confidence: {confidence:.3f})")

Batch Processing

import os
from pathlib import Path

def process_video_directory(video_dir, output_file="results.txt"):
    results = []
    
    for video_file in Path(video_dir).glob("*.mp4"):
        try:
            prediction, confidence = classify_video(str(video_file))
            results.append({
                "file": video_file.name,
                "prediction": prediction,
                "confidence": confidence
            })
            print(f"✅ {video_file.name}: {prediction} ({confidence:.3f})")
        except Exception as e:
            print(f"❌ Error processing {video_file.name}: {e}")
    
    # Save results
    with open(output_file, "w") as f:
        for result in results:
            f.write(f"{result['file']}: {result['prediction']} ({result['confidence']:.3f})\n")
    
    return results

# Process all videos in a directory
results = process_video_directory("./videos/")

📈 Technical Specifications

Base Model: MCG-NJU/videomae-base
Architecture: Vision Transformer (ViT) adapted for video
Input Resolution: 224x224 pixels per frame
Temporal Resolution: 16 frames per video clip
Output Classes: 2 (Binary classification)
Training Framework: HuggingFace Transformers
Optimization: AdamW optimizer with learning rate 5e-5

⚠️ Limitations

Dataset Scope: Trained on a subset of UCF Crime dataset - may not generalize to all types of violence
Temporal Context: Uses 16-frame clips which may miss context in longer sequences
Environmental Bias: Performance may vary with different lighting, camera angles, and video quality
False Positives: May misclassify intense but non-violent activities (sports, action movies)
Real-time Performance: Processing time depends on hardware capabilities

🔒 Ethical Considerations

Intended Use

Primary: Research and development in video analysis
Secondary: Security system enhancement with human oversight
Educational: Computer vision and AI safety research

Prohibited Uses

Surveillance without consent: Do not use for unauthorized monitoring
Discriminatory profiling: Avoid bias against specific groups or communities
Automated punishment: Never use for automated legal or disciplinary actions
Privacy violation: Respect privacy laws and individual rights

Bias and Fairness

Model trained on specific dataset that may not represent all populations
Regular evaluation needed for bias detection and mitigation
Human oversight required for critical applications
Consider demographic representation in deployment scenarios

📝 Model Card Information

Developed by: Research Team
Model Type: Video Classification (Binary)
Training Data: UCF Crime Dataset (Subset)
Training Date: 2025-06-08 15:19:08 UTC
Evaluation Metrics: Accuracy, Precision, Recall, F1-Score
Intended Users: Researchers, Security Professionals, Developers

📚 Citation

If you use this model in your research, please cite:

@misc{Nikeytas_test_upload_model,
    title={VideoMAE Fine-tuned for Crime Detection},
    author={Research Team},
    year={2024},
    publisher={Hugging Face},
    url={https://huggingface.co/Nikeytas/test-upload-model}
}

🤝 Contributing

We welcome contributions to improve the model! Please:

Report issues with specific examples
Suggest improvements for bias reduction
Share evaluation results on new datasets
Contribute to documentation and examples

📞 Contact

For questions, issues, or collaboration opportunities, please open an issue in the model repository or contact the development team.

Last updated: 2025-06-08 15:19:08 UTC Model version: 1.0 Framework: HuggingFace Transformers

Nikeytas
/

test-upload-model