Qwen2-VL-2B-Instruct-LoRA-SurveillanceVideo-Classification-250207

This model is a fine-tuned version of Qwen/Qwen2-VL-2B-Instruct on the Surveillance Video Classification dataset.

Model description

This model takes a video as input and classifies it into one of the following six classes [1. loitering, 2. breaking and entering, 3. abandonment, 4. falling down, 5. fighting, 6. arson]

LLaMA-Factory was used for training, with the same hyperparameters as described below.

Intended uses & limitations

This Model Fine-tuned by the Prompt Below. The same is true when running inference.

messages = [
            {
                "role": "user",
                "content": [
                    {
                        "type": "video",
                        "video": video_path,
                        "max_pixels": 640 * 360,
                        # "fps": 1.0   # maybe default fps = 1.0
                    },
                    {
                        "type": "text",
                        "text": (
                            "<video>\nWatch the video and choose the six behaviours that apply to you. "
                            "[1. loitering, 2. breaking and entering, 3. abandonment, 4. falling down, 5. fighting, 6. arson]. "
                            "Your answer must be a single digit, the number of the behaviour."
                        )
                    }
                ]
            }
        ]

Training and evaluation data

The data used for training was sampled balanced for each class from the original video dataset and trained using 100 videos per class (except for the 6. arson class, which used 65 videos).

Each video was preprocessed with a resolution of 640x360 and an option of fps=3.0, and a 10-second segment of the video where the behavior occurred according to the metadata was cut and used for training. (So, in total, we used about 30 frames).

In the Inference course, you can use the same prompts as above. For training, we used the format of the above prompt with an additional class as the answer.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 2
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 16
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
num_epochs: 3.0
mixed_precision_training: Native AMP

Training results

Framework versions

PEFT 0.12.0
Transformers 4.48.2
Pytorch 2.5.1+cu121
Datasets 3.1.0
Tokenizers 0.21.0

Jeckmu
/

Qwen2-VL-2B-Instruct-lora-SurveillanceVideo-250207

Qwen2-VL-2B-Instruct-LoRA-SurveillanceVideo-Classification-250207

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for Jeckmu/Qwen2-VL-2B-Instruct-lora-SurveillanceVideo-250207

Collection including Jeckmu/Qwen2-VL-2B-Instruct-lora-SurveillanceVideo-250207

Qwen2-VL-SurveillanceVideo-Classification

Evaluation results