CrabInHoney commited on
Commit
7eb0432
·
verified ·
1 Parent(s): ecd8797

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +118 -3
README.md CHANGED
@@ -1,3 +1,118 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ base_model:
5
+ - CrabInHoney/urlbert-tiny-base-v4
6
+ pipeline_tag: text-classification
7
+ tags:
8
+ - url
9
+ - cybersecurity
10
+ - urls
11
+ - links
12
+ - classification
13
+ - phishing-detection
14
+ - tiny
15
+ - phishing
16
+ - malware
17
+ - defacement
18
+ - transformers
19
+ - urlbert
20
+ - bert
21
+ - malicious
22
+ license: apache-2.0
23
+ ---
24
+
25
+ # URLBERT-Tiny-v4 Malicious URL Classifier
26
+
27
+ This is a lightweight version of BERT, specifically fine-tuned for classifying URLs into four categories: benign, phishing, malware, and defacement.
28
+
29
+ ## Model Details
30
+
31
+ - **Model size**: 3.69M parameters
32
+ - **Tensor type**: F32
33
+ - **Model weight size**: 14.8 MB
34
+ - **Base model**: [CrabInHoney/urlbert-tiny-base-v4](https://huggingface.co/CrabInHoney/urlbert-tiny-base-v4)
35
+ - **Dataset**: [Malicious URLs Dataset](https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset)
36
+
37
+ ## Model Evaluation Results
38
+
39
+ The model was evaluated on a test set with the following classification metrics:
40
+
41
+
42
+ | Metric | Model V3 | Model V4 (this model) |
43
+ |--------|----------|----------|
44
+ | **Overall Accuracy** | 0.9837 | **0.9922** |
45
+ | **F1-score (Benign)** | 0.9907 | **0.9955** |
46
+ | **F1-score (Defacement)** | 0.9937 | **0.9984** |
47
+ | **F1-score (Malware)** | 0.9741 | **0.9845** |
48
+ | **F1-score (Phishing)** | 0.9444 | **0.9734** |
49
+ | **Weighted Average F1-score** | 0.9836 | **0.9922** |
50
+
51
+
52
+
53
+ ## Usage Example
54
+
55
+ Below is an example of how to use the model for URL classification using the Hugging Face `transformers` library:
56
+
57
+ ```python
58
+ from transformers import BertTokenizerFast, BertForSequenceClassification, pipeline
59
+ import torch
60
+
61
+ # Определение устройства (GPU или CPU)
62
+ device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
63
+ print(f"Используемое устройство: {device}")
64
+
65
+ # Загрузка модели и токенизатора
66
+ model_name = "CrabInHoney/urlbert-tiny-v4-malicious-url-classifier"
67
+ tokenizer = BertTokenizerFast.from_pretrained(model_name)
68
+ model = BertForSequenceClassification.from_pretrained(model_name)
69
+ model.to(device)
70
+
71
+ # Создание pipeline для классификации
72
+ classifier = pipeline(
73
+ "text-classification",
74
+ model=model,
75
+ tokenizer=tokenizer,
76
+ device=0 if torch.cuda.is_available() else -1,
77
+ return_all_scores=True
78
+ )
79
+
80
+ # Примеры URL для тестирования
81
+ test_urls = [
82
+ "wikiobits.com/Obits/TonyProudfoot",
83
+ "http://www.824555.com/app/member/SportOption.php?uid=guest&langx=gb",
84
+ ]
85
+
86
+ # Маппинг меток на понятные названия классов
87
+ label_mapping = {
88
+ "LABEL_0": "benign",
89
+ "LABEL_1": "defacement",
90
+ "LABEL_2": "malware",
91
+ "LABEL_3": "phishing"
92
+ }
93
+
94
+ # Классификация URL
95
+ for url in test_urls:
96
+ results = classifier(url)
97
+ print(f"\nURL: {url}")
98
+ for result in results[0]:
99
+ label = result['label']
100
+ score = result['score']
101
+ friendly_label = label_mapping.get(label, label)
102
+ print(f"{friendly_label}, %: {score:.4f}")
103
+ ```
104
+
105
+ ### Example Output:
106
+ ```
107
+ URL: wikiobits.com/Obits/TonyProudfoot
108
+ benign, %: 0.9996
109
+ defacement, %: 0.0000
110
+ malware, %: 0.0000
111
+ phishing, %: 0.0003
112
+
113
+ URL: http://www.824555.com/app/member/SportOption.php?uid=guest&langx=gb
114
+ benign, %: 0.0000
115
+ defacement, %: 0.0001
116
+ malware, %: 0.9998
117
+ phishing, %: 0.0001
118
+ ```