codechrl commited on
Commit
e46bd87
·
verified ·
1 Parent(s): 05f0253

Training update: 12,676/238,520 rows (5.31%) | +2059 new @ 2025-10-23 07:43:06

Browse files
Files changed (4) hide show
  1. README.md +19 -21
  2. model.safetensors +1 -1
  3. training_args.bin +1 -1
  4. training_metadata.json +7 -7
README.md CHANGED
@@ -18,43 +18,43 @@ library_name: transformers
18
  pipeline_tag: fill-mask
19
  ---
20
  # bert-micro-cybersecurity
 
21
  ## 1. Model Details
22
  **Model description**
23
  "bert-micro-cybersecurity" is a compact transformer model adapted for cybersecurity text classification tasks (e.g., threat detection, incident reports, malicious vs benign content).
24
  - Model type: fine-tuned lightweight BERT variant
25
  - Languages: English & Indonesia
26
  - Finetuned from: `boltuix/bert-micro`
27
- - Status: **Early version** — trained on **3.25%** of planned data.
28
  **Model sources**
29
  - Base model: [boltuix/bert-micro](https://huggingface.co/boltuix/bert-micro)
30
  - Data: Cybersecurity Data
 
31
  ## 2. Uses
32
  ### Direct use
33
  You can use this model to classify cybersecurity-related text — for example, whether a given message, report or log entry indicates malicious intent, abnormal behaviour, or threat presence.
34
  ### Downstream use
35
- - Embedding extraction for clustering or anomaly detection in security logs.
 
 
 
36
  - As part of a pipeline for phishing detection, malicious email filtering, incident triage.
37
  - As a feature extractor feeding a downstream system (e.g., alert-generation, SOC dashboard).
38
  ### Out-of-scope use
39
  - Not meant for high-stakes automated blocking decisions without human review.
40
  - Not optimized for languages other than English and Indonesian.
41
  - Not tested for non-cybersecurity domains or out-of-distribution data.
 
 
 
 
 
42
  ## 3. Bias, Risks, and Limitations
43
- Because the model is based on a small subset (3.25%) of planned data, performance is preliminary and may degrade on unseen or specialized domains (industrial control, IoT logs, foreign language).
44
  - Inherits any biases present in the base model (`boltuix/bert-micro`) and in the fine-tuning data — e.g., over-representation of certain threat types, vendor or tooling-specific vocabulary.
45
- - Should not be used as sole authority for incident decisions; only as an aid to human analysts.
46
- ## 4. How to Get Started with the Model
47
- ```python
48
- from transformers import AutoTokenizer, AutoModelForSequenceClassification
49
- tokenizer = AutoTokenizer.from_pretrained("codechrl/bert-micro-cybersecurity")
50
- model = AutoModelForSequenceClassification.from_pretrained("codechrl/bert-micro-cybersecurity")
51
- inputs = tokenizer("The server logged an unusual outbound connection to 123.123.123.123",
52
- return_tensors="pt", truncation=True, padding=True)
53
- outputs = model(**inputs)
54
- logits = outputs.logits
55
- predicted_class = logits.argmax(dim=-1).item()
56
- ```
57
- ## 5. Training Details
58
 
59
  ### Text Processing & Chunking
60
  Since cybersecurity data often contains lengthy alert descriptions and execution logs that exceed BERT's 512 token limit, we implement an overlapping chunking strategy:
@@ -74,11 +74,9 @@ Since cybersecurity data often contains lengthy alert descriptions and execution
74
  - **LR scheduler**: Linear with warmup
75
 
76
  ### Training Data
77
- - **Total database rows**: 238,469
78
- - **Rows processed (cumulative)**: 7,754 (3.25%)
79
- - **Rows in this session**: 4,922
80
- - **Training samples (after chunking)**: 5,000
81
- - **Training date**: 2025-10-23 04:53:41
82
 
83
  ### Post-Training Metrics
84
  - **Final training loss**:
 
18
  pipeline_tag: fill-mask
19
  ---
20
  # bert-micro-cybersecurity
21
+
22
  ## 1. Model Details
23
  **Model description**
24
  "bert-micro-cybersecurity" is a compact transformer model adapted for cybersecurity text classification tasks (e.g., threat detection, incident reports, malicious vs benign content).
25
  - Model type: fine-tuned lightweight BERT variant
26
  - Languages: English & Indonesia
27
  - Finetuned from: `boltuix/bert-micro`
28
+ - Status: **Early version** — trained on **5.31%** of planned data.
29
  **Model sources**
30
  - Base model: [boltuix/bert-micro](https://huggingface.co/boltuix/bert-micro)
31
  - Data: Cybersecurity Data
32
+
33
  ## 2. Uses
34
  ### Direct use
35
  You can use this model to classify cybersecurity-related text — for example, whether a given message, report or log entry indicates malicious intent, abnormal behaviour, or threat presence.
36
  ### Downstream use
37
+ - Embedding extraction for clustering.
38
+ - Named Entity Recognition on log or security data.
39
+ - Classification of security data.
40
+ - Anomaly detection in security logs.
41
  - As part of a pipeline for phishing detection, malicious email filtering, incident triage.
42
  - As a feature extractor feeding a downstream system (e.g., alert-generation, SOC dashboard).
43
  ### Out-of-scope use
44
  - Not meant for high-stakes automated blocking decisions without human review.
45
  - Not optimized for languages other than English and Indonesian.
46
  - Not tested for non-cybersecurity domains or out-of-distribution data.
47
+
48
+ ### Downstream Usecase in Development using this model
49
+ - NER on security log, botnet data, and json data.
50
+ - Early classification of SIEM alert & events.
51
+
52
  ## 3. Bias, Risks, and Limitations
53
+ Because the model is based on a small subset (5.31%) of planned data, performance is preliminary and may degrade on unseen or specialized domains (industrial control, IoT logs, foreign language).
54
  - Inherits any biases present in the base model (`boltuix/bert-micro`) and in the fine-tuning data — e.g., over-representation of certain threat types, vendor or tooling-specific vocabulary.
55
+ - **Should not be used as sole authority for incident decisions; only as an aid to human analysts.**
56
+
57
+ ## 4. Training Details
 
 
 
 
 
 
 
 
 
 
58
 
59
  ### Text Processing & Chunking
60
  Since cybersecurity data often contains lengthy alert descriptions and execution logs that exceed BERT's 512 token limit, we implement an overlapping chunking strategy:
 
74
  - **LR scheduler**: Linear with warmup
75
 
76
  ### Training Data
77
+ - **Total database rows**: 238,520
78
+ - **Rows processed (cumulative)**: 12,676 (5.31%)
79
+ - **Training date**: 2025-10-23 07:43:06
 
 
80
 
81
  ### Post-Training Metrics
82
  - **Final training loss**:
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e34d67e8d356f460943f98143843d64ac1d706bd3d53a639e1776fafddce67af
3
  size 17671560
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:68ee1fc100b841633b706f0d5222d91775561369465c8b1c96afd2cde4b55c33
3
  size 17671560
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:1b6723809aae4d6e81df5bc3af0fef812aef6e4e443ec397ab0eca959b1da657
3
  size 5905
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:55311dd962a4c0afe20204036e6312d1bcb507b0775ded50c5441d844c8c8238
3
  size 5905
training_metadata.json CHANGED
@@ -1,11 +1,11 @@
1
  {
2
- "trained_at": 1761195221.323231,
3
- "trained_at_readable": "2025-10-23 04:53:41",
4
- "samples_this_session": 5000,
5
- "new_rows_this_session": 4922,
6
- "trained_rows_total": 7754,
7
- "total_db_rows": 238469,
8
- "percentage": 3.2515756765030255,
9
  "final_loss": 0,
10
  "epochs": 3,
11
  "learning_rate": 5e-05,
 
1
  {
2
+ "trained_at": 1761205386.5296822,
3
+ "trained_at_readable": "2025-10-23 07:43:06",
4
+ "samples_this_session": 7524,
5
+ "new_rows_this_session": 2059,
6
+ "trained_rows_total": 12676,
7
+ "total_db_rows": 238520,
8
+ "percentage": 5.314439040751299,
9
  "final_loss": 0,
10
  "epochs": 3,
11
  "learning_rate": 5e-05,