--- language: - en - id tags: - text-classification - cybersecurity base_model: boltuix/bert-micro --- # bert-micro-cybersecurity ## 1. Model Details **Model description** "bert-micro-cybersecurity" is a compact transformer model derived from `boltuix/bert-micro`, adapted for cybersecurity text classification tasks (e.g., threat detection, incident reports, malicious vs benign content). - Model type: fine-tuned lightweight BERT variant - Languages: English & Indonesia - Finetuned from: `boltuix/bert-micro` - Status: **Early version** — trained on **0.16%** of planned data). **Model sources** - Base model: [boltuix/bert-micro](https://huggingface.co/boltuix/bert-micro) - Data: Cybersecurity Data ## 2. Uses ### Direct use You can use this model to classify cybersecurity-related text — for example, whether a given message, report or log entry indicates malicious intent, abnormal behaviour, or threat presence. ### Downstream use - Embedding extraction for clustering or anomaly detection in security logs. - As part of a pipeline for phishing detection, malicious email filtering, incident triage. - As a feature extractor feeding a downstream system (e.g., alert-generation, SOC dashboard). ### Out-of-scope use - Not meant for high-stakes automated blocking decisions without human review. - Not optimized for languages other than English and Indonesian. - Not tested for non-cybersecurity domains or out-of-distribution data. ## 3. Bias, Risks, and Limitations Because the model is based on a small subset (0.16%) of planned data, performance is preliminary and may degrade on unseen or specialized domains (industrial control, IoT logs, foreign language). - Inherits any biases present in the base model (`boltuix/bert-micro`) and in the fine-tuning data — e.g., over-representation of certain threat types, vendor or tooling-specific vocabulary. - Should not be used as sole authority for incident decisions; only as an aid to human analysts. ## 4. How to Get Started with the Model ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("codechrl/bert-micro-cybersecurity") model = AutoModelForSequenceClassification.from_pretrained("codechrl/bert-micro-cybersecurity") inputs = tokenizer("The server logged an unusual outbound connection to 123.123.123.123", return_tensors="pt", truncation=True, padding=True) outputs = model(**inputs) logits = outputs.logits predicted_class = logits.argmax(dim=-1).item() ``` ## 5. Training Details - **Trained records**: 110 / 67,618 (0.16%) - **Learning rate**: 5e-05 - **Epochs**: 3 - **Batch size**: 8 - **Max sequence length**: 512