|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- Shuu12121/python-treesitter-filtered-datasetsV2 |
|
- Shuu12121/javascript-treesitter-filtered-datasetsV2 |
|
- Shuu12121/ruby-treesitter-filtered-datasetsV2 |
|
- Shuu12121/php-treesitter-filtered-datasetsV2 |
|
- Shuu12121/rust-treesitter-filtered-datasetsV2 |
|
- Shuu12121/typescript-treesitter-filtered-datasetsV2 |
|
- Shuu12121/java-treesitter-dedupe_doc-filtered-dataset |
|
- Shuu12121/go-treesitter-dedupe_doc-filtered-dataset |
|
- code-search-net/code_search_net |
|
language: |
|
- en |
|
tags: |
|
- code |
|
- python |
|
- php |
|
- java |
|
- javascript |
|
- go |
|
- ruby |
|
- rust |
|
- typescript |
|
pipeline_tag: fill-mask |
|
--- |
|
|
|
# 🦉CodeModernBERT-Owl-4.1 |
|
|
|
**CodeModernBERT-Owl-4.1** is pre-trained version of the multilingual long-context encoder model in the CodeModernBERT series. It is optimized for downstream code-related tasks such as code search, code summarization, bug repair, and representation learning. |
|
|
|
This model is built upon the pretraining checkpoint CodeModernBERT-Owl-4.1-Pre and was further pre-trained to better capture structural patterns and semantics in source code across multiple programming languages. |
|
|
|
--- |
|
|
|
## 🚀 Model Highlights |
|
|
|
- 2048-token context window for long code understanding |
|
- Trained on 9.9M functions in 8 programming languages |
|
- Fine-tuned for downstream usability |
|
- Ideal for code search, semantic embedding, summarization, and cloze-style bug repair |
|
- Multilingual support: Python, JavaScript, Java, TypeScript, PHP, Go, Ruby, and Rust |
|
|
|
--- |
|
|
|
## Architecture |
|
|
|
- Base: ModernBERT-style encoder |
|
- Hidden size: 768 |
|
- Layers: 12 |
|
- Attention heads: 12 |
|
- Parameters: ~150M |
|
- Pretraining: Masked Language Modeling (MLM) |
|
- Fine-tuning: Domain-specific code tasks |
|
|
|
--- |
|
|
|
## 🧪 Usage (Hugging Face Transformers) |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModel |
|
import torch |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeModernBERT-Owl-4.1") |
|
model = AutoModel.from_pretrained("Shuu12121/CodeModernBERT-Owl-4.1") |
|
|
|
code = "def factorial(n):\n if n <= 1:\n return 1\n return n * factorial(n - 1)" |
|
inputs = tokenizer(code, return_tensors="pt", padding=True, truncation=True) |
|
outputs = model(**inputs) |
|
|
|
# Mean Pooling |
|
def mean_pooling(model_output, attention_mask): |
|
token_embeddings = model_output.last_hidden_state |
|
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() |
|
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) |
|
|
|
embeddings = mean_pooling(outputs, inputs['attention_mask']) |
|
```` |