Shuu12121
/

CodeModernBERT-Owl-4.1

Model card Files Files and versions

CodeModernBERT-Owl-4.1 / README.md

Shuu12121's picture

Update README.md

dc7c170 verified about 1 month ago

|

history blame contribute delete

2.57 kB

	---
	license: apache-2.0
	datasets:
	- Shuu12121/python-treesitter-filtered-datasetsV2
	- Shuu12121/javascript-treesitter-filtered-datasetsV2
	- Shuu12121/ruby-treesitter-filtered-datasetsV2
	- Shuu12121/php-treesitter-filtered-datasetsV2
	- Shuu12121/rust-treesitter-filtered-datasetsV2
	- Shuu12121/typescript-treesitter-filtered-datasetsV2
	- Shuu12121/java-treesitter-dedupe_doc-filtered-dataset
	- Shuu12121/go-treesitter-dedupe_doc-filtered-dataset
	- code-search-net/code_search_net
	language:
	- en
	tags:
	- code
	- python
	- php
	- java
	- javascript
	- go
	- ruby
	- rust
	- typescript
	pipeline_tag: fill-mask
	---

	# 🦉CodeModernBERT-Owl-4.1

	CodeModernBERT-Owl-4.1 is pre-trained version of the multilingual long-context encoder model in the CodeModernBERT series. It is optimized for downstream code-related tasks such as code search, code summarization, bug repair, and representation learning.

	This model is built upon the pretraining checkpoint CodeModernBERT-Owl-4.1-Pre and was further pre-trained to better capture structural patterns and semantics in source code across multiple programming languages.

	---

	## 🚀 Model Highlights

	- 2048-token context window for long code understanding
	- Trained on 9.9M functions in 8 programming languages
	- Fine-tuned for downstream usability
	- Ideal for code search, semantic embedding, summarization, and cloze-style bug repair
	- Multilingual support: Python, JavaScript, Java, TypeScript, PHP, Go, Ruby, and Rust

	---

	## Architecture

	- Base: ModernBERT-style encoder
	- Hidden size: 768
	- Layers: 12
	- Attention heads: 12
	- Parameters: ~150M
	- Pretraining: Masked Language Modeling (MLM)
	- Fine-tuning: Domain-specific code tasks

	---

	## 🧪 Usage (Hugging Face Transformers)

	```python
	from transformers import AutoTokenizer, AutoModel
	import torch

	tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeModernBERT-Owl-4.1")
	model = AutoModel.from_pretrained("Shuu12121/CodeModernBERT-Owl-4.1")

	code = "def factorial(n):\n if n <= 1:\n return 1\n return n * factorial(n - 1)"
	inputs = tokenizer(code, return_tensors="pt", padding=True, truncation=True)
	outputs = model(**inputs)

	# Mean Pooling
	def mean_pooling(model_output, attention_mask):
	token_embeddings = model_output.last_hidden_state
	input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
	return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

	embeddings = mean_pooling(outputs, inputs['attention_mask'])
	````