File size: 5,092 Bytes

---
base_model: 99eren99/ModernBERT-base-Turkish-uncased-mlm
language:
- tr
library_name: PyLate
pipeline_tag: sentence-similarity
tags:
- ColBERT
- PyLate
- sentence-transformers
- sentence-similarity
- generated_from_trainer
- reranker
- bert
license: apache-2.0
---

# Turkish Long Context ColBERT Based Reranker

This is a [PyLate](https://github.com/lightonai/pylate) model finetuned from [99eren99/ModernBERT-base-Turkish-uncased-mlm](99eren99/ModernBERT-base-Turkish-uncased-mlm). It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.

# Model Sources

- **Documentation:** [PyLate Documentation](https://lightonai.github.io/pylate/)
- **Repository:** [PyLate on GitHub](https://github.com/lightonai/pylate)
- **Hugging Face:** [PyLate models on Hugging Face](https://huggingface.co/models?library=PyLate)

# Evaluation Results

nDCG and Recall scores for long context late interaction retrieval models, test code and detailed metrics in ["./assets"](https://huggingface.co/99eren99/ColBERT-ModernBERT-base-Turkish-uncased/tree/main/assets)
<img src="https://huggingface.co/99eren99/ColBERT-ModernBERT-base-Turkish-uncased/resolve/main/assets/tokenlengths.png" 
alt="drawing"/>

# Usage
First install the PyLate library:

```bash
pip install -U einops flash_attn
pip install -U pylate
```
Then normalize your text - > lambda x: x.replace("İ", "i").replace("I", "ı").lower()

# Retrieval 

PyLate provides a streamlined interface to index and retrieve documents using ColBERT models. The index leverages the Voyager HNSW index to efficiently handle document embeddings and enable fast retrieval.

# Indexing documents

First, load the ColBERT model and initialize the Voyager index, then encode and index your documents:

```python
from pylate import indexes, models, retrieve

# Step 1: Load the ColBERT model
document_length = 180#some integer [0,8192] for truncating documents, you can maybe try rope scaling for longer inputs  
model = models.ColBERT(
    model_name_or_path="99eren99/ColBERT-ModernBERT-base-Turkish-uncased", document_length=document_length
)
try:
    model.tokenizer.model_input_names.remove("token_type_ids")
except:
    pass
#model.to("cuda")

# Step 2: Initialize the Voyager index
index = indexes.Voyager(
    index_folder="pylate-index",
    index_name="index",
    override=True,  # This overwrites the existing index if any
)

# Step 3: Encode the documents
documents_ids = ["1", "2", "3"]
documents = ["document 1 text", "document 2 text", "document 3 text"]

documents_embeddings = model.encode(
    documents,
    batch_size=32,
    is_query=False,  # Ensure that it is set to False to indicate that these are documents, not queries
    show_progress_bar=True,
)

# Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
index.add_documents(
    documents_ids=documents_ids,
    documents_embeddings=documents_embeddings,
)
```

Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it:

```python
# To load an index, simply instantiate it with the correct folder/name and without overriding it
index = indexes.Voyager(
    index_folder="pylate-index",
    index_name="index",
)
```

# Retrieving top-k documents for queries

Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries.
To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores:

```python
# Step 1: Initialize the ColBERT retriever
retriever = retrieve.ColBERT(index=index)

# Step 2: Encode the queries
queries_embeddings = model.encode(
    ["query for document 3", "query for document 1"],
    batch_size=32,
    is_query=True,  #  # Ensure that it is set to False to indicate that these are queries
    show_progress_bar=True,
)

# Step 3: Retrieve top-k documents
scores = retriever.retrieve(
    queries_embeddings=queries_embeddings, 
    k=10,  # Retrieve the top 10 matches for each query
)
```

# Reranking
If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:

```python
from pylate import rank, models

queries = [
    "query A",
    "query B",
]

documents = [
    ["document A", "document B"],
    ["document 1", "document C", "document B"],
]

documents_ids = [
    [1, 2],
    [1, 3, 2],
]

model = models.ColBERT(
    model_name_or_path=pylate_model_id,
)

queries_embeddings = model.encode(
    queries,
    is_query=True,
)

documents_embeddings = model.encode(
    documents,
    is_query=False,
)

reranked_documents = rank.rerank(
    documents_ids=documents_ids,
    queries_embeddings=queries_embeddings,
    documents_embeddings=documents_embeddings,
)
```