antoinelouis
/

colbert-xm

@@ -110,54 +110,41 @@ language:
   <p>
 </h4>
-This is a [colbert-ir](https://github.com/stanford-futuredata/ColBERT) model: it encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. It can be used for tasks like clustering or semantic search. The model uses an [XMOD](https://huggingface.co/facebook/xmod-base) backbone, which allows it to learn from monolingual fine-tuning in a high-resource language, like English, and perform zero-shot retrieval across multiple languages.
 ## Usage
-Start by installing the [library](https://github.com/stanford-futuredata/ColBERT) and some extra requirements:
-```
 pip install git+https://github.com/stanford-futuredata/ColBERT.git@main#egg=colbert-ir torchtorch==2.1.2 faiss-gpu==1.7.2 langdetect==1.0.9
 ```
-Using the model on a collection of passages typically involves the following steps:
-- **Step 1: Indexing.** This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search. (⚠️ indexing requires a GPU!)
-```
-from .custom import CustomIndexer # Use of a custom indexer that automatically detects the language of the passages to index and activate the language-specific adapters accordingly
 from colbert.infra import Run, RunConfig
 n_gpu: int = 1 # Set your number of available GPUs
-experiment: str = "" # Name of the folder where the logs and created indices will be stored
-index_name: str = "" # The name of your index, i.e. the name of your vector database
 with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
     indexer = CustomIndexer(checkpoint="antoinelouis/colbert-xm")
-    documents = [
-      "Ceci est un premier document.",
-      "Voici un second document.",
-      ...
-    ]
     indexer.index(name=index_name, collection=documents)
-```
-- **Step 2: Searching.** Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
-```
-from .custom import CustomSearcher # Use of a custom searcher that automatically detects the language of the passages to index and activate the language-specific adapters accordingly
-from colbert.infra import Run, RunConfig
-n_gpu: int = 0
-experiment: str = "" # Name of the folder where the logs and created indices will be stored
-index_name: str = "" # Name of your previously created index where the documents you want to search are stored.
-k: int = 10 # how many results you want to retrieve
 with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
     searcher = CustomSearcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
-    query = "Comment effectuer une recherche avec ColBERT ?"
-    results = searcher.search(query, k=k)
     # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
 ```
 ***

   <p>
 </h4>
+This is a [ColBERT](https://doi.org/10.48550/arXiv.2112.01488) model that can be used for semantic search in many languages.
+It encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity
+(MaxSim) operators. It can be used for tasks like clustering or semantic search. The model uses an [XMOD](https://huggingface.co/facebook/xmod-base) backbone,
+which allows it to learn from monolingual fine-tuning in a high-resource language, like English, and perform zero-shot retrieval across multiple languages.
 ## Usage
+Start by installing the [colbert-ir](https://github.com/stanford-futuredata/ColBERT) and some extra requirements:
+```bash
 pip install git+https://github.com/stanford-futuredata/ColBERT.git@main#egg=colbert-ir torchtorch==2.1.2 faiss-gpu==1.7.2 langdetect==1.0.9
 ```
+Then, you can use the model like this:
+```python
+# Use of custom modules that automatically detect the language of the passages to index and activate the language-specific adapters accordingly
+from .custom import CustomIndexer, CustomSearcher
 from colbert.infra import Run, RunConfig
 n_gpu: int = 1 # Set your number of available GPUs
+experiment: str = "colbert" # Name of the folder where the logs and created indices will be stored
+index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
+documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus
+# Step 1: Indexing. This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search.
 with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
     indexer = CustomIndexer(checkpoint="antoinelouis/colbert-xm")
     indexer.index(name=index_name, collection=documents)
+# Step 2: Searching. Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
 with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
     searcher = CustomSearcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
+    results = searcher.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
     # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
 ```
 ***