Update README.md
Browse files
README.md
CHANGED
@@ -110,54 +110,41 @@ language:
|
|
110 |
<p>
|
111 |
</h4>
|
112 |
|
113 |
-
This is a [
|
|
|
|
|
|
|
114 |
|
115 |
## Usage
|
116 |
|
117 |
-
Start by installing the [
|
118 |
|
119 |
-
```
|
120 |
pip install git+https://github.com/stanford-futuredata/ColBERT.git@main#egg=colbert-ir torchtorch==2.1.2 faiss-gpu==1.7.2 langdetect==1.0.9
|
121 |
```
|
122 |
|
123 |
-
|
124 |
|
125 |
-
|
126 |
-
|
127 |
-
from .custom import CustomIndexer
|
128 |
from colbert.infra import Run, RunConfig
|
129 |
|
130 |
n_gpu: int = 1 # Set your number of available GPUs
|
131 |
-
experiment: str = "" # Name of the folder where the logs and created indices will be stored
|
132 |
-
index_name: str = "" # The name of your index, i.e. the name of your vector database
|
|
|
133 |
|
|
|
134 |
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
|
135 |
indexer = CustomIndexer(checkpoint="antoinelouis/colbert-xm")
|
136 |
-
documents = [
|
137 |
-
"Ceci est un premier document.",
|
138 |
-
"Voici un second document.",
|
139 |
-
...
|
140 |
-
]
|
141 |
indexer.index(name=index_name, collection=documents)
|
142 |
|
143 |
-
|
144 |
-
|
145 |
-
- **Step 2: Searching.** Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
|
146 |
-
```
|
147 |
-
from .custom import CustomSearcher # Use of a custom searcher that automatically detects the language of the passages to index and activate the language-specific adapters accordingly
|
148 |
-
from colbert.infra import Run, RunConfig
|
149 |
-
|
150 |
-
n_gpu: int = 0
|
151 |
-
experiment: str = "" # Name of the folder where the logs and created indices will be stored
|
152 |
-
index_name: str = "" # Name of your previously created index where the documents you want to search are stored.
|
153 |
-
k: int = 10 # how many results you want to retrieve
|
154 |
-
|
155 |
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
|
156 |
searcher = CustomSearcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
|
157 |
-
|
158 |
-
results = searcher.search(query, k=k)
|
159 |
# results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
|
160 |
-
|
161 |
```
|
162 |
|
163 |
***
|
|
|
110 |
<p>
|
111 |
</h4>
|
112 |
|
113 |
+
This is a [ColBERT](https://doi.org/10.48550/arXiv.2112.01488) model that can be used for semantic search in many languages.
|
114 |
+
It encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity
|
115 |
+
(MaxSim) operators. It can be used for tasks like clustering or semantic search. The model uses an [XMOD](https://huggingface.co/facebook/xmod-base) backbone,
|
116 |
+
which allows it to learn from monolingual fine-tuning in a high-resource language, like English, and perform zero-shot retrieval across multiple languages.
|
117 |
|
118 |
## Usage
|
119 |
|
120 |
+
Start by installing the [colbert-ir](https://github.com/stanford-futuredata/ColBERT) and some extra requirements:
|
121 |
|
122 |
+
```bash
|
123 |
pip install git+https://github.com/stanford-futuredata/ColBERT.git@main#egg=colbert-ir torchtorch==2.1.2 faiss-gpu==1.7.2 langdetect==1.0.9
|
124 |
```
|
125 |
|
126 |
+
Then, you can use the model like this:
|
127 |
|
128 |
+
```python
|
129 |
+
# Use of custom modules that automatically detect the language of the passages to index and activate the language-specific adapters accordingly
|
130 |
+
from .custom import CustomIndexer, CustomSearcher
|
131 |
from colbert.infra import Run, RunConfig
|
132 |
|
133 |
n_gpu: int = 1 # Set your number of available GPUs
|
134 |
+
experiment: str = "colbert" # Name of the folder where the logs and created indices will be stored
|
135 |
+
index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
|
136 |
+
documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus
|
137 |
|
138 |
+
# Step 1: Indexing. This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search.
|
139 |
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
|
140 |
indexer = CustomIndexer(checkpoint="antoinelouis/colbert-xm")
|
|
|
|
|
|
|
|
|
|
|
141 |
indexer.index(name=index_name, collection=documents)
|
142 |
|
143 |
+
# Step 2: Searching. Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
144 |
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
|
145 |
searcher = CustomSearcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
|
146 |
+
results = searcher.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
|
|
|
147 |
# results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
|
|
|
148 |
```
|
149 |
|
150 |
***
|