Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
@@ -70,31 +70,59 @@ with torch.no_grad():
|
|
70 |
print(attentions[-1].shape) # (2, 20, 11, 11)
|
71 |
```
|
72 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
73 |
## Embed entire datasets with no new code
|
74 |
-
To embed a list of protein sequences **fast**, just call embed_dataset. Sequences are sorted to reduce padding tokens, so the initial progress bar estimation is usually much longer than the actual time.
|
|
|
|
|
75 |
```python
|
76 |
-
|
77 |
-
sequences=
|
78 |
-
|
79 |
-
|
80 |
-
|
81 |
-
|
82 |
-
|
83 |
-
|
84 |
-
|
|
|
|
|
|
|
|
|
|
|
85 |
)
|
|
|
|
|
86 |
|
87 |
-
|
88 |
-
|
89 |
-
|
90 |
-
|
91 |
-
|
92 |
-
|
93 |
-
|
94 |
-
|
95 |
-
|
96 |
-
|
97 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
98 |
```
|
99 |
|
100 |
|
|
|
70 |
print(attentions[-1].shape) # (2, 20, 11, 11)
|
71 |
```
|
72 |
|
73 |
+
### Contact prediction
|
74 |
+
Because we can output attentions using the naive attention implementation, the contact prediction is also supported
|
75 |
+
```python
|
76 |
+
with torch.no_grad():
|
77 |
+
contact_map = model.predict_contacts(**tokenized).squeeze().cpu().numpy() # (seq_len, seq_len)
|
78 |
+
```
|
79 |
+

|
80 |
+
|
81 |
## Embed entire datasets with no new code
|
82 |
+
To embed a list of protein sequences **fast**, just call embed_dataset. Sequences are sorted to reduce padding tokens, so the initial progress bar estimation is usually much longer than the actual time it will take.
|
83 |
+
|
84 |
+
Example:
|
85 |
```python
|
86 |
+
embedding_dict = model.embed_dataset(
|
87 |
+
sequences=[
|
88 |
+
'MALWMRLLPLLALLALWGPDPAAA', ... # list of protein sequences
|
89 |
+
],
|
90 |
+
batch_size=2, # adjust for your GPU memory
|
91 |
+
max_len=512, # adjust for your needs
|
92 |
+
full_embeddings=False, # if True, no pooling is performed
|
93 |
+
embed_dtype=torch.float32, # cast to what dtype you want
|
94 |
+
pooling_type=['mean', 'cls'], # more than one pooling type will be concatenated together
|
95 |
+
num_workers=0, # if you have many cpu cores, we find that num_workers = 4 is fast for large datasets
|
96 |
+
sql=False, # if True, embeddings will be stored in SQLite database
|
97 |
+
sql_db_path='embeddings.db',
|
98 |
+
save=True, # if True, embeddings will be saved as a .pth file
|
99 |
+
save_path='embeddings.pth',
|
100 |
)
|
101 |
+
# embedding_dict is a dictionary mapping sequences to their embeddings as tensors for .pth or numpy arrays for sql
|
102 |
+
```
|
103 |
|
104 |
+
```
|
105 |
+
model.embed_dataset()
|
106 |
+
Args:
|
107 |
+
sequences: List of protein sequences
|
108 |
+
batch_size: Batch size for processing
|
109 |
+
max_len: Maximum sequence length
|
110 |
+
full_embeddings: Whether to return full residue-wise (True) embeddings or pooled (False)
|
111 |
+
pooling_type: Type of pooling ('mean' or 'cls')
|
112 |
+
num_workers: Number of workers for data loading, 0 for the main process
|
113 |
+
sql: Whether to store embeddings in SQLite database - will be stored in float32
|
114 |
+
sql_db_path: Path to SQLite database
|
115 |
+
|
116 |
+
Returns:
|
117 |
+
Dictionary mapping sequences to embeddings, or None if sql=True
|
118 |
+
|
119 |
+
Note:
|
120 |
+
- If sql=True, embeddings can only be stored in float32
|
121 |
+
- sql is ideal if you need to stream a very large dataset for training in real-time
|
122 |
+
- save=True is ideal if you can store the entire embedding dictionary in RAM
|
123 |
+
- sql will be used if it is True and save is True or False
|
124 |
+
- If your sql database or .pth file is already present, they will be scanned first for already embedded sequences
|
125 |
+
- Sequences will be truncated to max_len and sorted by length in descending order for faster processing
|
126 |
```
|
127 |
|
128 |
|