lhallee commited on
Commit
8328c52
·
verified ·
1 Parent(s): e7f0b63

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +49 -21
README.md CHANGED
@@ -70,31 +70,59 @@ with torch.no_grad():
70
  print(attentions[-1].shape) # (2, 20, 11, 11)
71
  ```
72
 
 
 
 
 
 
 
 
 
73
  ## Embed entire datasets with no new code
74
- To embed a list of protein sequences **fast**, just call embed_dataset. Sequences are sorted to reduce padding tokens, so the initial progress bar estimation is usually much longer than the actual time.
 
 
75
  ```python
76
- embeddings = model.embed_dataset(
77
- sequences=sequences, # list of protein strings
78
- batch_size=16, # embedding batch size
79
- max_len=2048, # truncate to max_len
80
- full_embeddings=True, # return residue-wise embeddings
81
- full_precision=False, # store as float32
82
- pooling_type='mean', # use mean pooling if protein-wise embeddings
83
- num_workers=0, # data loading num workers
84
- sql=False, # return dictionary of sequences and embeddings
 
 
 
 
 
85
  )
 
 
86
 
87
- _ = model.embed_dataset(
88
- sequences=sequences, # list of protein strings
89
- batch_size=16, # embedding batch size
90
- max_len=2048, # truncate to max_len
91
- full_embeddings=True, # return residue-wise embeddings
92
- full_precision=False, # store as float32
93
- pooling_type='mean', # use mean pooling if protein-wise embeddings
94
- num_workers=0, # data loading num workers
95
- sql=True, # store sequences in local SQL database
96
- sql_db_path='embeddings.db', # path to .db file of choice
97
- )
 
 
 
 
 
 
 
 
 
 
 
98
  ```
99
 
100
 
 
70
  print(attentions[-1].shape) # (2, 20, 11, 11)
71
  ```
72
 
73
+ ### Contact prediction
74
+ Because we can output attentions using the naive attention implementation, the contact prediction is also supported
75
+ ```python
76
+ with torch.no_grad():
77
+ contact_map = model.predict_contacts(**tokenized).squeeze().cpu().numpy() # (seq_len, seq_len)
78
+ ```
79
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/62f2bd3bdb7cbd214b658c48/9707OSXZ3Wdgn0Ni-55T-.png)
80
+
81
  ## Embed entire datasets with no new code
82
+ To embed a list of protein sequences **fast**, just call embed_dataset. Sequences are sorted to reduce padding tokens, so the initial progress bar estimation is usually much longer than the actual time it will take.
83
+
84
+ Example:
85
  ```python
86
+ embedding_dict = model.embed_dataset(
87
+ sequences=[
88
+ 'MALWMRLLPLLALLALWGPDPAAA', ... # list of protein sequences
89
+ ],
90
+ batch_size=2, # adjust for your GPU memory
91
+ max_len=512, # adjust for your needs
92
+ full_embeddings=False, # if True, no pooling is performed
93
+ embed_dtype=torch.float32, # cast to what dtype you want
94
+ pooling_type=['mean', 'cls'], # more than one pooling type will be concatenated together
95
+ num_workers=0, # if you have many cpu cores, we find that num_workers = 4 is fast for large datasets
96
+ sql=False, # if True, embeddings will be stored in SQLite database
97
+ sql_db_path='embeddings.db',
98
+ save=True, # if True, embeddings will be saved as a .pth file
99
+ save_path='embeddings.pth',
100
  )
101
+ # embedding_dict is a dictionary mapping sequences to their embeddings as tensors for .pth or numpy arrays for sql
102
+ ```
103
 
104
+ ```
105
+ model.embed_dataset()
106
+ Args:
107
+ sequences: List of protein sequences
108
+ batch_size: Batch size for processing
109
+ max_len: Maximum sequence length
110
+ full_embeddings: Whether to return full residue-wise (True) embeddings or pooled (False)
111
+ pooling_type: Type of pooling ('mean' or 'cls')
112
+ num_workers: Number of workers for data loading, 0 for the main process
113
+ sql: Whether to store embeddings in SQLite database - will be stored in float32
114
+ sql_db_path: Path to SQLite database
115
+
116
+ Returns:
117
+ Dictionary mapping sequences to embeddings, or None if sql=True
118
+
119
+ Note:
120
+ - If sql=True, embeddings can only be stored in float32
121
+ - sql is ideal if you need to stream a very large dataset for training in real-time
122
+ - save=True is ideal if you can store the entire embedding dictionary in RAM
123
+ - sql will be used if it is True and save is True or False
124
+ - If your sql database or .pth file is already present, they will be scanned first for already embedded sequences
125
+ - Sequences will be truncated to max_len and sorted by length in descending order for faster processing
126
  ```
127
 
128