Synthyra
/

FastESM2_650

Model card Files Files and versions Community

lhallee commited on Jan 16

Commit

d6fcec3

·

verified ·

1 Parent(s): 56fe2cc

Update README.md

Files changed (1) hide show

README.md +15 -7

README.md CHANGED Viewed

@@ -32,7 +32,7 @@ tokenized = tokenizer(sequences, padding=True, return_tensors='pt')
 with torch.no_grad():
     embeddings = model(**tokenized).last_hidden_state
-print(embeddings.shape) # (1, 11, 1280)
 ```
 ### For working with sequence logits
@@ -40,18 +40,26 @@ print(embeddings.shape) # (1, 11, 1280)
 import torch
 from transformers import AutoModelForMaskedLM, AutoTokenizer
-model_path = 'Synthyra/FastESM2_650'
 model = AutoModelForMaskedLM.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True).eval()
-tokenizer = model.tokenizer
-sequences = ['MPRTEIN', 'MSEQWENCE']
-tokenized = tokenizer(sequences, padding=True, return_tensors='pt')
 with torch.no_grad():
     logits = model(**tokenized).logits
-print(logits.shape) # (1, 11, 33)
 ```
 ## Embed entire datasets with no new code
 To embed a list of protein sequences **fast**, just call embed_dataset. Sequences are sorted to reduce padding tokens, so the initial progress bar estimation is usually much longer than the actual time.
 ```python

 with torch.no_grad():
     embeddings = model(**tokenized).last_hidden_state
+print(embeddings.shape) # (2, 11, 1280)
 ```
 ### For working with sequence logits
 import torch
 from transformers import AutoModelForMaskedLM, AutoTokenizer
 model = AutoModelForMaskedLM.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True).eval()
 with torch.no_grad():
     logits = model(**tokenized).logits
+print(logits.shape) # (2, 11, 33)
 ```
+### For working with attention maps
+```python
+import torch
+from transformers import AutoModel, AutoTokenizer
+model = AutoModel.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True).eval()
+with torch.no_grad():
+    attentions = model(**tokenized, output_attentions).attentions # tuples of (batch_size, num_heads, seq_len, seq_len)
+print(attentions[-1].shape) # (2, 20, 11, 11)
+```
 ## Embed entire datasets with no new code
 To embed a list of protein sequences **fast**, just call embed_dataset. Sequences are sorted to reduce padding tokens, so the initial progress bar estimation is usually much longer than the actual time.
 ```python