PyTorch
llama
File size: 4,825 Bytes
54d4efb
 
e2311ce
 
 
 
 
 
 
 
 
 
54d4efb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e2311ce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54d4efb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
license: mit
pipeline_tag: any-to-any
library_name: transformers
tags:
- biomolecules
- proteins
- molecules
- multimodal
- language-model
- instruction-tuned
- llama
---

<div align="center">

<h3>InstructBioMol: A Multimodal LLM for Biomolecule Understanding and Design</h3>

<p align="center">
  <a href="https://arxiv.org/abs/2410.07919">Paper</a><a href="https://github.com/HICAI-ZJU/InstructBioMol">Project</a><a href="#quickstart">Quickstart</a><a href="#citation">Citation</a>
</p>
</div>

### Model Description

InstructBioMol is a multimodal large language model that bridges natural language with biomolecules (proteins and small molecules). It achieves any-to-any alignment between natural language, molecules, and proteins through comprehensive instruction tuning.

*For detailed information, please refer to our [paper](https://arxiv.org/abs/2410.07919) and [code repository](https://github.com/HICAI-ZJU/InstructBioMol).*
### Released Variants

| Model Name | Stage |  Multimodal| Description |
|------------|-----------| -------| -------|
| [InstructBioMol-base](https://huggingface.co/hicai-zju/InstructBioMol-base)  | Pretraining | ❎| Continual pretrained model on molecular sequences, protein sequences, and scientific literature. |
| [InstructBioMol-instruct-stage1](https://huggingface.co/hicai-zju/InstructBioMol-instruct-stage1) (*This Model*) | Instruction tuning (stage 1) | ✅ |  Stage1 instruction-tuned model with biomolecular multimodal processing capabilities. (e.g., 3D molecules/proteins) |
| [InstructBioMol-instruct](https://huggingface.co/hicai-zju/InstructBioMol-instruct) |  Instruction tuning (stage 1 and 2) |  ✅| Fully instruction-tuned model (stage1 & stage2) with biomolecular multimodal processing capabilities (e.g., 3D molecules/proteins) |

### Training Details

**Base Architecture**: InstructBioMol-base

**Training Data**:

​1. Molecule - Natural Language Alignment:
  - 60 million data from pubchem and chebi

​2. Protein - Natural Langauge Alignment:
  - 35 million data from UniProt (Swiss-Prot and TrEMBL)

​3. Molecule - Protein Alignment:
  - 1 million data from BindingDB and Rhea


**Training Objective**: Instruction tuning

### Quickstart

You can use InstructBioMol with the `transformers` library by setting `trust_remote_code=True`. The model handles multimodal inputs, specifically proteins and molecules, as demonstrated below.

```python
from transformers import AutoModel, AutoTokenizer
import torch

# Load the model and tokenizer
model_name = "hicai-zju/InstructBioMol-instruct-stage1" # or "hicai-zju/InstructBioMol-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto", trust_remote_code=True)

# Example: Generate a description for a protein sequence
protein_sequence = ">sp|P0A7G8|FMT_ECOLI Formyltetrahydrofolate synthetase OS=Escherichia coli (strain K12) PE=1 SV=1
MSKKLVSGTDVAEYLLSVQKEELGDLTLEIDELKTVTLTRIAQLKDFGSGSIPVEAVKLINQENILFLLGTLGIGKTTTTLLKRIISDKDFGFYSSADKLYDYKGYVVFGESVAGAEADWTSKIDVVVAPFTSIDETAKLLAKLTPDVSVLGQAVAVKGALRILGMDDAAQRVADIVGLAVTGQIVKLAANAGADLLEALKLPEVVVVGNGVAYALDGRLKAEFSLDTAVADGASEVAGKLIARNGADGSLKGVLLEELGAAKLKVIAPLTGLAKELKAFESLLAEKKD"
prompt = f"Please describe this protein:
<PROT>{protein_sequence}</PROT>"

input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
output_ids = model.generate(input_ids, max_new_tokens=100, do_sample=False)
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(generated_text)

# Example: Generate a SMILES string for a molecule description
mol_description = "A molecule with anti-cancer activity and a molecular weight around 300."
prompt = f"Generate a SMILES string for a molecule with the following properties:
<MOL>{mol_description}</MOL>"

input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
output_ids = model.generate(input_ids, max_new_tokens=100, do_sample=False)
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(generated_text)
```

### Citation

```bibtex
@article{DBLP:journals/corr/abs-2410-07919,
  author       = {Xiang Zhuang and
                  Keyan Ding and
                  Tianwen Lyu and
                  Yinuo Jiang and
                  Xiaotong Li and
                  Zhuoyi Xiang and
                  Zeyuan Wang and
                  Ming Qin and
                  Kehua Feng and
                  Jike Wang and
                  Qiang Zhang and
                  Huajun Chen},
  title        = {InstructBioMol: Advancing Biomolecule Understanding and Design Following
                  Human Instructions},
  journal      = {CoRR},
  volume       = {abs/2410.07919},
  year         = {2024}
}
```