|
--- |
|
license: mit |
|
pipeline_tag: any-to-any |
|
library_name: transformers |
|
tags: |
|
- biomolecules |
|
- proteins |
|
- molecules |
|
- multimodal |
|
- language-model |
|
- instruction-tuned |
|
- llama |
|
--- |
|
|
|
<div align="center"> |
|
|
|
<h3>InstructBioMol: A Multimodal LLM for Biomolecule Understanding and Design</h3> |
|
|
|
<p align="center"> |
|
<a href="https://arxiv.org/abs/2410.07919">Paper</a> • |
|
<a href="https://github.com/HICAI-ZJU/InstructBioMol">Project</a> • |
|
<a href="#quickstart">Quickstart</a> • |
|
<a href="#citation">Citation</a> |
|
</p> |
|
</div> |
|
|
|
### Model Description |
|
|
|
InstructBioMol is a multimodal large language model that bridges natural language with biomolecules (proteins and small molecules). It achieves any-to-any alignment between natural language, molecules, and proteins through comprehensive instruction tuning. |
|
|
|
*For detailed information, please refer to our [paper](https://arxiv.org/abs/2410.07919) and [code repository](https://github.com/HICAI-ZJU/InstructBioMol).* |
|
### Released Variants |
|
|
|
| Model Name | Stage | Multimodal| Description | |
|
|------------|-----------| -------| -------| |
|
| [InstructBioMol-base](https://huggingface.co/hicai-zju/InstructBioMol-base) | Pretraining | ❎| Continual pretrained model on molecular sequences, protein sequences, and scientific literature. | |
|
| [InstructBioMol-instruct-stage1](https://huggingface.co/hicai-zju/InstructBioMol-instruct-stage1) (*This Model*) | Instruction tuning (stage 1) | ✅ | Stage1 instruction-tuned model with biomolecular multimodal processing capabilities. (e.g., 3D molecules/proteins) | |
|
| [InstructBioMol-instruct](https://huggingface.co/hicai-zju/InstructBioMol-instruct) | Instruction tuning (stage 1 and 2) | ✅| Fully instruction-tuned model (stage1 & stage2) with biomolecular multimodal processing capabilities (e.g., 3D molecules/proteins) | |
|
|
|
### Training Details |
|
|
|
**Base Architecture**: InstructBioMol-base |
|
|
|
**Training Data**: |
|
|
|
1. Molecule - Natural Language Alignment: |
|
- 60 million data from pubchem and chebi |
|
|
|
2. Protein - Natural Langauge Alignment: |
|
- 35 million data from UniProt (Swiss-Prot and TrEMBL) |
|
|
|
3. Molecule - Protein Alignment: |
|
- 1 million data from BindingDB and Rhea |
|
|
|
|
|
**Training Objective**: Instruction tuning |
|
|
|
### Quickstart |
|
|
|
You can use InstructBioMol with the `transformers` library by setting `trust_remote_code=True`. The model handles multimodal inputs, specifically proteins and molecules, as demonstrated below. |
|
|
|
```python |
|
from transformers import AutoModel, AutoTokenizer |
|
import torch |
|
|
|
# Load the model and tokenizer |
|
model_name = "hicai-zju/InstructBioMol-instruct-stage1" # or "hicai-zju/InstructBioMol-instruct" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
|
model = AutoModel.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto", trust_remote_code=True) |
|
|
|
# Example: Generate a description for a protein sequence |
|
protein_sequence = ">sp|P0A7G8|FMT_ECOLI Formyltetrahydrofolate synthetase OS=Escherichia coli (strain K12) PE=1 SV=1 |
|
MSKKLVSGTDVAEYLLSVQKEELGDLTLEIDELKTVTLTRIAQLKDFGSGSIPVEAVKLINQENILFLLGTLGIGKTTTTLLKRIISDKDFGFYSSADKLYDYKGYVVFGESVAGAEADWTSKIDVVVAPFTSIDETAKLLAKLTPDVSVLGQAVAVKGALRILGMDDAAQRVADIVGLAVTGQIVKLAANAGADLLEALKLPEVVVVGNGVAYALDGRLKAEFSLDTAVADGASEVAGKLIARNGADGSLKGVLLEELGAAKLKVIAPLTGLAKELKAFESLLAEKKD" |
|
prompt = f"Please describe this protein: |
|
<PROT>{protein_sequence}</PROT>" |
|
|
|
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device) |
|
output_ids = model.generate(input_ids, max_new_tokens=100, do_sample=False) |
|
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True) |
|
print(generated_text) |
|
|
|
# Example: Generate a SMILES string for a molecule description |
|
mol_description = "A molecule with anti-cancer activity and a molecular weight around 300." |
|
prompt = f"Generate a SMILES string for a molecule with the following properties: |
|
<MOL>{mol_description}</MOL>" |
|
|
|
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device) |
|
output_ids = model.generate(input_ids, max_new_tokens=100, do_sample=False) |
|
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True) |
|
print(generated_text) |
|
``` |
|
|
|
### Citation |
|
|
|
```bibtex |
|
@article{DBLP:journals/corr/abs-2410-07919, |
|
author = {Xiang Zhuang and |
|
Keyan Ding and |
|
Tianwen Lyu and |
|
Yinuo Jiang and |
|
Xiaotong Li and |
|
Zhuoyi Xiang and |
|
Zeyuan Wang and |
|
Ming Qin and |
|
Kehua Feng and |
|
Jike Wang and |
|
Qiang Zhang and |
|
Huajun Chen}, |
|
title = {InstructBioMol: Advancing Biomolecule Understanding and Design Following |
|
Human Instructions}, |
|
journal = {CoRR}, |
|
volume = {abs/2410.07919}, |
|
year = {2024} |
|
} |
|
``` |