license: mit
pipeline_tag: any-to-any
library_name: transformers
tags:
- biomolecules
- proteins
- molecules
- multimodal
- language-model
- instruction-tuned
- llama
InstructBioMol: A Multimodal LLM for Biomolecule Understanding and Design
Paper • Project • Quickstart • Citation
Model Description
InstructBioMol is a multimodal large language model that bridges natural language with biomolecules (proteins and small molecules). It achieves any-to-any alignment between natural language, molecules, and proteins through comprehensive instruction tuning.
For detailed information, please refer to our paper and code repository.
Released Variants
Model Name | Stage | Multimodal | Description |
---|---|---|---|
InstructBioMol-base | Pretraining | ❎ | Continual pretrained model on molecular sequences, protein sequences, and scientific literature. |
InstructBioMol-instruct-stage1 (This Model) | Instruction tuning (stage 1) | ✅ | Stage1 instruction-tuned model with biomolecular multimodal processing capabilities. (e.g., 3D molecules/proteins) |
InstructBioMol-instruct | Instruction tuning (stage 1 and 2) | ✅ | Fully instruction-tuned model (stage1 & stage2) with biomolecular multimodal processing capabilities (e.g., 3D molecules/proteins) |
Training Details
Base Architecture: InstructBioMol-base
Training Data:
1. Molecule - Natural Language Alignment:
- 60 million data from pubchem and chebi
2. Protein - Natural Langauge Alignment:
- 35 million data from UniProt (Swiss-Prot and TrEMBL)
3. Molecule - Protein Alignment:
- 1 million data from BindingDB and Rhea
Training Objective: Instruction tuning
Quickstart
You can use InstructBioMol with the transformers
library by setting trust_remote_code=True
. The model handles multimodal inputs, specifically proteins and molecules, as demonstrated below.
from transformers import AutoModel, AutoTokenizer
import torch
# Load the model and tokenizer
model_name = "hicai-zju/InstructBioMol-instruct-stage1" # or "hicai-zju/InstructBioMol-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto", trust_remote_code=True)
# Example: Generate a description for a protein sequence
protein_sequence = ">sp|P0A7G8|FMT_ECOLI Formyltetrahydrofolate synthetase OS=Escherichia coli (strain K12) PE=1 SV=1
MSKKLVSGTDVAEYLLSVQKEELGDLTLEIDELKTVTLTRIAQLKDFGSGSIPVEAVKLINQENILFLLGTLGIGKTTTTLLKRIISDKDFGFYSSADKLYDYKGYVVFGESVAGAEADWTSKIDVVVAPFTSIDETAKLLAKLTPDVSVLGQAVAVKGALRILGMDDAAQRVADIVGLAVTGQIVKLAANAGADLLEALKLPEVVVVGNGVAYALDGRLKAEFSLDTAVADGASEVAGKLIARNGADGSLKGVLLEELGAAKLKVIAPLTGLAKELKAFESLLAEKKD"
prompt = f"Please describe this protein:
<PROT>{protein_sequence}</PROT>"
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
output_ids = model.generate(input_ids, max_new_tokens=100, do_sample=False)
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(generated_text)
# Example: Generate a SMILES string for a molecule description
mol_description = "A molecule with anti-cancer activity and a molecular weight around 300."
prompt = f"Generate a SMILES string for a molecule with the following properties:
<MOL>{mol_description}</MOL>"
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
output_ids = model.generate(input_ids, max_new_tokens=100, do_sample=False)
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(generated_text)
Citation
@article{DBLP:journals/corr/abs-2410-07919,
author = {Xiang Zhuang and
Keyan Ding and
Tianwen Lyu and
Yinuo Jiang and
Xiaotong Li and
Zhuoyi Xiang and
Zeyuan Wang and
Ming Qin and
Kehua Feng and
Jike Wang and
Qiang Zhang and
Huajun Chen},
title = {InstructBioMol: Advancing Biomolecule Understanding and Design Following
Human Instructions},
journal = {CoRR},
volume = {abs/2410.07919},
year = {2024}
}