metadata
license: mit
InstructBioMol: A Multimodal LLM for Biomolecule Understanding and Design
Paper • Project • Quickstart • Citation
Model Description
InstructBioMol is a multimodal large language model that bridges natural language with biomolecules (proteins and small molecules). It achieves any-to-any alignment between natural language, molecules, and proteins through comprehensive instruction tuning.
For detailed information, please refer to our paper and code repository.
Released Variants
Model Name | Stage | Multimodal | Description |
---|---|---|---|
InstructBioMol-base (This Model) | Pretraining | ❎ | Continual pretrained model on molecular sequences, protein sequences, and scientific literature. |
InstructBioMol-instruct-stage1 | Instruction tuning (stage 1) | ✅ | Stage1 instruction-tuned model with biomolecular multimodal processing capabilities. (e.g., 3D molecules/proteins) |
InstructBioMol-instruct | Instruction tuning (stage 1 and 2) | ✅ | Fully instruction-tuned model (stage1 & stage2) with biomolecular multimodal processing capabilities (e.g., 3D molecules/proteins) |
Training Details
Base Architecture: LLaMA-2-7B
Training Data:
1. Molecular Sequences:
- Format: SELFIES
- Source: PubChem
- Size: 100 million (100M) entries
2. Protein Sequences:
- Format: FASTA-like, prefixed with
<p>
(e.g.,<p>M<p>A<p>L<p>W...
). - Source: UniRef50
- Size: 59 million (59M) entries
3. Natural Language Texts:
- Source: Abstracts from PubMed, bioRxiv, and ChemRxiv
- Size: 6 million (6M) abstracts
Training Objective: Causal language modeling (self-supervised)
Quick Start
from transformers import LlamaForCausalLM, LlamaTokenizer
import torch
model_name = "hicai-zju/InstructBioMol-base"
tokenizer = LlamaTokenizer.from_pretrained(model_name)
model = LlamaForCausalLM.from_pretrained(model_name, device_map="cuda:0")
prompt = "<p>M" # protein sequence
# prompt = "[C]" # molecule sequence
# prompt = 'Scientific' # natural language
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=100,
temperature=0.7,
top_p=0.9,
do_sample=True
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
Citation
@article{zhuang2025advancing,
author = {Xiang Zhuang and
Keyan Ding and
Tianwen Lyu and
Yinuo Jiang and
Xiaotong Li and
Zhuoyi Xiang and
Zeyuan Wang and
Ming Qin and
Kehua Feng and
Jike Wang and
Qiang Zhang and
Huajun Chen},
title={Advancing biomolecular understanding and design following human instructions},
journal={Nature Machine Intelligence},
pages={1--14},
year={2025},
publisher={Nature Publishing Group UK London}
}