--- license: mit ---

InstructBioMol: A Multimodal LLM for Biomolecule Understanding and Design

PaperProjectQuickstartCitation

### Model Description InstructBioMol is a multimodal large language model that bridges natural language with biomolecules (proteins and small molecules). It achieves any-to-any alignment between natural language, molecules, and proteins through comprehensive instruction tuning. *For detailed information, please refer to our [paper](https://arxiv.org/abs/2410.07919) and [code repository](https://github.com/HICAI-ZJU/InstructBioMol).* ### Released Variants | Model Name | Stage | Multimodal| Description | |------------|-----------| -------| -------| | [InstructBioMol-base](https://huggingface.co/hicai-zju/InstructBioMol-base) (*This Model*) | Pretraining | ❎| Continual pretrained model on molecular sequences, protein sequences, and scientific literature. | | [InstructBioMol-instruct-stage1](https://huggingface.co/hicai-zju/InstructBioMol-instruct-stage1) | Instruction tuning (stage 1) | ✅ | Stage1 instruction-tuned model with biomolecular multimodal processing capabilities. (e.g., 3D molecules/proteins) | | [InstructBioMol-instruct](https://huggingface.co/hicai-zju/InstructBioMol-instruct) | Instruction tuning (stage 1 and 2) | ✅| Fully instruction-tuned model (stage1 & stage2) with biomolecular multimodal processing capabilities (e.g., 3D molecules/proteins) | ### Training Details **Base Architecture**: LLaMA-2-7B **Training Data**: ​1. ​Molecular Sequences​​: - Format: SELFIES - Source: PubChem - Size: ​​100 million (100M) entries​​ ​2. ​Protein Sequences​​: - Format: FASTA-like, prefixed with `

` (e.g., `

M

A

L

W...`). - Source: UniRef50 - Size: ​​59 million (59M) entries​​ ​3. ​Natural Language Texts​​: - Source: Abstracts from ​​PubMed​​, ​​bioRxiv​​, and ​​ChemRxiv​​ - Size: ​​6 million (6M) abstracts​ **Training Objective**: Causal language modeling (self-supervised) ### Quick Start ```python from transformers import LlamaForCausalLM, LlamaTokenizer import torch model_name = "hicai-zju/InstructBioMol-base" tokenizer = LlamaTokenizer.from_pretrained(model_name) model = LlamaForCausalLM.from_pretrained(model_name, device_map="cuda:0") prompt = "

M" # protein sequence # prompt = "[C]" # molecule sequence # prompt = 'Scientific' # natural language inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=100, temperature=0.7, top_p=0.9, do_sample=True ) generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) print(generated_text) ``` ### Citation ```bibtex @article{zhuang2025advancing, author = {Xiang Zhuang and Keyan Ding and Tianwen Lyu and Yinuo Jiang and Xiaotong Li and Zhuoyi Xiang and Zeyuan Wang and Ming Qin and Kehua Feng and Jike Wang and Qiang Zhang and Huajun Chen}, title={Advancing biomolecular understanding and design following human instructions}, journal={Nature Machine Intelligence}, pages={1--14}, year={2025}, publisher={Nature Publishing Group UK London} } ```